EnsemblCompara GeneTrees: Complete,
duplication-aware phylogenetic trees in vertebrates
Albert J. Vilella,1Jessica Severin,1,3Abel Ureta-Vidal,1,4Li Heng,2Richard Durbin,2
and Ewan Birney1,5
1EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom;2Wellcome Trust Sanger Institute,
Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1HH, United Kingdom
We have developed a comprehensive gene orientated phylogenetic resource, EnsemblCompara GeneTrees, based on
a computational pipeline to handle clustering, multiple alignment, and tree generation, including the handling of large
gene families. We developed two novel non-sequence-based metrics of gene tree correctness and benchmarked a number
of tree methods. The TreeBeST method from TreeFam shows the best performance in our hands. We also compared this
phylogenetic approach to clustering approaches for ortholog prediction, showing a large increase in coverage using the
phylogenetic approach. All data are made available in a number of formats and will be kept up to date with the Ensembl
[Supplemental material is available online at www.genome.org.]
The use of phylogenetic trees to describe the evolution of bi-
ological processes was established in the 1950s (Hennig 1952) and
remains a fundamental approach to understanding the evolution
of individual genes through to complete genomes; for example, in
the mouse (Mouse Genome Sequencing Consortium 2002), rat
(Gibbs et al. 2004), chicken (International Chicken Genome Se-
quencing Consortium 2004), and monodelphis (Mikkelsen et al.
2007) genome papers, and numerous papers on individual
sequences. Now routine, the determination of vertebrate genome
sequences provides a rich data source to understand evolution,
and using phylogenetic trees of the genes is one of the best ways to
organize these data. However, the increased set of genomes makes
the compute and engineering tasks to form all the gene trees
progressively more complex and harder for individual groups to
use. The Ensembl project provides an accurate and consistent
protein-coding gene set for all vertebrate genomes (International
Human Genome Sequencing Consortium 2001; Dehal et al. 2002;
Mouse Genome Sequencing Consortium 2002; Gibbs et al. 2004;
Xie et al. 2005; Mikkelsen et al. 2007; Rhesus Macaque Genome
Sequencing and Analysis Consortium 2007). Previously (until
April 2006), Ensembl provided a basic method for tracing ortho-
logs via the Best Reciprocal BLAST method, similar to approaches
used in other genome analyses, such as Drosophila melanogaster
(Adams et al. 2000) or human (International Human Genome
Sequencing Consortium 2001). In June 2006 (Hubbard et al.
2007), we replaced this system with a phylogenetically sound,
gene tree-based approach, providing a complete set of phyloge-
netic trees spanning 91% of genes across vertebrates. In addition
to the vertebrates we have included a few important non-verte-
brate species (fly, worm, and yeast) to act both as out groups and
provide links to these model organisms. In this paper we provide
and document the display and access methods for these trees.
There have been a number of methods proposed for routine
generation of genomewide orthology descriptions, including
Inparanoid (Remm et al. 2001), MSOAR (Fu et al. 2007), OrthoMCL
(Li et al. 2003), HomoloGene (Wheeler et al. 2008), TreeFam (Li
et al. 2006), PhyOP (Goodstadt and Ponting 2006), and PhiGs
(Dehal and Boore 2006). The first four, Inparanoid, MSOAR,
OrthoMCL, and HomoloGene, focus on providing clusters (or
linked clusters) of genes, without an explicit tree topology. PhyOP
(Goodstadt and Ponting 2006) uses a tree-based method, but be-
tween pairs of closely related species, resolving paralogs accurately
by using neutral substitution (as measured by dS, the synonymous
substitution rate). TreeFam provides an explicit gene tree across
multiple species, using both dS, dN(nonsynonymous substitution
rate), nucleotide and protein distance measures, and the
standard species tree to balance duplications vs. deletions to in-
form the tree construction, using the program TreeBeST (http://
treesoft.sourceforge.net/treebest.shtml; L. Heng, A.J. Vilella, E.
Birney, and R. Durbin, in prep.).
The PhiGs method (Dehal and Boore 2006) is a leading
phylogenetic-based method that produced a comprehensive
phylogenetic resource for the genomes at the time it was run, and
the basic outline of its analysis, which was clustering of protein
sequences, followed by phylogenetic trees, is similar to the
method presented here. However, the PhiGs resource covered
a smaller number of species (23 vs. 45) and has been difficult to
keep up to date with the advances in gene sets and genomes.
Another major difference between PhiG-based phylogenetic trees
and the phylogenetic trees presented here is that the former was
calculated using a single maximum likelihood method based on
protein evolution. In contrast, the Ensembl gene trees are calcu-
lated using a new method, TreeBeST, which integrates multiple
tree topologies, in particular both DNA level and protein level
models and combines this with a species-tree aware penalization
of topologies, which are inconsistent with known species rela-
tionships. We show in this paper that this method produces trees
that are more consistent with synteny relationships and less
Center (GSC), 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama 230-
0045, Japan;4Eagle Genomics, 19 Forge End, Stapleford, Cambridge
CB22 5BN, UK.
E-mail firstname.lastname@example.org; fax 44-1223-494919.
Article published online before print. Article and publication date are at
online through the Genome Research Open Access option.
3RIKEN Yokohama Institute, Genomic Sciences
19:327–335 ? 2009 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/09; www.genome.org
anomalous topologies than single protein-based phylogenetic
There are also many single phylogenetic tree-building
approaches, many of them based on maximum likelihood meth-
ods; one leading method is PhyML (Guindon and Gascuel 2003).
It is unclear what is the best method to use, in particular in the
context of genome-wide tree building with constraints on com-
putational costs and the need to robustly handle many complex
scenarios usually involving large families with heterogeneous
phylogenetic depths. In this paper, we benchmark in vertebrates
the tree programs TreeBeST and PhyML, and the resulting trees to
basic best reciprocal hit (BRH) methods, and cluster frameworks,
in particular Inparanoid and HomoloGene. We also benchmark to
a recentPhyOPdata set.ThePhyOP pipelinehas recentlyswitched
to use the same tree-building program (TreeBeST) that we use, but
differs in its input clusters. Although we adopted this same tree-
building method, we describe here considerable novel engineering
in the deployment of these methods across all vertebrates. Similar
to the PhiGs resource, we have used the dense coverage of
genomes to provide topologically based timings (i.e., the standard
use of outgroups vs. subsequent lineages to bracket a duplication),
in order to label duplication events.
A robust, computationally efficient pipeline for gene
We have built a fault-tolerant pipeline to run our orthology and
paralogy gene prediction analysis using TreeFam methodology.
The fault-tolerance works at two levels: first, we use a robust
compute scheduling engine (in our case, LSF, though other pack-
ages could substitute for this component) to schedule jobs, but
even with the use of LSF’s scheduling and job recovery, there can
be periodic network or disk failures, which result in apparent
successful LSF completion without data being successfully stored.
Our experience is that a second level of datatrackingis required,in
particular due to the complex interdependence on compute
results in the pipeline, which is hard to express as single static LSF-
based set of dependencies. Finally the pipeline allows aggregation
of multiple highly similar compute tasks (in our case, BLAST
comparisons) into a single LSF task, which is important to allow
the granularity of the LSF tracking component to be optimized.
The pipelinecan be divided intoeight mainstepsthat are presented
in the schema in Figure 1.These eight steps are describedas follows.
1. Protein data set: For each species considered in the analysis, we
only consider protein coding genes. For each gene, we only
consider the longest protein translation.
2. BLASTP all vs. all: Each protein is queried using WUBLASTP
against each individual species protein database, including its
self-species protein database.
3. Graph construction: Connections (edges) between the nodes
(proteins) are retained when they satisfy either a best reciprocal
hit (BRH) or a BLAST score ratio (BSR) over 0.33.
A BSR for two proteins, P1 and P2, is defined as scoreP1P2/
max(self-scoreP1 or self-scoreP2).
4. Clusters: We extract from the graph the connectedcomponents
(i.e., single linkage clusters). Each connected component rep-
resents a cluster, i.e., a gene family. If the cluster has greater
than 750 members, steps 3 and 4 are repeated at higher strin-
gency (see below).
5. Multiple alignments: Proteins in the same cluster are aligned
using MUSCLE (Edgar 2004) to obtain a multiple alignment.
6. Gene tree and reconciliation: The CDS backtranslated protein-
based multiple alignment is used as an input to the tree pro-
gram, TreeBeST, as well as the multifurcated species tree nec-
essary for the reconciliation and the duplication calls on
7. Inference of orthologs and paralogs: As many users like to use
ortholog look-up tables, we flatten the resulting trees into
ortholog and paralog tables of pairwise relationships between
genes. In the case of paralogs, this flattening also records the
timing of the duplication due to the presence of extant species
past the duplication, and thus implicitly outgroup lineages
before the duplication (see Supplemental Fig. 1A,B for a de-
8. Pairwise dN/dS (nonsynonymous substitutions/synonymous
substitutions): We calculate the pairwise dN/dSbetween pairs of
genes for closely related species using codeml from the PAML
package (Yang 2007).
For the Ensembl v41 assessment, step 6 was divided into step 6a,
using PhyML (Guindon and Gascuel 2003) to build the tree, and
step 6b, using RAP (Dufayard et al. 2005) for the tree reconciliation.
At the end of step 4, if the cluster is large (currently param-
etrized as containing more than 750 genes), the genes in this
cluster are then reinjected into step 3 (Fig. 1, dashed lines), with
Computational pipeline for the EnsemblCompara process.
328 Genome Research
Vilella et al.
only the BSR threshold condition to satisfy. If more iterations are
necessary, the BSR threshold is increased by 0.1 at each iteration.
The same kinds of iterations are applied at the end of steps 5 and 6
when MUSCLE or the tree-building program failed to process
a cluster. This iteration procedure is effectively a hierarchical
breakdown of the initial clustering to get more fine-grained sets of
clusters that can easily be processed. This iterative approach is
critical to generate sensible trees for complex large families, such
as those of zinc finger proteins or olfactory receptors. Although it
would be desirable to place all genes from these gene families into
a comprehensive single tree, there are numerous engineering, al-
gorithmic, and display problems associated with large trees, and
the hierarchical breakdown provides a pragmatic solution for such
The TreeBeST method has two new components. First, it runs
a number of independent phylogenetic methods, in particular
DNA, codon, and protein maximum likelihood models are created
on the same data. Second, the TreeBeST method then creates
a combined tree using a stochastic context free grammar approach
to integrate the different tree information with a model to pe-
nalize duplications and deletions relative to a known species tree.
The result is that TreeBeST will tend to use DNA- or codon-based
methods in the parts of the phylogeny that do not have saturated
DNA mutation rates (e.g., intramammalian comparisons), but
utilizes protein information at longer distances (e.g., comparisons
between mammals and fish). We developed two metrics to assess
the different methods.
Duplication consistency score
We developed a consistency score for
proposed duplications, where we mea-
sure the intersection of the number of
species postduplication over the union;
should have the gene persisting at least in
an equally likely manner in subsequent
lineages. In contrast, incorrect topologies
will often have simply reordered a deep
node leading to usually a few species in
the topologically incorrect positions;
reconciliation to the species tree then
forces the prediction of duplication fol-
lowed by extensive loss in a precisely
daughter lineages. The duplication con-
sistency score captures this unbalanced
nature of poor topologies as the in-
tersection in subsequent lineages is low.
Figure 2 shows clearly that the PhyML/
RAP approach made many more dupli-
cation nodes compared to TreeBeST, and
the vast majority of the additional
a low duplication consistency score. This
result is unsurprising, as TreeBeST takes
the species tree as input and explicitly
penalizes both duplication and deletion
of genes; in other words, the TreeBeST
program tends to produce duplication
nodes when the gene tree has extensive
extant members on each side of the
duplication. Although this metric fundamentally reflects the
difference in methodology between PhyML, a pure sequence-
based tree, and TreeBeST, which uses the species tree as input, it is
clear that the TreeBeST results are more biologically consistent,
given the assumption that gene duplication and deletion rates are
Gene synteny metric
We also developed an alternative metric that was not confounded
by the tree methodology using the fact that gene order and ori-
entation (informally called synteny) are conserved across species.
None of the tree approaches used synteny information in the tree
construction, though the old best reciprocal hit method extended
its range using synteny information. Supplemental Figure 2 shows
the difference in results between a strict BRH approach with no
syntenic information used, PhyML/RAP and TreeBeST. In both
cases, for perfect and good syntenic genes, the TreeBeST pipeline
shows better results. PhyML/RAP gave poorer results than BRH.
We believe this was mainly due to a large number of wrong gene
tree topologies and hence difficult tree reconciliations that over-
estimated duplication events. Such overestimation led to missed
Comparison to bootstrap metrics
support of the duplication nodes from TreeBeST. As expected,
ordinated deletions on the subsequent lineages. The histogram shows the distribution of consistency
scores for both PhyML/RAP and TreeBeST methods. PhyML/RAP has both a higher absolute number of
duplications and far more at low consistency values.
A diagram of the duplication consistency score on an example tree showing unlikely co-
Duplication-aware gene trees in vertebrates
there is astrong correlationbetween thebootstrapsupportandthe
duplication consistency measure,withmostof the high duplication
consistency measure scores also having high bootstrap support
(Fig. 3), and low duplication consistency measures having a vari-
ety of bootstrap scores, but nearly always below the 80% support
level. Interestingly, there was a set of low bootstrap but high
duplication-consistent set of duplication nodes (Figure 3, bottom
right), but not the inverse set of high bootstrap low duplication-
consistent nodes. These duplication nodes were not obviously
correlated with either internal aspects of the multiple alignment
or tree (e.g., length or average distance from the duplication node
to extant species) or external properties of the genes (e.g., Gene
Ontology [GO] term distribution, Pfam domain sets or the posi-
tion of the duplication node with respect to the vertebrate tree).
This set of genes might reflect the fact that the bootstrap statistic
is about the consistency of the tree across the columns of the
multiple alignment, and this consistency measure does not nec-
essarily have to apply to every gene equally. In contrast, the
duplication consistency measure, which is a property of the be-
havior of genes post-duplication, may be more consistent across
The conclusion of this investigation is that TreeBeST was the
best of these extensively tested methods on the criteria of duplica-
tion consistency and synteny consistency criteria. It is hard to have
entirely objective measures of the accuracy of trees (see Discussion
below). We also briefly investigated further tree programs and fur-
ther multiple alignment programs (e.g., ClustalW), but many of
these were not robust enough to work in
a large-scale compute environment with
the complex gene families present across
vertebrates. In the future these metrics
will permit the testing of both other tree
construction programs and other multiple
alignment programs, and we will con-
tinue to test and assess new robustly
engineered programs with a good chance
of improving the trees.
External benchmarking to other
Overlap of ortholog sets
Table 1 shows the overlap of ortholog sets
v45 to Inparanoid, HomoloGene, PhyOP
or TreeFamCurated for certain pairs of
species. In all our comparisons we have
taken genes as reference, and have
counted for each gene its best homology
prediction. The ranking from best to
worst favored one-to-one orthologs over
one-to-many orthologs, and both were
favored over paralogs. When a gene was
not involved in any homologyrelation, it
has been labeled as unclassified.
In all the data shown in the tables,
EnsemblCompara always shows better or
similar coverage to any other method.
This is clearly visible in HomoloGene,
where one-third more human genes and
twofold more mouse genes are lost in
HomoloGene as compared with EnsemblCompara GeneTrees v45.
Part of this large difference is the absence of RefSeq (Pruitt et al.
2007) entries for particular human genes, i.e., a problem with gene
the genomes rather than the inability to create an orthology
relationship. As HomoloGene is a database, and not a method
that can be applied to a new data set, one cannot perform a per-
fectly matched set. We then restricted the 22,568 human protein
coding genes and 24,496 mouse protein coding genes present
in the Ensembl database to the common RefSeq set used as an
input to HomoloGene to compare the homology types as fairly
as possible between the two data sets. For this set there were
838 HomoloGene associations that could have been made in
EnsemblCompara, compared to 1519 EnsemblCompara cases be-
tween genes with RefSeq IDs, but no HomoloGene association.
Manual inspection of these cases show some complex tree topol-
ogies, but also clear cases of one-to-one orthology that had
been missed in HomoloGene (e.g., the MAGIX gene), whereas the
majority of the missing EnsemblCompara cases came from com-
plex scenarios with unclear correct tree topologies, such as Ig locus
When comparing our results with Inparanoid (Remm et al.
2001) we used a matched protein set (Ensembl v45) for both
methods. We observed that although they return very similar
results, EnsemblCompara has increased coverage with marginally
increased specificity (see below for specificity measure). The gain
in gene coveragein favor of EnsemblCompara v45becomes clearer
of duplication nodes (y-axis). Because of the large number of values, the density of points is shown
using the smoothScatter kernel-based density function in R.
A scatter plot of the duplication consistency score (x-axis) compared to the bootstrap value
Vilella et al.
330 Genome Research
Comparison of EnsemblCompara versus other methods
HomoloGene vs. EnsemblCompara v45
PhyOP vs. EnsemblCompara v45
App ortholog one-to-one
Inparanoid vs. EnsemblCompara v45
TreeFamCurated vs. EnsemblCompara v45
App ortholog one-to-one
In each case, the different Ensembl categories are listed in the columns, whereas the comparison database is listed in rows. Each cell shows the number of gene IDs for the two species for the
intersection of the two categories. As well as one-to-many and many-to-many relationships making the between-species numbers different, each homology program can make a different pairing of
genes leading to different numbers in the one-to-one category. The ‘‘unclassified’’ column shows genes not captured at all by that method. Boldface indicates the roughly equivalent categories of
homologous relationships between the methods and the unclassified category from each method.
Duplication-aware gene trees in vertebrates
when looking at more distant species, such as human/medaka or
The PhyOP pipeline has recently moved to using the same
tree-building program (TreeBeST) as TreeFam and EnsemblCompara.
This means that any difference is due to the input clusters. The
PhyOP pipeline shows marginally less unclassified genes than the
EnsemblCompara pipeline, i.e., having orthologous genes pre-
dicted that were not present in EnsemblCompara. Examination of
these cases showed many genes involved in large families. Cur-
rently, EnsemblComparahandles35 speciescomparedtothe more
restricted set of six species in the PhyOP run, and it seems that in
breaking down the large families into appropriate clusters, some
genes in these large families can become orphaned. This is clearly
an area that can be improved in the future.
The TreeFamCurated entry corresponds to the comparison of
our data set against the curated set of TreeFam, with 1247 such
cases only. The curated trees in TreeFam incorporate expert
knowledge to change the topology of trees, for example, by using
information on the conservation of function. In Table 1 we show
that the concordance between our automated prediction set and
the manually curated TreeFam data set is very high. The only ex-
ception seems to be that our method tends to miss orthology
relationships in favor of within-species paralogs. We believe this is
mainly due to wrong tree topologies involving mispredicted
(merged/split/partial) genes for which automatic tree building has
difficulties to place the genes correctly. The manual curation in
TreeFam then corrects this problem and results in a better tree
topology. In the long term, the incorporation of more manual
curation into the human and mouse gene sets, coupled with more
improvements in the gene prediction methodology in Ensembl
should progressively remove these errors.
Comparison using the synteny metric
We were interested in looking at the differences in the synteny
of theirspecificity.The plot inFigure4 showsthenumberofhuman
human syntenic genes. EnsemblCompara and PhyOP always per-
form better in terms of the number of covered human genes, but
there is a remarkably similar level of syntenous predictions between
all the different methods, with, in some cases, Inparanoid showing
a marginally higher rate of syntenous predictions. In the case of
on both coverage and specificity measures. The teleost genomes
represent aparticularchallenge forthe clustering mechanismdue to
the ancient duplication at the root of the teleost linage, leading to
to capture using the clustering methods.
Display and access of orthologs
We provide different ways to access and visualize the orthology/
paralogy data and have used it in a number of ways in house.
The main entry points are GeneView (http://jun2007.archive.
and GeneTreeView (http://jun2007.archive.ensembl.org/Homo_sapiens/
genetreeview?db=core;gene=ENSG00000129965; Fig. 5).
In GeneView, we list the orthologous and within-species
paralogous gene predictions. In each case, the user has access to
MultiContigView, a display that shows the ortholog or paralog
relation in the genomic context of both species. The user can also
see the alignment between the two ortholog/paralog protein
sequences via AlignView.
GeneTreeView (Fig. 5) displays the gene tree and shows the
considered gene highlighted in red in context of all its homolo-
gous relations. Duplication nodes are colored red, whereas speci-
ation nodes are colored blue. The user can dump multiple
alignment of this gene tree with the ‘‘Export’’ menu, as well as
a picture of the tree in different formats (PDF, PS, and SVG). Future
development willincludezoom in/outat specificinternalnodesto
display subtrees. We will also include the ability to dump the gene
list of the whole tree and a subtree, as well as the multiple align-
ments of the protein/CDS in a subtree.
Projection of GO terms via orthology links
One of the benefits of extensive and accurate prediction of
orthologs is that one can infer that they have (usually) retained
the same function in extant species. Using this methodology we
have automatically projected GO terms from the main two
mammalian sources, human and mouse, out across other verte-
brate species. When we project GO terms, we tag the evidence
as ‘‘inferred from electronic annotation’’ (IEA), consistent with
GO terms, and we only project from experimentally referenced
GO evidence in the source organism. After discussion with the GO
community we have only projected via one-to-one ortholog links,
though it is worth considering in the future a more flexible ap-
proach of projection through duplications for some terms (e.g.,
molecular function terms may rarely be changed by recent du-
plication structure, while biological process terms may change
more frequently). Table 2 shows the set of species for which we
have projected GO terms and the comparison with existing sets.
Even in the well annotated human and mouse genomes, this
projectionprovides a smallincrease in the overallnumberof genes
and a marked increase when not considering genes already IEA
man genes (x-axis) vs. number of genes in syntenic relationships (y-axis).
A plot showing different methods in terms of coverage in hu-
Vilella et al.
332 Genome Research
annotated. Currently the bulk of IEA assignments come via do-
main matching (e.g., Interpro2Go), and thus have to use quite
broad specificity terms, whereas our ortholog annotation can
provide far more detailed GO terms. Of course in less intensively
studied genomes this creates a large set (e.g., ;5000 previously
annotated genes for dog) of GO mappings.
Data mining using BioMart
BioMart is a flexible data mining application that can be addressed
using a user-friendly web page, programmatic access, web service ac-
cess, and the BioMart package in the R statistical environment. bio-
maRt enables the user to do bulk dumps of ortholog or paralog gene
pair lists given a species pair and to restrict this by any valid BioMart
query. It can also dump the peptide/cDNA sequence of the gene in
Raw dump accessible via ftp
We also provide dumps of the gene tree multiple alignments and
the trees themselves in ‘‘emf’’ format, described in more detail at
The tree is written down in newick format,
embedded in the emf format itself, such
that there is only one file representing the
entire data set. We are developing format
readers in the BioPerl libraries to ensure
easy integration of this flat file data into
other pipelines in a standalone manner.
Programmatic access using the Perl API
The data stored in an EnsemblCompara
database are finally also accessible in a
programmatic way using a Perl API. More
detailed documents and tutorials on how
to install and use the API can be found at
index.html. Supplemental text shows three examples of Perl scripts
using the API.
The orthology pipeline presented here is robust and provides
a framework in which we can assess different components in tree
generation.We haveusedthreekeymetrics:coveragein homology
relationships, duplication consistency score, and consistency with
genome synteny, to assess both different internal components of
our pipeline and to other orthology sets. In our assessments the
MUSCLE+TreeBeST system, which is the set of methods used in
TreeFam, performs best according to these metrics. One problem
in phylogenetic method development is that it is hard to have
access to objectively correct trees to assess methods. Simulation-
based assessment can explore the potential source of errors, and
TreeBeST performs well, and critically better than sequence-only
methods, with simulated data (L. Heng, A.J. Vilella, E. Birney, and
R. Durbin, in prep.). Of the three metrics used in this paper, the
(giving rise to Ins1 and Ins2 genes) and teleost fish. Duplication nodes are shown as red squares whereas speciation nodes are in blue. The green bars to
the right provide a graphical view of the multiple alignment, showing partial gene structures in hamster (Cavia p.), cat (Felis c.), and rabbit (Oryctolagus c.),
due to their low coverage status.
A screen shot of the gene tree page at Ensembl for the INS (insulin peptide) gene. This shows two independent duplications in rodents
GO term projection
Number of genes associated with GO terms in different species.
aThe number of genes with GO terms in total.
bThe number of genes with non-‘‘inferred by electronic annotation’’ (IEA) terms.
cThe number of additional genes with a GO term added by projection.
dThe number of genes with a GO term added including cases which previously only had IEA terms.
Duplication-aware gene trees in vertebrates
first two (coverage and duplication consistency) are somewhat
arbitrary choices, which nevertheless correspond to observations
about ‘‘poor’’ trees from biological experts who use other in-
formation (such as conservation of function) to infer orthology.
Such observations are necessarily anecdotal, but having methods
that produce trees with higher coverage and less duplication fol-
lowed by mirroring loss in the daughter lineages is more consis-
tent with biological expertise. This is shown by the comparison to
the curated TreeFam trees, which attempt to capture systemati-
cally this expert knowledge for a subset of trees. The third metric,
the conservation of synteny for orthologous genes in mammals, is
more principled. However, new methods integrating this in-
formation into phylogenetic methods, such as the MSOAR (Fu
et al. 2007) method, and ( Jiang et al. 2007), could provide more
accurate trees at the expense of not being able to use synteny to
In comparison to other genome-wide frameworks, mainly
cluster based,thesemethodsperformedbetterin termsof coverage
with at least as good specificity, as measured by the synteny
metric. In particular, much improvement is seen in the teleost
lineage, where the complex ancient duplication structure, which
has been differentially lost in extant species, leads to more com-
plex phylogenies. In addition, this phylogenetic method provides
a far richer data set including the topological timings of duplica-
tions and the ability to implement other tree-dependent methods,
such as global dN/dS methods. Obviously, one can expect
improvements in both alignments and tree methods in the future,
and this framework is flexible enough both to assess and replace
the components we are using currently.
The phylogenetic information presented here is now a stan-
dard part of the Ensembl system, and will be present in all future
releases, as well as available through the Ensembl archives. This
provides an individual gene-specific biologist both the opportu-
nity to explore the evolution of a gene family, discovering po-
tentially unappreciated ancestral duplications, or draws his or her
attention to other lineages where a gene has been duplicated. As
the presence of recent lineage-specific duplications is often asso-
ciated with positive selection, this could lead a biologist to look
into the specific biology of a previously unappreciated species to
understand the functional role of a gene. More mundanely, the
presence of these accurate orthology links allows other groups in
Ensembl to provide appropriate projection of information across
the vertebrate lineages, using the concentration of information on
human and mouse to inform all of the species in the vertebrate
tree. This is a great boon when coupled with the GO annotation
dictionary, and also allows us to project the HGNC symbols
across species confidently to provide a useful visual tag for genes
in different species. Ensembl also includes the MCL (Enright et al.
2002) generated Ensembl families resource. This is a clustering-
based method designed to work at a far deeper phylogenetic dis-
tance (incorporating events in protein families that occurred
during early eukaryotic evolution) than the ortholog predic-
tion framework presented here. There are both conceptual prob-
lems due to large scale domain changes over this depth of
evolution, which in some cases involve genuine gene split
and merge events, and engineering problems due to the consid-
erable increase in family membership when working at this
scale. We are currently investigating ways both to deepen our
gene family clusters and to reconcile these deeper families to
broader protein family representation, such as TRIBE-MCL, but
currently both methods show complementary aspects of protein
Gene synteny metric
In order to assess the quality of our orthology predictions, we
developed a synteny metric that provides a measure of gene order
conservation. The main idea is that when a predicted ortholog
between two species is flanked (by distance criteria) by ortholo-
gous pairs on each side of each genome in an ordered manner, the
central orthologous link is considered to be consistent with syn-
teny. This measure can be applied to both one-to-one orthologs
and to one-to-many orthologs, where a recent tandem duplication
in one species has duplicated a gene as the criteria for flanking
orthologous genes is based on distance, not gene order. Consid-
ering two species (e.g., human and mouse) and one species as
reference (e.g., human), we called a perfect syntenic gene (on the
reference species) a gene that has an orthology relation for which
bothupstreamanddownstream orthologiesexist at 250kb andare
colinear in both species. We called a good syntenic gene (on the
reference species) a gene that has an orthology relation for which
one orthology exists at 250 kb, either an upstream or downstream,
that is colinear in both species. It is important to note that we are
using this metric to assess the quality of resulting trees, and not
directly as part of our tree-building procedure.
Duplication consistency score
In order to assess the reliability of the duplication calls on internal
nodes of our tree, we developed a simple measure of the consis-
tency of lineages after a putative duplication node. This measure is
based on the assumption that duplication followed by reciprocal
complementary gene losses on the left and right branches of
a duplication node is an unlikely scenario (see main text): Dupli-
cation score = intersection of species between left and right
branches/union of species between left and right branches.
We created a fault-tolerant pipeline using Object-Oriented Perl
and a MySQL database. The EnsemblCompara schema API sits on
top of the main Ensembl schema and API, and links to BioPerl
(Stajich et al. 2002) objects for the main data types. The
EnsemblCompara GeneTrees are updated every 2 mo, which
involves being built from scratch for every Ensembl release over
a two-week period using a cluster of computers, generating about
50 GB of data.
Data sets used for assessing our pipeline
In order to compare the various pipeline implementations, we
have performed all analyses on an identical data set from Ensembl
v41 (October 2006). Assessment comprised implementations of
(1) BRH alone andthe tree-based programs, (2) PhyML followed by
tree reconciliation with RAP, and (3) TreeBeST.
We also compared our Ensembl release 45 (June 2007), using
the TreeBeST approach data set against other method of pre-
dictions or databases such as HomoloGene, Inparanoid, PhyOP,
The species tree for RAP is an adapted tree from the ENCODE
Multiple Sequence Analysispaper
the Supplemental materials (Margulies et al. 2007). The species
tree provided to TreeBeSTonly requires topological constraints, so
we have used an adapted topology from the NCBI taxonomic tree,
which can be found in the supplementary information.
and canbefound in
334 Genome Research
Vilella et al.
We thank Chris Ponting, Leo Goodstadt, and Andreas Heger for
providing the recent PhyOP run and many useful discussions. We
also thank the rest of the TreeFam and the Ensembl teams for
many insightful discussions. We thank Michael Hoffman for his
assistance in using R. We also thank the Sanger Institute computer
services forthe reliable runningof the compute farm.A.V., A.U.-V.,
and J.S. were supported by the Wellcome Trust; E.B. was supported
by EMBL; R.D. and L.H. were supported by the Wellcome Trust via
the Sanger Institute.
Adams, M.D., Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D.,
Amanatides, P.G., Scherer, S.E., Li, P.W., Hoskins, R.A., Galle, R.F., et al.
2000. The genome sequence of Drosophila melanogaster. Science 287:
Dehal, P.S. and Boore, J.L. 2006. A phylogenomic gene cluster resource: The
Phylogenetically Inferred Groups (PhIGs) database. BMC Bioinformatics
7: 201. doi: 10.1186/1471-2105-7-201.
Dehal, P., Satou, Y., Campbell, R.K., Chapman, J., Degnan, B., De Tomaso,
A., Davidson, B., Di Gregorio, A., Gelpke, M., Goodstein, D.M., et al.
2002. The draft genome of Ciona intestinalis: Insights into chordate and
vertebrate origins. Science 298: 2157–2167.
Dufayard, J.F., Duret, L., Penel, S., Gouy, M., Rechenmann, F., and Perriere,
G. 2005. Tree pattern matching in phylogenetic trees: Automatic search
for orthologs or paralogs in homologous gene sequence databases.
Bioinformatics 21: 2596–2603.
Edgar, R.C. 2004. MUSCLE: A multiple sequence alignment method with
reduced time and space complexity. BMC Bioinformatics 5: 113. doi:
Enright, A.J., Van Dongen, S., and Ouzounis, C.A. 2002. An efficient
algorithm for large-scale detection of protein families. Nucleic Acids Res.
Fu, Z., Chen, X., Vacic, V., Nan, P., Zhong, Y., and Jiang, T. 2007. MSOAR: A
high-throughput ortholog assignment system based on genome
rearrangement. J. Comput. Biol. 14: 1160–1175.
Gibbs, R.A., Weinstock, G.M., Metzker, M.L., Muzny, D.M., Sodergren, E.J.,
Scherer, S., Scott, G., Steffen, D., Worley, K.C., Burch, P.E., et al. 2004.
Genome sequence of the Brown Norway rat yields insights into
mammalian evolution. Nature 428: 493–521.
Goodstadt, L. and Ponting, C.P. 2006. Phylogenetic reconstruction of
orthology, paralogy, and conserved synteny for dog and human. PLoS
Comput. Biol. 2: e133. doi: 10.1371/journal.pcbi.0020133.
Guindon, S. and Gascuel, O. 2003. A simple, fast, and accurate algorithm to
estimate large phylogenies by maximum likelihood. Syst. Biol. 52: 696–
Hennig, W. 1952. Grundzu ¨ge der Theorie der Phylogenetischen Systematik.
Deutscher Zentralverlag, Berlin.
Hubbard, T.J., Aken, B.L., Beal, K., Ballester, B., Caccamo, M., Chen, Y.,
Clarke, L., Coates, G., Cunningham, F., Cutts, T., et al. 2007. Ensembl
2007. Nucleic Acids Res. 35: D610–D617.
International Chicken Genome Sequencing Consortium. 2004. Sequence
and comparative analysis of the chicken genome provide unique
perspectives on vertebrate evolution. Nature 432: 695–716.
International Human Genome Sequencing Consortium. 2001. Initial
sequencing and analysis of the human genome. Nature 409: 860–921.
Jiang, Z., Tang, H., Ventura, M., Cardone, M.F., Marques-Bonet, T., She, X.,
Pevzner, P.A., and Eichler, E.E. 2007. Ancestral reconstruction of
segmental duplications reveals punctuated cores of human genome
evolution. Nat. Genet. 39: 1361–1368.
Li, L., Stoeckert Jr., C.J., and Roos, D.S. 2003. OrthoMCL: Identification of
ortholog groups for eukaryotic genomes. Genome Res. 13: 2178–2189.
Li, H., Coghlan, A., Ruan, J., Coin, L.J., Heriche, J.K., Osmotherly, L., Li, R.,
Liu, T., Zhang, Z., Bolund, L., et al. 2006. TreeFam: A curated database of
phylogenetic trees of animal gene families. Nucleic Acids Res. 34: D572–
Margulies, E.H., Cooper, G.M., Asimenos, G., Thomas, D.J., Dewey, C.N.,
Siepel, A., Birney, E., Keefe, D., Schwartz, A.S., Hou, M., et al. 2007.
Analyses of deep mammalian sequence alignments and constraint
predictions for 1% of the human genome. Genome Res. 17: 760–774.
Mikkelsen,T.S., Wakefield, M.J., Aken, B., Amemiya, C.T., Chang, J.L., Duke,
S., Garber, M., Gentles, A.J., Goodstadt, L., Heger, A., et al. 2007.
Genome of the marsupial Monodelphis domestica reveals innovation in
non-coding sequences. Nature 447: 167–177.
Mouse Genome Sequencing Consortium. 2002. Initial sequencing and
comparative analysis of the mouse genome. Nature 420: 520–562.
Pruitt, K.D., Tatusova, T., and Maglott, D.R. 2007. NCBI reference sequences
(RefSeq): A curated non-redundant sequence database of genomes,
transcripts and proteins. Nucleic Acids Res. 35: D61–D65.
Remm, M., Storm, C.E., and Sonnhammer, E.L. 2001. Automatic clustering
of orthologs and in-paralogs from pairwise species comparisons. J. Mol.
Biol. 314: 1041–1052.
Rhesus Macaque Genome Sequencing and Analysis Consortium. 2007.
Evolutionary and biomedical insights from the rhesus macaque
genome. Science 316: 222–234.
Stajich, J.E., Block, D., Boulez, K., Brenner, S.E., Chervitz, S.A., Dagdigian,
C., Fuellen, G., Gilbert, J.G., Korf, I., Lapp, H., et al. 2002. The Bioperl
toolkit: Perl modules for the life sciences. Genome Res. 12: 1611–1618.
Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K.,
Chetvernin, V., Church, D.M., Dicuccio, M., Edgar, R., Federhen, S., et al.
2008. Database resources of the National Center for Biotechnology
Information. Nucleic Acids Res. 36: D13–D21.
Xie, X., Lu, J., Kulbokas, E.J., Golub, T.R., Mootha, V., Lindblad-Toh, K.,
Lander, E.S., and Kellis, M. 2005. Systematic discovery of regulatory
motifs in human promoters and 39 UTRs by comparison of several
mammals. Nature 434: 338–345.
Yang, Z. 2007. PAML 4: Phylogenetic analysis by maximum likelihood. Mol.
Biol. Evol. 24: 1586–1591.
Received October 26, 2007; accepted in revised form November 18, 2008.
Duplication-aware gene trees in vertebrates