The tree of genomes: An empirical comparison of genome phylogeny reconstruction methods.
Angela McCann, James Cotton, James McInerney
Journal Article: BMC Evolutionary Biology (impact factor: 4.29). 12/2008; 8(1):312. DOI: 10.1186/1471-2148-8-312
Abstract
Source: PubMed
Comments on this publication
ResearchGate members can add comments. Sign up now and post your comment!
Similar publications
Accuracy of phylogeny reconstruction methods combining overlapping gene data sets.
Authors: Anne Kupczok, Heiko A Schmidt, Arndt von Haeseler
Algorithms for molecular biology : AMB. 5:37.
Species tree inference in a recent radiation of orioles (Genus Icterus): multiple markers and methods reveal cytonuclear discordance in the northern oriole group.
Authors: Frode Jacobsen, Kevin E Omland
Molecular phylogenetics and evolution. 61(2):460-9.
Phylogenetic analysis based on spectral methods.
Authors: Melanie Abeysundera, Chris Field, Hong Gu
Molecular biology and evolution. 29(2):579-97.
Fast and accurate methods for phylogenomic analyses.
Authors: Jimmy Yang, Tandy Warnow
BMC bioinformatics. 12 Suppl 9:S4.
A direct comparison of strategies for combinatorial RNA interference.
Authors: Luke S Lambeth, Nick J Van Hateren, Stuart A Wilson, Venugopal Nair
BMC molecular biology. 11:77.
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.
ssBioMed CentBMC Evolutionary Biology
Open AcceResearch article
The tree of genomes: An empirical comparison of
genome-phylogeny reconstruction methods
Angela McCann, James A Cotton and James O McInerney*
Address: Bioinformatics laboratory, Department of Biology, National University of Ireland Maynooth, Maynooth, Co. Kildare, Ireland
Email: Angela McCann - angela.mccann@nuim.ie; James A Cotton - j.a.cotton@qmul.ac.uk; James O McInerney* - james.o.mcinerney@nuim.ie
* Corresponding author
Abstract
Background: In the past decade or more, the emphasis for reconstructing species phylogenies has
moved from the analysis of a single gene to the analysis of multiple genes and even completed
genomes. The simplest method of scaling up is to use familiar analysis methods on a larger scale
and this is the most popular approach. However, duplications and losses of genes along with
horizontal gene transfer (HGT) can lead to a situation where there is only an indirect relationship
between gene and genome phylogenies. In this study we examine five widely-used approaches and
their variants to see if indeed they are more-or-less saying the same thing. In particular, we focus
on Conditioned Reconstruction as it is a method that is designed to work well even if HGT is
present.
Results: We confirm a previous suggestion that this method has a systematic bias. We show that
no two methods produce the same results and most current methods of inferring genome
phylogenies produce results that are significantly different to other methods.
Conclusion: We conclude that genome phylogenies need to be interpreted differently, depending
on the method used to construct them.
Background
Hundreds of genome sequencing projects have been com-
pleted [1], providing us with an abundant source of data
to reconstruct phylogenetic relationships, but also with
some novel problems in interpreting these data. The evo-
lutionary history of any genome includes elements of
gene duplication, gene loss, lineage sorting and horizon-
tal transfer of genes, all of which have the ability to con-
found phylogeny reconstruction [2-4]. Against this
background, a variety of genome-phylogeny methods
have been developed. These vary in their approach, the
input data they require and the interpretation of the
mentally different signals or if they are more-or-less find-
ing the same tree.
Current genome-level phylogeny methods can be split
into two categories – sequence-based methods and gene-
content methods. Analyses of sequence evolution pre-
dates gene-content methods simply because data for indi-
vidual genes were available before data for completed
genomes. Ubiquitously distributed ribosomal RNA
(rRNA) genes have usually been used as surrogates for
larger samples of individual genomes. These particular
genes are popular for phylogenetic studies due to their
Published: 12 November 2008
BMC Evolutionary Biology 2008, 8:312 doi:10.1186/1471-2148-8-312
Received: 7 April 2008
Accepted: 12 November 2008
This article is available from: http://www.biomedcentral.com/1471-2148/8/312
© 2008 McCann et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 13
(page number not for citation purposes)
result. However, to date, no study has been carried out
that asks whether these methods are picking out funda-
plentitude, universally conserved structure and apparent
resistance to horizontal gene transfer (HGT) [5]. In con-
trast, some methods are designed to include information
from the evolutionary history of several individual genes.
The supertree approach, for instance, involves the creation
of individual trees from gene families and the amalgama-
tion of these into one final supertree. Another sequence-
based approach has involved the concatenation of align-
ments of several genes [6-8] usually with an effort being
made to remove sequences that have an obvious history of
HGT. Data concatenation should have the effect of mini-
mizing stochastic effects due to small sample size and
amplifying low signals, though gene concatenation is not
without its problems [8,9].
The second group of methods uses variation in gene-con-
tent as the basis for phylogeny reconstruction. These
approaches range from the use of similarity of gene-con-
tent [10-12] to the inclusion of the analysis of gene order
data [13]. Usually a pairwise analysis of genomes in a set
is carried out with a metric being computed that reflects
the similarity between the genomes, this can be done
using a maximum parsimony score, a threshold parsi-
mony score (e.g. [14]) or deriving a phylogenetic distance.
Finally phylogenetic hypotheses are generated based
upon these scores.
If the process of evolution is indeed hierarchical or tree-
like, then with increased sampling, all reasonable or con-
sistent methods should converge on the same tree. Recent
work has found a great deal of congruence between phyl-
ogenetic trees for different gene families in closely related
organisms but a lack of congruence between gene trees
from distant relatives [15,16]. This suggests that the pat-
tern of inheritance of genes may indeed be largely vertical,
or at least tree-like for parts of the reconstructable tree, but
that this pattern is difficult to identify for deep-level rela-
tionships [17]. In other words, parts of life's history may
not be reconstructable because of incorrect identification
of orthologs, hidden paralogy, horizontal gene transfer
events or the inability of methods based upon current
evolutionary models to correctly reconstruct deep-level
phylogenetic relationships. Ideally, a formal probabilistic
model describing all of the many processes involved
would allow us to both study these processes quantita-
tively and reconstruct phylogenetic relationships [18], but
no such unifying models exist, and any such model would
be complex and difficult to fit.
Given that a number of heuristic methods now exist for
the inference of phylogenetic histories from genomic
data, it is reasonable to ask whether these methods are
likely to give fundamentally different answers. In this
report, we have examined the similarity of the results we
obtain when we use a variety of different organismal phy-
eny methods, we have not tried to exhaustively explore all
available methods. Instead, we have chosen exemplar
methods that use different kinds of data. If all genome-
phylogeny methods tend to return the same answers, then
it probably does not matter which one is used; however, if
on the other hand, different methods give different
results, the choice of method becomes important.
We have used exploratory statistics in order to examine
the phylogeny of 22 diverse Archaea for which completed
genome sequences are available. In particular, we wished
to explore variation in the phylogenetic hypotheses from
this dataset. Our comparisons involved using five distinct
phylogeny reconstruction methods and their variants, giv-
ing a total of nine methods. Seven of these methods use
large portions of each genome: four variants of the Condi-
tioned Reconstruction (CR) method [11,19], two variants
of the SHOT method [20] and a Supertree approach [21].
We have also examined reconstructions based on the 16S
rRNA molecule and from a concatenated alignment of 31
genes involved in translation (the same genes used by
[7]).
Particular attention is paid to CR in this report [11,19].
This is a method that has not been used extensively but
which forms the basis of the "Ring of Life" hypothesis
[11,19]. CR is based on the analysis of shared gene-con-
tent, with closely related species sharing a large propor-
tion of genes whilst more distantly related species have
fewer genes in common. The method is based on the for-
mation of a matrix consisting of the four possible patterns
of joint presence (P) or absence (A) of genes between any
two genomes (PP, PA, AP, AA). The proportions of the
first three are readily determined, but the shared absences
will pose a problem for any Markov method using pres-
ence-absence data to infer phylogenetic signal. This is due
to the fact that the analysis will only be carried out over all
genes present in either genome one or genome two. The
authors' solution is to introduce a conditioning genome;
i.e. an additional genome that is used solely for reference
purposes. Genes that are present in the conditioning
genome, but absent in the two genomes under considera-
tion provide an estimate of shared absence.
Along with their claim about insensitivity to HGT the
authors also claim that the phylogenetic outcome of CR is
not affected by the choice of conditioning genome
[11,19]. However, there has been little work testing either
of these conjectures, or examining the performance of CR
in comparison to other methods [22]. An analysis was
however preformed to determine if CR could differentiate
between genome fusions or HGT events involved in the
formation of mixed genomes [22], as the authors claimedPage 2 of 13
(page number not for citation purposes)
logeny reconstruction approaches on a real dataset. In
reality, given the enormous number of genome-phylog-
it could. Bailey et al (2006) concluded that this was not
possible and that CR can actually induce bias in ortholog
sampling; they also showed using simulation studies that
different conditioning genomes can result in different
trees being derived for the same dataset. Another study of
CR carried out by Spencer et al (2007), also showed that
altering the conditioning genome in an analysis can affect
bootstrap support for different tree topologies.
A problem that we encounter in this study concerns the
issue of using different analysis methods and using differ-
ent kinds of data. When we consider a single alignment,
then it is usual to perform an initial analysis to find which
model best fits the data [23,24]. Then using this model,
the best phylogenetic tree or set of trees is found using
some optimization procedure [25]. Therefore, the two
variables will be the choice of alignment – whether to use
a single gene [26,27] or concatenate several genes together
[6-8] – and the choice of model. When using gene-content
data, the choices will include the method of encoding
gene-content and the way in which the encoded data is
analysed [10-12]. For an analysis that includes phyloge-
netic supertree construction, the choices will center on the
model that is used to construct phylogenetic trees from
the alignments of orthologs and the method of inferring
the supertree [28]. Clearly, it is difficult to carry out a
study that uses all combinations of methods and also it is
difficult to use an approach that holds all variables con-
stant while only changing one at a time. In this study, we
have chosen to use a representative sample of approaches
and our analysis involved comparing the final sets of phy-
logenetic hypotheses. The overall objective of this work is
not to identify which method yields the correct phylog-
eny, rather it is to ask whether the different methods are,
in general, producing the same phylogenies.
Methods
Ortholog identification
A total of 22 fully sequenced Archaeal genomes were
downloaded from the Cogent database [1]. A previously
described greedy algorithm was used to identify ortholo-
gous families in these genomes [16]. Briefly, a random
query sequence was chosen from the original database of
22 Archaeal genomes and homologous genes were identi-
fied as BLASTP hits [29] with an E-value of 10-8 or less. The
initial query sequence and all hits were then removed
from the original Archaeal database and the process con-
tinued iteratively with a new query until all genes were
assigned to a gene family. A total of 14,673 gene families
were identified. From this initial set, 1,655 paralogous
families were eliminated by removing all gene families
with more than one representative from any genome.
11,864 phylogenetically uninformative gene families
(fewer than four sequences) were also eliminated. Amino-
acid sequences for each gene family were then aligned
iable positions from the alignment, these are potentially
fast evolving or poorly aligned regions. In Gblocks [31],
the maximum number of contiguous non-conserved posi-
tions allowed was set to 15 and the minimum length of a
block was set to 8 amino acid positions. Following
Gblocks site removal, those alignments that now had
fewer than 150 amino acid positions remaining were
excluded from further analysis. All remaining alignments
were screened for the presence of phylogenetic signal
using the Permutation Tail Probability (PTP) test [32,33].
Only 594 alignments passed the test (p < 0.01) and were
retained. The presence or absence of these remaining sin-
gle-copy gene families were scored in a matrix and pro-
vided the input for the gene-content based phylogenetic
methods.
Supertree
The 594 remaining alignments were analysed using Mul-
tiphyl [23]. This software reconstructs maximum-likeli-
hood (ML) phylogenies for each gene family using the
best-fitting empirical homogeneous model of amino-acid
substitution, according to the Akaike Information Crite-
rion (AIC). Multiphyl [23] distributes the model selection
and tree search calculations across a network of comput-
ers. Tree space was searched using Nearest Neighbor Inter-
change (NNI) branch swapping and local branch length
optimization, until convergence. Following gene tree esti-
mation, a supertree was inferred from these gene trees
using CLANN [21] with the Most Similar Supertree Algo-
rithm (MSSA/DFIT) criterion [15], using the default heu-
ristic search options. Non-parametric bootstrapping was
carried out by sampling-with-replacement 100 pseudore-
plicates of individual gene trees, using the default settings
in CLANN [21], generating a supertree for each of these
replicates and summarizing the results using a majority-
rule consensus method.
Conditioned Reconstruction
A matrix of presence and absence of gene families (see
ortholog identification above) was analysed using a Java
implementation of the CR algorithm [11,19], (software
available on request). Our program implements a number
of variants of the standard CR method. The first approach
uses only a single conditioning genome, as originally pro-
posed by the authors [11,19]. The conditioning genome is
specified by the user and a phylogeny is inferred using
paralinear/logdet distances [34,35]. The first variant of CR
is called averaged (Avg) CR and does not require the a pri-
ori identification of a conditioning genome. In this case,
every genome is used as the conditioning genome. In
other words, when working out the distance between two
genomes, every other genome acts the conditioning
genome. The logdet distances derived using each condi-Page 3 of 13
(page number not for citation purposes)
using ClustalW v1.83 [30] with all settings at their default
values. Gblocks [31] was then used to remove highly var-
tioning genome are summed and the mean of this value
gives the final distance between the two genomes of inter-
est, this process is repeated for all pairs of genomes in the
analyses. The second variant is an unconditioned distance
approach (see [36]), this involves including a pseudo-
conditioning genome in which every gene family is
present (i.e. comprised entirely of the present state).
The final variant of CR analysed in this report, employed
software created by Spencer et al., (2007). This program is
based on a modified BIONJ algorithm [37], adapted to
produce a supertree. The input to this program is a series
of distance matrices, each derived using a different condi-
tioning genome; in our case 22 different matrices were
used, therefore all genomes in this analysis acted as the
conditioning genome at one point. The modified BIONJ
algorithm operates by firstly choosing a pair of taxa from
each distance matrix that minimizes some criterion (see
[36]). The best such pair across all the distance matrices
are then selected and the subtrees containing these taxa
are aggregated in all distance matrices; finally the distance
matrices are updated. This process is continued iteratively
until every taxon has been aggregated in every matrix and
a supertree is produced. Spencer et al., (2007) provide two
different approaches of the algorithm, both of which were
used in this report. The first method is a vote-counting
method that does not take into account differences in reli-
ability between conditioning genomes. The second is an
inverse-variance weighting scheme that does take into
account differences in reliability between conditioning
genomes.
In total four variants of CR are implemented in this study.
With the exception of the modified BIONJ approach
described above, all distance matrices were converted into
phylogenetic trees using the neighbor-joining algorithm
implemented in PHYLIP [38]. In addition, we constructed
100 bootstrap pseudoreplicates by resampling the origi-
nal presence and absence matrix (see ortholog identifica-
tion above).
SHOT method
Two distance matrices were derived from the matrix of all
orthologs (see ortholog identification) based on two vari-
ants of the SHOT method [20], by applying the formulae
below:
npp is the number of gene families in common between the
Following derivation of a distance matrix, phylogenetic
hypotheses were derived using the neighbor-joining
method as described above. Bootstrap resampling was
employed in order to examine variation in estimates from
these approaches.
Concatenated alignment
A concatenated alignment was built using the 31 genes
used by Cicarelli et al., (2006; see table S2). These genes
are largely involved in translation and have been
described as having "indisputable orthology" in 191 spe-
cies. The complete data matrix was obtained from the
iTOL website [39] and all non-archaeal species were
removed. Four genomes used in our study were absent
from the Ciccarelli et al (2006) data set. The genes from
these genomes were retrieved and aligned to the iTOL
genes as a profile alignment in ClustalW v1.83 [30]. Phy-
logenetic hypotheses based on this alignment were then
generated using Multiphyl [23] using the homogeneous
(unpartitioned) model selection, tree reconstruction and
bootstrap resampling capabilities of MultiPhyl.
Ribosomal RNA Tree
16S rRNA sequences were obtained from the Ribosomal
Database Project (RDP, [40]) or, when particular
sequences were not available in the RDP, they were
retrieved from GenBank (see table S1). All downloaded
16S rRNA genes were compared to the corresponding
genes in our downloaded genomes to ensure the correct
genes had been retrieved. The RDP alignment was used as
a profile to align GenBank sequences using ClustalW
v1.83 [30]. According to the AIC implemented in Model-
test [24], the best-fitting model of nucleotide substitution
was the General Time Reversible (GTR) substitution
model, with rates at variable sites sampled from a gamma
distribution. Phylogeny reconstruction was carried out
using the default TBR heuristic search in PAUP 4b10 [41].
Bootstrap resampling was also carried out using PAUP
4b10.
Comparing trees and matrices
Pairwise Robinson-Foulds (RF) distances (symmetric-dif-
ference distances) [42] between trees were calculated
using PAUP 4b10 [41]. Phylogenetic trees were visualized
using TreeView [43] and TreeMap 2.0β [44] (see Addi-
tional file 1). Comparisons between distance matrices
produced using the gene-content methods were per-
formed by calculating a sum-of-squares distance. The
matrices were transformed so that undefined values from
the CR procedure were replaced with the largest value in
the matrix. The resulting distance matrices could then be
visualised using Principle Components Analysis (PCA) in
the R statistical programming language (R Development
d n
a b
abpp1
2 2
2
= −
+
log( ) (1)
d
npp
a b2
= − log(
min( , )
) (2)Page 4 of 13
(page number not for citation purposes)
two genomes of interest and a and b are the number of
gene families in each of the two genomes individually.
Core Team, 2006).
Results & discussion
Our objective was to explore a variety of exemplar analysis
methods from each of the different categories of analysis
type in order to ascertain whether variation in the result-
ing trees is trivial or extensive, random or accompanied by
systematic bias. In the first instance, we used exploratory
statistics to examine variation in the distance matrices
produced by those methods that produced distance matri-
ces. We focused on the CR approach and specifically, the
effects that are seen with variation in the choice of condi-
tioning genome.
Variation within Conditioned Reconstruction approaches
In the analysis of the CR approach it became obvious that
the distance matrix that was recovered was very dependent
on the conditioning genome that was used in the analysis.
We inferred CR distance matrices using all possible (a
total of 22) combinations of conditioning genomes.
Another distance matrix was derived by taking a pair of
genomes and calculating the distance between them using
every other genome as a conditioning genome and then
averaging these distances (Avg CR). Another distance
matrix was produced using a synthetic conditioning
genome where every gene family was present (uncondi-
tioned approach). Two final matrices were produced
using the SHOT formulae in equations 1 and 2 above. We
used PCA in order to visualize the most important sources
of variation across the CR distance matrices as well as the
two matrices from the SHOT methods. Figure 1a shows
the most important axes following PCA of the distance
matrices. Each point on the diagram represents the loca-
tion of a distance matrix, with the relative closeness of
points to one another being indicative of the relative sim-
ilarities of the distance matrices. Beside each point is the
name of the conditioning genome that was used in order
to produce the matrix. In the case of the SHOT methods
or the various CR alternatives, these are labeled appropri-
ately. In this plot the points are proportional in size to the
size of the conditioning genome and the colours of the
shaded points are darker for Crenarchaeotes and lighter
for Euryarchaeotes.
The two axes in this plot account for 74% of the total
amount of variation in the PCA. The first axis (the
abscissa) accounts for 52% of the variation and the sec-
ond axis (the ordinate) accounts for 22% of the total var-
iation. No other axis accounted for more than 6% of the
variation, therefore these two axes are by far the most
important correlates with variation in the distance matri-
ces.
Firstly, an analysis of this plot shows that the same dis-
tance matrix is not produced every time and that the cal-
important trend in these data matrices (axis 1) is corre-
lated with choice of conditioning genome. When condi-
tioning genomes are used that are closely related, then the
resulting distance matrices will tend to be closely related.
For example, the distance matrices produced using the
four Thermococci (see table S3 for classification) as the
conditioning genomes are clustered together. A similar
within-group clustering is observed when, say, the Ther-
moplasmatales, the Crenarchaeota or the Methanogens
act as the conditioning genome. When Archaeoglobus fulg-
idus was used as the conditioning genome the resulting
distance matrix also clustered with the Methanogens. Arc.
fulgidus is a sulphur metabolising archaeon with similar
biochemistry to the methanogens [45] so therefore, this
placement is also perhaps not surprising. Therefore, tak-
ing an overall look at the results of using different condi-
tioning genomes, we can see that phylogenetic position is
the most important factor in inducing differences in the
distance matrices.
The outliers on axis 1 in this plot are the matrices where
the four Thermococci were used as conditioning genomes.
These outliers account for much of the variation in axis 1.
There are two things that can be said about these distance
matrices. Firstly, three of these four matrices contained the
highest proportion of undefined values in our analyses.
Undefined values occur when attempting to perform an
operation on invalid operands, e.g. getting the logarithm
of a negative number. When one Thermococcus is chosen
as the conditioning genome the distances between the
other Thermococci and the rest of the genomes contain a
high proportion of undefined values. This point is backed
up by the claim [36] that a conditioning genome far from
the taxon of interest is optimal. This may be the reason
that these are outliers. Another possible reason is that
these are close relatives so perhaps they are outliers
because quite simply these four conditioning genomes
have produced matrices that are similar to one another
but very different to the rest of the conditioning genomes
and the fact that they have large numbers of undefined
values is purely incidental.
The second most important axis (axis 2) clearly defines
the split between the Crenarchaeota and Euryarchaeota.
So, even the second most important source of variation in
the data is also related to phylogenetic affiliations of the
conditioning genomes.
In order to explore whether these matrices are signifi-
cantly different to one another we used bootstrap resam-
pling of the data. For each dataset, we produced 100
bootstrap samples and 100 corresponding distance matri-
ces. We expected to find one of two situations. Either thePage 5 of 13
(page number not for citation purposes)
culated distances heavily depend on the conditioning
genome or the data treatment that is used. The most
variation within 100 bootstrap replicates is so great that
there is no correlation between matrices produced using a
Resources
Science & Research Jobs
CSIRO - Postdoctoral Fellow - Statistician
Position: PostDoc Position
Employer: Commonwealth Scientific and Industr...
Scientist for Medical Signal Processing-ECG interpretation algorithm in Philips Research Asia – Shanghai
Position: Researcher
Employer: Philips (China) Investment Co.,Ltd

