ArticlePDF Available

The K tree score: Quantification of differences in the relative branch length and topology of phylogenetic trees


Abstract and Figures

We introduce a new phylogenetic comparison method that measures overall differences in the relative branch length and topology of two phylogenetic trees. To do this, the algorithm first scales one of the trees to have a global divergence as similar as possible to the other tree. Then, the branch length distance, which takes differences in topology and branch lengths into account, is applied to the two trees. We thus obtain the minimum branch length distance or K tree score. Two trees with very different relative branch lengths get a high K score whereas two trees that follow a similar among-lineage rate variation get a low score, regardless of the overall rates in both trees. There are several applications of the K tree score, two of which are explained here in more detail. First, this score allows the evaluation of the performance of phylogenetic algorithms, not only with respect to their topological accuracy, but also with respect to the reproduction of a given branch length variation. In a second example, we show how the K score allows the selection of orthologous genes by choosing those that better follow the overall shape of a given reference tree. Availability:
Content may be subject to copyright.
Vol. 23 no. 21 2007, pages 2954–2956
BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btm466
The K tree score: quantification of differences in the relative
branch length and topology of phylogenetic trees
ctor Soria-Carrasco, Gerard Talavera, Javier Igea and Jose Castresana
Department of Physiology and Molecular Biodiversity, Institute of Molecular Biology of Barcelona, CSIC,
Jordi Girona 18, 08034 Barcelona, Spain
Received on June 26, 2007; revised on August 13, 2007; accepted on September 6, 2007
Advance Access publication September 22, 2007
Associate Editor: Keith Crandall
Summary: We introduce a new phylogenetic comparison method
that measures overall differences in the relative branch length and
topology of two phylogenetic trees. To do this, the algorithm first
scales one of the trees to have a global divergence as similar as
possible to the other tree. Then, the branch length distance, which
takes differences in topology and branch lengths into account, is
applied to the two trees. We thus obtain the minimum branch length
distance or K tree score. Two trees with very different relative branch
lengths get a high K score whereas two trees that follow a similar
among-lineage rate variation get a low score, regardless of the
overall rates in both trees. There are several applications of the
K tree score, two of which are explained here in more detail. First,
this score allows the evaluation of the performance of phylogenetic
algorithms, not only with respect to their topological accuracy, but
also with respect to the reproduction of a given branch length
variation. In a second example, we show how the K score allows the
selection of orthologous genes by choosing those that better follow
the overall shape of a given reference tree.
In phylogenetic reconstruction, the application of different
methods or the use of different genes may lead to the estimation
of different phylogenetic trees (Castresana, 2007; Hillis et al.,
2005; Huerta-Cepas et al., 2007). In order to analyze if the
resulting trees are congruent, it is fundamental to be able to
quantify differences between such trees. Normally, only
topology is taken into account for such task, for example, by
means of the symmetric difference (Robinson and Foulds,
1981). Few methods have been developed that also take branch
length information into account (Hall, 2005; Kuhner and
Felsenstein, 1994). These methods have been successfully
applied to quantify the performance of different phylogenetic
methods in simulated alignments, but they have the drawback
that they are not directly applicable to trees with different
evolutionary rates. Here, we introduce a new phylogenetic
comparison measure that takes branch length information into
account after scaling the trees so that they have comparable
global evolutionary rates.
The basis of our method to compare two phylogenetic trees,
T and T
, is the branch length distance (BLD) introduced by
Kuhner and Felsenstein (Felsenstein, 2004; Kuhner and
Felsenstein, 1994). This distance is sensitive to the similarity
in branch lengths of both trees. Consider the set of partitions
present in both trees, that is, the whole set of partitions present
in T plus the set of partitions present in T
but not in T.
Partitions for external branches are also included. For tree T,
we can define an array B of branch lengths associated to
each partition (b
, b
,..., b
). Branches that do not appear in
T (corresponding to partitions that are only present in T
) are
assigned to 0 in such array. We can similarly define the array B
associated to tree T
. The BLD between trees T and T
is the
squared root of the sum of ðb
for all partitions. However,
BLD depends on the absolute size of the trees being compared,
so that two trees with the same shape (topology and relative
branch length) but different global rates will give rise to a very
high BLD (Kuhner and Felsenstein, 1994), which may be
In our method, we introduce a factor, K, to scale tree T
that both trees, T and T
, have a similar global divergence.
Thus, we are interested in calculating BLD after scaling T
a factor K :
To obtain the value of K that minimizes BLD we differentiate
Equation (1). It can be shown that the value of K that makes
this derivative zero is:
K ¼
*To whom correspondence should be addressed.
2954 ß The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email:
by guest on June 11, 2013 from
We then substitute this value of K in Equation (1) and obtain
the minimum branch length distance or K tree score. It should
be taken into account that the K tree score is not symmetric,
that is, the result from T to T
may not be the same than from
to T, and, in consequence, the K score does not have the
mathematical properties of a distance. Thus, this score is
generally not useful to compare only two trees (although the K
factor of Equation (2) can be very valuable for scaling
purposes; see below). The K tree score is most useful when
there is a tree that serves as reference (T ) and several other trees
) that will be scaled and compared to T. In such cases, trees
that are similar in shape to T will receive a low K tree score
whereas those that are very different will get a relatively higher
K score, regardless of their overall rates.
The method that calculates the K tree score (as well as other
tree comparison measures) is implemented in a Perl program
called Ktreedist.
There are several applications of the K tree score. First, it can
be used to evaluate the quality of phylogenetic reconstructions
in simulated alignments by comparing the true tree to the
trees obtained with different phylogenetic methods. For
example, the reference tree shown in Figure 1A was used to
simulate with SeqGen (Rambaut and Grassly, 1997) 100
alignments of 1000 positions with a GTR model and gamma
rate heterogeneity ( ¼ 1.5). We then constructed maximum-
likelihood (ML) trees from such simulations using Phyml
(Guindon and Gascuel, 2003) with two different conditions:
without and with rate heterogeneity. To facilitate the compar-
ison between both phylogenetic methods we imposed the
topology of the reference tree during the ML reconstructions.
After averaging the branch lengths of the 100 reconstructed
trees, we obtained one tree for each phylogenetic method. Both
trees differed in their overall rates (with the nonrate hetero-
geneity tree not capturing all substitutions, leading to a K scale
factor 1) but, importantly, they also differed in their shapes:
see, for example, the relative lengths of sp3, sp4 and sp5.
The differences in shape were reflected in the K scores: 0.197 for
the average tree without rate heterogeneity and 0.030 for the
average tree calculated with rate heterogeneity, indicating the
better performance of the latter method. (Differences also
appeared after averaging the K score from the 100 trees
obtained with each method although, in this case, the
magnitude of the difference was smaller.) Thus, the K tree
score can be used to quantify the different quality in
branch length reconstruction of different phylogenetic meth-
ods. The K score can also be used with trees that have different
topologies. In such cases, nonshared branches that are
relatively long will contribute to the K score much more
than small conflicting branches. This is different from the
symmetric difference (Robinson and Foulds, 1981), in which
all topological differences count the same.
In a second example, we show how the K tree score can be
used to make an accurate selection of orthologous genes.
Orthologs should reflect the same topology of the species tree
but they should also give rise, in principle, to a similar tree
shape. We extracted from the ENSEMBL database (Hubbard
et al., 2007) the tables of pairs of orthologous genes of seven
Fig. 1. (A) Reference tree used to simulate 100 alignments and the average reconstructions obtained by ML without and with rate heterogeneity.
(B) Trees obtained with 472 concatenated introns (reference tree) and with two individual introns (intron 1 of BXDC5 and intron 3 of EGLN2).
The K tree score
by guest on June 11, 2013 from
mammalian species. By matching the pairwise orthology tables,
we constructed a set of one-to-one orthologs, and we down-
loaded the corresponding genes. We then extracted the introns
and, after applying several filters (elimination of very long
introns, those with problematic alignments, etc.), we obtained
a set of 472 putative orthologous introns. Some of these introns
produced ML phylogenetic trees that were of unusual shape,
which could be due to different rates of evolution in different
lineages (heterotachy) or could indicate that they do not come
from orthologous genes (hidden paralogy). We then con-
structed a reference tree (Fig. 1B) with the concatenated
alignment of the 472 introns using the RAxML program
(Stamatakis, 2006), which can handle very long alignments,
with a GTR model of evolution and four rate categories. This
tree should reflect the average divergence of the seven genomes
and, as expected, rodents showed a higher acceleration in their
branches. We then calculated the K score of the trees of all
individual introns with respect to the reference tree. We show in
Figure 1B the trees of two putative orthologous introns. Intron
1ofBXDC5, despite having a high global rate, produced a
phylogeny with the same topology and a very similar tree shape
to the reference tree. This was reflected in a low K score: 0.049,
smaller than the mean of the distribution of K scores of all
individual introns (0.104), which is indicative of a very likely
ortholog. (The K score would also be low in a similar tree but
with a topological conflict affecting a small branch, which
would not affect the high probability of orthology.) Intron 3 of
EGLN2 also reproduced the reference topology. However,
this tree showed a relatively long basal branch in primates as
well as a long branch connecting Euarchontoglires and
Laurasiatherians. In consequence, the K score for this tree
with respect to the reference is much higher: 0.270. In fact, this
value is a clear outlier in the distribution of K scores. Although
heterotachy cannot be discarded, the chances that the latter
gene contains hidden paralogs in some species are higher than
in the first gene. Thus, the K score can be used to establish
a certain threshold and make a more accurate selection of
orthologous genes.
If orthology is ensured for a set of genes, then a high K tree
score with respect to a given reference will be indicative of trees
with very fast-evolving species or with a significant amount of
other types of heterotachy. These trees are of more difficult
reconstruction, and thus the K tree score can be used to select
(in a similar way as above) a set of the most reliable genes for
estimating species phylogenies.
On a more practical side, the K scale factor [Equation (2)]
can be used in instances where it is necessary to scale trees to
have equivalent divergences. For example, the linearization of
trees by means of a method like nonparametric rate smoothing
produces trees with an arbitrary scale when no dates are known
for the tree nodes (Sanderson, 1997). In such cases, one can
make use of the K scale factor obtained from the comparison
between the linearized tree and the original (reference) tree:
the scaling of the linearized tree with this K factor will
re-establish a genetic distance scale equivalent to that of the
original tree.
J.C. is supported by grant number CGL2005-01341/BOS from
the Plan Nacional IþDþI of the MEC (Spain), cofinanced with
FEDER funds.
Conflict of Interest: none declared.
Castresana,J. (2007) Topological variation in single-gene phylogenetic trees.
Genome Biol., 8, 216.
Felsenstein,J. (2004) Inferring Phylogenies. Sinauer Associates, Sunderland,
Massachusetts, pp. 531–533.
Guindon,S. and Gascuel,O. (2003) A simple, fast, and accurate algorithm to
estimate large phylogenies by maximum likelihood. Syst. Biol., 52, 696–704.
Hall,B.G. (2005) Comparison of the accuracies of several phylogenetic methods
using protein and DNA sequences. Mol. Biol. Evol., 22, 792–802.
Hillis,D.M. et al. (2005) Analysis and visualization of tree space. Syst. Biol., 54,
Hubbard,T.J.P. et al. (2007) Ensembl 2007. Nucleic Acids Res., 35, D610–D617.
Huerta-Cepas,J. et al. (2007) The human phylome. Genome Biol., 8, R109.
Kuhner,M.K. and Felsenstein,J . (1994) A simulation comparison of phylogeny
algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol., 11,
Rambaut,A. and Grassly,N.C. (1997) Seq-Gen: an application for the Monte
Carlo simulation of DNA sequence evolution along phylogenetic trees.
Comput. Appl. Biosci., 13, 235–238.
Robinson,D.F. and Foulds,L.R. (1981) Comparison of phylogenetic trees. Math.
Biosci., 53, 131–147.
Sanderson,M.J. (1997) A nonparametric approach to estimating divergence times
in the absence of rate constancy. Mol. Biol. Evol., 14, 1218–1231.
Stamatakis,A. (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic
analyses with thousands of taxa and mixed models. Bioinformatics, 22,
V.Soria-Carrasco et al.
by guest on June 11, 2013 from
... Branch support values were obtained using nonparametric bootstrapping with 1,000 resamplings for the first three phylogenetic methods, and the posterior probabilities for the Bayesian approach were estimated on 16,806 samples with a Burn-in phase for the first 25% of tree samples. The best tree for the F gene was determined by calculating the minimum branch length distance (K tree score) between the phylogenetic trees by the Ktreedist program [42]. The complete F gene data set was also used to calculate the mean evolutionary distances within and between clusters. ...
... The robustness of the trees was assessed by comparison of the bootstrap values for NJ, MP, and ML and posterior probabilities of the Bayesian approach [38]. A K-score analysis was also done to compare MrB, ML, and NJ using the Ktreedist program [42]. From these comparisons, the performance of MrBayes was slightly better than ML, and both were clearly higher than NJ and MP. ...
... (ZIP) Table S4 Comparison of trees generated by MrBayes, Maximum Likelihood, and Neighbor Joining on 741 F gene sequences. The comparison was made by calculating the minimum branch length distance (K tree score) between the phylogenetic trees using the Ktreedist program [42]. The minimal score represents the best tree. ...
... Nous avons retrouvé en parallèle l'histoire évolutive des 77 espèces en utilisant l'outil en ligne TimeTree [235]. Les distances entre les différents arbres phylogénétiques ont été mesurées par le biais de l'indice de Robinson-Foulds, qui se base sur la topologie [236], et l'approche du score de K-tree, qui prend en compte les différences à l'échelle de la topologie et de la longueur de branches [237]. Après inférence phylogénétique à partir des deux jeux de données, nous avons calculé l'état ancestral des ancêtres communs les plus proches de chaque clade de PTBP à l'aide de RAxML. ...
... Importantly, all these instances of divergent behaviour 234 (except for the platypus PTBP3) are consistent with the deviations described above from the expected composition by 235 the mathematical modelling of the ortholog nucleotide composition. 236 Mammalian PTBP1s accumulate GC-enriching synonymous substitutions 237 We have shown that PTBP1 genes are GC-richer and specifically GC3-richer than the PTBP2 and PTBP3 paralogs 238 in the same genome, and that this enrichment is of a larger magnitude in placental PTBP1s. We have thus assessed 239 whether a directional mutational pattern underlies this enrichment, especially regarding synonymous mutations. ...
Full-text available
Au cours du processus cellulaire de traduction, la machinerie ribosomale synthétise une protéine au travers de la lecture successive des codons le long de l'ARN messager. À chaque codon lu, les ribosomes font appel aux ARN de transfert, chargés d'un acide aminé (l'unité de base des protéines). La complémentarité entre le codon de l'ARNm et l'anticodon de l'ARNt est évaluée, conduisant éventuellement à la polymérisation de l’acide aminé sur la protéine naissante. Il existe 64 codons associés classiquement à 20 acides aminés. Plusieurs codons, qualifiés de synonymes, peuvent donc être associés à un même acide aminé. Le biais d’usage des codons (CUB) désigne l'usage différentiel des codons synonymes à l’échelle d'un gène, d'une région génomique ou d'un génome. Le CUB peut être associé à des processus mutationnels, à l'origine de particularités locales de composition nucléotidique, mais aussi à des processus de sélection pour améliorer la dynamique de synthèse de protéines. L’influence de ces deux processus sur le CUB a été démontrée chez les procaryotes et chez certains eucaryotes. Cependant, il n’existe pas d’évidence forte d’une sélection agissant à l’échelle des gènes des Vertébrés, et plus précisément des mammifères. Considérons-nous donc le CUB dans ces espèces sous un angle correct ? Possédons-nous les outils nécessaires pour tirer de telles conclusions ? Pour répondre à ces questions, nous proposons ici une approche mathématique, informatique et analytique du CUB par le biais de l'analyse de paralogues et de virus humains.Nous avons conçu un nouvel indice de mesure du CUB appelé COUSIN (COdon Usage Similarity INdex), qui quantifie la distance entre le CUB d’une séquence et celui d’une référence, et qui se démarque des autres indices existants par sa clarté dans l’interprétation des résultats. Cet indice est implémenté au sein d’un un outil éponyme ( Dans un deuxième temps, nous avons effectué une étude de l’histoire évolutive et du CUB des gènes paralogues de Vertébrés Polypyrimidine Tract-Binding Protein (PTBP) dont l’expression tissu-spécifique pourrait être associée aux différences dans leur CUB. Nous montrons que les paralogues PTBP1 semblent soumis à un biais mutationnel vers un enrichissement en GC, alors que le CUB des PTBP2 pourrait refléter une sélection traductionnelle vers l’utilisation de codons rares dans le génome. Nous interprétons que l’évolution du CUB des PTBPs est compatible avec un scénario de sous-fonctionnalisation des paralogues par expression différentielle pendant le développement des Vertébrés. Finalement, nous avons étudié le CUB chez des virus humains au travers des polyomavirus humains (PyVs). Du fait de leur mode de vie obligatoirement parasite de la machinerie de traduction, le CUB des virus pourrait impacter la présentation clinique de l’infection. Notre choix des PyVs humains vient précisément de leur diversité génotypique ainsi que de leur multiplicité de manifestations cliniques. Les infections par PyVs humains sont fortement prévalentes et asymptomatiques mais, dans un contexte d’immunosuppression, elles peuvent provoquer des symptômes tissulaires importants et parfois mortels. Le Polyomavirus BK (BKPyV) est notamment connu pour provoquer des néphropathies chez des patients récipients d’une greffe de reins. Pour préparer l'analyse de données longitudinales de virémie, de virurie et génétiques sur des patients récipients d'une greffe de reins, nous avons construit deux pipelines permettant une analyse du génome des PyVs et de leur génotype. Afin de mieux comprendre la dynamique évolutive de BKPyV dans le cadre des néphropathies, nous avons analysé l’évolution et le CUB des PyVs dans le contexte de la relation hôte-parasite.Les résultats proposés au sein de cette thèse enrichissent les bases pour l’étude du CUB chez les Vertébrés, et alimentent le débat sur les approches tissu-spécifiques au travers de l'expression différentielle des gènes et du tropisme des virus.
... Mutations that co-occur most frequently, or over the shortest average distance along branches of the tree, can be viewed as epistatic. In the case of H3 hemagglutinin, the ancestral mutation A138S located in a known antigenic site (designated A) occurred far more frequently among non-conservative mutation pairs that H1N1 -1930 -2002 H1N1 -1933 -1989 H1N1 -1999 -2007 H1N1 -2007 -2009 (with H274Y) HxN1 -1934H1N1 -2002HxN1 -1976H1N1 -2009pandemic -2012H1N1 -1991-1998H1N1 -2007H5N1 -2002-2008 Mass Tree H1N1 -1930-2002H1N1 -1933-1989H1N1 -1999-2007H1N1 -2007(with H274Y) HxN1 -1934-2008H1N1 -1992HxN1 -1976H1N1 -2009pandemic -2012H1N1 -1991-1998H1N1 -2006-2008H5N1 -2002-2008 co-occur over the smallest branch distances across the tree. The most frequent descendant mutations occurred in other antigenic domains or elsewhere in the structure consistent with the findings of Kryazhimskiy and co-workers (Kryazhimskiy et al., 2011). ...
... The MAST algorithm (Gordon, 1980) returned low congruence indices (P values) of between 2 to 3, where a P value of 1 indicates the trees are identical. Similarly, the application of KTreedist (Soria-Carrasco et al., 2007) and TreeCmp (Bogdanowicz et al., 2012) for the much larger trees, that have been built in our more recent applications (Akand & Downard, 2020), revealed a high degree of similarity with the companion sequence trees. The trees were determined to be congruent based on the K tree scores and MS distances, respectively, when the size of the trees and the number of leaf nodes were taken into account. ...
Full-text available
An alternative, more rapid, sequence‐free approach to build phylogenetic trees has been conceived and implemented. Molecular phylogenetics has continued to mostly focus on improvement in tree construction based on gene sequence alignments. Here protein‐based phylogenies are constructed using numerical data sets (“phylonumerics”) representing the masses of peptide segments recorded in a mass mapping experiment. This truly sequence‐free method requires no gene sequences, nor their alignment, to build the trees affording a considerable time and cost‐saving to conventional phylogenetics methods. The approach also calculates single point amino acid mutations from a comparison of mass pairs from different maps in the data set and displays these at branch nodes across the tree together with their frequency. Studies of the consecutive, and near‐consecutive, ancestral and descendant mutations across interconnected branches of a mass tree allow putative adaptive, epistatic, and compensatory mutations to be identified in order to investigate mechanisms associated with evolutionary processes and pathways. A side‐by‐side comparison of this sequence‐free approach and conventional gene sequence phylogenetics is discussed.
... All rights reserved sequence evolution (Kimura 1980) was implemented, as it is most often used by the DNA barcode community and BOLD. The mini-barcode ML trees were then compared to the full-length reference trees using Ktreedist (Soria-Carrasco et al. 2007). K-scores (topology and branch length differences) and Robinson-Foulds symmetric differences (topological differences) were calculated for each dataset. ...
Full-text available
Metabarcoding to determine the species composition and diversity of marine zooplankton communities is a fast‐developing field in which the standardization of methods is yet to be fully achieved. The selection of genetic markers and primer choice are particularly important because they substantially influence species detection rates and accuracy. Validation is therefore an important step in the design of metabarcoding protocols. We developed taxon‐specific mini‐barcode primers for the cytochrome c oxidase subunit I (COI) gene region and used an experimental approach to test species detection rates and primer accuracy of the newly designed primers for prawns, shrimps and crabs and published primers for marine lobsters and fish. Artificially assembled mock communities (with known species ratios) and unsorted coastal tow net zooplankton samples were sequenced and the detected species were compared to those seeded in mock communities to test detection rates. Taxon‐specific primers increased detection rates of target taxa compared to a universal primer set. Primer cocktails (multiple primer sets) significantly increased species detection rates compared to single primer pairs and could detect up to 100% of underrepresented target taxa in mock communities. Taxon‐specific primers recovered fewer false positive or negative results than the universal primer. The methods used to design taxon‐specific mini‐barcodes and the experimental mock community validation protocols shown here can easily be applied to studies on other groups and will allow for a level of standardization among studies undertaken in different ecosystems or geographic locations.
... We identified 30 core genes (6% of the total core genes) with recombination signal, along with putative major parents, minor parents, and recombinants in each of these genes. Core genome phylogenetic tree with non-recombinant core genes (489 genes) was found to be congruent (K tree score = 0.0015) (Soria-Carrasco et al., 2007) with that of previous core genome tree, implying undetectable impact of recombination events on core genome divergence as a whole. As mobile genetic elements influence HGT as well as transposition, it is apparent that less susceptibility to the mobile genetic elements for S. acidocaldarius strains in comparison to S. islandicus and S. solfataricus strains resulted in higher preservation in S. acidocaldarius genomes (Brügger et al., 2002;Chen et al., 2005;Redder and Garrett, 2006;Quehenberger et al., 2017;Wagner et al., 2017). ...
Full-text available
Sulfolobaceae family, comprising diverse thermoacidophilic and aerobic sulfur-metabolizing Archaea from various geographical locations, offers an ideal opportunity to infer the evolutionary dynamics across the members of this family. Comparative pan-genomics coupled with evolutionary analyses has revealed asymmetric genome evolution within the Sulfolobaceae family. The trend of genome streamlining followed by periods of differential gene gains resulted in an overall genome expansion in some species of this family, whereas there was reduction in others. Among the core genes, both Sulfolobus islandicus and Saccharolobus solfataricus showed a considerable fraction of positively selected genes and also higher frequencies of gene acquisition. In contrast, Sulfolobus acidocaldarius genomes experienced substantial amount of gene loss and strong purifying selection as manifested by relatively lower genome size and higher genome conservation. Central carbohydrate metabolism and sulfur metabolism coevolved with the genome diversification pattern of this archaeal family. The autotrophic CO 2 fixation with three significant positively selected enzymes from S. islandicus and S. solfataricus was found to be more imperative than heterotrophic CO 2 fixation for Sulfolobaceae. Overall, our analysis provides an insight into the interplay of various genomic adaptation strategies including gene gain-loss, mutation, and selection influencing genome diversification of Sulfolobaceae at various taxonomic levels and geographical locations.
... The resulting phylogenetic trees, both from empirical and simulated data sets, were evaluated along four different axes: 1) percentage of nodes correctly resolved, 2) relative branch lengths differences using the K tree score (K) (Soria-Carrasco et al. 2007), 3) bootstrap values as an average of all nodes (for simulations only) and 4) degree of success in recovering monophyletic genera. ...
Full-text available
Taxa are frequently labeled incertae sedis when their placement is debated at ranks above the species level, such as their subgeneric, generic or subtribal placement. This is a pervasive problem in groups with complex systematics due to difficulties in identifying suitable synapomorphies. In this study, we propose combining DNA barcodes with a multi-locus backbone phylogeny in order to assign taxa to genus or other higher-level categories. This sampling strategy generates molecular matrices containing large amounts of missing data that are not distributed randomly: barcodes are sampled for all representatives, and additional markers are sampled only for a small percentage. We investigate the effects of the degree and randomness of missing data on phylogenetic accuracy using simulations for up to 100 markers in 1000-tips trees, as well as a real case: the subtribe Polyommatina (Lepidoptera: Lycaenidae), a large group including numerous species with unresolved taxonomy. Our simulation tests show that when a strategic and representative selection of species for higher-level categories has been made for multi-gene sequencing (approximately one per simulated genus), the addition of this multi-gene backbone DNA data for as few as 5-10% of the specimens in the total dataset can produce high-quality phylogenies, comparable to those resulting from 100% multi-gene sampling. In contrast, trees based exclusively on barcodes performed poorly. This approach was applied to a 1365-specimen dataset of Polyommatina (including ca. 80% of described species), with nearly 8% of representative species included in the multi-gene backbone and the remaining 92% included only by mitochondrial COI barcodes, a phylogeny was generated that highlighted potential misplacements, unrecognized major clades, and placement for insertae sedis taxa. We use this information to make systematic rearrangements within Polyommatina, and to describe two new genera. Finally, we propose a systematic workflow to assess higher-level taxonomy in hyperdiverse groups. This research identifies an additional, enhanced value of DNA barcodes for improvements in higher-level systematics using large datasets.
... This metric gives a measure of distance between two trees by counting the number of dissimilar partitions. The RF distance between the estimated and benchmark tree is found using KTreeDist [19]. Performance of simulated data is compared with that of an alignment-based method, GTRGAMMA model from RAxML. ...
Full-text available
Phylogenetic analysis i.e. construction of an accurate phylogenetic tree from genomic sequences of a set of species is one of the main challenges in bioinformatics. The popular approaches to this require aligning each pair of sequences to calculate pairwise distances or aligning all the sequences to construct a multiple sequence alignment. The computational complexity and difficulties in getting accurate alignments have led to development of alignment-free methods to estimate phylogenies. However, the alignment free approaches focus on computing distances between species and do not utilize statistical approaches for phylogeny estimation. Herein, we present a simple alignment free method for phylogeny construction based on contiguous sub-sequences of length k termed k -mers. The presence or absence of these k -mers are used to construct a phylogeny using a maximum likelihood approach. The results suggest our method is competitive with other alignment-free approaches, while outperforming them in some cases.
The circulation of the four-dengue virus (DENV) serotypes has significantly increased in recent years, accompanied by an increase in viral genetic diversity. In order to conduct disease surveillance and understand DENV evolution and its effects on virus transmission and disease, efficient and accurate methods for phylogenetic classification are required. Phylogenetic analysis of different viral genes sequences is the most used method, the envelope gene (E) being the most frequently selected target. We explored the genetic variability of the four DENV serotypes throughout their complete coding sequence (CDS) of sequences available in GenBank and used genomic regions of different variability rate to recapitulate the phylogeny obtained with the DENV CDS. Our results indicate that the use of high or low variable regions accurately recapitulate the phylogeny obtained with CDS of sequences from different DENV genotypes. However, when analyzing the phylogeny of a single genotype, highly variable regions performed better in recapitulating the distance branch length, topology, and support of the CDS phylogeny. The use of three concatenated highly variable regions was not statistically different in distance branch length and support to that obtained in CDS phylogeny. •This study demonstrated the ability of highly variable regions of the DENV genome to recapitulate the phylogeny obtained with the full coding sequence (CDS). •The use of genomic regions of high or low variability did not affect the performance in recapitulating the phylogeny obtained with CDS from different genotypes. However, when phylogeny was analyzed for sequences from a single genotype, highly variable regions performed better in recapitulating the distance branch length, topology, and support of the CDS phylogeny. •The use of concatenated highly variable genome regions represent a useful option for recapitulating genome-wide phylogenies in analyses of sequences belonging to the same DENV genotype.
Full-text available
Dengue fever is caused by four related dengue virus serotypes, DENV-1 to DENV-4, where each serotype comprises distinct genotypes and lineages. The last major outbreak in Mexico occurred during 2012 and 2013, when 112,698 confirmed cases were reported (DENV-1 and DENV-2 were predominant). Following partial E, NS2A and NS5 gene sequencing, based on the virus genome variability, we analyzed 396 DENV-1 and 248 DENV-2 gene sequences from serum samples from dengue acute clinical cases from 13 Mexican states, Mutations were identified, and their genetic variability estimated, along with their evolutionary relationship with DENV sequences sampled globally. DENV-1 genotype V and DENV-2 Asian-American genotype V were the only genotypes circulating during the outbreak. Mutations in NS2A and NS5 proteins were widely disseminated and suggested local emergence of new lineages. Phylogeographic analysis suggested viral spread occurred from coastal regions, and tourist destinations, such as Yucatan and Quintana Roo, which played important roles in disseminating these lineages.
Full-text available
Motivation The recent widespread application of whole-genome sequencing (WGS) for microbial disease investigations has spurred the development of new bioinformatics tools, including a notable proliferation of phylogenomics pipelines designed for infectious disease surveillance and outbreak investigation. Transitioning the use of WGS data out of the research lab and into the front lines of surveillance and outbreak response requires user-friendly, reproducible, and scalable pipelines that have been well validated. Results SNVPhyl (Single Nucleotide Variant Phylogenomics) is a bioinformatics pipeline for identifying high-quality SNVs and constructing a whole genome phylogeny from a collection of WGS reads and a reference genome. Individual pipeline components are integrated into the Galaxy bioinformatics framework, enabling data analysis in a user-friendly, reproducible, and scalable environment. We show that SNVPhyl can detect SNVs with high sensitivity and specificity and identify and remove regions of high SNV density (indicative of recombination). SNVPhyl is able to correctly distinguish outbreak from non-outbreak isolates across a range of variant-calling settings, sequencing-coverage thresholds, or in the presence of contamination. Availability SNVPhyl is available as a Galaxy workflow, Docker and virtual machine images, and a Unix-based command-line application. SNVPhyl is released under the Apache 2.0 license and available at or at .
Full-text available
Using simulated data, we compared five methods of phylogenetic tree estimation: parsimony, compatibility, maximum likelihood, Fitch-Margoliash, and neighbor joining. For each combination of substitution rates and sequence length, 100 data sets were generated for each of 50 trees, for a total of 5,000 replications per condition. Accuracy was measured by two measures of the distance between the true tree and the estimate of the tree, one measure sensitive to accuracy of branch lengths and the other not. The distance-matrix methods (Fitch-Margoliash and neighbor joining) performed best when they were constrained from estimating negative branch lengths; all comparisons with other methods used this constraint. Parsimony and compatibility had similar results, with compatibility generally inferior; Fitch-Margoliash and neighbor joining had similar results, with neighbor joining generally slightly inferior. Maximum likelihood was the most successful method overall, although for short sequences Fitch-Margoliash and neighbor joining were sometimes better. Bias of the estimates was inferred by measuring whether the independent estimates of a tree for different data sets were closer to the true tree than to each other. Parsimony and compatibility had particular difficulty with inaccuracy and bias when substitution rates varied among different branches. When rates of evolution varied among different sites, all methods showed signs of inaccuracy and bias.
Full-text available
A biologically realistic method was used to simulate evolutionary trees. The method uses a real DNA coding sequence as the starting point, simulates mutation according to the mutational spectrum of Escherichia coli-including base substitutions, insertions, and deletions-and separates the processes of mutation and selection. Trees of 8, 16, 32, and 64 taxa were simulated with average branch lengths of 50, 100, 150, 200, and 250 changes per branch. The resulting sequences were aligned with ClustalX, and trees were estimated by Neighbor Joining, Parsimony, Maximum Likelihood, and Bayesian methods from both DNA sequences and the corresponding protein sequences. The estimated trees were compared with the true trees, and both topological and branch length accuracies were scored. Over the variety of conditions tested, Bayesian trees estimated from DNA sequences that had been aligned according to the alignment of the corresponding protein sequences were the most accurate, followed by Maximum Likelihood trees estimated from DNA sequences and Parsimony trees estimated from protein sequences.
Full-text available
We explored the use of multidimensional scaling (MDS) of tree-to-tree pairwise distances to visualize the relationships among sets of phylogenetic trees. We found the technique to be useful for exploring “tree islands” (sets of topologically related trees among larger sets of near-optimal trees), for comparing sets of trees obtained from bootstrapping and Bayesian sampling, for comparing trees obtained from the analysis of several different genes, and for comparing multiple Bayesian analyses. The technique was also useful as a teaching aid for illustrating the progress of a Bayesian analysis and as an exploratory tool for examining large sets of phylogenetic trees. We also identified some limitations to the method, including distortions of the multidimensional tree space into two dimensions through the MDS technique, and the definition of the MDS-defined space based on a limited sample of trees. Nonetheless, the technique is a useful approach for the analysis of large sets of phylogenetic trees.
Full-text available
Unlabelled: RAxML-VI-HPC (randomized axelerated maximum likelihood for high performance computing) is a sequential and parallel program for inference of large phylogenies with maximum likelihood (ML). Low-level technical optimizations, a modification of the search algorithm, and the use of the GTR+CAT approximation as replacement for GTR+Gamma yield a program that is between 2.7 and 52 times faster than the previous version of RAxML. A large-scale performance comparison with GARLI, PHYML, IQPNNI and MrBayes on real data containing 1000 up to 6722 taxa shows that RAxML requires at least 5.6 times less main memory and yields better trees in similar times than the best competing program (GARLI) on datasets up to 2500 taxa. On datasets > or =4000 taxa it also runs 2-3 times faster than GARLI. RAxML has been parallelized with MPI to conduct parallel multiple bootstraps and inferences on distinct starting trees. The program has been used to compute ML trees on two of the largest alignments to date containing 25,057 (1463 bp) and 2182 (51,089 bp) taxa, respectively. Availability:
A new method for estimating divergence times when evolutionary rates are variable across lineages is proposed. The method, called nonparametric rate smoothing (NPRS), relies on minimization of ancestor-descendant local rate changes and is motivated by the likelihood that evolutionary rates are autocorrelated in time. Fossil information pertaining to minimum and/or maximum ages of nodes in a phylogeny is incorporated into the algorithms by constrained optimization techniques. The accuracy of NPRS was examined by comparison to a clock-based maxi-mum-likelihood method in computer simulations. NPRS provides more accurate estimates of divergence times when (1) sequence lengths are sufficiently long, (2) rates are truly nonclocklike, and (3) rates are moderately to highly autocorrelated in time. The algorithms were applied to estimate divergence times in seed plants based on data from the chloroplast rbcL gene. Both constrained and unconstrained NPRS methods tended to produce divergence time estimates more consistent with paleobotanical evidence than did clock-based estimates.
A metric on general phylogenetic trees is presented. This extends the work of most previous authors, who constructed metrics for binary trees. The metric presented in this paper makes possible the comparison of the many nonbinary phylogenetic trees appearing in the literature. This provides an objective procedure for comparing the different methods for constructing phylogenetic trees. The metric is based on elementary operations which transform one tree into another. Various results obtained in applying these operations are given. They enable the distance between any pair of trees to be calculated efficiently. This generalizes previous work by Bourque to the case where interior vertices can be labeled, and labels may contain more than one element or may be empty.
Tesis doctoral inédita. Universidad Autónoma de Madrid, Facultad de Ciencias, Departamento de Biología Molecular. Fecha de lectura: 07-11-2008 Bibliogr.: p. 95-100
Motivation: Seq-Gen is a program that will simulate the evolution of nucleotide sequences along a phylogeny, using common models of the substitution process. A range of models of molecular evolution are implemented, including the general reversible model. Nucleotide frequencies and other parameters of the model may be given and site-specific rate heterogeneity can also be incorporated in a number of ways. Any number of trees may be read in and the program will produce any number of data sets for each tree. Thus, large sets of replicate simulations can be easily created. This can be used to test phylogenetic hypotheses using the parametric bootstrap. Availability: Seq-Gen can be obtained by WWW from http:/(/) + or by FTP from ftp:/(/) The package includes the source code, manual and example files. An Apple Macintosh version is available from the same sites.
The increase in the number of large data sets and the complexity of current probabilistic sequence evolution models necessitates fast and reliable phylogeny reconstruction methods. We describe a new approach, based on the maximum- likelihood principle, which clearly satisfies these requirements. The core of this method is a simple hill-climbing algorithm that adjusts tree topology and branch lengths simultaneously. This algorithm starts from an initial tree built by a fast distance-based method and modifies this tree to improve its likelihood at each iteration. Due to this simultaneous adjustment of the topology and branch lengths, only a few iterations are sufficient to reach an optimum. We used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches. The reduction of computing time is dramatic in comparison with other maximum-likelihood packages, while the likelihood maximization ability tends to be higher. For example, only 12 min were required on a standard personal computer to analyze a data set consisting of 500 rbcL sequences with 1,428 base pairs from plant plastids, thus reaching a speed of the same order as some popular distance-based and parsimony algorithms. This new method is implemented in the PHYML program, which is freely available on our web page: