Fast calculation of the quartet distance between trees of arbitrary degrees.

Department of Computer Science, University of Aarhus, Aabogade 34, DK-8200 Arhus N, Denmark.
Algorithms for Molecular Biology (Impact Factor: 1.61). 02/2006; 1:16. DOI: 10.1186/1748-7188-1-16
Source: DBLP

ABSTRACT A number of algorithms have been developed for calculating the quartet distance between two evolutionary trees on the same set of species. The quartet distance is the number of quartets - sub-trees induced by four leaves - that differs between the trees. Mostly, these algorithms are restricted to work on binary trees, but recently we have developed algorithms that work on trees of arbitrary degree.
We present a fast algorithm for computing the quartet distance between trees of arbitrary degree. Given input trees T and T', the algorithm runs in time O(n + /V/./V'/ min{id, id'}) and space O(n + /V/./V'/), where n is the number of leaves in the two trees, V and V are the non-leaf nodes in T and T', respectively, and id and id' are the maximal number of non-leaf nodes adjacent to a non-leaf node in T and T', respectively. The fastest algorithms previously published for arbitrary degree trees run in O(n3) (independent of the degree of the tree) and O(/V/./V'/'), respectively. We experimentally compare the algorithm with existing algorithms for computing the quartet distance for general trees.
We present a new algorithm for computing the quartet distance between two trees of arbitrary degree. The new algorithm improves the asymptotic running time for computing the quartet distance, compared to previous methods, and experimental results indicate that the new method also performs significantly better in practice.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Distance measures between trees are useful for comparing trees in a systematic manner, and several different distance measures have been proposed. The triplet and quartet distances, for rooted and unrooted trees, respectively, are defined as the number of subsets of three or four leaves, respectively, where the topologies of the induced subtrees differ. These distances can trivially be computed by explicitly enumerating all sets of three or four leaves and testing if the topologies are different, but this leads to time complexities at least of the order n3 or n4 just for enumerating the sets. The different topologies can be counte dimplicitly, however, and in this paper, we review a series of algorithmic improvements that have been used during the last decade to develop more efficient algorithms by exploiting two different strategies for this; one based on dynamic programming and another based oncoloring leaves in one tree and updating a hierarchical decomposition of the other.
    Biology 01/2013; 2(4):1189-209.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper discusses phylogenetic reticulation using linguistic data from the Automated Similar-ity Judgment Program or ASJP (Holman et al., 2008; Wichmann et al., 2010a). It contributes methodologically to the examination of two measures of reticulation in distance-based phylo-genetic data, specifically the δ score of Holland et al. (2002) and the more recent Q-residuals of Gray et al. (2010). It is shown that the δ score is a more adequate measure of reticulation. Our empirical analyses examine possible correlations between δ and (a) the size (number of lan-guages), (b) age, and (c) heterogeneity of language groups, (d) linguistic isolation of individual languages within their respective phylogenies, and (e) the status of speech forms as dialects or recently emerged languages. Among these, only (d) is significantly correlated with δ. Our inter-pretation is that δ is a realistic measure of reticulation and sensitive to effects of socio-historical events such as language extinction. Finally, we correlate average δ scores for different language families with the goodness of fit between ASJP and expert classifications, showing that the δ scores explain much of the variance.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Character matrices with extensive missing data are frequently used in phylogenomics with potentially detrimental effects on the accuracy and robustness of tree inference. Therefore, many investigators select taxa and genes with high data coverage. Drawbacks of these selections are their exclusive reliance on data coverage without consideration of actual signal in the data which might, thus, not deliver optimal data matrices in terms of potential phylogenetic signal. In order to circumvent this problem, we have developed a heuristics implemented in a software called mare which (1) assesses information content of genes in supermatrices using a measure of potential signal combined with data coverage and (2) reduces supermatrices with a simple hill climbing procedure to submatrices with high total information content. We conducted simulation studies using matrices of 50 taxa x 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10 - 30 %. With matrices of 50 taxa x 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10 - 30 % Maximum Likelihood (ML) tree reconstructions failed to recover correct trees. A selection of a data subset with the herein proposed approach increased the chance to recover correct partial trees more than 10-fold. The selection of data subsets with the herein proposed simple hill climbing procedure performed well either considering the information content or just a simple presence/absence information of genes. We also applied our approach on an empirical data set, addressing questions of vertebrate systematics. With this empirical dataset selecting a data subset with high information content and supporting a tree with high average boostrap support was most successful if information content of genes was considered. Our analyses of simulated and empirical data demonstrate that sparse supermatrices can be reduced on a formal basis outperforming the usually used simple selections of taxa and genes with high data coverage.
    BMC Bioinformatics 12/2013; 14(1):348. · 3.02 Impact Factor

Full-text (2 Sources)

Available from