Fast calculation of the quartet distance between trees of arbitrary degrees.

Department of Computer Science, University of Aarhus, Aabogade 34, DK-8200 Arhus N, Denmark.
Algorithms for Molecular Biology (Impact Factor: 1.61). 02/2006; 1:16. DOI:10.1186/1748-7188-1-16
Source: DBLP

ABSTRACT A number of algorithms have been developed for calculating the quartet distance between two evolutionary trees on the same set of species. The quartet distance is the number of quartets - sub-trees induced by four leaves - that differs between the trees. Mostly, these algorithms are restricted to work on binary trees, but recently we have developed algorithms that work on trees of arbitrary degree.
We present a fast algorithm for computing the quartet distance between trees of arbitrary degree. Given input trees T and T', the algorithm runs in time O(n + /V/./V'/ min{id, id'}) and space O(n + /V/./V'/), where n is the number of leaves in the two trees, V and V are the non-leaf nodes in T and T', respectively, and id and id' are the maximal number of non-leaf nodes adjacent to a non-leaf node in T and T', respectively. The fastest algorithms previously published for arbitrary degree trees run in O(n3) (independent of the degree of the tree) and O(/V/./V'/'), respectively. We experimentally compare the algorithm with existing algorithms for computing the quartet distance for general trees.
We present a new algorithm for computing the quartet distance between two trees of arbitrary degree. The new algorithm improves the asymptotic running time for computing the quartet distance, compared to previous methods, and experimental results indicate that the new method also performs significantly better in practice.

0 0
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Recent advances in automated assessment of basic vocabulary lists allow the construction of linguistic phylogenies useful for tracing dynamics of human population expansions, reconstructing ancestral cultures, and modeling transition rates of cultural traits over time. Here we investigate the Tupi expansion, a widely-dispersed language family in lowland South America, with a distance-based phylogeny based on 40-word vocabulary lists from 48 languages. We coded 11 cultural traits across the diverse Tupi family including traditional warfare patterns, post-marital residence, corporate structure, community size, paternity beliefs, sibling terminology, presence of canoes, tattooing, shamanism, men's houses, and lip plugs. The linguistic phylogeny supports a Tupi homeland in west-central Brazil with subsequent major expansions across much of lowland South America. Consistently, ancestral reconstructions of cultural traits over the linguistic phylogeny suggest that social complexity has tended to decline through time, most notably in the independent emergence of several nomadic hunter-gatherer societies. Estimated rates of cultural change across the Tupi expansion are on the order of only a few changes per 10,000 years, in accord with previous cultural phylogenetic results in other language families around the world, and indicate a conservative nature to much of human culture.
    PLoS ONE 01/2012; 7(4):e35025. · 3.73 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: This paper discusses phylogenetic reticulation using linguistic data from the Automated Similar-ity Judgment Program or ASJP (Holman et al., 2008; Wichmann et al., 2010a). It contributes methodologically to the examination of two measures of reticulation in distance-based phylo-genetic data, specifically the δ score of Holland et al. (2002) and the more recent Q-residuals of Gray et al. (2010). It is shown that the δ score is a more adequate measure of reticulation. Our empirical analyses examine possible correlations between δ and (a) the size (number of lan-guages), (b) age, and (c) heterogeneity of language groups, (d) linguistic isolation of individual languages within their respective phylogenies, and (e) the status of speech forms as dialects or recently emerged languages. Among these, only (d) is significantly correlated with δ. Our inter-pretation is that δ is a realistic measure of reticulation and sensitive to effects of socio-historical events such as language extinction. Finally, we correlate average δ scores for different language families with the goodness of fit between ASJP and expert classifications, showing that the δ scores explain much of the variance.
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Character matrices with extensive missing data are frequently used in phylogenomics with potentially detrimental effects on the accuracy and robustness of tree inference. Therefore, many investigators select taxa and genes with high data coverage. Drawbacks of these selections are their exclusive reliance on data coverage without consideration of actual signal in the data which might, thus, not deliver optimal data matrices in terms of potential phylogenetic signal. In order to circumvent this problem, we have developed a heuristics implemented in a software called mare which (1) assesses information content of genes in supermatrices using a measure of potential signal combined with data coverage and (2) reduces supermatrices with a simple hill climbing procedure to submatrices with high total information content. We conducted simulation studies using matrices of 50 taxa x 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10 - 30 %. With matrices of 50 taxa x 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10 - 30 % Maximum Likelihood (ML) tree reconstructions failed to recover correct trees. A selection of a data subset with the herein proposed approach increased the chance to recover correct partial trees more than 10-fold. The selection of data subsets with the herein proposed simple hill climbing procedure performed well either considering the information content or just a simple presence/absence information of genes. We also applied our approach on an empirical data set, addressing questions of vertebrate systematics. With this empirical dataset selecting a data subset with high information content and supporting a tree with high average boostrap support was most successful if information content of genes was considered. Our analyses of simulated and empirical data demonstrate that sparse supermatrices can be reduced on a formal basis outperforming the usually used simple selections of taxa and genes with high data coverage.
    BMC Bioinformatics 12/2013; 14(1):348. · 3.02 Impact Factor

Full-text (2 Sources)

Available from

Chris Christiansen