Algorithms for Rapid Error Correction for the Gene Duplication Problem
DOI: 10.1007/978-3-642-21260-4_23 Conference: Bioinformatics Research and Applications - 7th International Symposium, ISBRA 2011, Changsha, China, May 27-29, 2011. Proceedings
Gene tree - species tree reconciliation problems infer the patterns and processes of gene evolution within the context of
an organismal phylogeny. In one example, the gene duplication problem seeks the evolutionary scenario that implies the minimum
number of gene duplications needed to reconcile a gene tree and a species tree. While the gene duplication problem can effectively
link gene and species evolution, error in gene trees can profoundly bias the results. We describe novel algorithms that rapidly
search local Subtree Prune and Regraft (SPR) or Tree Bisection and Reconnection (TBR) neighborhoods of a gene tree to find
a topology that implies the fewest duplications. These algorithms improve on the current solutions by a factor of n for searching SPR neighborhoods and n
2 for searching TBR neighborhoods, where n is the number of vertices in the given gene tree. They provide a fast error correction protocol for gene trees, in which
we allow small gene tree rearrangements to improve the reconciliation cost. We tested the SPR tree rearrangement algorithm
on a collection of 1201 plant gene trees, and in every case, the SPR algorithm identified an alternate topology that implied
at least one fewer duplication. We also demonstrate a simple method to use the gene rearrangement algorithm to improve gene
tree parsimony phylogenetic analyses, which infer a species tree based on the gene duplication problem.
Available from: Vincent Berry
[Show abstract] [Hide abstract]
ABSTRACT: We propose a reconciliation heuristic accounting for gene duplications, losses and horizontal transfers that specifically takes into account the uncertainties in the gene tree. Rearrangements are tried for gene tree edges that are weakly supported, and are accepted whenever they improve the reconciliation cost. We prove useful properties on the dynamic programming matrix used to compute reconciliations, which allows to speed-up the tree space exploration when rearrangements are generated by Nearest Neighbor Interchanges (NNI) edit operations. Experimental results on simulated and real data confirm that running times are greatly reduced when considering the above-mentioned optimization in comparison to the naïve rearrangement procedure. Results also show that gene trees modified by such NNI rearrangements are closer to the correct (simulated) trees and lead to more correct event predictions on average. The program is available at
Available from: Vincent Ranwez
[Show abstract] [Hide abstract]
Reconciliation methods compare gene trees and species trees to recover evolutionary events such as duplications, transfers and losses explaining the history and composition of genomes. It is well-known that gene trees inferred from molecular sequences can be partly erroneous due to incorrect sequence alignments as well as phylogenetic reconstruction artifacts such as long branch attraction. In practice, this leads reconciliation methods to overestimate the number of evolutionary events. Several methods have been proposed to circumvent this problem, by collapsing the unsupported edges and then resolving the obtained multifurcating nodes, or by directly rearranging the binary gene trees. Yet these methods have been defined for models of evolution accounting only for duplications and losses, i.e. can not be applied to handle prokaryotic gene families.
We propose a reconciliation method accounting for gene duplications, losses and horizontal transfers, that specifically takes into account the uncertainties in gene trees by rearranging their weakly supported edges. Rearrangements are performed on edges having a low confidence value, and are accepted whenever they improve the reconciliation cost. We prove useful properties on the dynamic programming matrix used to compute reconciliations, which allows to speed-up the tree space exploration when rearrangements are generated by Nearest Neighbor Interchanges (NNI) edit operations. Experiments on synthetic data show that gene trees modified by such NNI rearrangements are closer to the correct simulated trees and lead to better event predictions on average. Experiments on real data demonstrate that the proposed method leads to a decrease in the reconciliation cost and the number of inferred events. Finally on a dataset of 30 k gene families, this reconciliation method shows a ranking of prokaryotic phyla by transfer rates identical to that proposed by a different approach dedicated to transfer detection [BMCBIOINF 11:324, 2010, PNAS 109(13):4962–4967, 2012].
Prokaryotic gene trees can now be reconciled with their species phylogeny while accounting for the uncertainty of the gene tree. More accurate and more precise reconciliations are obtained with respect to previous parsimony algorithms not accounting for such uncertainties [LNCS 6398:93–108, 2010, BIOINF 28(12): i283–i291, 2012].
A software implementing the method is freely available at http://www.atgc-montpellier.fr/Mowgli/.
[Show abstract] [Hide abstract]
ABSTRACT: The use of genomic data sets for phylogenetics is complicated by the fact that evolutionary processes such as gene duplication and loss, or incomplete lineage sorting (deep coalescence) cause incongruence among gene trees. One well-known approach that deals with this complication is gene tree parsimony, which, given a collection of gene trees, seeks a species tree that requires the smallest number of evolutionary events to explain the incongruence of the gene trees. However, a lack of efficient algorithms has limited the use of this approach. Here, we present efficient algorithms for SPR and TBR-based local search heuristics for gene tree parsimony under the 1) duplication, 2) loss, 3) duplication-loss, and 4) deep coalescence reconciliation costs. These novel algorithms improve upon the time complexities of previous algorithms for these problems by a factor of $(n)$, where $(n)$ is the number of species in the collection of gene trees. Our algorithms provide a substantial improvement in runtime and scalability compared to previous implementations and enable large-scale gene tree parsimony analyses using any of the four reconciliation costs. Our algorithms have been implemented in the software packages DupTree and iGTP, and have already been used to perform several compelling phylogenetic studies.
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.