Efficient Algorithms for the Reconciliation Problem with Gene Duplication, Horizontal Transfer and Loss

Computer Science and Artificial Intelligence Laboratory, Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
Bioinformatics (Impact Factor: 4.98). 06/2012; 28(12):i283-91. DOI: 10.1093/bioinformatics/bts225
Source: PubMed


Gene family evolution is driven by evolutionary events such as speciation, gene duplication, horizontal gene transfer and gene loss, and inferring these events in the evolutionary history of a given gene family is a fundamental problem in comparative and evolutionary genomics with numerous important applications. Solving this problem requires the use of a reconciliation framework, where the input consists of a gene family phylogeny and the corresponding species phylogeny, and the goal is to reconcile the two by postulating speciation, gene duplication, horizontal gene transfer and gene loss events. This reconciliation problem is referred to as duplication-transfer-loss (DTL) reconciliation and has been extensively studied in the literature. Yet, even the fastest existing algorithms for DTL reconciliation are too slow for reconciling large gene families and for use in more sophisticated applications such as gene tree or species tree reconstruction.
We present two new algorithms for the DTL reconciliation problem that are dramatically faster than existing algorithms, both asymptotically and in practice. We also extend the standard DTL reconciliation model by considering distance-dependent transfer costs, which allow for more accurate reconciliation and give an efficient algorithm for DTL reconciliation under this extended model. We implemented our new algorithms and demonstrated up to 100 000-fold speed-up over existing methods, using both simulated and biological datasets. This dramatic improvement makes it possible to use DTL reconciliation for performing rigorous evolutionary analyses of large gene families and enables its use in advanced reconciliation-based gene and species tree reconstruction methods.
Our programs can be freely downloaded from http://compbio.mit.edu/ranger-dtl/.

Download full-text


Available from: Manolis Kellis,
  • Source
    • "Additional support for transfers, as well as alternative scenarios were explored by sampling the multiple optimal reconciliations with DTL-RANGER (Bansal et al. 2012). Here, the support for duplication/speciation/transfer events, as well as mapping of individual events on the species tree were annotated on basis of (a) frequency of inferred events/mappings across different transfer cost values (b) highest transfer cost where transfer event is indicated (only for inferred HGT events), with fixed duplication and loss costs (DTL-RANGER dated version with parameters L={1,2,3}, Δ=4, Ө {5,...,40}). "
    [Show abstract] [Hide abstract]
    ABSTRACT: In recent years, the influx of newly sequenced fungal genomes has enabled sampling of secondary metabolite biosynthesis on an unprecedented scale. However, explanations of extant diversity which take into account both large-scale phylogeny reconstructions and knowledge gained from multiple genome projects are still lacking.We analysed the evolutionary sources of genetic diversity in aromatic polyketide biosynthesis in over a hundred model fungal genomes. By reconciling the history of over four hundred non-reducing polyketide synthases with corresponding species history, we demonstrate that extant fungal NR-PKSs are clades of distant siblings, originating from a burst of duplications in early Pezizomycotina and thinned by extensive losses.The capability of higher fungi to biosynthesise the simplest precursor molecule (orsellinic acid) is highlighted as an ancestral trait underlying biosynthesis of aromatic compounds. This base activity was modified during early evolution of filamentous fungi, towards divergent reaction schemes associated with biosynthesis of e.g. aflatoxins and fusarubins (C4-C9 cyclisation) or various anthraquinone derivatives (C6-C11 cyclisation). The functional plasticity is further shown to have been supplemented by modularisation of domain architecture into discrete pieces (conserved splice junctions within product template domain), as well as tight linkage of key accessory enzyme families and divergence in employed transcriptional factors.While the majority of discord between species and gene history is explained by ancient duplications, this landscape has been altered by more recent duplications, as well as multiple Horizontal Gene Transfers. The 25 detected transfers include previously undescribed events leading to emergence of e.g. fusarubin biosynthesis in Fusarium genus.Both the underlying data and the results of present analysis (including alternative scenarios revealed by sampling multiple reconciliation optima) are maintained as a freely available web-based resource: http://cropnet.pl/metasites/sekmet/nrpks_2014.
    Genome Biology and Evolution 11/2015; 7(11). DOI:10.1093/gbe/evv204 · 4.23 Impact Factor
  • Source
    • "Reconciliations are computed with an implementation of the ILP approach and compared with the results of Jane 4 [14], TreeMap 3b [2], NOTUNG 2.8 Beta [13], and Ranger-DTL [12]. For all tools the same simulated data sets were reconciled using the respective default parameters. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we present an integer linear programming (ILP) approach, called CoRe-ILP, for finding an optimal time consistent cophylogenetic host-parasite reconciliation under the cophylogenetic event model with the events cospeciation, duplication, sorting, host switch, and failure to diverge. Instead of assuming event costs, a simplified model is used, maximizing primarily for cospeciations and secondarily minimizing host switching events. Duplications, sortings, and failure to diverge events are not explicitly scored. Different from existing event based reconciliation methods, CoRe-ILP can use (approximate) phylogenetic branch lengths for filtering possible ancestral host-parasite interactions. Experimentally, it is shown that CoRe-ILP can successfully use branch length information and performs well for biological and simulated data sets. The results of CoRe-ILP are compared with the results of the reconciliation tools Jane 4, Treemap 3b, NOTUNG 2.8 Beta, and Ranger-DTL. Algorithm CoRe-ILP is implemented using IBM ILOG CPLEXTM Optimizer 12.6 and is freely available from http://pacosy.informatik.uni-leipzig.de/core-ilp.
    IEEE/ACM Transactions on Computational Biology and Bioinformatics 01/2015; DOI:10.1109/TCBB.2015.2430336 · 1.44 Impact Factor
  • Source
    • "Some parsimony methods (e.g. Bansal et al., 2012) do not need information on the order of speciations in time. This allows a more efficient recursion over reconciliations, but at the cost of considering reconciliations that contain transfer events that are not consistent with any ordering of the species tree (Tofigh et al., 2011). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Traditionally, gene phylogenies have been reconstructed solely on the basis of molecular sequences; this, however, often does not provide enough information to distinguish between statistically equivalent relationships. To address this problem, several recent methods have incorporated information on the species phylogeny in gene tree reconstruction, leading to dramatic improvements in accuracy. Although probabilistic methods are able to estimate all model parameters but are computationally expensive, parsimony methods-generally computationally more efficient-require a prior estimate of parameters and of the statistical support. Results: Here, we present the Tree Estimation using Reconciliation (TERA) algorithm, a parsimony based, species tree aware method for gene tree reconstruction based on a scoring scheme combining duplication, transfer and loss costs with an estimate of the sequence likelihood. TERA explores all reconciled gene trees that can be amalgamated from a sample of gene trees. Using a large scale simulated dataset, we demonstrate that TERA achieves the same accuracy as the corresponding probabilistic method while being faster, and outperforms other parsimony-based methods in both accuracy and speed. Running TERA on a set of 1099 homologous gene families from complete cyanobacterial genomes, we find that incorporating knowledge of the species tree results in a two thirds reduction in the number of apparent transfer events.
    Bioinformatics 11/2014; 31(6). DOI:10.1093/bioinformatics/btu728 · 4.98 Impact Factor
Show more