Large-scale assignment of orthology: back to phylogenetics? Genome Biol 9:235

Bioinformatics and Genomics Program, Center for Genomic Regulation, Doctor Aiguader 88, Barcelona, Spain.
Genome biology (Impact Factor: 10.81). 11/2008; 9(10):235. DOI: 10.1186/gb-2008-9-10-235
Source: PubMed

ABSTRACT Reliable orthology prediction is central to comparative genomics. Although orthology is defined by phylogenetic criteria, most automated prediction methods are based on pairwise sequence comparisons. Recently, automated phylogeny-based orthology prediction has emerged as a feasible alternative for genome-wide studies.

7 Reads
  • Source
    • "However, the concepts of orthology and paralogy may not always be clearly distinct in practice, owing to incomplete lineage sorting (Mallo et al. 2014), and detecting orthology without a phylogeny is problematic (Gabaldón 2008). Fitch (2000) suggested that 'there are no proven cases of genic analogy' (p. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Sequence alignment is just as much a part of phylogenetics as is tree building, although it is often viewed solely as a necessary tool to construct trees. However, alignment for the purpose of phylogenetic inference is primarily about homology, as it is the procedure that expresses homology relationships among the characters, rather than the historical relationships of the taxa. Molecular homology is rather vaguely defined and understood, despite its importance in the molecular age. Indeed, homology has rarely been evaluated with respect to nucleotide sequence alignments, in spite of the fact that nucleotides are the only data that directly represent genotype. All other molecular data represent phenotype, just as do morphology and anatomy. Thus, efforts to improve sequence alignment for phylogenetic purposes should involve a more refined use of the homology concept at a molecular level. To this end, we present examples of molecular-data levels at which homology might be considered, and arrange them in a hierarchy. The concept that we propose has many levels, which link directly to the developmental and morphological components of homology. Of note, there is no simple relationship between gene homology and nucleotide homology. We also propose terminology with which to better describe and discuss molecular homology at these levels. Our over-arching conceptual framework is then used to shed light on the multitude of automated procedures that have been created for multiple-sequence alignment. Sequence alignment needs to be based on aligning homologous nucleotides, without necessary reference to homology at any other level of the hierarchy. In particular, inference of nucleotide homology involves deriving a plausible scenario for molecular change among the set of sequences. Our clarifications should allow the development of a procedure that specifically addresses homology, which is required when performing alignment for phylogenetic purposes, but which does not yet exist.
    Australian Systematic Botany 01/2015; 28:46-62. DOI:10.1071/SB15001 · 1.08 Impact Factor
  • Source
    • "A number of other alternative orthology inference pipelines also suffer from using similarity measurements as approximations to directly infer orthology (Li et al. 2003; Roure et al. 2007; Schreiber et al. 2009; Altenhoff et al. 2011, 2013). Given the incomplete and noisy nature of transcriptomic and low-coverage genomic data, orthology is best inferred by using phylogenies to separate paralogs and orthologs after homology has been established (Gabald on 2008). A variety of tree-based orthology inference methods have been developed . "
    [Show abstract] [Hide abstract]
    ABSTRACT: Orthology inference is central to phylogenomic analyses. Phylogenomic data sets commonly include transcriptomes and low-coverage genomes that are incomplete and contain errors and isoforms. These properties can severely violate the underlying assumptions of orthology inference with existing heuristics. We present a procedure that uses phylogenies for both homology and orthology assignment. The procedure first uses similarity scores to infer putative homologs that are then aligned, constructed into phylogenies, and pruned of spurious branches caused by deep paralogs, misassembly, frame shifts, or recombination. These final homologs are then used to identify orthologs. We explore four alternative tree-based orthology inference approaches, of which two are new. These accommodate gene and genome duplications as well as gene tree discordance. We demonstrate these methods in three published data sets including the grape family, Hymenoptera, and millipedes with divergence times ranging from ca. 100 Ma to over 400 Ma. The procedure significantly increased the completeness and accuracy of the inferred homologs and orthologs. We also found that data sets that are more recently diverged and/or include more high-coverage genomes had more complete sets of orthologs. To explicitly evaluate sources of conflicting phylogenetic signals, we applied serial jackknife analyses of gene regions keeping each locus intact. The methods described here can scale to over 100 taxa. They have been implemented in python with independent scripts for each step, making it easy to modify or incorporate them into existing pipelines. All scripts are available from
    Molecular Biology and Evolution 08/2014; 31(11). DOI:10.1093/molbev/msu245 · 9.11 Impact Factor
  • Source
    • "Amino acid sequences were aligned using MAFFT, and gene trees were obtained with RAxML as described earlier. To overcome common biases related to poorly resolved phylogenies (Hahn 2007), we used an approach similar to that described as the species-overlap method (Gabaldon 2008). When faced with disagreement between the gene and species trees, we used a conservative criterion that takes into account short branch lengths and the known problems of incomplete lineage sorting that lead to inconsistencies across genes in the position of D. willistoni (Tamura et al. 2004; Obbard et al. 2012) and the relationships among D. yakuba, D. erecta, and the melanogaster cluster (Pollard et al. 2006). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene turnover rates and the evolution of gene family sizes are important aspects of genome evolution. Here, we use curated sequence data of the major chemosensory gene families from Drosophila - the gustatory receptor (GR), odorant receptor (OR), ionotropic receptor (IR), and odorant binding protein (OBP) families - to conduct a comparative analysis among families, exploring different methods to estimate gene birth and death rates, including an ad hoc simulation study. Remarkably, we found that the state-of-the-art methods may produce very different rate estimates, which may lead to disparate conclusions regarding the evolution of chemosensory gene family sizes in Drosophila. Among biological factors, we found that a peculiarity of D. sechellia's gene turnover rates was a major source of bias in global estimates, whereas gene conversion had negligible effects for the families analyzed herein. Turnover rates vary considerably among families, subfamilies and ortholog groups, although all analyzed families were quite dynamic in terms of gene turnover. Computer simulations showed that the methods that use ortholog group information appear to be the most accurate for the Drosophila chemosensory families. Most importantly, these results reveal the potential of rate heterogeneity among lineages to severely bias some turnover rate estimation methods and the need of further evaluating the performance of these methods in a more diverse sampling of gene families and phylogenetic contexts. Using branch-specific codon substitution models, we find further evidence of positive selection in recently duplicated genes, which attests to a non-neutral aspect of the gene birth-and-death process.
    Genome Biology and Evolution 06/2014; 6(7). DOI:10.1093/gbe/evu130 · 4.23 Impact Factor
Show more

Preview (3 Sources)

7 Reads
Available from