[Show abstract][Hide abstract] ABSTRACT: This chapter first describes the empirical codon model presented by Schneider et al. in 2005, demonstrating its advantages over amino acid models for aligning coding sequences. It then outlines two models, proposed in 2007, that introduce parameters to empirical models in order to combine the advantages of empirical and parametric models. Doron-Faigenboim and Pupko (2007) presented a 'combined empirical and mechanistic' model of codon evolution in which empirical transition rates between amino acids formed the basis for a parametric codon model. Another version of a combination of empirical and parametric model was presented by Kosiol et al. (2007). The empirical codon matrix that is the basis for their model was derived by directly estimating a rate matrix using an expectation maximization algorithm from a set of multiple sequence alignments and trees. The chapter concludes with a study that used unsupervised learning to determine the relevant parameters in a codon model.
[Show abstract][Hide abstract] ABSTRACT: This background chapter summarizes some of the basic concepts used in molecular evolution, such as the use of Markov models in modelling sequence evolution as well as maximum-likelihood and Bayesian estimation. It also discusses comparisons of models and methods using likelihood-based tests, simulations, and empirical tests. The goal of this chapter is to provide consistent notation and common definitions, and to give the reader an introduction to those topics while avoiding repeated descriptions of the same concepts in several chapters.
[Show abstract][Hide abstract] ABSTRACT: OMA (Orthologous MAtrix) is a database that identifies orthologs among publicly available, complete genomes. Initiated in
2004, the project is at its 11th release. It now includes 1000 genomes, making it one of the largest resources of its kind.
Here, we describe recent developments in terms of species covered; the algorithmic pipeline—in particular regarding the treatment
of alternative splicing, and new features of the web (OMA Browser) and programming interface (SOAP API). In the second part,
we review the various representations provided by OMA and their typical applications. The database is publicly accessible
Full-text · Article · Jan 2011 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: Published estimates of the proportion of positively selected genes (PSGs) in human vary over three orders of magnitude. In
mammals, estimates of the proportion of PSGs cover an even wider range of values. We used 2,980 orthologous protein-coding
genes from human, chimpanzee, macaque, dog, cow, rat, and mouse as well as an established phylogenetic topology to infer the
fraction of PSGs in all seven terminal branches. The inferred fraction of PSGs ranged from 0.9% in human through 17.5% in
macaque to 23.3% in dog. We found three factors that influence the fraction of genes that exhibit telltale signs of positive
selection: the quality of the sequence, the degree of misannotation, and ambiguities in the multiple sequence alignment. The
inferred fraction of PSGs in sequences that are deficient in all three criteria of coverage, annotation, and alignment is
7.2 times higher than that in genes with high trace sequencing coverage, “known” annotation status, and perfect alignment
scores. We conclude that some estimates on the prevalence of positive Darwinian selection in the literature may be inflated
and should be treated with caution.
Full-text · Article · Jun 2009 · Genome Biology and Evolution
[Show abstract][Hide abstract] ABSTRACT: The Microbe browser is a web server providing comparative microbial genomics data. It offers comprehensive, integrated data
from GenBank, RefSeq, UniProt, InterPro, Gene Ontology and the Orthologs Matrix Project (OMA) database, displayed along with
gene predictions from five software packages. The Microbe browser is daily updated from the source databases and includes
all completely sequenced bacterial and archaeal genomes. The data are displayed in an easy-to-use, interactive website based
on Ensembl software. The Microbe browser is available at http://microbe.vital-it.ch/. Programmatic access is available through the OMA application programming interface (API) at http://microbe.vital-it.ch/api.
Full-text · Article · May 2009 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: It is known that the accuracy of phylogenetic reconstruction decreases when more distant outgroups are used. We quantify this phenomenon with a novel scoring method, the outgroup score pOG. This score expresses if the support for a particular branch of a tree decreases with increasingly distant outgroups. Large-scale simulations confirmed that the outgroup support follows this expectation and that the pOG score captures this pattern. The score often identifies the correct topology even when the primary reconstruction methods fail, particularly in the presence of model violations. In simulations of problematic phylogenetic scenarios such as rate variation among lineages (which can lead to long-branch attraction artifacts) and quartet-based reconstruction, the pOG analysis outperformed the primary reconstruction methods. Because the pOG method does not make any assumptions about the evolutionary model (besides the decreasing support from increasingly distant outgroups), it can detect cases of violations not treated by a specific model or too strong to be fully corrected. When used as an optimization criterion in the construction of a tree of 23 mammals, the outgroup signal confirmed many well-accepted mammalian orders and superorders. It supports Atlantogenata, a clade of Afrotheria and Xenarthra, and suggests an Artiodactyla-Chiroptera clade.
Full-text · Article · Mar 2009 · Molecular Biology and Evolution
[Show abstract][Hide abstract] ABSTRACT: Measuring evolutionary distances between DNA or protein sequences forms the basis of many applications in computational biology and evolutionary studies. Of particular interest are distances based on synonymous substitutions, since these substitutions are considered to be under very little selection pressure and therefore assumed to accumulate in an almost clock-like manner. SynPAM, the method presented here, allows the estimation of distances between coding DNA sequences based on synonymous codon substitutions. The problem of estimating an accurate distance from the observed substitution pattern is solved by maximum-likelihood with empirical codon substitution matrices employed for the underlying Markov model. Comparisons with established measures of synonymous distance indicate that SynPAM has less variance and yields useful results over a longer time range.
Full-text · Article · Nov 2007 · IEEE/ACM Transactions on Computational Biology and Bioinformatics
[Show abstract][Hide abstract] ABSTRACT: Inference of the evolutionary relation between proteins, in particular the identification of orthologs, is a central problem in comparative genomics. Several large-scale efforts with various methodologies and scope tackle this problem, including OMA (the Orthologous MAtrix project).
Based on the results of the OMA project, we introduce here the OMA Browser, a web-based tool allowing the exploration of orthologous relations over 352 complete genomes. Orthologs can be viewed as groups across species, but also at the level of sequence pairs, allowing the distinction among one-to-one, one-to-many and many-to-many orthologs.
[Show abstract][Hide abstract] ABSTRACT: A probabilistic sequence (PS) is a sequence in which each position instead of having a single character (amino acid, nucleotide, or codon), has a vector describing the probability of each symbol being the character at that position. A probabilistic ancestral sequence (PAS) is a reconstructed PS for the common ancestor of several sequences. This chapter presents a formalism to compute the probabilities of each character at each position of the biological sequence for the internal nodes in a given phylogenetic tree using a Markov model of evolution. From this model, the probability of an evolutionary configuration can be computed. In addition, efficient algorithms for computing the likelihood score of aligning a character with a character, a character with a probabilistic character, or two probabilistic characters are derived. These scores can then be used in direct string matching or dynamic programming alignments of probabilistic sequences with insertions and deletions. Applications for these alignments, including long-distance homology searching and multiple sequence alignment construction, are shown.
[Show abstract][Hide abstract] ABSTRACT: In recent years the phylogenetic relationship of mammalian orders has been addressed in a number of molecular studies. These analyses have frequently yielded inconsistent results with respect to some basal ordinal relationships. For example, the relative placement of primates, rodents, and carnivores has differed in various studies. Here, we attempt to resolve this phylogenetic problem by using data from completely sequenced nuclear genomes to base the analyses on the largest possible amount of data. To minimize the risk of reconstruction artifacts, the trees were reconstructed under different criteria-distance, parsimony, and likelihood. For the distance trees, distance metrics that measure independent phenomena (amino acid replacement, synonymous substitution, and gene reordering) were used, as it is highly improbable that all of the trees would be affected the same way by any reconstruction artifact. In contradiction to the currently favored classification, our results based on full-genome analysis of the phylogenetic relationship between human, dog, and mouse yielded overwhelming support for a primate-carnivore clade with the exclusion of rodents.
Full-text · Article · Feb 2007 · PLoS Computational Biology
[Show abstract][Hide abstract] ABSTRACT: Observing differences between DNA or protein sequences and estimating the true amount of substitutions from them is a prominent problem in molecular evolution as many analyses are based on distance measures between biological sequences. Since the relationship between the observed and the actual amount of mutations is very complex, more than four decades of research have been spent to improve molecular distance measures. In this article we present a method called SynPAM which can be used to estimate the amount of synonymous change between sequences of coding DNA. The method is novel in that it is based on an empirical model of codon evolution and that it uses a maximum-likelihood formalism to measure synonymous change in terms of codon substitutions, while reducing the need for assumptions about DNA evolution to an absolute minimum. We compared the SynPAM method with two established methods for measuring synonymous sequence divergence. Our results suggest that this new method not only shows less variance, but is also able to capture weaker phylogenetic signals than the other methods.
[Show abstract][Hide abstract] ABSTRACT: The estimation of the difference between two evolutionary distances within a triplet of homologs is a common operation that is used for example to determine which of two sequences is closer to a third one. The most accurate method is currently maximum likelihood over the entire triplet. However, this approach is relatively time consuming.
We show that an alternative estimator, based on pairwise estimates and therefore much faster to compute, has almost the same statistical power as the maximum likelihood estimator. We also provide a numerical approximation for its variance, which could otherwise only be estimated through an expensive re-sampling approach such as bootstrapping. An extensive simulation demonstrates that the approximation delivers precise confidence intervals. To illustrate the possible applications of these results, we show how they improve the detection of asymmetric evolution, and the identification of the closest relative to a given sequence in a group of homologs.
The results presented in this paper constitute a basis for large-scale protein cross-comparisons of pairwise evolutionary distances.
Full-text · Article · Feb 2006 · BMC Bioinformatics
[Show abstract][Hide abstract] ABSTRACT: The OMA project is a large-scale effort to identify groups of orthologs from complete genome data, currently 150 species.
The algorithm relies solely on protein sequence information and does not require any human supervision. It has several original
features, in particular a verification step that detects paralogs and prevents them from being clustered together. Consistency
checks and verification are performed throughout the process. The resulting groups, whenever a comparison could be made, are
highly consistent both with EC assignments, and with assignments from the manually curated database HAMAP. A highly accurate
set of orthologous sequences constitutes the basis for several other investigations, including phylogenetic analysis and protein
[Show abstract][Hide abstract] ABSTRACT: Codon substitution probabilities are used in many types of molecular evolution studies such as determining Ka/Ks ratios, creating ancestral DNA sequences or aligning coding DNA. Until the recent dramatic increase in genomic data enabled construction of empirical matrices, researchers relied on parameterized models of codon evolution. Here we present the first empirical codon substitution matrix entirely built from alignments of coding sequences from vertebrate DNA and thus provide an alternative to parameterized models of codon evolution.
A set of 17,502 alignments of orthologous sequences from five vertebrate genomes yielded 8.3 million aligned codons from which the number of substitutions between codons were counted. From this data, both a probability matrix and a matrix of similarity scores were computed. They are 64 x 64 matrices describing the substitutions between all codons. Substitutions from sense codons to stop codons are not considered, resulting in block diagonal matrices consisting of 61 x 61 entries for the sense codons and 3 x 3 entries for the stop codons.
The amount of genomic data currently available allowed for the construction of an empirical codon substitution matrix. However, more sequence data is still needed to construct matrices from different subsets of DNA, specific to kingdoms, evolutionary distance or different amount of synonymous change. Codon mutation matrices have advantages for alignments up to medium evolutionary distances and for usages that require DNA such as ancestral reconstruction of DNA sequences and the calculation of Ka/Ks ratios.
Full-text · Article · Feb 2005 · BMC Bioinformatics