Choosing Appropriate Substitution Models for the Phylogenetic Analysis of Protein-Coding Sequences
ABSTRACT Although phylogenetic inference of protein-coding sequences continues to dominate the literature, few analyses incorporate evolutionary models that consider the genetic code. This problem is exacerbated by the exclusion of codon-based models from commonly employed model selection techniques, presumably due to the computational cost associated with codon models. We investigated an efficient alternative to standard nucleotide substitution models, in which codon position (CP) is incorporated into the model. We determined the most appropriate model for alignments of 177 RNA virus genes and 106 yeast genes, using 11 substitution models including one codon model and four CP models. The majority of analyzed gene alignments are best described by CP substitution models, rather than by standard nucleotide models, and without the computational cost of full codon models. These results have significant implications for phylogenetic inference of coding sequences as they make it clear that substitution models incorporating CPs not only are a computationally realistic alternative to standard models but may also frequently be statistically superior.
- SourceAvailable from: Samad Amini-Bavil-Olyaee
Journal of Clinical Virology 02/2015; 63:38-41. DOI:10.1016/j.jcv.2014.12.010 · 3.47 Impact Factor
- "Preliminary analysis using Path-O-Gen (http://tree.bio.ed.ac.uk/ software/pathogen/) indicated that temporal signal of each subgenotype dataset was limited (D2: R 2 = 0.233 and D3: R 2 = 0.16). To maximize the temporal signal, each subgenotype was allowed to have an independent tree, while sharing the same underlying substitution model , which allows for different rates at the 1st + 2nd and 3rd codon positions. The uncorrelated lognormal molecular clock model was applied to take into account rate variation along the tree branches . "
[Show abstract] [Hide abstract]
- "The analysis was conducted separately on the concatenated MLST loci and on the 56-kDa protein gene sequences using BEAST v 1.6 (Drummond and Rambaut, 2007). Several models were used for each set of sequences (Table 3 and Table 4) with different substitution rate for each codon position (Shapiro et al., 2006) and both strict and relaxed (uncorrelated lognormal) molecular clocks. All Markov Chain Monte Carlo chains were run for sufficient time at 100 million steps to ensure statistical convergence, with 10% removed as burn-in. "
ABSTRACT: Orientia tsutsugamushi is the causative agent of scrub typhus, a major cause of febrile illness in rural area of Asia-Pacific region. A multi-locus sequence typing (MLST) analysis was performed on strains isolated from human patients from 3 countries in Southeast Asia: Cambodia, Vietnam and Thailand. The phylogeny of the 56-kDa protein encoding gene was analyzed on the same strains and showed a structured topology with genetically distinct clusters. MLST analysis did not lead to the same conclusion. DNA polymorphism and phylogeny of individual gene loci indicated a significant level of recombination and genetic diversity whereas the ST distribution indicated the presence of isolated patches. No correlation was found with the geographic origin. This work suggests that weak divergence in core genome and ancestral haplotypes are maintained by permanent recombination in mites while the 56-kDa protein gene is diverging in higher speed due to selection by the mammalian immune system.Infection Genetics and Evolution 01/2015; 31. DOI:10.1016/j.meegid.2015.01.005 · 3.26 Impact Factor
[Show abstract] [Hide abstract]
- "Evolutionary rates were estimated using the Bayesian Markov Chain Monte Carlo (MCMC) inference method implemented in BEAST v1.7.5 (Drummond et al., 2012). These analyses employed a SDR06 nucleotide substitution model (two independent HKY þΓ substitution models—one for the first and second codon positions, and one for the third), an uncorrelated lognormal relaxed molecular clock model, and a Bayesian skyline plot coalescent model (Shapiro et al., 2006). Nucleotide frequencies were estimated from the data. "
ABSTRACT: HCV genotype 4 is prevalent in many African countries, yet little is known about the genotype׳s epidemic history on the continent. We present a comprehensive study of the molecular epidemiology of genotype 4. To address the deficit of data from the Democratic Republic of the Congo (DRC) we PCR amplified 60 new HCV isolates from the DRC, resulting in 33 core- and 48 NS5B-region sequences. Our data, together with genotype 4 database sequences, were analysed using Bayesian phylogenetic approaches. We find three well-supported intra-genotypic lineages and estimate that the genotype 4 common ancestor existed around 1733 (1650–1805). We show that genotype 4 originated in central Africa and that multiple lineages have been exported to north Africa since ~1850, including subtype 4a which dominates the epidemic in Egypt. We speculate on the causes of the historical intra-continental spread of genotype 4, including population movements during World War 2.Virology 01/2015; 464-465. DOI:10.1016/j.virol.2014.07.006 · 3.35 Impact Factor