Choosing Appropriate Substitution Models for the Phylogenetic Analysis of Protein-Coding Sequences

University of Oxford, Oxford, England, United Kingdom
Molecular Biology and Evolution (Impact Factor: 9.11). 02/2006; 23(1):7-9. DOI: 10.1093/molbev/msj021
Source: PubMed


Although phylogenetic inference of protein-coding sequences continues to dominate the literature, few analyses incorporate evolutionary models that consider the genetic code. This problem is exacerbated by the exclusion of codon-based models from commonly employed model selection techniques, presumably due to the computational cost associated with codon models. We investigated an efficient alternative to standard nucleotide substitution models, in which codon position (CP) is incorporated into the model. We determined the most appropriate model for alignments of 177 RNA virus genes and 106 yeast genes, using 11 substitution models including one codon model and four CP models. The majority of analyzed gene alignments are best described by CP substitution models, rather than by standard nucleotide models, and without the computational cost of full codon models. These results have significant implications for phylogenetic inference of coding sequences as they make it clear that substitution models incorporating CPs not only are a computationally realistic alternative to standard models but may also frequently be statistically superior.

12 Reads
  • Source
    • "Preliminary analysis using Path-O-Gen ( software/pathogen/) indicated that temporal signal of each subgenotype dataset was limited (D2: R 2 = 0.233 and D3: R 2 = 0.16). To maximize the temporal signal, each subgenotype was allowed to have an independent tree, while sharing the same underlying substitution model [16], which allows for different rates at the 1st + 2nd and 3rd codon positions. The uncorrelated lognormal molecular clock model was applied to take into account rate variation along the tree branches [17]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background: Hepatitis B virus (HBV) has been classified into eight genotypes and forty subgenotypes. Genotype D of HBV is the most worldwide distributed genotype and HBV subgenotype D1 has been isolated from Iranian patients. Objective: To characterize for the first time complete genomes of recently emerged non-D1 strains in Iran. Study design: HBV complete genomes isolated from 9 Iranian HBV carriers were sequenced. Different diversities of the ORFs were mapped and evolutionary history relationships were investigated. Results: Phylogenetic analysis identified four D2 subgenotypes and five D3 subgenotypes of HBV in the studied patients. Of note, D2 strains clustered with strains from Lebanon and Syria. The time of the most recent common ancestor (TMRCA) of the first cluster of D2 was dated at 1953 (BCI=1926, 1976) while the second cluster was dated at 1947 (BCI=1911, 1978). All five Iranian D3 strains formed a monophyletic cluster with Indian strain and dated back to 1967 (BCI=1946, 1987). Surprisingly, two D3 strains had an adw2 subtype. Interestingly, more than 80% of the present strains showed precore mutations, while two isolates carried basal core promoter variation. Conclusion: Iranian D2 and D3 isolates were introduced on at least two and one occasion in Iran and diverged from west and south Asian HBV strains, respectively. Considering the impact of the different (sub) genotypes on clinical outcome, exploring the distinct mutational patterns of Iranian D1 and non-D1 strains is of clinical importance.
    Journal of Clinical Virology 02/2015; 63:38-41. DOI:10.1016/j.jcv.2014.12.010 · 3.02 Impact Factor
  • Source
    • "The analysis was conducted separately on the concatenated MLST loci and on the 56-kDa protein gene sequences using BEAST v 1.6 (Drummond and Rambaut, 2007). Several models were used for each set of sequences (Table 3 and Table 4) with different substitution rate for each codon position (Shapiro et al., 2006) and both strict and relaxed (uncorrelated lognormal) molecular clocks. All Markov Chain Monte Carlo chains were run for sufficient time at 100 million steps to ensure statistical convergence, with 10% removed as burn-in. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Orientia tsutsugamushi is the causative agent of scrub typhus, a major cause of febrile illness in rural area of Asia-Pacific region. A multi-locus sequence typing (MLST) analysis was performed on strains isolated from human patients from 3 countries in Southeast Asia: Cambodia, Vietnam and Thailand. The phylogeny of the 56-kDa protein encoding gene was analyzed on the same strains and showed a structured topology with genetically distinct clusters. MLST analysis did not lead to the same conclusion. DNA polymorphism and phylogeny of individual gene loci indicated a significant level of recombination and genetic diversity whereas the ST distribution indicated the presence of isolated patches. No correlation was found with the geographic origin. This work suggests that weak divergence in core genome and ancestral haplotypes are maintained by permanent recombination in mites while the 56-kDa protein gene is diverging in higher speed due to selection by the mammalian immune system.
    Infection Genetics and Evolution 01/2015; 31. DOI:10.1016/j.meegid.2015.01.005 · 3.02 Impact Factor
  • Source
    • "Evolutionary rates were estimated using the Bayesian Markov Chain Monte Carlo (MCMC) inference method implemented in BEAST v1.7.5 (Drummond et al., 2012). These analyses employed a SDR06 nucleotide substitution model (two independent HKY þΓ substitution models—one for the first and second codon positions, and one for the third), an uncorrelated lognormal relaxed molecular clock model, and a Bayesian skyline plot coalescent model (Shapiro et al., 2006). Nucleotide frequencies were estimated from the data. "
    [Show abstract] [Hide abstract]
    ABSTRACT: HCV genotype 4 is prevalent in many African countries, yet little is known about the genotype׳s epidemic history on the continent. We present a comprehensive study of the molecular epidemiology of genotype 4. To address the deficit of data from the Democratic Republic of the Congo (DRC) we PCR amplified 60 new HCV isolates from the DRC, resulting in 33 core- and 48 NS5B-region sequences. Our data, together with genotype 4 database sequences, were analysed using Bayesian phylogenetic approaches. We find three well-supported intra-genotypic lineages and estimate that the genotype 4 common ancestor existed around 1733 (1650–1805). We show that genotype 4 originated in central Africa and that multiple lineages have been exported to north Africa since ~1850, including subtype 4a which dominates the epidemic in Egypt. We speculate on the causes of the historical intra-continental spread of genotype 4, including population movements during World War 2.
    Virology 01/2015; 464-465(100). DOI:10.1016/j.virol.2014.07.006 · 3.35 Impact Factor
Show more