HIV-Specific Probabilistic Models of Protein Evolution

Department of Microbiology, University of Washington School of Medicine, Seattle, Washington, United States of America.
PLoS ONE (Impact Factor: 3.53). 02/2007; 2(6):e503. DOI: 10.1371/journal.pone.0000503
Source: PubMed

ABSTRACT Comparative sequence analyses, including such fundamental bioinformatics techniques as similarity searching, sequence alignment and phylogenetic inference, have become a mainstay for researchers studying type 1 Human Immunodeficiency Virus (HIV-1) genome structure and evolution. Implicit in comparative analyses is an underlying model of evolution, and the chosen model can significantly affect the results. In general, evolutionary models describe the probabilities of replacing one amino acid character with another over a period of time. Most widely used evolutionary models for protein sequences have been derived from curated alignments of hundreds of proteins, usually based on mammalian genomes. It is unclear to what extent these empirical models are generalizable to a very different organism, such as HIV-1-the most extensively sequenced organism in existence. We developed a maximum likelihood model fitting procedure to a collection of HIV-1 alignments sampled from different viral genes, and inferred two empirical substitution models, suitable for describing between-and within-host evolution. Our procedure pools the information from multiple sequence alignments, and provided software implementation can be run efficiently in parallel on a computer cluster. We describe how the inferred substitution models can be used to generate scoring matrices suitable for alignment and similarity searches. Our models had a consistently superior fit relative to the best existing models and to parameter-rich data-driven models when benchmarked on independent HIV-1 alignments, demonstrating evolutionary biases in amino-acid substitution that are unique to HIV, and that are not captured by the existing models. The scoring matrices derived from the models showed a marked difference from common amino-acid scoring matrices. The use of an appropriate evolutionary model recovered a known viral transmission history, whereas a poorly chosen model introduced phylogenetic error. We argue that our model derivation procedure is immediately applicable to other organisms with extensive sequence data available, such as Hepatitis C and Influenza A viruses.

Download full-text


Available from: Laura Heath, Jul 07, 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we present a de novo assembly of the transcriptome of the damselfly, Enallagma hageni, through the use of 454 pyrosequencing. E. hageni is a member of the suborder Zygoptera within the order Odonata, and the Odonata are the basal lineage of the winged insects (Pterygota). To date, sequence data used in phylogenetic analysis of Enallagma species have been derived from either mtDNA or ribosomal nuclear DNA. This transcriptome contained 31,661 contigs that were assembled and translated into 14,813 individual open reading frames. Using these data, we constructed an extensive dataset of 634 orthologous nuclear protein-coding genes across 11 species of Arthropoda, and used Bayesian techniques to elucidate Enallagma's place in the Arthropod phylogenetic tree. Additionally, we demonstrate that the Enallagma transcriptome contains 169 genes that are evolving at rates that differ relative to the rest of the transcriptome (29 accelerated and 140 decreased), and through multiple Gene Ontology searches and clustering methods, we present the first functional-annotation of any palaeopteran's transcriptome in the literature.
    G3-Genes Genomes Genetics 03/2013; 3(4). DOI:10.1534/g3.113.005637 · 2.51 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Many methods exist for reconstructing phylogenies from molecular sequence data, but few phylogenies are known and can be used to check their efficacy. Simulation remains the most important approach to testing the accuracy and robustness of phylogenetic inference methods. However, current simulation programs are limited, especially concerning realistic models for simulating insertions and deletions. We implement a portable and flexible application, named INDELible, for generating nucleotide, amino acid and codon sequence data by simulating insertions and deletions (indels) as well as substitutions. Indels are simulated under several models of indel-length distribution. The program implements a rich repertoire of substitution models, including the general unrestricted model and nonstationary nonhomogeneous models of nucleotide substitution, mixture, and partition models that account for heterogeneity among sites, and codon models that allow the nonsynonymous/synonymous substitution rate ratio to vary among sites and branches. With its many unique features, INDELible should be useful for evaluating the performance of many inference methods, including those for multiple sequence alignment, phylogenetic tree inference, and ancestral sequence, or genome reconstruction.
    Molecular Biology and Evolution 06/2009; 26(8):1879-88. DOI:10.1093/molbev/msp098 · 14.31 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We develop a model-based phylogenetic maximum likelihood test for evidence of preferential substitution toward a given residue at individual positions of a protein alignment--directional evolution of protein sequences (DEPS). DEPS can identify both the target residue and sites evolving toward it, help detect selective sweeps and frequency-dependent selection--scenarios that confound most existing tests for selection, and achieve good power and accuracy on simulated data. We applied DEPS to alignments representing different genomic regions of influenza A virus (IAV), sampled from avian hosts (H5N1 serotype) and human hosts (H3N2 serotype), and identified multiple directionally evolving sites in 5/8 genomic segments of H5N1 and H3N2 IAV. We propose a simple descriptive classification of directionally evolving sites into 5 groups based on the temporal distribution of residue frequencies and document known functional correlates, such as immune escape or host adaptation.
    Molecular Biology and Evolution 06/2008; 25(9):1809-24. DOI:10.1093/molbev/msn123 · 14.31 Impact Factor