A novel empirical mutual information approach to identify co-evolving amino acid positions of influenza A viruses

Graduate Institute of Electrical Engineering, Chang Gung University, Taoyuan, Taiwan.
Computational biology and chemistry (Impact Factor: 1.12). 06/2012; 39:20-8. DOI: 10.1016/j.compbiolchem.2012.06.004
Source: PubMed


Mutual information (MI) is an approach commonly used to estimate the evolutionary correlation of 2 amino acid sites. Although several MI methods exist, prior to our contribution no systematic method had been developed to assess their performance, or to establish numerical thresholds to detect co-evolving amino acid sites. The current study performed a Markov chain Monte Carlo (MCMC) algorithm on influenza viral sequences to capture their evolutionary characteristics. A consensus maximum clade credibility (MCC) tree was estimated from the samples, together with their amino acid substitution statistics, from which we generated synthetic sequences of known dependent and independent paired amino acid sites. A pair-to-pair and influenza-specific amino acid substitution matrix (P2PFLU) incorporated into Bayesian Evolutionary Analysis Sampling Trees (BEAST) enumerated these synthetic sequences. The sequences inherited evolutionary features and co-varying characteristics from the real viral sequences, rendering these synthetic data ideal for exploring their co-evolving features. For the MI measure, we proposed a novel metric called the empirical MI (MI(Em)), which outperformed other MI measures in analysis of receiver operating characteristics (ROC). We implemented our approach on 1086 all-time PB2 sequences of influenza A H5N1 viruses, in which we found 97 sites exhibiting co-evolutionary substitution of one or more amino acid sites. In particular, PB2 451, along with eight other PB2 sites of various MI(Em) scores, was found to co-evolve with PB2 627, a known species-associated amino acid residue which plays a critical role in influenza virus replication.

4 Reads
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The NCBI Influenza Virus Resource provides an integrated tool for influenza virus sequence retrieval and analysis. Every sequence in the database is curated by an automatic procedure, and NCBI staff make sure that the information presented in the database is complete, accurate, and up to date. The database has open access to all users and has been used as the backbone of several other influenza virus sequence databases, such as the Influenza Virus Database (3), the BioHealthBase BRC ( decorator=influenza), the Influenza Virus Genotype Tool (http://www.flugenome. org/index.php), and the Influenza Primer Design Resource (http://www.ipdr.mcw. edu/fludb/search). Sequence analysis tools such as multiple-sequence alignment and clustering of protein sequences are integrated with the database and allow users to quickly modify a data set to optimize the analysis. Using these tools offers a convenient way for preliminary sequence analyses. The influenza virus genome annotation tool makes sequence submission to GenBank much easier and will greatly promote data sharing among the influenza virus research community.
    Journal of Virology 02/2008; 82(2):596-601. DOI:10.1128/JVI.02005-07 · 4.44 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: An important element of the developing field of proteomics is to understand protein-protein interactions and other functional links amongst genes. Across-species correlation methods for detecting functional links work on the premise that functionally linked proteins will tend to show a common pattern of presence and absence across a range of genomes. We describe a maximum likelihood statistical model for predicting functional gene linkages. The method detects independent instances of the correlated gain or loss of pairs of proteins on phylogenetic trees, reducing the high rates of false positives observed in conventional across-species methods that do not explicitly incorporate a phylogeny. We show, in a dataset of 10,551 protein pairs, that the phylogenetic method improves by up to 35% on across-species analyses at identifying known functionally linked proteins. The method shows that protein pairs with at least two to three correlated events of gain or loss are almost certainly functionally linked. Contingent evolution, in which one gene's presence or absence depends upon the presence of another, can also be detected phylogenetically, and may identify genes whose functional significance depends upon its interaction with other genes. Incorporating phylogenetic information improves the prediction of functional linkages. The improvement derives from having a lower rate of false positives and from detecting trends that across-species analyses miss. Phylogenetic methods can easily be incorporated into the screening of large-scale bioinformatics datasets to identify sets of protein links and to characterise gene networks.
    PLoS Computational Biology 07/2005; 1(1):e3. DOI:10.1371/journal.pcbi.0010003 · 4.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recent advances in structural proteomics call for development of fast and reliable automatic methods for prediction of functional surfaces of proteins with known three-dimensional structure, including binding sites for known and unknown protein partners as well as oligomerization interfaces. Despite significant progress the problem is still far from being solved. Most existing methods rely, at least partially, on evolutionary information from multiple sequence alignments projected on protein surface. The common drawback of such methods is their limited applicability to the proteins with a sparse set of sequential homologs, as well as inability to detect interfaces in evolutionary variable regions. In this study, the authors developed an improved method for predicting interfaces from a single protein structure, which is based on local statistical properties of the protein surface derived at the level of atomic groups. The proposed Protein IntErface Recognition (PIER) method achieved the overall precision of 60% at the recall threshold of 50% at the residue level on a diverse benchmark of 490 homodimeric, 62 heterodimeric, and 196 transient interfaces (compared with 25% precision at 50% recall expected from random residue function assignment). For 70% of proteins in the benchmark, the binding patch residues were successfully detected with precision exceeding 50% at 50% recall. The calculation only took seconds for an average 300-residue protein. The authors demonstrated that adding the evolutionary conservation signal only marginally influenced the overall prediction performance on the benchmark; moreover, for certain classes of proteins, using this signal actually resulted in a deteriorated prediction. Thorough benchmarking using other datasets from literature showed that PIER yielded improved performance as compared with several alignment-free or alignment-dependent predictions. The accuracy, efficiency, and dependence on structure alone make PIER a suitable tool for automated high-throughput annotation of protein structures emerging from structural proteomics projects.
    Proteins Structure Function and Bioinformatics 05/2007; 67(2):400-17. DOI:10.1002/prot.21233 · 2.63 Impact Factor
Show more