Article

Whole-proteome phylogeny of large dsDNA viruses and parvoviruses through a composition vector method related to dynamical language model

Department of Biology, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China.
BMC Evolutionary Biology (Impact Factor: 3.41). 06/2010; 10:192. DOI: 10.1186/1471-2148-10-192
Source: PubMed

ABSTRACT Background: The vast sequence divergence among different virus groups has presented a great challenge to alignment-based analysis of virus phylogeny. Due to the problems caused by the uncertainty in alignment, existing tools for phylogenetic analysis based on multiple alignment could not be directly applied to the whole-genome comparison and phylogenomic studies of viruses. There has been a growing interest in alignment-free methods for phylogenetic analysis using complete genome data. Among the alignment-free methods, a dynamical language (DL) method proposed by our group has successfully been applied to the phylogenetic analysis of bacteria and chloroplast genomes.
Results: In this paper, the DL method is used to analyze the whole-proteome phylogeny of 124 large dsDNA viruses and 30 parvoviruses, two data sets with large difference in genome size. The trees from our analyses are in good agreement to the latest classification of large dsDNA viruses and parvoviruses by the International Committee on Taxonomy of Viruses (ICTV).
Conclusions: The present method provides a new way for recovering the phylogeny of large dsDNA viruses and parvoviruses, and also some insights on the affiliation of a number of unclassified viruses. In comparison, some alignment-free methods such as the CV Tree method can be used for recovering the phylogeny of large dsDNA viruses, but they are not suitable for resolving the phylogeny of parvoviruses with a much smaller genome size.

Download full-text

Full-text

Available from: Ka Hou Chu, Jun 25, 2015
0 Followers
 · 
103 Views
  • [Show abstract] [Hide abstract]
    ABSTRACT: An improved definition of the similarity metric based on the research of Professor Shihyen Chen is proposed, in order to solve the problem that the distances of mutual information and angle cosine do not satisfy the triangle inequality. The properties of proposed similarity metric are studied including the relationship with the distance. The phylogenetic trees of the genomes of 64 vertebrates are constructed by the method of composition vector and the distance based on similarity metric and mutual information or angle cosine. The results show that the distance of mutual information and angle cosine can be constructed by the improved definition of similarity metric, and the phylogenetic trees reconstructed by the constructed distance are agreement with the current known phylogenies of species.
    2013 6th International Conference on Biomedical Engineering and Informatics (BMEI); 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: There has been a growing interest in alignment-free methods for phylogenetic analysis using complete genome data. Among them, CVTree method, feature frequency profiles method and dynamical language approach were used to investigate the whole-proteome phylogeny of large dsDNA viruses. Using the data set of large dsDNA viruses from Gao and Qi (BMC Evol. Biol. 2007), the phylogenetic results based on the CVTree method and the dynamical language approach were compared in Yu et al. (BMC Evol. Biol. 2010). In this paper, we first apply dynamical language approach to the data set of large dsDNA viruses from Wu et al. (Proc. Natl. Acad. Sci. USA 2009) and compare our phylogenetic results with those based on the feature frequency profiles method. Then we construct the whole-proteome phylogeny of the larger dataset combining the above two data sets. According to the report of The International Committee on the Taxonomy of Viruses (ICTV), the trees from our analyses are in good agreement to the latest classification of large dsDNA viruses.
    01/2012; DOI:10.1109/ICNC.2012.6234564
  • [Show abstract] [Hide abstract]
    ABSTRACT: The generation of a correlation matrix from a large set of long gene sequences is a common requirement in many bioinformatics problems such as phylogenetic analysis. The generation is not only computationally intensive but also requires significant memory resources as, typically, few gene sequences can be simultaneously stored in primary memory. The standard practice in such computation is to use frequent input/output (I/O) operations. Therefore, minimizing the number of these operations will yield much faster run-times. This paper develops an approach for the faster and scalable computing of large-size correlation matrices through the full use of available memory and a reduced number of I/O operations. The approach is scalable in the sense that the same algorithms can be executed on different computing platforms with different amounts of memory and can be applied to different problems with different correlation matrix sizes. The significant performance improvement of the approach over the existing approaches is demonstrated through benchmark examples.
    Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2013 IEEE Symposium on; 01/2013