Whole-proteome phylogeny of large dsDNA viruses and parvoviruses through a composition vector method related to dynamical language model

Department of Biology, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China.
BMC Evolutionary Biology (Impact Factor: 3.37). 06/2010; 10:192. DOI: 10.1186/1471-2148-10-192
Source: PubMed


Background: The vast sequence divergence among different virus groups has presented a great challenge to alignment-based analysis of virus phylogeny. Due to the problems caused by the uncertainty in alignment, existing tools for phylogenetic analysis based on multiple alignment could not be directly applied to the whole-genome comparison and phylogenomic studies of viruses. There has been a growing interest in alignment-free methods for phylogenetic analysis using complete genome data. Among the alignment-free methods, a dynamical language (DL) method proposed by our group has successfully been applied to the phylogenetic analysis of bacteria and chloroplast genomes.
Results: In this paper, the DL method is used to analyze the whole-proteome phylogeny of 124 large dsDNA viruses and 30 parvoviruses, two data sets with large difference in genome size. The trees from our analyses are in good agreement to the latest classification of large dsDNA viruses and parvoviruses by the International Committee on Taxonomy of Viruses (ICTV).
Conclusions: The present method provides a new way for recovering the phylogeny of large dsDNA viruses and parvoviruses, and also some insights on the affiliation of a number of unclassified viruses. In comparison, some alignment-free methods such as the CV Tree method can be used for recovering the phylogeny of large dsDNA viruses, but they are not suitable for resolving the phylogeny of parvoviruses with a much smaller genome size.

Download full-text


Available from: Ka Hou Chu, Sep 30, 2015
21 Reads
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The composition vector (CV) method is an alignment-free method for phylogenetics. Because of its simplicity when compared with the alignment-based methods, the method has been widely discussed lately. There are mainly four steps in the CV method: (1) count the frequency of each k-string in the sequence; (2) construct the composition vector for the sequence; (3) compute the distance between every two composition vectors to form a distance matrix; and (4) construct the phylogenetic tree. In this paper, we review several developments of the CV method respectively.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Sequence alignment is not directly applicable to whole genome phylogeny since several events such as rearrangements make full length alignments impossible. Here, a novel alignment-free method derived from the standpoint of information theory is proposed and used to construct the whole-genome phylogeny for a population of viruses from 13 viral families comprising 218 dsDNA viruses. The method is based on information correlation (IC) and partial information correlation (PIC). We observe that (i) the IC-PIC tree segregates the population into clades, the membership of each is remarkably consistent with biologist's systematics only with little exceptions; (ii) the IC-PIC tree reveals potential evolutionary relationships among some viral families; and (iii) the IC-PIC tree predicts the taxonomic positions of certain "unclassified" viruses. Our approach provides a new way for recovering the phylogeny of viruses, and has practical applications in developing alignment-free methods for sequence classification.
    Gene 11/2011; 492(1):309-14. DOI:10.1016/j.gene.2011.11.004 · 2.14 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Pathogens like HIV-1, which evolve into many closely related variants displaying differential infectivity and evolutionary dynamics in a short time scale, require fast and accurate classification. Conventional whole genome sequence alignment-based methods are computationally expensive and involve complex analysis. Alignment-free methodologies are increasingly being used to effectively differentiate genomic variations between viral species. Multifractal analysis, which explores the self-similar nature of genomes, is an alignment-free methodology that has been applied to study such variations. However, whether multifractal analysis can quantify variations between closely related genomes, such as the HIV-1 subtypes, is an open question. Here we address the above by implementing the multifractal analysis on four retroviral genomes (HIV-1, HIV-2, SIVcpz, and HTLV-1), and demonstrate that individual multifractal properties can differentiate between different retrovirus types easily. However, the individual multifractal measures do not resolve within-group variations for different known subtypes of HIV-1 M group. We show here that these known subtypes can instead be classified correctly using a combination of the crucial multifractal measures. This method is simple and computationally fast in comparison to the conventional alignment-based methods for whole genome phylogenetic analysis.
    Molecular Phylogenetics and Evolution 12/2011; 62(2):756-63. DOI:10.1016/j.ympev.2011.11.017 · 3.92 Impact Factor
Show more