[Show abstract][Hide abstract] ABSTRACT: Comparing DNA or protein sequences plays an important role in the functional analysis of genomes. Despite many methods available for sequences comparison, few methods retain the information content of sequences. We propose a new approach, the Yau-Hausdorff method, which considers all translations and rotations when seeking the best match of graphical curves of DNA or protein sequences. The complexity of this method is lower than that of any other two dimensional minimum Hausdorff algorithm. The Yau-Hausdorff method can be used for measuring the similarity of DNA sequences based on two important tools: the Yau-Hausdorff distance and graphical representation of DNA sequences. The graphical representations of DNA sequences conserve all sequence information and the Yau-Hausdorff distance is mathematically proved as a true metric. Therefore, the proposed distance can preciously measure the similarity of DNA sequences. The phylogenetic analyses of DNA sequences by the Yau-Hausdorff distance show the accuracy and stability of our approach in similarity comparison of DNA or protein sequences. This study demonstrates that Yau-Hausdorff distance is a natural metric for DNA and protein sequences with high level of stability. The approach can be also applied to similarity analysis of protein sequences by graphic representations, as well as general two dimensional shape matching.
PLoS ONE 09/2015; 10(9):e0136577. DOI:10.1371/journal.pone.0136577 · 3.23 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: According to the WHO, ebolaviruses have resulted in 8818 human deaths in West Africa as of January 2015. To better understand the evolutionary relationship of the ebolaviruses and infer virulence from the relationship, we applied the alignment-free natural vector method to classify the newest ebolaviruses. The dataset includes three new Guinea viruses as well as 99 viruses from Sierra Leone. For the viruses of the family of Filoviridae, both genus label classification and species label classification achieve an accuracy rate of 100%. We represented the relationships among Filoviridae viruses by Unweighted Pair Group Method with Arithmetic Mean (UPGMA) phylogenetic trees and found that the filoviruses can be separated well by three genera. We performed the phylogenetic analysis on the relationship among different species of Ebolavirus by their coding-complete genomes and seven viral protein genes (glycoprotein [GP], nucleoprotein [NP], VP24, VP30, VP35, VP40, and RNA polymerase [L]). The topology of the phylogenetic tree by the viral protein VP24 shows consistency with the variations of virulence of ebolaviruses. The result suggests that VP24 be a pharmaceutical target for treating or preventing ebolaviruses.
DNA and cell biology 03/2015; 34(6). DOI:10.1089/dna.2014.2678 · 2.06 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Multiple sequence alignments (MSA) is a prominent method for classification of DNA sequences, yet it is hampered with inherent limitations in computational complexity. Alignment-free methods have been developed over past decade for more efficient comparison and classification of DNA sequences than MSA. However, most alignment-free methods may lose structural and functional information of DNA sequences because they are based on feature extractions. Therefore, they may not fully reflect the actual differences among DNA sequences. Alignment-free methods with information conservation are needed for more accurate comparison and classification of DNA sequences. We propose a new alignment-free similarity measure of DNA sequences using the Discrete Fourier Transform (DFT). In this method, we map DNA sequences into four binary indicator sequences and apply DFT to the indicator sequences to transform them into frequency domain. The Euclidean distance of full DFT power spectra of the DNA sequences is used as similarity distance metric. To compare the DFT power spectra of DNA sequences with different lengths, we propose an even scaling method to extend shorter DFT power spectra to equal the longest length of the sequences compared. After the DFT power spectra are evenly scaled, the DNA sequences are compared in the same DFT frequency space dimensionality. We assess the accuracy of the similarity metric in hierarchical clustering using simulated DNA and virus sequences. The results demonstrate that the DFT based method is an effective and accurate measure of DNA sequence similarity.
[Show abstract][Hide abstract] ABSTRACT: A nonlinear Tracking-Differentiator is one-input-two-output system that can
generate smooth approximation of measured signals and get the derivatives of
the signals. The nonlinear tracking-Differentiator is explored to denoise and
generate the derivatives of the walks of the 3-periodicity of DNA sequences. An
improved algorithm for gene finding is presented using the nonlinear
Tracking-Differentiator. The gene finding algorithm employs the 3-base
periodicity of coding region. The 3-base periodicity DNA walks are denoised and
tracked using the nonlinear Tracking-Differentiator. Case studies demonstrate
that the nonlinear Tracking-Differentiator is an effective method to improve
the accuracy of the gene finding algorithm.
[Show abstract][Hide abstract] ABSTRACT: A genome space is a moduli space of genomes. In this space, each point corresponds to a genome. The natural distance between two points in the genome space reflects the biological distance between these two genomes. Currently, there is no method to represent genomes by a point in a space without losing biological information. Here, we propose a new graphical representation for DNA sequences. The breakthrough of the subject is that we can construct the moment vectors from DNA sequences using this new graphical method and prove that the correspondence between moment vectors and DNA sequences is one-to-one. Using these moment vectors, we have constructed a novel genome space as a subspace in R(N). It allows us to show that the SARS-CoV is most closely related to a coronavirus from the palm civet not from a bird as initially suspected, and the newly discovered human coronavirus HCoV-HKU1 is more closely related to SARS than to any other known member of group 2 coronavirus. Furthermore, we reconstructed the phylogenetic tree for 34 lentiviruses (including human immunodeficiency virus) based on their whole genome sequences. Our genome space will provide a new powerful tool for analyzing the classification of genomes and their phylogenetic relationships.
DNA Research 04/2010; 17(3):155-68. DOI:10.1093/dnares/dsq008 · 5.48 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The indispensable prerequisites in characterizing information content of DNA molecules by computational methods are the numerical representations of symbolic DNA sequences. Current numerical representation methods for DNA sequences do not contain the genetic code context information, which may play an important role in defining protein coding regions. We propose a novel numerical representation of DNA sequences based on genetic code context within DNA sequences and explore the feasibility of applying this method to identify protein coding regions in genomes. Computational experiments indicate that incorporating genetic code information into numerical representations is a promising approach in which DNA sequences are uniquely represented and more information is represented so that digital processing tools can be applied to the periodicity analysis in DNA sequences effectively.
Computational Intelligence in Bioinformatics and Computational Biology, 2008. CIBCB '08. IEEE Symposium on; 10/2008
[Show abstract][Hide abstract] ABSTRACT: With the exponential growth of genomic sequences, there is an increasing demand to accurately identify protein coding regions (exons) from genomic sequences. Despite many progresses being made in the identification of protein coding regions by computational methods during the last two decades, the performances and efficiencies of the prediction methods still need to be improved. In addition, it is indispensable to develop different prediction methods since combining different methods may greatly improve the prediction accuracy. A new method to predict protein coding regions is developed in this paper based on the fact that most of exon sequences have a 3-base periodicity, while intron sequences do not have this unique feature. The method computes the 3-base periodicity and the background noise of the stepwise DNA segments of the target DNA sequences using nucleotide distributions in the three codon positions of the DNA sequences. Exon and intron sequences can be identified from trends of the ratio of the 3-base periodicity to the background noise in the DNA sequences. Case studies on genes from different organisms show that this method is an effective approach for exon prediction.
[Show abstract][Hide abstract] ABSTRACT: In this paper, a new algorithm to predict protein-coding regions by tracking the 3-base periodicity of DNA sequences is presented. A nonlinear Tracking-Differentiator (TD) is explored to generate smooth approximation of the trajectories of the 3-periodicity of DNA sequences and compute the derivatives, based on which protein-coding and non-coding regions can be identified. Case studies demonstrate that the nonlinear TD is an effective method to improve the accuracy of the gene finding algorithm.
Proceedings of the IEEE Conference on Decision and Control 01/2006; DOI:10.1109/CDC.2006.377109
[Show abstract][Hide abstract] ABSTRACT: The 3-base periodicity, identified as a pronounced peak at the frequency N/3 (N is the length of the DNA sequence) of the Fourier power spectrum of protein coding regions, is used as a marker in gene-finding algorithms to distinguish protein coding regions (exons) and noncoding regions (introns) of genomes. In this paper, we reveal the explanation of this phenomenon which results from a nonuniform distribution of nucleotides in the three coding positions. There is a linear correlation between the nucleotide distributions in the three codon positions and the power spectrum at the frequency N/3. Furthermore, this study indicates the relationship between the length of a DNA sequence and the variance of nucleotide distributions and the average Fourier power spectrum, which is the noise signal in gene-finding methods. The results presented in this paper provide an efficient way to compute the Fourier power spectrum at N/3 and the noise signal in gene-finding methods by calculating the nucleotide distributions in the three codon positions.