[Show abstract][Hide abstract] ABSTRACT: A nonlinear Tracking-Differentiator is one-input-two-output system that can
generate smooth approximation of measured signals and get the derivatives of
the signals. The nonlinear tracking-Differentiator is explored to denoise and
generate the derivatives of the walks of the 3-periodicity of DNA sequences. An
improved algorithm for gene finding is presented using the nonlinear
Tracking-Differentiator. The gene finding algorithm employs the 3-base
periodicity of coding region. The 3-base periodicity DNA walks are denoised and
tracked using the nonlinear Tracking-Differentiator. Case studies demonstrate
that the nonlinear Tracking-Differentiator is an effective method to improve
the accuracy of the gene finding algorithm.
[Show abstract][Hide abstract] ABSTRACT: A genome space is a moduli space of genomes. In this space, each point corresponds to a genome. The natural distance between two points in the genome space reflects the biological distance between these two genomes. Currently, there is no method to represent genomes by a point in a space without losing biological information. Here, we propose a new graphical representation for DNA sequences. The breakthrough of the subject is that we can construct the moment vectors from DNA sequences using this new graphical method and prove that the correspondence between moment vectors and DNA sequences is one-to-one. Using these moment vectors, we have constructed a novel genome space as a subspace in R(N). It allows us to show that the SARS-CoV is most closely related to a coronavirus from the palm civet not from a bird as initially suspected, and the newly discovered human coronavirus HCoV-HKU1 is more closely related to SARS than to any other known member of group 2 coronavirus. Furthermore, we reconstructed the phylogenetic tree for 34 lentiviruses (including human immunodeficiency virus) based on their whole genome sequences. Our genome space will provide a new powerful tool for analyzing the classification of genomes and their phylogenetic relationships.
DNA Research 04/2010; 17(3):155-68. · 4.43 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: With the exponential growth of genomic sequences, there is an increasing demand to accurately identify protein coding regions (exons) from genomic sequences. Despite many progresses being made in the identification of protein coding regions by computational methods during the last two decades, the performances and efficiencies of the prediction methods still need to be improved. In addition, it is indispensable to develop different prediction methods since combining different methods may greatly improve the prediction accuracy. A new method to predict protein coding regions is developed in this paper based on the fact that most of exon sequences have a 3-base periodicity, while intron sequences do not have this unique feature. The method computes the 3-base periodicity and the background noise of the stepwise DNA segments of the target DNA sequences using nucleotide distributions in the three codon positions of the DNA sequences. Exon and intron sequences can be identified from trends of the ratio of the 3-base periodicity to the background noise in the DNA sequences. Case studies on genes from different organisms show that this method is an effective approach for exon prediction.
Journal of Theoretical Biology 09/2007; 247(4):687-94. · 2.35 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In this paper, a new algorithm to predict protein-coding regions by tracking the 3-base periodicity of DNA sequences is presented. A nonlinear tracking-differentiator (TD) is explored to generate smooth approximation of the trajectories of the 3-periodicity of DNA sequences and compute the derivatives, based on which protein-coding and non-coding regions can be identified. Case studies demonstrate that the nonlinear TD is an effective method to improve the accuracy of the gene finding algorithm
[Show abstract][Hide abstract] ABSTRACT: The 3-base periodicity, identified as a pronounced peak at the frequency N/3 (N is the length of the DNA sequence) of the Fourier power spectrum of protein coding regions, is used as a marker in gene-finding algorithms to distinguish protein coding regions (exons) and noncoding regions (introns) of genomes. In this paper, we reveal the explanation of this phenomenon which results from a nonuniform distribution of nucleotides in the three coding positions. There is a linear correlation between the nucleotide distributions in the three codon positions and the power spectrum at the frequency N/3. Furthermore, this study indicates the relationship between the length of a DNA sequence and the variance of nucleotide distributions and the average Fourier power spectrum, which is the noise signal in gene-finding methods. The results presented in this paper provide an efficient way to compute the Fourier power spectrum at N/3 and the noise signal in gene-finding methods by calculating the nucleotide distributions in the three codon positions.
Journal of Computational Biology 12/2005; 12(9):1153-65. · 1.56 Impact Factor