Shuanhu Wu

Georgia State University, Atlanta, Georgia, United States

Are you Shuanhu Wu?

Claim your profile

Publications (18)12.78 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Please cite the following paper if you use this resource: ***************************************************************************************** A.W.C. Liew, J. Xian, S. Wu, D. Smith, and H. Yan, "Spectral Estimation in Unevenly Sampled Space of Periodically Expressed Microarray Time Series Data", BMC Bioinformatics, 8:137, 24 April 2007 *****************************************************************************************
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Please cite the following paper if you use this resource: ***************************************************************************************** A.W.C. Liew, J. Xian, S. Wu, D. Smith, and H. Yan, "Spectral Estimation in Unevenly Sampled Space of Periodically Expressed Microarray Time Series Data", BMC Bioinformatics, 8:137, 24 April 2007 *****************************************************************************************
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a new modeling strategy for the recognition and prediction of promoter region. In our model, we base on following considerations: (1) promoter region comprises a number of binding sites (consensus sequences) that RNA polymerase II can bind to and start the transcription of gene, different promoter can be determined by a combination of different binding sites; (2) the spacing of these binding sites is not always consistent and there is some nucleotide variation in some position in different genes and species. Based on above considerations, we first split promoter region into equal intervals and calculate the occurring probability for each words that is assumed to be the sequences of binding sites in each interval by training sets respectively. Here we combined those interval probabilities into one matrix and refer it to as Interval Position Weight Matrix (IPWM); then a new promoter modeling strategy and feature abstracting method are introduced based on maximal probability model and IPWM. The results of testing on large genomic sequences and comparisons with several currently famous algorithms show that our algorithm is efficient with higher sensitivity and specificity.
    Eighth International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2011, 26-28 July 2011, Shanghai, China; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, an effective promoter identification algorithm is proposed. This new algorithm is based on the following features of promoters: (I) Promoter regions include some binding sites where RNA polymerase II binds to and also where transcription starts. These binding sites include core-promoter, like TATA-box, GC-box, i.e. However, spacing structure of binding sites is not always consistent, the same kind of binding sites in promoter regions often differ in structure because of nucleotide variation. (II) Positions of binding sites in the gene are not fixed, instead, their positions are actually more likely to fluctuate in an approximate region. Based on above two features of promoters, firstly, we overlook differences in structure of binding sites caused by nucleotide variation. In another word, Those binding motifs, with similarity in strucuture but appearing in different forms caused by nucleotide variation, are seen as one binding motif. Secondly, we divide promoter regions into equal-length intervals and calculate occurring probability of binding sites in each interval. It is the first time for us to present a new concept “Interval Weight Matrix (IWM)” to reflect relationship between interval and occurring probability of binding sites. Then a new promoter identification system is proposed. After testing on large sequences and comparing with other well-known systems, it is proved that our new algorithm performs much better in reducing false positives(FP) than other well-kbown systems.
    01/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene expression data analysis is very important for the research on gene regulatory mechanisms. Genes which exhibit similar patterns are often functionally related. In this paper a novel bicluster detection method is proposed. Its advantage lies in it can not only make use of the traditional data clustering methods, but also form a systemic architecture. The whole processing procedure can be divided into two parts. The first is using one existing clustering method to cluster all the 2-combinations of the data matrix in the direction where the dimensionality is smaller, Then based on the clustering results some binary tables are created. The second part is to verify the concatenated quasi-bicluster. Since the data in the same bicluster is highly correlated with each other, a principal component analysis (PCA) based efficient verification method is applied, which can also work in noisy environment. The whole processing aims at finding all of the possible biclusters from large to small.
    FSKD 2009, Sixth International Conference on Fuzzy Systems and Knowledge Discovery, Tianjin, China, 14-16 August 2009, 6 Volumes; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Microarray data bi-clustering is very helpful for the research on gene regulatory mechanisms analysis. Genes exhibiting similar expression patterns provide useful clues for studying their possible functions. In this paper a novel bicluster detection method is proposed. Compared with the other approaches, biclusters are not detected directly with the whole given experiment data matrix, but are verified with the concatenation of small biclusters which are firstly detected using a conventional clustering method such as K-means and so on so as to making fully use of the rich and powerful existing data clustering methods. By this way, the affect of the high dimensionality of the data is greatly reduced. Since the data within a bicluster is highly correlated with each other, a principal component analysis based efficient verification method is applied to concatenate small biclusers into a larger one. Some experiment results on the simulated data are presented.
    Proceedings of the 2nd International Conference on BioMedical Engineering and Informatics, BMEI 2009, October 17-19, 2009, Tianjin, China; 01/2009
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cluster analysis is an important tool for discovering the structures and patterns hidden in gene expression data. In this paper, a new algorithm for clustering gene expression profiles is proposed. In this method, we find natural clusters in the data based on a competitive learning strategy. Using partially known modes as constraints in our method, we reduce the sensitivity of the clustering procedure to the algorithm initialization and produce more reliable results. Also the proposed algorithm can give the correct estimation of the number of clusters in the data. Experiments on simulated and real gene expression data demonstrate the robustness of our method. Comparative studies with several other clustering algorithms illustrated the effectiveness of our method.
    Intelligent Computation Technology and Automation, International Conference on. 10/2008; 1:44-51.
  • Xudong Xie, Shuanhu Wu, Hong Yan
    Machine Learning in Bioinformatics, 04/2008: pages 301 - 319; , ISBN: 9780470397428
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a new strategy to analyse the periodicity of gene expression profiles using Singular Spectrum Analysis (SSA) and Autoregressive (AR) model based spectral estimation. By combining the advantages of SSA and AR modelling, more periodic genes are extracted in the Plasmodium falciparum data set, compared with the classical Fourier analysis technique. We are able to identify more gene targets for new drug discovery, and by checking against the seven well-known malaria vaccine candidates, we have found five additional genes that warrant further biological verification.
    International Journal of Bioinformatics Research and Applications 01/2008; 4(3):337-49.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Computational prediction of eukaryotic promoter is one of most elusive problems in DNA sequence analysis. Although considerable efforts have been devoted to this study and a number of algorithms have been developed in the last few years, their performances still need to further improve. In this work, we developed a new algorithm called PPFB for promoter prediction base on following hypothesis: promoter is determined by some motifs or word patterns and different promoters are determined by different motifs. We select most potential motifs (i.e. features) by divergence distance between two classes and constructed a classifier by feature boosting. Different from other classifier, we adopted a different training and classifying strategy. Computational results on large genomic sequences and comparisons with the several excellent algorithms showed that our method is efficient with better sensitivity and specificity.
    Advances in Neural Networks - ISNN 2008, 5th International Symposium on Neural Networks, ISNN 2008, Beijing, China, September 24-28, 2008, Proceedings, Part I; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The eukaryotic promoter prediction is one of the most important problems in DNA sequence analysis, but also a very difficult one. Although a number of algorithms have been proposed, their performances are still limited by low sensitivities and high false positives. We present a method for improving the performance of promoter regions prediction. We focus on the selection of most effective features for different functional regions in DNA sequences. Our feature selection algorithm is based on relative entropy or Kullback-Leibler divergence, and a system combined with position-specific information for promoter regions prediction is developed. The results of testing on large genomic sequences and comparisons with the PromoterInspector and Dragon Promoter Finder show that our algorithm is efficient with higher sensitivity and specificity in predicting promoter regions.
    Physical Review E 05/2007; 75(4 Pt 1):041908. · 2.31 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Periodogram analysis of time-series is widespread in biology. A new challenge for analyzing the microarray time series data is to identify genes that are periodically expressed. Such challenge occurs due to the fact that the observed time series usually exhibit non-idealities, such as noise, short length, and unevenly sampled time points. Most methods used in the literature operate on evenly sampled time series and are not suitable for unevenly sampled time series. For evenly sampled data, methods based on the classical Fourier periodogram are often used to detect periodically expressed gene. Recently, the Lomb-Scargle algorithm has been applied to unevenly sampled gene expression data for spectral estimation. However, since the Lomb-Scargle method assumes that there is a single stationary sinusoid wave with infinite support, it introduces spurious periodic components in the periodogram for data with a finite length. In this paper, we propose a new spectral estimation algorithm for unevenly sampled gene expression data. The new method is based on signal reconstruction in a shift-invariant signal space, where a direct spectral estimation procedure is developed using the B-spline basis. Experiments on simulated noisy gene expression profiles show that our algorithm is superior to the Lomb-Scargle algorithm and the classical Fourier periodogram based method in detecting periodically expressed genes. We have applied our algorithm to the Plasmodium falciparum and Yeast gene expression data and the results show that the algorithm is able to detect biologically meaningful periodically expressed genes. We have proposed an effective method for identifying periodic genes in unevenly sampled space of microarray time series gene expression data. The method can also be used as an effective tool for gene expression time series interpolation or resampling.
    BMC Bioinformatics 02/2007; 8:137. · 3.02 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, an effective promoter detection algorithm, which is called PromoterExplorer, is proposed. In our approach, various features, i.e. local distribution of pentamers, positional CpG island features and digitized DNA sequence, are combined to build a high-dimensional input vector. A cascade AdaBoost based learning procedure is adopted to select the most "informative" or "discriminating" features to build a sequence of weak classifiers. A number of weak classifiers construct a strong classifier, which can achieve a better performance. In order to reduce the false positive, a cascade structure is used for detection. PromoterExplorer is tested based on large-scale DNA sequences from different databases, including EPD, Genbank and human chromosome 22. The proposed method consistently outperforms PromoterInspector and Dragon Promoter Finder.
    Proceedings of 5th Asia-Pacific Bioinformatics Conference, APBC 2007, 15-17 January 2007, Hong Kong, China; 01/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: MOTIVATION: Promoter prediction is important for the analysis of gene regulations. Although a number of promoter prediction algorithms have been reported in literature, significant improvement in prediction accuracy remains a challenge. In this paper, an effective promoter identification algorithm, which is called PromoterExplorer, is proposed. In our approach, we analyze the different roles of various features, that is, local distribution of pentamers, positional CpG island features and digitized DNA sequence, and then combine them to build a high-dimensional input vector. A cascade AdaBoost-based learning procedure is adopted to select the most 'informative' or 'discriminating' features to build a sequence of weak classifiers, which are combined to form a strong classifier so as to achieve a better performance. The cascade structure used for identification can also reduce the false positive. RESULTS: PromoterExplorer is tested based on large-scale DNA sequences from different databases, including the EPD, DBTSS, GenBank and human chromosome 22. Experimental results show that consistent and promising performance can be achieved.
    Bioinformatics 12/2006; 22(22):2722-8. · 5.47 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Spectral analysis of DNA microarray gene expressions time series data is important for understanding the regulation of gene expression and gene function of the Plasmodium falciparum in the intraerythrocytic developmental cycle. In this paper, we propose a new strategy to analyze the cell cycle regulation of gene expression profiles based on the combination of singular spectrum analysis (SSA) and autoregressive (AR) spectral estimation. Using the SSA, we extract the dominant trend of data and reduce the effect of noise. Based on the AR analysis, high resolution spectra can be produced. Experiment results show that our method can extract more genes and the information can be useful for new drug design. Yes Yes
    01/2006;
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, a new feature extracting method and clustering scheme in spectral space for gene expression data was proposed. We model each member of same cluster as the sum of cluster's representative term and experimental artifacts term. More compact clusters and hence better clustering results can be obtained through extracting essential features or reducing experimental artifacts. In term of the periodicity of gene expression profile data, features extracting is performed in DCT domain by soft-thresholding de-noising method. Clustering process is based on OPTOC competitive learning strategy. The results for clustering real gene expression profiles show that our method is better than directly clustering in the original space. Yes Yes
    01/2005;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Cluster analysis of gene expression data from a cDNA microarray is useful for identifying biologically relevant groups of genes. However, finding the natural clusters in the data and estimating the correct number of clusters are still two largely unsolved problems. In this paper, we propose a new clustering framework that is able to address both these problems. By using the one-prototype-take-one-cluster (OPTOC) competitive learning paradigm, the proposed algorithm can find natural clusters in the input data, and the clustering solution is not sensitive to initialization. In order to estimate the number of distinct clusters in the data, we propose a cluster splitting and merging strategy. We have applied the new algorithm to simulated gene expression data for which the correct distribution of genes over clusters is known a priori. The results show that the proposed algorithm can find natural clusters and give the correct number of clusters. The algorithm has also been tested on real gene expression changes during yeast cell cycle, for which the fundamental patterns of gene expression and assignment of genes to clusters are well understood from numerous previous studies. Comparative studies with several clustering algorithms illustrate the effectiveness of our method.
    IEEE Transactions on Information Technology in Biomedicine 04/2004; 8(1):5-15. · 1.98 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Cluster analysis of gene expression data is useful for identifying biologically relevant groups of genes. However, finding the correct clusters in the data and estimating the correct number of clusters are still two largely unsolved problems. In this paper, we propose a new clustering framework that is able to address both these problems. By using the one-prototype-take-one-cluster (OPTOC) competitive learning paradigm, the proposed algorithm can find natural clusters in the input data, and the clustering solution is not sensitive to initialization. In order to estimate the number of distinct clusters in the data, an over-clustering and merging strategy is proposed. For validation, we applied the new algorithm to both simulated gene expression data and real gene expression data (expression changes during yeast cell cycle). The results clearly indicate the effectiveness of our method.
    Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint Conference of the Fourth International Conference on; 01/2004

Publication Stats

115 Citations
12.78 Total Impact Points

Institutions

  • 2008
    • Georgia State University
      Atlanta, Georgia, United States
  • 2005–2008
    • Yantai University
      • School of Computer Science & Technology
      Chifu, Shandong Sheng, China
  • 2007
    • Griffith University
      • School of Information and Communication Technology (ICT)
      Southport, Queensland, Australia
  • 2004–2007
    • The University of Hong Kong
      • Department of Electrical and Electronic Engineering
      Hong Kong, Hong Kong
  • 2004–2006
    • City University of Hong Kong
      • Department of Electronic Engineering
      Chiu-lung, Kowloon City, Hong Kong