Lei Zhu

Hangzhou Dianzi University, Hang-hsien, Zhejiang Sheng, China

Are you Lei Zhu?

Claim your profile

Publications (26)10.39 Total impact

  • Bin Han · Ruifei Xie · Shixiu Wu · Lihua Li · Lei Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: Variation in the expression of genes arises from a variety of sources. It is important to remove sources of variation between arrays of non-biological origin. Non-biological variation, caused by lurking confounding factors, usually attracts little attention, although it may substantially influence the expression profile of genes. In this study, we proposed a method which is able to identify the potential confounding factors and highlight the non-biological variations. We also developed methods and statistical tests to study the confounding factors and their influence on the homogeneity of microarray data, gene selection, and disease classification. We explored an ovarian cancer gene expression profile and showed that data batches and arraying conditions are two confounding factors. Their influence on the homogeneity of data, gene selection, and disease classification are statistically analyzed. Experiments showed that after normalization, their influences were removed. Comparative studies further showed that the data became more homogeneous and the classification quality was improved. This research demonstrated that identifying and reducing the impact of confounding factors is paramount in making sense of gene-disease association analysis.
    No preview · Article · Mar 2015 · Cancer biomarkers: section A of Disease markers
  • Source
    Bin Han · Haifeng Lai · Ruifei Xie · Lihua Li · Lei Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: Identifying glioma cancer-alerted genetic markers through analysis of microarray data allows us to detect tumours at the genome-wide level. To this end, we propose to identify glioma gene markers based primarily on their correlation with the glioma diagnostic outcomes, rather than merely on the classification quality or differential expression levels, as it is not the classification or expression level per se that is crucial, but the selection of biologically relevant biomarkers is the most important issue. With the help of singular value decomposition, microarray data are decomposed and the eigenvectors corresponding to the biological effect of diagnostic outcomes are identified. Genes that play important roles in determining this biological effect are thus detected. Therefore, genes are essentially identified in terms of their strength of association with diagnostic outcomes. Monte Carlo simulations are then used to fine tune the selected gene set in terms of classification accuracy. Experiments show that the proposed method achieves better classification accuracies and is data sets independent. Graph-based statistical analysis showed that the selected genes have close relationships with glioma diagnostic outcomes. Further biological database and literature study confirms that the identified genes are biologically relevant.
    Full-text · Article · Nov 2014 · International Journal of Data Mining and Bioinformatics
  • Bin Han · Haifeng Lai · Ruifei Xie · Lihua Li · Lei Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: Identifying glioma cancer-alerted genetic markers through analysis of microarray data allows us to detect tumours at the genome-wide level. To this end, we propose to identify glioma gene markers based primarily on their correlation with the glioma diagnostic outcomes, rather than merely on the classification quality or differential expression levels, as it is not the classification or expression level per se that is crucial, but the selection of biologically relevant biomarkers is the most important issue. With the help of singular value decomposition, microarray data are decomposed and the eigenvectors corresponding to the biological effect of diagnostic outcomes are identified. Genes that play important roles in determining this biological effect are thus detected. Therefore, genes are essentially identified in terms of their strength of association with diagnostic outcomes. Monte Carlo simulations are then used to fine tune the selected gene set in terms of classification accuracy. Experiments show that the proposed method achieves better classification accuracies and is data sets independent. Graph-based statistical analysis showed that the selected genes have close relationships with glioma diagnostic outcomes. Further biological database and literature study confirms that the identified genes are biologically relevant.
    No preview · Article · Jan 2014 · International Journal of Data Mining and Bioinformatics
  • Bin Han · Ruifei Xie · Lihua Li · Lei Zhu · Shen Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: Extracting significant features from high-dimension and small sample size biological data is a challenging problem. Recently, Michał Draminski proposed the Monte Carlo feature selection (MC) algorithm, which was able to search over large feature spaces and achieved better classification accuracies. However in MC the information of feature rank variations is not utilized and the ranks of features are not dynamically updated. Here, we propose a novel feature selection algorithm which integrates the ideas of the professional tennis players ranking, such as seed players and dynamic ranking, into Monte Carlo simulation. Seed players make the feature selection game more competitive and selective. The strategy of dynamic ranking ensures that it is always the current best players to take part in each competition. The proposed algorithm is tested on 8 biological datasets. Results demonstrate that the proposed method is computationally efficient, stable and has favorable performance in classification.
    No preview · Article · Oct 2013 · Computer methods and programs in biomedicine
  • Yan'e Li · Bin Han · Lihua Li · Lei Zhu

    No preview · Article · Jan 2013
  • Liyan Jin · Bin Han · Lihua Li · Lei Zhu · Shuangxi Fan

    No preview · Article · Jan 2013
  • Xiaodong Guo · Qi Dai · Bin Han · Lei Zhu · Lihua Li
    [Show abstract] [Hide abstract]
    ABSTRACT: There are several algorithms to analyze similarity of DNA sequence, but it still remains a challenge. This paper presented a novel way to analyze DNA sequences, which was based on LZ complexity and dynamic programming algorithm. A DNA sequence can be broken into a word set with the LZ complexity. Motivated by the dynamic programming algorithm, we then analyze the similarity of DNA sequences by measuring shared information among their word-sets. Finally, the proposed method was tested by analyzing the similarity of the first exon of b-globin gene of eleven different species and compared its performance with the LZ complexity and multiple sequence alignment. The reasonable result verifies the validity of the proposed method.
    No preview · Article · May 2011
  • Li Wu · Qi Dai · Bin Han · Lei Zhu · Lihua Li
    [Show abstract] [Hide abstract]
    ABSTRACT: Structural class of protein is important in understanding of folding patterns. Effective and reliable computational methods are needed for prediction of protein structural class. In this paper, a novel method for prediction of protein structural class was proposed, which combined protein sequence information and predicted secondary structural feature, and used support vector machine classifier to classify attributes of protein. Jackknife cross-validation was taken to evaluate the the performance of proposed method, using three benchmark datasets. Results demonstrate that the proposed method combining the predicted secondary structural feature with sequence information is more efficient than the existing methods, which indicates the necessity to extract more information to improve protein structural class prediction.
    No preview · Article · Jan 2011
  • Ruifei Xie · Bin Han · Lihua Li · Qing Wang · Lei Zhu · Qi Dai
    [Show abstract] [Hide abstract]
    ABSTRACT: The prediction of Chemotherapy response is paramount for personalized ovarian cancer treatment. In this paper, we propose to use Monte Carlo simulation to select gene features for ovarian cancer chemotherapy response prediction with microarray data. Results show that the selected genes not only has comparatively higher classification rate which are independent of classifiers, but also has biological significance. Genes such as FCN3, HSD3B2, BRCA1/2, SLC5A5, ERRS, GPR4 and Rnh1 demonstrate direct relationship with the formation and development of ovarian cancer and are worthy for further biological investigation.
    No preview · Article · Jan 2011
  • Ruifei Xie · Bin Han · Lihua Li · Juan Zhang · Lei Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: Extracting significant features from high-dimensional and small sample-size microarray data is a challenging problem. Other than wrapper or filter methods, we propose a novel feature selection algorithm which integrates the ideas of professional tennis players ranking, such as seed players and dynamic ranking with Monte Carlo simulation. Seed players make the ‘game’ more competitive and selective, hence improve the selection efficiency. Besides, the ranks of features are dynamically updated and this ensures that it is always the current best players to take part in each competitions. The proposed algorithm is tested on widely used public datasets. Results demonstrate that the proposed method comparatively converges faster, more stable and has good performance in classification and therefore is an efficient algorithm for feature selection.
    No preview · Conference Paper · Jan 2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Mass spectrometry (MS) data has been widely analyzed for the detection of early stage cancers. Its potential for seeking proteomic biomarkers has received a great deal of attention in recent years. In the sparse representation classification (SRC) framework, a testing sample is represented as a sparse linear combination of training samples. The coefficient vector of representation is obtained by a ℓ1-norm regularized least square method. Classification results are achieved by defining discriminant functions from the coefficient vector for each category. In this paper, a novel feature selection method based on SRC was proposed. To investigate its performance, the proposed methods was tested and evaluated on the ovarian cancer database OC-WCX2a and OC-WCX2b. The experimental results showed that SRC is efficient for tumor classification. Feature selection based on sparse representation (SRFS) can select highly predictive representative feature sets.
    No preview · Article · Dec 2010
  • Source
    Bin Han · Lihua Li · Yan Chen · Lei Zhu · Qi Dai
    [Show abstract] [Hide abstract]
    ABSTRACT: With advances in microarray technology, many biomarkers selection approaches have been proposed for cancer diagnosis. Marker sets are selected by scoring genes for how well they can discriminate between different classes of diseases [1-4] or are ranked by significance analysis without reference to classification tasks. However there is a pressing need for methods integrating biological priori knowledge in the gene selection process. In this study, we proposed to identify genes primarily in terms of diagnostic outcome relevance. As gene expression is a combination effect, with the help of SVD, the microarray data is decomposed, the eigenvectors correspond to the biological effect of clinical outcomes are identified. Genes which play important roles in determining this biological effect are detected. Therefore, genes are essentially identified in terms of the strength of association with clinical outcomes and the relationship of genes and clinical outcomes is analyzed. Monte Carlo simulations are then used to fine tune the selected gene set in terms of classification accuracy. The approach was tested on four public data sets. Comparative studies show that the selected genes achieved higher classification accuracies. Graphical analysis visualizes that they have close relationship with the cancer class. Statistical simulation shows that the gene set found by the proposed method is also less variable and comparatively invariant to external influences. The biological relevance of the selected genes is further discussed and validated with the literature study and analysis of biological databases.
    Preview · Article · Dec 2010 · Journal of Biomedical Informatics
  • [Show abstract] [Hide abstract]
    ABSTRACT: Protein mass spectrometry has become a popular tool for cancer diagnosis. Feature selection and classification techniques play an important role in the identification of protein biomarkers. In this paper, based on the protein spectrum of cancer classification, an efficient combination of wavelet features and Recursive Null Space LDA algorithm for feature selection is proposed. Firstly, the multi-resolution wavelet decomposition is used to extract the detail features of the protein spectrum data. Then, in order to reduce the dimension of the features, we use T-test for screening the data sets. Thirdly, the Recursive Null Space LDA algorithm is adopted to screen out the most discriminative protein features. Finally, according to the optimal feature set, we use nearest neighbor classifier to estimate the performance. The experimental results on public ovarian cancer data set OC-WCX2a show the promising performance of the proposed algorithm.
    No preview · Article · Jun 2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: Protein mass spectrometry has become a popular tool for cancer diagnosis. This article describes a novel proteomic pattern analysis algorithm for tumor classification using SELDI-TOF mass spectrometry. Different from the traditional pattern analysis methods, sparse representation accepts a new frame. Firstly the MS data is preprocessed. Secondly, the proposed method seeks the sparse representation of test sample on training sample set. Then 2-fold cross validation is performed to evaluate classification ability. The proposed method was tested and evaluated in the ovarian cancer database OC-WCX2a, OC-WCX2b, prostate cancer database PC-H4. The experimental results show the good performance of sparse representation method.
    No preview · Article · Jan 2010
  • Li Wu · Qi Dai · Bin Han · Lei Zhu · Lihua Li
    [Show abstract] [Hide abstract]
    ABSTRACT: Knowledge of structural classes is useful in understanding of folding patterns in proteins. Although numerous methods were proposed and achieved promising results in structural class prediction, some problems in using protein-sequence information have impeded the development. In this paper, a combined representation of protein-sequence information is proposed for prediction of protein structural class, which combines word frequencies, word position information and physicochemical properties of amino acids. Then the support vector machine classifier is adopted to classify attributes of protein. To check the validity, we use three benchmark datasets and jackknife cross-validation to evaluate the proposed method. Results show that the proposed combined representation of protein-sequence information is more efficient, which indicates that the necessity for protein structural class prediction method to extract more information as possible.
    No preview · Article · Jan 2010
  • Shufei Chen · Bin Han · Lihua Li · Lei zhu · Haifeng Lai · Qi Dai
    [Show abstract] [Hide abstract]
    ABSTRACT: Ovarian Carcinoma (OvCa) is the most lethal type of gynecological cancer. The studies show that about 90% patients could be saved if they are treated in the early stage. In this study, a novel biomarker selection approach is proposed which combines singular value decomposition (SVD) and Monte Carlo strategy to early OvCa detection. Other than supervised classification methods or differential expression detection based methods, the biomarkers are identified in terms of their relevance to the clinical outcomes and stability. Comparative study and statistical analysis show that the proposed method outperforms SVM-RFE and T-test methods which are the typical supervised classification and differential expression detection based feature selection methods in feature set stability and achieve satisfying classification result (88.9%) as well. The reliability of the identified biomarkers is also biologically validated and supported by other biological research.
    No preview · Conference Paper · Jan 2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: Many studies have been proposed to identify gene makers that are associated with cancers, but the found markers are approach dependent. For example, the results are correlated with classifiers in supervised feature selection, and many of them didn't consider the influences of other factors, such as the grades or stages of cancers. In this study, we proposed a supervised SVD approach to extract the gene features linked to chemotherapy response patients of ovarian cancer, and applied across factor normalization to remove the influences of the factors. Chi square test is used to detect whether the factors affect the distribution of chemotherapy response and quantile-quantile plot is used to detect the distribution of chemotherapy response samples. The experimental results show that the influences of the factors are removed effectively, and the classification performance of gene markers selected by the proposed methods outperform that by SVMRFE and T-test in seven classifiers except for JRip classifier and NaiveBayes classifier.
    No preview · Conference Paper · Jul 2009
  • [Show abstract] [Hide abstract]
    ABSTRACT: Ovarian cancer (OvCa) has become one of the most lethal gynecological cancers in the world. The identification of ovarian cancer linked biomarkers will provide the basis of diagnoses and treatment. In this study, we proposed to combine singular value decomposition (SVD) and Monte Carlo method to analyze the OvCa data and predict the outcomes of samples. A supervised SVD was proposed to weight biomarkers according to their relative importance in sample clustering, and the candidate biomarkers were selected. Biomarkers were further selected with Monte Carlo method from candidate biomarkers over different classifiers. With the selected biomarkers, more than 90% classification accuracy was achieved over classifiers. These results are also supported by independent biological studies.
    No preview · Article · Jan 2009
  • Qi Dai · Xiaoqing Liu · Lihua Li · Yuhua Yao · Bin Han · Lei Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: One of the major tasks in biological sequence analysis is to compare biological sequences, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Numerous efficient methods have been developed for sequence comparison, but challenges remain. In this article, we proposed a novel method to compare biological sequences based on Gaussian model. Instead of comparing the frequencies of k-words in biological sequences directly, we considered the k-word frequency distribution under Gaussian model which gives the different expression levels of k-words. The proposed method was tested by similarity search, evaluation on functionally related genes, and phylogenetic analysis. The performance of our method was further compared with alignment-based and alignment-free methods. The results demonstrate that Gaussian model provides more information about k-word frequencies and improves the efficiency of sequence comparison.
    No preview · Article · Jan 2009 · Journal of Computational Chemistry
  • [Show abstract] [Hide abstract]
    ABSTRACT: Early detection of cancer is crucial for successful treatments. High throughput and high resolution mass spectrometry are increasingly used for disease classification. In this paper a novel cancer classification method called Null space based linear discriminant analysis (NS-LDA) is proposed. NSLDA first extracts the first order derivative information of the mass spectrometry profiles. Based on the null-space strategy, NSLDA then reduce the dimension of data and extracts the discriminant features simultaneously. The method was tested and evaluated on the ovarian cancer database OC-WCX2a and prostate cancer database PC-H4. The experimental results on these two real life cancer database show that the NS-LDA method outperforms the PCA and LDA method in the analysis of mass spectrometry data.
    No preview · Conference Paper · Jan 2009