A new classification method using array Comparative Genome Hybridization data, based on the concept of Limited Jumping Emerging Patterns

Faculty of Electronics and Information Technology of Warsaw University of Technology, Institute of Computer Science, Nowowiejska 15/19, Warsaw, 00-665, Poland.
BMC Bioinformatics (Impact Factor: 2.58). 02/2009; 10 Suppl 1(Suppl 1):S64. DOI: 10.1186/1471-2105-10-S1-S64
Source: PubMed


Classification using aCGH data is an important and insufficiently investigated problem in bioinformatics. In this paper we propose a new classification method of DNA copy number data based on the concept of limited Jumping Emerging Patterns. We present the comparison of our limJEPClassifier to SVM which is considered the most successful classifier in the case of high-throughput data.
Our results revealed that the classification performance using limJEPClassifier is significantly higher than other methods. Furthermore, we show that application of the limited JEP's can significantly improve classification, when strongly unbalanced data are given.
Nowadays, aCGH has become a very important tool, used in research of cancer or genomic disorders. Therefore, improving classification of aCGH data can have a great impact on many medical issues such as the process of diagnosis and finding disease-related genes. The performed experiment shows that the application of Jumping Emerging Patterns can be effective in the classification of high-dimensional data, including these from aCGH experiments.

Download full-text


Available from: Krzysztof Walczak, Dec 21, 2013
24 Reads
  • Source
    • "An extra advantage of the feature selection process is that the majority of the irrelevant features are discarded and the few remaining can be indicators of possible biomarkers related to the observed disease. Feature selection has already been shown to significantly benefit the classification accuracy of aCGH data [9], [12], [22]. Thus, it is important to design effective feature selection method for identifying the DNA copy number biomarkers. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Array comparative genomic hybridization (aCGH) is a newly introduced method for the detection of copy number abnormalities associated with human diseases with special focus on cancer. Specific patterns in DNA copy number variations (CNVs) can be associated with certain disease types and can facilitate prognosis and progress monitoring of the disease. Machine learning techniques have been used to model the problem of tissue typing as a classification problem. Feature selection is an important part of the classification process, because many biological features are not related to the diseases and confuse the classification tasks. Multiple feature selection methods have been proposed in the different domains where classification has been applied. In this work, we will present a new feature selection method based on structured sparsity-inducing norms to identify the informative aCGH biomarkers which can help us classify different disease subtypes. To validate the performance of the proposed method, we experimentally compare it with existing feature selection methods on four publicly available aCGH data sets. In all empirical results, the proposed sparse learning based feature selection method consistently outperforms other related approaches. More important, we carefully investigate the aCGH biomarkers selected by our method, and the biological evidences in literature strongly support our results.
    IEEE/ACM Transactions on Computational Biology and Bioinformatics 01/2014; 11(1):168-181. DOI:10.1109/TCBB.2013.141 · 1.44 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a new classification algorithm based on Jumping Emerging Patterns (JEPs), that have the highest impact on classification accuracy. The core idea of our method is the application of a new ¿REAL/ALL¿ coefficient, which is used to compare the discriminating power among various groups of JEPs. The efficacy of the proposed approach was confirmed by tests performed on both synthetic and real data sets. The results show that our method may significantly improve the classification performance in comparison to other classifiers based on JEPs.
    Proceedings of the International Multiconference on Computer Science and Information Technology, IMCSIT 2009, Mragowo, Poland, 12-14 October 2009; 01/2009
  • [Show abstract] [Hide abstract]
    ABSTRACT: Computational methods for predicting protein subcellular localization have used various types of features, including N-terminal sorting signals, amino acid compositions, and text annotations from protein databases. Our approach does not use biological knowledge such as the sorting signals or homologues, but use just protein sequence information. The method divides a protein sequence into short $k$-mer sequence fragments which can be mapped to word features in document classification. A large number of class association rules are mined from the protein sequence examples that range from the N-terminus to the C-terminus. Then, a boosting algorithm is applied to those rules to build up a final classifier. Experimental results using benchmark datasets show our method is excellent in terms of both the classification performance and the test coverage. The result also implies that the $k$-mer sequence features which determine subcellular locations do not necessarily exist in specific positions of a protein sequence. Online prediction service implementing our method is available at
    IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 09/2011; 9(2). DOI:10.1109/TCBB.2011.131 · 1.44 Impact Factor