András Kocsor

University of Szeged, Szeged, Csongrad megye, Hungary

Are you András Kocsor?

Claim your profile

Publications (19)39.12 Total impact

  • Article: Detecting atypical examples of known domain types by sequence similarity searching: the SBASE domain library approach.
    [show abstract] [hide abstract]
    ABSTRACT: SBASE is a project initiated to detect known domain types and predicting domain architectures using sequence similarity searching (Simon et al., Protein Seq Data Anal, 5: 39-42, 1992, Pongor et al, Nucl. Acids. Res. 21:3111-3115, 1992). The current approach uses a curated collection of domain sequences - the SBASE domain library - and standard similarity search algorithms, followed by postprocessing which is based on a simple statistics of the domain similarity network (http://hydra.icgeb.trieste.it/sbase/). It is especially useful in detecting rare, atypical examples of known domain types which are sometimes missed even by more sophisticated methodologies. This approach does not require multiple alignment or machine learning techniques, and can be a useful complement to other domain detection methodologies. This article gives an overview of the project history as well as of the concepts and principles developed within this the project.
    Current Protein and Peptide Science 11/2010; 11(7):538-49. · 2.89 Impact Factor
  • Article: ROC analysis: applications to the classification of biological sequences and 3D structures.
    Paolo Sonego, András Kocsor, Sándor Pongor
    [show abstract] [hide abstract]
    ABSTRACT: ROC ('receiver operator characteristics') analysis is a visual as well as numerical method used for assessing the performance of classification algorithms, such as those used for predicting structures and functions from sequence data. This review summarizes the fundamental concepts of ROC analysis and the interpretation of results using examples of sequence and structure comparison. We overview the available programs and provide evaluation guidelines for genomic/proteomic data, with particular regard to applications to large and heterogeneous databases used in bioinformatics.
    Briefings in Bioinformatics 06/2008; 9(3):198-209. · 5.20 Impact Factor
  • Source
    Article: Balanced ROC analysis (BAROC) protocol for the evaluation of protein similarities.
    [show abstract] [hide abstract]
    ABSTRACT: Identification of problematic protein classes (domain types, protein families) that are difficult to predict from sequence is a key issue in genome annotation. ROC (Receiver Operating Characteristic) analysis is routinely used for the evaluation of protein similarities, however its results - the area under curve (AUC) values - are differentially biased for the various protein classes that are highly different in size. We show the bias can be compensated for by adjusting the length of the top list in a class-dependent fashion, so that the number of negatives within the top list will be equal to (or proportional with) the size of the positive class. Using this balanced protocol the problematic classes can be identified by their AUC values, or by a scatter diagram in which the AUC values are plotted against positive/negative ratio of the top list. The use of likelihood-ratio scoring (Kaján et al, Bioinformatics,22, 2865-2869, 2007) the bias caused by class imbalance can be further decreased.
    Journal of Biochemical and Biophysical Methods 05/2008; 70(6):1210-4. · 2.33 Impact Factor
  • Source
    Article: Benchmarking protein classification algorithms via supervised cross-validation.
    [show abstract] [hide abstract]
    ABSTRACT: Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.
    Journal of Biochemical and Biophysical Methods 05/2008; 70(6):1215-23. · 2.33 Impact Factor
  • Source
    Article: Protein classification based on propagation of unrooted binary trees.
    [show abstract] [hide abstract]
    ABSTRACT: We present two efficient network propagation algorithms that operate on a binary tree, i.e., a sparse-edged substitute of an entire similarity network. TreeProp-N is based on passing increments between nodes while TreeProp-E employs propagation to the edges of the tree. Both algorithms improve protein classification efficiency.
    Protein and Peptide Letters 02/2008; 15(5):428-34. · 1.94 Impact Factor
  • Source
    Chapter: Tree-Based Algorithms for Protein Classification
    [show abstract] [hide abstract]
    ABSTRACT: The problem of protein sequence classification is one of the crucial tasks in the interpretation of genomic data. Many high-throughput systems were developed with the aim of categorizing the proteins based only on their sequences. However, modelling how the proteins have evolved can also help in the classification task of sequenced data. Hence the phylo-genetic analysis has gained importance in the field of protein classification. This approach does not just rely on the similarities in sequences, but it also considers the phylogenetic information stored in a tree (e.g. in a phylogenetic tree). Eisen used firstly phylogenetic trees in protein classification, and his work has revived the discipline of phylogenomics. In this chapter we provide an overview about this area, and in addition we propose two algorithms that well suited to this scope. We present two algorithms that are based on a weighted binary tree representation of protein similarity data. TreeInsert assigns the class label to the query by determining a minimum cost necessary to insert the query in the (precomputed) trees representing the various classes. Then TreNN assigns the label to the query based on an analysis of the query’s neighborhood within a binary tree containing members of the known classes. The algorithms were tested in combination with various sequence similarity scoring methods (BLAST, Smith-Waterman, Local Alignment Kernel as well as various compression-based distance scores) using a large number of classification tasks representing various degrees of difficulty. At the expense of a small computational overhead, both TreeNN and TreeInsert exceed the performance of simple similarity search (1NN) as determined by ROC analysis, at the expense of a modest computational overhead. Combined with a fast tree-building method, both algorithms are suitable for web-based server applications.
    01/2008: pages 165-182;
  • Article: Extracting Human Protein Information from MEDLINE Using a Full-Sentence Parser.
    Róbert Busa-Fekete, András Kocsor
    Acta Cybern. 01/2008; 18:391-402.
  • Source
    Chapter: Whitening-Based Feature Space Transformations in a Speech Impediment Therapy System
    [show abstract] [hide abstract]
    ABSTRACT: It is quite common to use feature extraction methods prior to classification. Here we deal with three algorithms defining uncorrelated features. The first one is the so-called whitening method, which transforms the data so that the covariance matrix becomes an identity matrix. The second method, the well-known Fast Independent Component Analysis (FastICA) searches for orthogonal directions along which the value of the non-Gaussianity measure is large in the whitened data space. The third one, the Whitening-based Springy Discriminant Analysis (WSDA) is a novel method combination, which provides orthogonal directions for better class separation. We compare the effects of the above methods on a real-time vowel classification task. Based on the results we conclude that the WSDA transformation is especially suitable for this task.
    08/2007: pages 222-229;
  • Source
    Article: A Protein Classification Benchmark collection for machine learning.
    [show abstract] [hide abstract]
    ABSTRACT: Protein classification by machine learning algorithms is now widely used in structural and functional annotation of proteins. The Protein Classification Benchmark collection (http://hydra.icgeb.trieste.it/benchmark) was created in order to provide standard datasets on which the performance of machine learning methods can be compared. It is primarily meant for method developers and users interested in comparing methods under standardized conditions. The collection contains datasets of sequences and structures, and each set is subdivided into positive/negative, training/test sets in several ways. There is a total of 6405 classification tasks, 3297 on protein sequences, 3095 on protein structures and 10 on protein coding regions in DNA. Typical tasks include the classification of structural domains in the SCOP and CATH databases based on their sequences or structures, as well as various functional and taxonomic classification problems. In the case of hierarchical classification schemes, the classification tasks can be defined at various levels of the hierarchy (such as classes, folds, superfamilies, etc.). For each dataset there are distance matrices available that contain all vs. all comparison of the data, based on various sequence or structure comparison methods, as well as a set of classification performance measures computed with various classifier algorithms.
    Nucleic Acids Research 02/2007; 35(Database issue):D232-6. · 8.03 Impact Factor
  • Source
    Article: Kalman filtering for disease-state estimation from microarray data.
    [show abstract] [hide abstract]
    ABSTRACT: MOTIVATION: In this paper, we propose using the Kalman filter (KF) as a pre-processing step in microarray-based molecular diagnosis. Incorporating the expression covariance between genes is important in such classification problems, since this represents the functional relationships that govern tissue state. Failing to fulfil such requirements may result in biologically implausible class prediction models. Here, we show that employing the KF to remove noise (while retaining meaningful covariance and thus being able to estimate the underlying biological state from microarray measurements) yields linearly separable data suitable for most classification algorithms. RESULTS: We demonstrate the utility and performance of the KF as a robust disease-state estimator on publicly available binary and multi-class microarray datasets in combination with the most widely used classification methods to date. Moreover, using popular graphical representation schemes we show that our filtered datasets also have an improved visualization capability.
    Bioinformatics 01/2007; 22(24):3047-53. · 5.47 Impact Factor
  • Article: Application of a simple likelihood ratio approximant to protein sequence classification.
    [show abstract] [hide abstract]
    ABSTRACT: MOTIVATION: Likelihood ratio approximants (LRA) have been widely used for model comparison in statistics. The present study was undertaken in order to explore their utility as a scoring (ranking) function in the classification of protein sequences. RESULTS: We used a simple LRA-based on the maximal similarity (or minimal distance) scores of the two top ranking sequence classes. The scoring methods (Smith-Waterman, BLAST, local alignment kernel and compression based distances) were compared on datasets designed to test sequence similarities between proteins distantly related in terms of structure or evolution. It was found that LRA-based scoring can significantly outperform simple scoring methods.
    Bioinformatics 01/2007; 22(23):2865-9. · 5.47 Impact Factor
  • Conference Proceeding: Counter-Example Generation-Based One-Class Classification.
    Machine Learning: ECML 2007, 18th European Conference on Machine Learning, Warsaw, Poland, September 17-21, 2007, Proceedings; 01/2007
  • Source
    Conference Proceeding: An Automatic Retraining Method for Speaker Independent Hidden Markov Models.
    Text, Speech and Dialogue, 10th International Conference, TSD 2007, Pilsen, Czech Republic, September 3-7, 2007, Proceedings; 01/2007
  • Conference Proceeding: Whitening-Based Feature Space Transformations in a Speech Impediment Therapy System.
    Text, Speech and Dialogue, 10th International Conference, TSD 2007, Pilsen, Czech Republic, September 3-7, 2007, Proceedings; 01/2007
  • Source
    Conference Proceeding: A Multi-Stack Based Phylogenetic Tree Building Method.
    Bioinformatics Research and Applications, Third International Symposium, ISBRA 2007, Atlanta, GA, USA, May 7-10, 2007, Proceedings; 01/2007
  • Source
    Conference Proceeding: Equivalence Learning in Protein Classification.
    Machine Learning and Data Mining in Pattern Recognition, 5th International Conference, MLDM 2007, Leipzig, Germany, July 18-20, 2007, Proceedings; 01/2007
  • Article: Application of compression-based distance measures to protein sequence classification: a methodological study.
    [show abstract] [hide abstract]
    ABSTRACT: MOTIVATION: Distance measures built on the notion of text compression have been used for the comparison and classification of entire genomes and mitochondrial genomes. The present study was undertaken in order to explore their utility in the classification of protein sequences. RESULTS: We constructed compression-based distance measures (CBMs) using the Lempel-Zlv and the PPMZ compression algorithms and compared their performance with that of the Smith-Waterman algorithm and BLAST, using nearest neighbour or support vector machine classification schemes. The datasets included a subset of the SCOP protein structure database to test distant protein similarities, a 3-phosphoglycerate-kinase sequences selected from archaean, bacterial and eukaryotic species as well as low and high-complexity sequence segments of the human proteome, CBMs values show a dependence on the length and the complexity of the sequences compared. In classification tasks CBMs performed especially well on distantly related proteins where the performance of a combined measure, constructed from a CBM and a BLAST score, approached or even slightly exceeded that of the Smith-Waterman algorithm and two hidden Markov model-based algorithms.
    Bioinformatics 03/2006; 22(4):407-12. · 5.47 Impact Factor
  • Article: Benchmarking protein classification algorithms via supervised cross-validation
    [show abstract] [hide abstract]
    ABSTRACT: Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.The datasets are available at http://hydra.icgeb.trieste.it/benchmark.
    Journal of Biochemical and Biophysical Methods.
  • Source
    Article: Locally Linear Embedding and its Variants for Feature Extraction
    Róbert Busa-Fekete, András Kocsor
    [show abstract] [hide abstract]
    ABSTRACT: Many problems in machine learning are hard to manage without applying some pre-processing or feature extraction method. Two popular forms of dimensionality reduction are the methods of principal component analysis (PCA) [2] and multidimensional scaling (MDS) [18]. In this paper we examine Locally Linear Embedding (LLE), which is an unsupervised, non-linear dimension reduction method that was originally proposed for visualisation. We will show that LLE is capable of feature extraction if we choose the right parameter values. In addition, we extend the original algorithm for more efficient classification. Afterwards we apply the methods to several databases that are available at the UCI repository, and then show that there is a significant improvement in classification performance.