Classification and Error Estimation for Discrete Data

Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77845, USA.
Current Genomics (Impact Factor: 2.34). 11/2009; 10(7):446-62. DOI: 10.2174/138920209789208228
Source: PubMed


Discrete classification is common in Genomic Signal Processing applications, in particular in classification of discretized gene expression data, and in discrete gene expression prediction and the inference of boolean genomic regulatory networks. Once a discrete classifier is obtained from sample data, its performance must be evaluated through its classification error. In practice, error estimation methods must then be employed to obtain reliable estimates of the classification error based on the available data. Both classifier design and error estimation are complicated, in the case of Genomics, by the prevalence of small-sample data sets in such applications. This paper presents a broad review of the methodology of classification and error estimation for discrete data, in the context of Genomics, focusing on the study of performance in small sample scenarios, as well as asymptotic behavior.

Download full-text


Available from: Ulisses M Braga-Neto
  • Source
    • "This method provides a measure of the network's learning ability; yet it is not preferable for performance evaluation tasks as it is known to be optimistically biased. However, as shown in [52] [53], in discrete classification problems with large-sample categorical datasets, like the classification problem of this study, resubstitution can be significantly accurate relative to more complex error estimation schemes, since the optimistic bias and the variance of the method tend to be vanished as the sample size increases, provided that classifier complexity is not too high. For this reason we decided to take into consideration the performance of the final models when for testing the training and the validation sets are used. "
    [Show abstract] [Hide abstract]
    ABSTRACT: NOWADAYS, THERE ARE MOLECULAR BIOLOGY TECHNIQUES PROVIDING INFORMATION RELATED TO CERVICAL CANCER AND ITS CAUSE: the human Papillomavirus (HPV), including DNA microarrays identifying HPV subtypes, mRNA techniques such as nucleic acid based amplification or flow cytometry identifying E6/E7 oncogenes, and immunocytochemistry techniques such as overexpression of p16. Each one of these techniques has its own performance, limitations and advantages, thus a combinatorial approach via computational intelligence methods could exploit the benefits of each method and produce more accurate results. In this article we propose a clinical decision support system (CDSS), composed by artificial neural networks, intelligently combining the results of classic and ancillary techniques for diagnostic accuracy improvement. We evaluated this method on 740 cases with complete series of cytological assessment, molecular tests, and colposcopy examination. The CDSS demonstrated high sensitivity (89.4%), high specificity (97.1%), high positive predictive value (89.4%), and high negative predictive value (97.1%), for detecting cervical intraepithelial neoplasia grade 2 or worse (CIN2+). In comparison to the tests involved in this study and their combinations, the CDSS produced the most balanced results in terms of sensitivity, specificity, PPV, and NPV. The proposed system may reduce the referral rate for colposcopy and guide personalised management and therapeutic interventions.
    Full-text · Article · Apr 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: The coefficient of determination (CoD) has significant applications in genomics, for example, in the inference of gene regulatory networks. We study several CoD estimators, based upon the resubstitution, leave-one-out, cross-validation, and bootstrap error estimators. We present an exact formulation of performance metrics for the resubstitution and leave-one-out CoD estimators, assuming the discrete histogram rule. Numerical experiments are carried out using a parametric Zipf model, where we compute exact performance metrics of resubstitution and leave-one-out CoD estimators using the previously derived equations, for varying actual CoD, sample size, and bin size. These results are compared to approximate performance metrics of 10-repeated 2-fold cross-validation and 0.632 bootstrap CoD estimators, computed via Monte Carlo sampling. The numerical results lead to a perhaps surprising conclusion: under the Zipf model under consideration, and for moderate and large values of the actual CoD, the resubstitution CoD estimator is the least biased and least variable among all CoD estimators, especially at small number of predictors. We also observed that the leave-one-out and cross-validation CoD estimators tend to perform the worst, whereas the performance of the bootstrap CoD estimator is intermediary, despite its high computational complexity.
    No preview · Article · Jan 2010 · Journal on Advances in Signal Processing
  • [Show abstract] [Hide abstract]
    ABSTRACT: The binary Coefficient of Determination (CoD) is a key component of inference methods in Genomic Signal Processing. Assuming a stochastic logic model, we introduce a new sample CoD estimator based upon maximum likelihood (ML) estimation. Experiments have been conducted to assess how the ML CoD estimator performs in recovering predictors in multivariate prediction settings. Performance is compared with the traditional nonparametric CoD estimators based on resubstitution, leave-one-out, bootstrap and cross-validation. The results show that the ML CoD estimator is the estimator of choice if prior knowledge is available about the logic relationships in the model, even if this knowledge is incomplete.
    No preview · Article · Nov 2011 · Circuits, Systems and Computers, 1977. Conference Record. 1977 11th Asilomar Conference on
Show more