ArticlePDF Available

Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models

  • SwitchPoint Ventures

Abstract and Figures

Identification of suitable biomarkers for accurate prediction of phenotypic outcomes is a goal for personalized medicine. However, current machine learning approaches are either too complex or perform poorly. Here, a novel feature detection and engineering machine-learning framework is presented to address this need. First, the Rip Curl process is applied which generates a set of 10 additional features. Second, we rank all features including the Rip Curl features from which the top-ranked will most likely contain the most informative features for prediction of the underlying biological classes. The top-ranked features are used in model building. This process creates for more expressive features which are captured in models with an eye towards the model learning from increasing sample amount and the accuracy/time results. The performance of the proposed Rip Curl classification framework was tested on omentum cancer data. Rip Curl outperformed other more sophisticated classification methods in terms of prediction accuracy, minimum number of classification markers, and computational time.
Content may be subject to copyright.
Research Article
Volume 1 Issue 2 - January 2017
Curr Trends Biomedical Eng & Biosci
Copyright © All rights are reserved by Damian R Mingle
Controlling Informative Features for Improved
Accuracy and Faster Predictions in Omentum Cancer
Damian R Mingle*
WPC Healthcare, Nashville, USA
Submission: December 09, 2016; Published: January 03, 2017
*Corresponding author: Damian Mingle, Chief Data Scientist, WPC Healthcare, 1802 Williamson Ct, Brentwood, TN 37027, USA
In recent years, the dawn of technologies like microarrays,
proteomics, and next-generation sequencing has transformed
life science. The data from these experimental approaches
deliver a comprehensive depiction of the complexity of biological
systems at different levels. A challenge within the “-omics” data
          
relevant to a particular question, such as biomarkers that can
accurately classify phenotypic outcomes [1]. This is certainly
true in the fold of peritoneum connecting the stomach with
other abdominal organs known as the omentum. Numerous
machine learning techniques and methods have been proposed
to identify biomarkers that accurately classify these outcomes
by learning the elusive pattern latent in the data. To date, there
have been three categories that assist in biomarker selection
A. Filters
B. Wrappers
C. Embedding
In practice, time-to-prediction and accuracy of prediction
matter a great deal.
Filtering methods are generally considered in an effort
to spend the least time-to-prediction and can be used to
decide which are the most informative features in relation
to the biological target [2]. Filtering produces the degrees of
correlation with a given phenotype and then ranks the markers
in a given dataset. Many researchers acknowledge the weakness
of such methods and take careful note to observe the selection
    
allow for interactions between biomarkers. An example of a
model, wrapper methods iteratively perform combinatorial
biomarker search. Since this combinatorial optimization process
is computationally complex, a NP-hard problem, many heuristics
have been proposed, for example, to reduce the search space
and thus reduce the computational burden of the biomarker
selection [4].
With the exception of performing feature selection and
wrapper methods. Recursive feature elimination support vector
machine (SVM-RFE) is a widely used technique for analysis of
microarray data [5,6]. The SVM-RFE procedure constructs a
Curr Trends Biomedical Eng & Biosci 1(2): CTBEB.MS.ID.555559 (2017) 001
Current Trends in Biomedical
Engineering & Biosciences
 
machine learning approaches are either too complex or perform poorly. Here, a novel feature detection and engineering machine-learning
framework is presented to address this need. First, the Rip Curl process is applied which generates a set of 10 additional features. Second, we
rank all features including the Rip Curl features from which the top-ranked will most likely contain the most informative features for prediction of
the underlying biological classes. The top-ranked features are used in model building. This process creates for more expressive features which are
captured in models with an eye towards the model learning from increasing sample amount and the accuracy/time results. The performance of
the proposed Rip Curl Rip Curl
Keywords: Omentum, Cancer, Data science, Machine learning, Biomarkers, Phenotype, Personalized medicine
How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr
Trends Biomedical Eng & Biosci. 2017; 1(2): 555559.
Current Trends in Biomedical Engineering & Biosciences
       
informative features being pruned from the model.This process
continues iteratively until a model has learned the minimum
number of features that are useful. In the case of “-omics” data,
this process becomes impractical when considering a large
feature space.
Our research used a hybrid approach between user and
machine that dramatically reduced the computational time
required by similar approaches while increasing prediction
accuracy when comparing other state-of-the art machine
learning techniques. Our proposed framework includes
A. Ranking and pruning attributes using information gain
to extract the most informative features and thus greatly
reducing the number of data dimensions;
B. A user to view histograms on attributes where the
information gain is 0.80 or higher and creating new binary
features from continuous values;
C. Re-ranking both the original features and the newly
constructed features; and
D. Using the number of instances to determine how many
top-n informative features should be used in modeling.
The Rip Curl framework can be used to construct a high-
      
dependencies among the attributes for analysis of complex
biological -omics datasets containing dependencies of
features. The performance of the proposed four-step
microarray. The proposed framework was compared with
SVM-RFE in terms of area under the ROC curve (AUC) and the
Results and Discussion
Using the omentum dataset we conducted the Rip Curl
process of setting the target feature (in this case it was one-
versus-all), characterized the target variable, loaded and
prepared the omentum data, saved the target and portioning
information, analyzed the omentum features, created cross-
validation and hold-out partitions, and conducted exploratory
data analysis.
Table 1: Different types of descriptive features.
Type Description
A predictive descriptive feature provides information
that is useful in estimating the correct value of a target
By itself, an interacting descriptive feature is not
informative about the value of the target feature. In
conjunction with one or more other features, however,
it becomes informative.
Redundant A descriptive feature is redundant if it has a strong
correlation with another descriptive feature.
An irrelevant descriptive feature does not provide
information that is useful in estimating the value of the
target feature.
In an effort to increase performance and accuracy we opted
for an approach of feature selection to help reduce the number
of descriptive features in the omentum dataset to just the subset
that is most useful for prediction. Before we begin our discussion
of approaches to feature selection, it is useful to distinguish
between different types of descriptive features (Table 1):
The goal of any feature selection approach is to identify the
smallest subset of descriptive features that maintains overall
model performance. Ideally a feature selection approach will
return the subset of features that includes the predictive
and interacting features while excluding the irrelevant and
redundant features.
         
ideal subset of descriptive features used to train an omentum
model. Considerd features. There are 2ddifferent possible feature
subsets, which is far too many to evaluate unless d is very small.
For example, with the descriptive features represented in the
omentum dataset,there are 210,960 which produces a 3,300 digit
integer as the possible feature subsets.
Material and Methods
The dataset used in the experiments were provided by the
Gene Expression Machine Learning Repository (GEMLeR) [7].
GEMLeR contains microarray data from 9 different tissue types
including colon, breast, endometrium, kidney, lung, omentum,
as tumor or normal. The data from this repository were collated
into 9 one-tissue-types versus all-other-types (OVA) datasets
where the second class is labeled as “other.” All GEMLeR
microarray datasets have been analyzed by SVM-RFE, the results
of which are available from the same resource.
Figure 1: Represent protein and LDH level in different types of
Figure 1 demonstrates the Rip Curl framework and its
dependencies on the prior stage.
In applying the Rip Curl framework, we initially ran the
omentum microarray data through to gain informative feature
feedback and then rank those features from most to least
     
the dataset and then applied 1% (1,545 instances X 0.01 = 154
How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr
Trends Biomedical Eng & Biosci. 2017; 1(2): 555559.
Current Trends in Biomedical Engineering & Biosciences
features) to discover how many top informative features we
would make use of in our framework. Where the features were
both in the top 1% and expressed informativeness at or above
80%, we created unique features that followed some meaningful
thresholds which grouped biomarker data into bins of “0” or “1”.
Finally, we sent the enhanced data back through the informative
feature test to reduce the feature space to the top 1% and then
removed all other features, modeling this subset using Random
Forest (Entropy).
Our general approach to model selection was to run several
model types and to select the best performing based on the
highest AUC from the cross-validation results. Once those models
        
models based on sample sizes of the data 16%, 32%, and 64%
and are reported in Figure 2. If the model for each sample size
did not increase then the model was discarded as a non-learning
model (Figure 2).
Figure 2: A demonstration of a proper learning model.
We chose area under the ROC curve for its immediate
understanding and calculated it as follows
Where T is a set of thresholds, |T| is the number of thresholds
tested, and TPR(T[i] ) and FPR(T[i] ) are the true positive and
false positive rates at threshold i respectively. The ROC index is
quite robust in the presence of imbalanced data, which makes
it a common choice for practitioners, especially when multiple
modeling techniques are being compared to one another.
In addition, we ran a second evaluation measure, Gini Norm,
which is calculated as follows
=(2 ×ROC index)-1 (2)
higher values indicate better model performance.
In our experiment with the omentum microarray data, we
wanted to pay particular attention to reducing complexity and
thereby improving time-to-prediction. This is especially true
with an M X N dimensional dataset, where M is the number of
samples and N      
where N is orders of magnitude greater than M, as is the case in
our experiment.
We selected Random Forest to represent the general
technique of random decision forests, an ensemble learning
   
Forest operates by constructing a multitude of decision trees
at training time and outputting the class that is the mode of
      
of the individual trees. Random decision forests correct for
   
3 demonstrates visually the increase in predictive accuracy as it
relates to complexity of that prediction.
Figure 3: Comparison of Rip Curl vs other methods.
Table 2: Comparative analysis of model performance.
Data Model
AUC (Cross
Gini Norm
Gini Norm
Gini Norm
10,935 0.9520 0.9427 0.9269 0.9040 0.8855 0.8537 10,859.25
8,165 0.9592 0.9492 0.9232 0.9184 0.8984 0.8463 8,233.39
8,283 0.9520 0.9427 0.9269 0.9040 0.8855 0.8537 10,732.36
Top 1% of
15 0.9379 0.9201 0.9344 0.8757 0.8401 0.8689 3,374.21
Table 2 emphasizes the disparity of different results in
prediction and time by holding constant the model type, Random
Forest (Entropy).
We observed that Rip Curl (Top 1% of Features) made use
of the best parameters, which we found to be max_depth: None,
max_features: 0.2, max_leaf_nodes: 50, min-samples_leaf: 5, and
How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr
Trends Biomedical Eng & Biosci. 2017; 1(2): 555559.
Current Trends in Biomedical Engineering & Biosciences
min_samples_split: 10. Rip Curl improved time-to-prediction
by a range of 59.02% to 68.93%, increased the hold-out AUC
by 0.81% to 1.21%, and increased the hold-out Gini Norm by a
range of 1.78% to 2.67%.
       
sum of the logs of the probabilities of each possible outcome
when we make a random selection from a set. The weights used
in the sum are the probabilities of the outcomes themselves so
that outcomes with high probabilities contribute more to the
overall entropy of a set than outcomes with low probabilities.
where P(t=i) is the probability that the outcome of randomly
selecting an element t is the type i, l is the number of different
types of things in the set, and s is an arbitrary logarithmic base
(which we selected as 2) (Shannon, 1948).
Once we established the in formativeness of each feature we
visually explored the histograms of each variable that expressed
          
was most concentrated within this biomarker (Figure 4).
Figure 4: Histogram of Gene 206067_s_at in raw expression.
In an effort to concentrate the omentum tissue signal, we
generated a rule that stated
Which rendered a new feature that generated an additional
histogram (Figure 5)?
Figure 5: Demonstration of Rip Curl feature engineering using
visual thresholds.
Allowing us to pass a different, possibly more understandable
context to our algorithm.
We repeated this process above, applying rules based on our
observation of the training data.
Some additional features were simple descriptive statistics
such as Min and Mode while others were a bit unconventional
such as
Where X a gene is feature and i represents the placement
within the feature index.
Binsum was another engineered feature that was simply
Where bin is one of the generated binary features and i is the
index of the bin within the omentum training data. In an effort
to develop greater context for the omentum data and the new
features that were engineered, we analyzed key values and their
respective informativeness (or importance) [8] (Table 3):
Table 3: Rip Curl Feature engineering statistics.
Feature Name Importance Unique Missing Mean SD Median Min Max
Binsum 92.88 9 0 1.46 2.07 1 0 8
1800_206067_s_at 84.93 2 0 0.15 0.35 0 0 1
400_216953_s_at 82.15 2 0 0.14 0.34 0 0 1
1100_219454_at 82.09 2 0 0.22 0.42 0 0 1
1300_214844_s_at 78.42 2 0 0.13 0.34 0 0 1
How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr
Trends Biomedical Eng & Biosci. 2017; 1(2): 555559.
Current Trends in Biomedical Engineering & Biosciences
4000_227195_at 77.04 2 0 0.21 0.41 0 0 1
1100_213518_at 76.07 2 0 0.31 0.46 0 0 1
3500_204457_s_at 75.71 2 0 0.21 0.40 0 0 1
900_219778_at 71.29 2 0 0.09 0.29 0 0 1
Lensum 55.36 3 0 13.82 2.98 16 816
Mode 53.04 837 0214 413 19.80 1.60 4152
Min 51.85 81 0242 1.41 2.20 0.20 15.40
Figure 6: Variable importance rank generated from the Rip Curl
Figure 6 demonstrates visually the variable importance
   Rip Curl model. We observed that 20% of the top
informative features were generated through the Rip Curl
framework: Binsum, >1800_206067_s_at, and >400_216953_s_
at with a range of importance between 27% and 93%.
the omentum dataset. In their experiments designed to generate
the state-of-the-art benchmark, all measurements were
performed using WEKA machine learning environment. They
opted to make use of one of the most popular machine learning
methods for gene expression analysis, Support Vector Machines
– Recursive Feature Elimination (SVM-RFE) feature selection
      
was done inside a 10-fold cross-validation loop on the omentum
dataset to avoid so called selection bias [9] and demonstrates
their approach (Figure 7).
Figure 7: SVM-RFE Process.
Head-to-Head Comparison with SVM-RFE
Table 4: Comparison results of international benchmark and Rip Curl.
Model AUC
SVM-RFE (Benchmark) 0.703
Rip Curl 0.934
Table 4 shows a comparison of the SVM-RFE benchmark
established in with the Rip Curl framework, and the following
results were observed [10] (Table 4):
Rip Curl represents a 32.92% gain in prediction accuracy
over the GEMLeR benchmark for the same omentum dataset
The Rip Curl    
state-of-the-art benchmark (SVM-RFE) in the GEMLeR omentum
cancer experiment. Since the Rip Curl 
 
framework is very low permitting analysis of data with many
features. Future research would suggest comparisons beyond
the omentum cancer data and exploration of other one-versus-
all experiments in the areas of breast, colon, endomentrium,
kidney, lung, ovary, prostate, and uterus [12-14].
We would like to acknowledge GEMLeR for making this
important dataset available to researchers and WPC Healthcare
for supporting our work. Finally, the authors would like to thank
the donors who participated in this study.
1. Abeel T, HelleputteT, Van de Peer Y, Dupont P, Saeys Y (2010) Robust
      
selection methods. Bioinformatics 26(3): 392-398.
2. Mingle D (2015) A Discriminative Feature Space for Detecting and
Recognizing Pathologies of the Vertebral Column. International Journal
of Biomedical Data Mining.
3.   
 
Comput Biol Bioinform 7(1): 108-117.
How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr
Trends Biomedical Eng & Biosci. 2017; 1(2): 555559.
Current Trends in Biomedical Engineering & Biosciences
4.    
causing genes using microarray data mining and Gene Ontology. BMC
medical genomics 4(1): 1.
5. Ding Y, Wilkins D (2006) Improving the performance of SVM-RFE to
select genes in microarray data. BMC bioinformatics 7(Suppl 2): S12.
6. Balakrishnan S, Narayanaswamy R, Savarimuthu N, Samikannu R
(2008) SVM ranking with backward search for feature selection in type
II diabetes databases. IEEE pp. 2628-2633.
7. Stiglic G, Kokol P (2010) Stability of ranked gene lists in large
microarray analysis studies. BioMed Research International 2010: ID
8. Breiman L (2001) Random forests. Machine learning 45(1): 5-32.
9. Ambroise C, Mc Lachlan GJ (2002) Selection bias in gene extraction on
the basis of microarray gene-expression data. Proc Natl Acad Sci U S A
99(10): 6562-6566.
10. Duan K, Rajapakse JC (2004) SVM-RFE peak selection for cancer
11. Hu ZZ, Huang H, Wu CH, Jung M, Dritschilo A, et al. (2011) Omics-based
12. Shannon CE (1948) A note on the concept of entropy. Bell System Tech
J 27: 379-423.
13. 
feature sets in genomics and proteomics. Neurocomputing 73(13):
14. Zervakis M, Blazadonakis ME, Tsiliki G, Danilatou V, Tsiknakis M, et al.
(2009) Outcome prediction based on microarray analysis: a critical
perspective on methods. BMC bioinformatics 10: 53.
Your next submission with JuniperPublishers
will reach you the below assets
Quality Editorial service
Swift Peer Review
Reprints availability
E-prints Service
Manuscript Podcast for convenient understanding
Global attainment for your research
Manuscript accessibility in different formats
( Pdf, E-pub, Full Text, Audio)
Unceasing customer service
Track the below URL for one-step submission
... The objective of our research is to make a machine learning model, which will predict the hospital readmission of a patient, and to identify the most significant features [14] [5], which contribute most to the readmission of a patient. To execute this research we require knowledge about healthcare domain related to readmission [6], diabetes and all the tests associated with it. Which will help us to understand the attributes and to find the most significant features [13] then we can reduce the hospital readmission up to a major extent [7]. ...
Conference Paper
Hospital Readmission is considered as an effective measurement of service and care provided within the hospital. Emergency readmission to hospital is frequently used as a measure of the quality of a hospital because a high proportion of readmissions should be preventable if the preceding care is adequate. The objective of this study to develop a model to predict 30-day hospital readmission. We have data of 1-lac diabetes patients with 50 features. We used machine learning algorithms: Logistic Regression, Decision Tree, Random Forest, Adaboost and XGBoost for prediction. We achieved the highest accuracy 94% using Random forest among all other algorithms. The results from this study are encouraging and can help healthcare providers to improve their services.
... We ranked the top 5 features for this model by informativeness how informative they are relative to the other features of that age group. (Mingle, 2017). The "Number of emergency" feature and the feature engineered by dividing the number of emergency visits by the sum of lab procedures plus medications were both unique among all three models: ...
Full-text available
Hospital readmission is considered an effective measurement of care provided within healthcare. Being able to risk identify patients facing a high likelihood of unplanned hospital readmission in the next 30-days could allow for further investigation and possibly prevent the readmission. Current models, such as LACE, sacrifice accuracy in order to allow for end-users to have a straight forward and simple experience. This study acknowledges that while HbA1c is important, it may not be critical in predicting readmissions. It also investigates the hypothesis that using machine learning on a wide feature, making use of model diversity, and blending prediction will improve the accuracy of readmission risk predictions compared with existing techniques. A dataset originally containing 100,000 admissions and 56 features was used to evaluate the hypothesis. The results from the study are encouraging and can help healthcare providers improve inpatient diabetic care.
Full-text available
Each year it has become more and more difficult for healthcare providers to determine if a patient has a pathology related to the vertebral column. There is great potential to become more efficient and effective in terms of quality of care provided to patients through the use of automated systems. However, in many cases automated systems can allow for misclassification and force providers to have to review more causes than necessary. In this study, we analyzed methods to increase the True Positives and lower the False Positives while comparing them against state-of-the-art techniques in the biomedical community. We found that by applying the studied techniques of a data-driven model, the benefits to healthcare providers are significant and align with the methodologies and techniques utilized in the current research community.
Conference Paper
Full-text available
We studied two cancer classification problems with mass spectrometry data and used SVM-RFE to se- lect a small subset of peaks as input variables for the classification. Our study shows that, SVM-RFE can select a good small subset of peaks with which the classifier achieves high prediction accuracy and the performance is much better than with the feature subset selected by T-statistics. We also found that, the best peak subset selected by SVM-RFE always have in the top ranked peaks by T-statistics while it includes some peaks that are ranked low by T-statistics. However, these peaks together give much better classification performance than the same number of most top ranked peaks by T-statistics. Our experimental comparison of the performance of Support Vector Machine classification algorithm with and without peak selection also consolidates the importance of peak selection for cancer classification with mass spectrometry data. Selecting a small subset of peaks not only improves the efficiency of the classification algorithms, but also improves the cancer classification accuracy, even for classifica- tion algorithms like Support Vector Machines, which are capable of handling large number of input variables. In the last decade or so, mass spectrometry (MS) has increasingly become the method of choice for analysis of complex protein samples. Mass spectrometry measures two prop- erties of ion mixtures in the gas phase under a vacuum environment: the mass-to-charge ratio (m/z) of ions in the mixture; and the number of ions present at different m/z values. The output is a mass spectrum or chart with a series of spike peaks, each representing the ion(s) of a specific m/z value present in the sample. The heights of the peaks are related to the abundances of the ions in the sample. The heights of peaks and the m/z values of peaks are a fingerprint of the sample. For protein samples, mass spectrometry measures the mass-to-charge ratio of the ionized proteins (or protein fragments) and their abundances in the sample. The recent advances in mass spectrometry technology are starting to enable high-throughput profiling of the protein content of complex samples. While mass spectrometry has been used intensively on purified, digested samples to identify proteins via peptide mass fingerprints, 1 recently, it has also found promising appli- cations in cancer classification. 2-4 Proteins vary between individuals, between cell types, and in the same cell under different stimuli or different disease states. Thus, the protein
Full-text available
Genomic, proteomic, and other omic-based approaches are now broadly used in biomedical research to facilitate the understanding of disease mechanisms and identification of molecular targets and biomarkers for therapeutic and diagnostic development. While the Omics technologies and bioinformatics tools for analyzing Omics data are rapidly advancing, the functional analysis and interpretation of the data remain challenging due to the inherent nature of the generally long workflows of Omics experiments. We adopt a strategy that emphasizes the use of curated knowledge resources coupled with expert-guided examination and interpretation of Omics data for the selection of potential molecular targets. We describe a downstream workflow and procedures for functional analysis that focus on biological pathways, from which molecular targets can be derived and proposed for experimental validation.
Full-text available
One of the best and most accurate methods for identifying disease-causing genes is monitoring gene expression values in different samples using microarray technology. One of the shortcomings of microarray data is that they provide a small quantity of samples with respect to the number of genes. This problem reduces the classification accuracy of the methods, so gene selection is essential to improve the predictive accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVMRFE) has become one of the leading methods, but its performance can be reduced because of the small sample size, noisy data and the fact that the method does not remove redundant genes. We propose a novel framework for gene selection which uses the advantageous features of conventional methods and addresses their weaknesses. In fact, we have combined the Fisher method and SVMRFE to utilize the advantages of a filtering method as well as an embedded method. Furthermore, we have added a redundancy reduction stage to address the weakness of the Fisher method and SVMRFE. In addition to gene expression values, the proposed method uses Gene Ontology which is a reliable source of information on genes. The use of Gene Ontology can compensate, in part, for the limitations of microarrays, such as having a small number of samples and erroneous measurement results. The proposed method has been applied to colon, Diffuse Large B-Cell Lymphoma (DLBCL) and prostate cancer datasets. The empirical results show that our method has improved classification performance in terms of accuracy, sensitivity and specificity. In addition, the study of the molecular function of selected genes strengthened the hypothesis that these genes are involved in the process of cancer growth. The proposed method addresses the weakness of conventional methods by adding a redundancy reduction stage and utilizing Gene Ontology information. It predicts marker genes for colon, DLBCL and prostate cancer with a high accuracy. The predictions made in this study can serve as a list of candidates for subsequent wet-lab verification and might help in the search for a cure for cancers.
Full-text available
This paper presents an empirical study that aims to explain the relationship between the number of samples and stability of different gene selection techniques for microarray datasets. Unlike other similar studies where number of genes in a ranked gene list is variable, this study uses an alternative approach where stability is observed at different number of samples that are used for gene selection. Three different metrics of stability, including a novel metric in bioinformatics, were used to estimate the stability of the ranked gene lists. Results of this study demonstrate that the univariate selection methods produce significantly more stable ranked gene lists than the multivariate selection methods used in this study. More specifically, thousands of samples are needed for these multivariate selection methods to achieve the same level of stability any given univariate selection method can achieve with only hundreds.
Full-text available
Filters and wrappers are two prevailing approaches for gene selection in microarray data analysis. Filters make use of statistical properties of each gene to represent its discriminating power between different classes. The computation is fast but the predictions are inaccurate. Wrappers make use of a chosen classifier to select genes by maximizing classification accuracy, but the computation burden is formidable. Filters and wrappers have been combined in previous studies to maximize the classification accuracy for a chosen classifier with respect to a filtered set of genes. The drawback of this single-filter-single-wrapper (SFSW) approach is that the classification accuracy is dependent on the choice of specific filter and wrapper. In this paper, a multiple-filter-multiple-wrapper (MFMW) approach is proposed that makes use of multiple filters and multiple wrappers to improve the accuracy and robustness of the classification, and to identify potential biomarker genes. Experiments based on six benchmark data sets show that the MFMW approach outperforms SFSW models (generated by all combinations of filters and wrappers used in the corresponding MFMW model) in all cases and for all six data sets. Some of MFMW-selected genes have been confirmed to be biomarkers or contribute to the development of particular cancers by other studies.
Full-text available
Biomarker discovery is an important topic in biomedical applications of computational biology, including applications such as gene and SNP selection from high-dimensional data. Surprisingly, the stability with respect to sampling variation or robustness of such selection processes has received attention only recently. However, robustness of biomarkers is an important issue, as it may greatly influence subsequent biological validations. In addition, a more robust set of markers may strengthen the confidence of an expert in the results of a selection method. Our first contribution is a general framework for the analysis of the robustness of a biomarker selection algorithm. Secondly, we conducted a large-scale analysis of the recently introduced concept of ensemble feature selection, where multiple feature selections are combined in order to increase the robustness of the final set of selected features. We focus on selection methods that are embedded in the estimation of support vector machines (SVMs). SVMs are powerful classification models that have shown state-of-the-art performance on several diagnosis and prognosis tasks on biological data. Their feature selection extensions also offered good results for gene selection tasks. We show that the robustness of SVMs for biomarker discovery can be substantially increased by using ensemble feature selection techniques, while at the same time improving upon classification performances. The proposed methodology is evaluated on four microarray datasets showing increases of up to almost 30% in robustness of the selected biomarkers, along with an improvement of approximately 15% in classification performance. The stability improvement with ensemble methods is particularly noticeable for small signature sizes (a few tens of genes), which is most relevant for the design of a diagnosis or prognosis model from a gene signature. Supplementary data are available at Bioinformatics online.
Conference Paper
Clinical databases have accumulated large quantities of information about patients and their clinical history. Data mining is the search for relationships and patterns within this data that could provide useful knowledge for effective decision-making. Classification analysis is one of the widely adopted data mining techniques for healthcare applications to support medical diagnosis, improving quality of patient care, etc. Usually medical databases are high dimensional in nature. If a training dataset contains irrelevant features (i.e., attributes), classification analysis may produce less accurate results. Data pre-processing is required to prepare the data for data mining and machine learning to increase the predictive accuracy. Feature selection is a preprocessing technique commonly used on high-dimensional data and its purposes include reducing dimensionality, removing irrelevant and redundant features, reducing the amount of data needed for learning, improving algorithms' predictive accuracy, and increasing the constructed models' comprehensibility. Much research work in data mining has gone into improving the predictive accuracy of the classifiers by applying the techniques of feature selection. The importance of feature selection in medical data mining is appreciable as the diagnosis of the disease could be done in this patient-care activity with minimum number of features. Feature selection may provide us with the means to reduce the number of clinical measures made while still maintaining or even enhancing accuracy and reducing false negative rates. In medical diagnosis, reduction in false negative rate can, literally, be the difference between life and death. In this paper we propose a feature selection approach for finding an optimum feature subset that enhances the classification accuracy of Naive .Bayes classifier. Experiments were conducted on the Pima Indian Diabetes Dataset to assess the effectiveness of our approach. The results confirm that SVM Ra- - nking with Backward Search approach leads to promising improvement on feature selection and enhances classification accuracy.
The classification of genomic and proteomic data in extremely high dimensional datasets is a well-known problem which requires appropriate classification techniques. Classification methods are usually combined with gene selection techniques to provide optimal classification conditions—i.e. a lower dimensional classification environment. Another reason for reducing the dimensionality of such datasets is their interpretability, as it is much easier to interpret a small set of ranked genes than 20 thousand genes. This paper evaluates the classification performance of Rotation Forest classifier on small subsets of ranked genes for two dataset collections consisting of 47 genomic and proteomic classification problems. Robustness and high classification accuracy is shown to be an important feature of Rotation Forest when applied to small sets of genes.