Article

Characterization of the effectiveness of reporting lists of small feature sets relative to the accuracy of the prior biological knowledge.

Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA.
Cancer informatics 01/2010; 9:49-60. pp.49-60
Source: PubMed

ABSTRACT When confronted with a small sample, feature-selection algorithms often fail to find good feature sets, a problem exacerbated for high-dimensional data and large feature sets. The problem is compounded by the fact that, if one obtains a feature set with a low error estimate, the estimate is unreliable because training-data-based error estimators typically perform poorly on small samples, exhibiting optimistic bias or high variance. One way around the problem is limit the number of features being considered, restrict features sets to sizes such that all feature sets can be examined by exhaustive search, and report a list of the best performing feature sets. If the list is short, then it greatly restricts the possible feature sets to be considered as candidates; however, one can expect the lowest error estimates obtained to be optimistically biased so that there may not be a close-to-optimal feature set on the list. This paper provides a power analysis of this methodology; in particular, it examines the kind of results one should expect to obtain relative to the length of the list and the number of discriminating features among those considered. Two measures are employed. The first is the probability that there is at least one feature set on the list whose true classification error is within some given tolerance of the best feature set and the second is the expected number of feature sets on the list whose true errors are within the given tolerance of the best feature set. These values are plotted as functions of the list length to generate power curves. The results show that, if the number of discriminating features is not too small-that is, the prior biological knowledge is not too poor-then one should expect, with high probability, to find good feature sets.

0 0
 · 
0 Bookmarks
 · 
20 Views
  • Source
    Article: Feature selection: evaluation, application, and small sample performance
    [show abstract] [hide abstract]
    ABSTRACT: A large number of algorithms have been proposed for feature subset selection. Our experimental results show that the sequential forward floating selection algorithm, proposed by Pudil et al. (1994), dominates the other algorithms tested. We study the problem of choosing an optimal feature set for land use classification based on SAR satellite images using four different texture models. Pooling features derived from different texture models, followed by a feature selection results in a substantial improvement in the classification accuracy. We also illustrate the dangers of using feature selection in small sample size situations
    IEEE Transactions on Pattern Analysis and Machine Intelligence 03/1997; · 4.91 Impact Factor
  • Source
    Article: Comparison of algorithms that select features for pattern classifiers
    [show abstract] [hide abstract]
    ABSTRACT: A comparative study of algorithms for large-scale feature selection (where the number of features is over 50) is carried out. In the study, the goodness of a feature subset is measured by leave-one-out correct-classification rate of a nearest-neighbor (1-NN) classifier and many practical problems are used. A unified way is given to compare algorithms having dissimilar objectives. Based on the results of many experiments, we give guidelines for the use of feature selection algorithms. Especially, it is shown that sequential floating search methods are suitable for small- and medium-scale problems and genetic algorithms are suitable for large-scale problems.
    Pattern Recognition. 01/2000;
  • Article: A review of feature selection techniques in bioinformatics.
    [show abstract] [hide abstract]
    ABSTRACT: Feature selection techniques have become an apparent need in many bioinformatics applications. In addition to the large pool of techniques that have already been developed in the machine learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed techniques. In this article, we make the interested reader aware of the possibilities of feature selection, providing a basic taxonomy of feature selection techniques, and discussing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications.
    Bioinformatics 11/2007; 23(19):2507-17. · 5.47 Impact Factor

Full-text

View
0 Downloads
Available from

Keywords

close-to-optimal feature
 
discriminating features
 
expected number
 
feature sets
 
feature-selection algorithms
 
features
 
features sets
 
good feature sets
 
high-dimensional data
 
large feature sets
 
list length
 
low error estimate
 
lowest error estimates
 
performing feature sets
 
possible feature sets
 
prior biological knowledge
 
problem exacerbated
 
small sample
 
training-data-based error estimators
 
true classification error