ArticlePDF AvailableLiterature Review

A review of feature selection techniques in bioinformatics

November 2007
Bioinformatics 23(19):2507-17

November 2007
23(19):2507-17

DOI:10.1093/bioinformatics/btm344

Source
PubMed

Authors:

Yvan Saeys

Ghent University

Pedro Larranaga

Universidad Politécnica de Madrid

Feature selection techniques have become an apparent need in many bioinformatics applications. In addition to the large pool of techniques that have already been developed in the machine learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed techniques. In this article, we make the interested reader aware of the possibilities of feature selection, providing a basic taxonomy of feature selection techniques, and discussing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications. Contact:yvan.saeys@psb.ugent.be Supplementary information:http://bioinformatics.psb.ugent.be/supplementary_data/yvsae/fsreview

. Key references for each type of feature selection technique in the microarray domain.

…

. Key references for each type of feature selection technique in the domain of mass spectrometry.

…

. Software for feature selection.

…

Figures - uploaded by Pedro Larranaga

Content may be subject to copyright.

Content uploaded by Pedro Larranaga

Content may be subject to copyright.

BIOINFORMATICS

Vol. 00 no. 00 2005

Pages 1–10

A review of feature selection techniques in bioinformatics

Yvan Saeys

, I

naki Inza

and Pedro Larra

naga

Department of Plant Systems Biology, VIB, B-9052 Ghent, Belgium and Bioinformatics and

Evolutionary Genomics group, Department of Molecular Genetics, Ghent University, B-9052 Ghent,

Belgium

Department of Computer Science and Artiﬁcial Intelligence, Computer Science Faculty, University

of the Basque Country, Paseo Manuel de Lardizabal 1, 20018 Donostia - San Sebasti

an, Spain

ABSTRACT

Feature selection techniques have become an apparent need in

many bioinformatics applications. In addition to the large pool of

techniques that have already been developed in the machine learning

and data mining ﬁelds, speciﬁc applications in bioinformatics have led

to a wealth of newly proposed techniques.

In this paper, we make the interested reader aware of the possibilities

of feature selection, providing a basic taxonomy of feature selection

techniques, and discussing their use, variety and potential in

a number of both common as well as upcoming bioinformatics

applications.

Companion website: http://bioinformatics.psb.ugent.be/

supplementary_data/yvsae/fsreview

1 INTRODUCTION

During the last decade, the motivation for applying feature selection

(FS) techniques in bioinformatics has shifted from being an

illustrative example to becoming a real prerequisite for model

building. In particular, the high dimensional nature of many

modelling tasks in bioinformatics, going from sequence analysis

over microarray analysis to spectral analyses and literature mining

has given rise to a wealth of feature selection techniques being

presented in the ﬁeld.

In this review, we focus on the application of feature selection

techniques. In contrast to other dimensionality reduction techniques

like those based on projection (e.g. principal component analysis)

or compression (e.g. using information theory), feature selection

techniques do not alter the original representation of the variables,

but merely select a subset of them. Thus, they preserve the

original semantics of the variables, hence offering the advantage of

interpretability by a domain expert.

While feature selection can be applied to both supervised

and unsupervised learning, we focus here on the problem of

supervised learning (classiﬁcation), where the class labels are

known beforehand. The interesting topic of feature selection for

unsupervised learning (clustering) is a more complex issue, and

research into this ﬁeld is recently getting more attention in several

communities [79, 122].

The main aim of this review is to make practitioners aware of

the beneﬁts, and in some cases even the necessity of applying

feature selection techniques. Therefore, we provide an overview

of the different feature selection techniques for classiﬁcation: we

illustrate them by reviewing the most important application ﬁelds

in the bioinformatics domain, highlighting the efforts done by

the bioinformatics community in developing novel and adapted

procedures. Finally, we also point the interested reader to some

useful data mining and bioinformatics software packages that can

be used for feature selection.

2 FEATURE SELECTION TECHNIQUES

As many pattern recognition techniques were originally not

designed to cope with large amounts of irrelevant features,

combining them with FS techniques has become a necessity in many

applications [43, 78, 79]. The objectives of feature selection are

manifold, the most important ones being: a) to avoid overﬁtting and

improve model performance, i.e. prediction performance in the case

of supervised classiﬁcation and better cluster detection in the case

of clustering, b) to provide faster and more cost-effective models,

and c) to gain a deeper insight into the underlying processes that

generated the data. However, the advantages of feature selection

techniques come at a certain price, as the search for a subset of

relevant features introduces an additional layer of complexity in

the modeling task. Instead of just optimizing the parameters of the

model for the full feature subset, we now need to ﬁnd the optimal

model parameters for the optimal feature subset, as there is no

guarantee that the optimal parameters for the full feature set are

equally optimal for the optimal feature subset [20]. As a result,

the search in the model hypothesis space is augmented by another

dimension: the one of ﬁnding the optimal subset of relevant features.

Feature selection techniques differ from each other in the way they

incorporate this search in the added space of feature subsets in the

model selection.

In the context of classiﬁcation, feature selection techniques can be

organized into three categories, depending on how they combine the

feature selection search with the construction of the classiﬁcation

model: ﬁlter methods, wrapper methods, and embedded methods.

Table 1 provides a common taxonomy of feature selection methods,

showing for each technique the most prominent advantages and

disadvantages, as well as some examples of the most inﬂuential

techniques.

Filter techniques assess the relevance of features by looking only

at the intrinsic properties of the data. In most cases a feature

relevance score is calculated, and low scoring features are removed.

Afterwards, this subset of features is presented as input to the

classiﬁcation algorithm. Advantages of ﬁlter techniques are that

they easily scale to very high-dimensional datasets, they are

computationally simple and fast, and they are independent of the

classiﬁcation algorithm. As a result, feature selection needs to be

performed only once, and then different classiﬁers can be evaluated.

A common disadvantage of ﬁlter methods is that they ignore the

interaction with the classiﬁer (the search in the feature subset space

is separated from the search in the hypothesis space), and that most

proposed techniques are univariate. This means that each feature is

 Oxford University Press 2005. 1

Table 1. A taxonomy of feature selection techniques. For each feature selection type, we highlight a set of characteristics which can guide the choice for a

technique suited to the goals and resources of practitioners in the ﬁeld.

Model search Advantages Disadvantages Examples

Filter

FS space

Classifier

Univariate

Fast

Ignores feature dependencies

Chi-square

Scalable Euclidean distance

Independent of the classiﬁer

Ignores interaction with the classiﬁer

t-test

Information gain, Gain ratio [6]

Multivariate

Models feature dependencies Slower than univariate techniques Correlation based feature selection (CFS) [45]

Independent of the classiﬁer Less scalable than univariate Markov blanket ﬁlter (MBF) [62]

Better computational complexity techniques Fast correlation based

than wrapper methods Ignores interaction with the classiﬁer feature selection (FCBF) [136]

Wrapper

FS space

Hypothesis space

Classifier

Deterministic

Simple Risk of over ﬁtting

Interacts with the classiﬁer More prone than randomized algorithms Sequential forward selection (SFS) [60]

Models feature dependencies to getting stuck in a local optimum Sequential backward elimination (SBE) [60]

Less computationally intensive (greedy search) Plus q take-away r [33]

than randomized methods Classiﬁer dependent selection Beam search [106]

Randomized

Less prone to local optima Computationally intensive Simulated annealing

Interacts with the classiﬁer Classiﬁer dependent selection Randomized hill climbing [110]

Models feature dependencies Higher risk of overﬁtting Genetic algorithms [50]

than deterministic algorithms Estimation of distribution algorithms [52]

Embedded

FS U Hypothesis space

Classifier

Interacts with the classiﬁer Decision trees

Better computational complexity Weighted naive Bayes [28]

than wrapper methods Classiﬁer dependent selection Feature selection using

Models feature dependencies the weight vector of SVM [44, 125]

considered separately, thereby ignoring feature dependencies, which

may lead to worse classiﬁcation performance when compared to

other types of feature selection techniques. In order to overcome the

problem of ignoring feature dependencies, a number of multivariate

ﬁlter techniques were introduced, aiming at the incorporation of

feature dependencies to some degree.

Whereas ﬁlter techniques treat the problem of ﬁnding a good feature

subset independently of the model selection step, wrapper methods

embed the model hypothesis search within the feature subset search.

In this setup, a search procedure in the space of possible feature

subsets is deﬁned, and various subsets of features are generated and

evaluated. The evaluation of a speciﬁc subset of features is obtained

by training and testing a speciﬁc classiﬁcation model, rendering

this approach tailored to a speciﬁc classiﬁcation algorithm. To

search the space of all feature subsets, a search algorithm is

then “wrapped” around the classiﬁcation model. However, as the

space of feature subsets grows exponentially with the number of

features, heuristic search methods are used to guide the search for an

optimal subset. These search methods can be divided in two classes:

deterministic and randomized search algorithms. Advantages of

wrapper approaches include the interaction between feature subset

search and model selection, and the ability to take into account

feature dependencies. A common drawback of these techniques is

that they have a higher risk of overﬁtting than ﬁlter techniques

and are very computationally intensive, especially if building the

classiﬁer has a high computational cost.

In a third class of feature selection techniques, termed embedded

techniques, the search for an optimal subset of features is built

into the classiﬁer construction, and can be seen as a search in the

combined space of feature subsets and hypotheses. Just like wrapper

approaches, embedded approaches are thus speciﬁc to a given

learning algorithm. Embedded methods have the advantage that they

include the interaction with the classiﬁcation model, while at the

same time being far less computationally intensive than wrapper

methods.

3 APPLICATIONS IN BIOINFORMATICS

3.1 Feature selection for sequence analysis

Sequence analysis has a long standing tradition in bioinformatics.

In the context of feature selection, two types of problems can be

distinguished: content and signal analysis. Content analysis focuses

on the broad characteristics of a sequence, such as tendency to

code for proteins or fulﬁllment of a certain biological function.

Signal analysis on the other hand focuses on the identiﬁcation of

important motifs in the sequence, such as gene structural elements

or regulatory elements.

Apart from the basic features that just represent the nucleotide or

amino acid at each position in a sequence, many other features,

such as higher order combinations of these building blocks (e.g. k-

mer patterns) can be derived, their number growing exponentially

with the pattern length k. As many of them will be irrelevant or

redundant, feature selection techniques are then applied to focus on

the subset of relevant variables.

3.1.1 Content analysis The prediction of subsequences that code

for proteins (coding potential prediction) has been a focus of

interest since the early days of bioinformatics. Because many

features can be extracted from a sequence, and most dependencies

occur between adjacent positions, many variations of Markov

models were developped. To deal with the high amount of possible

features, and the often limited amount of samples, [101] introduced

the interpolated Markov model (IMM), which used interpolation

between different orders of the Markov model to deal with small

sample sizes, and a ﬁlter method (Chi-square) to select only relevant

features. In further work, [24] extended the IMM framework

to also deal with non-adjacent feature dependencies, resulting in

the interpolated context model (ICM), which crosses a Bayesian

decision tree with a ﬁlter method (Chi-square) to assess feature

relevance. Recently, the avenue of FS techniques for coding

potential prediction was further pursued by [100], who combined

different measures of coding potential prediction, and then used the

Markov blanket multivariate ﬁlter approach (MBF) to retain only

the relevant ones.

A second class of techniques focuses on the prediction of protein

function from sequence. The early work of [16], who combined

a genetic algorithm in combination with the Gamma test to score

feature subsets for classiﬁcation of large subunits of rRNA, inspired

researchers to use FS techniques to focus on important subsets of

amino acids that relate to the protein’s functional class [1]. An

interesting technique is described in [137], using selective kernel

scaling for support vector machines (SVM) as a way to asses feature

weights, and subsequently remove features with low weights.

The use of FS techniques in the domain of sequence analysis is

also emerging in a number of more recent applications, such as

the recognition of promoter regions [18], and the prediction of

microRNA targets [59].

3.1.2 Signal analysis Many sequence analysis m ethodologies

involve the recognition of short, more or less conserved signals in

the sequence, representing mainly binding sites for various proteins

or protein complexes. A common approach to ﬁnd regulatory

motifs, is to relate motifs to gene expression levels using a

regression approach. Feature selection can then be used to search for

the motifs that maximize the ﬁt to the regression model [58, 116].

In [109], a classiﬁcation approach is chosen to ﬁnd discriminative

motifs. The method is inspired by [7] who use the threshold number

of misclassiﬁcation (TNoM, see further in the section on microarray

analysis) to score genes for relevance to tissue classiﬁcation.

From the TNoM score, a p-value is calculated that represents the

signiﬁcance of each motif. Motifs are then sorted according to their

p-value.

Another line of research is performed in the context of the gene

prediction setting, where structural elements such as the translation

initiation site (TIS) and splice sites are modelled as speciﬁc

classiﬁcation problems. The problem of feature selection for

structural element recognition was pioneered in [23] for the problem

of splice site prediction, combining a sequential backward method

together with an embedded SVM evaluation criterion to assess

feature relevance. In [99] an estimation of distribution algorithm

(EDA, a generalization of genetic algorithms) was used to gain more

insight in the relevant features for splice site prediction. Similarly,

the prediction of TIS is a suitable problem to ap ply feature selection

techniques. In [76], the authors demonstrate the advantages of using

feature selection for this problem, using the feature-class entropy as

a ﬁlter measure to remove irrelevant features.

In future research, FS techniques can be expected to be useful for a

number of challenging prediction tasks, such as identifying relevant

features related to alternative splice sites and alternative TIS.

3.2 Feature selection for microarray analysis

During the last decade, the advent of microarray datasets stimulated

a new line of research in bioinformatics. Microarray data pose a

great challenge for computational techniques, because of their large

dimensionality (up to several tens of thousands of genes) and their

small sample sizes [112]. Furthermore, additional experimental

complications like noise and variability render the analysis of

microarray data an exciting domain.

In order to deal with these particular characteristics of microarray

data, the obvious need for dimension reduction techniques was

realized [2, 7, 40, 97], and soon their application became a de

facto standard in the ﬁeld. Whereas in 2001, the ﬁeld of microarray

analysis was still claimed to be in its infancy [31], a considerable

and valuable effort has since been done to contribute new and

adapt known FS methodologies [53]. A general overview of the

most inﬂuential techniques, organized according to the general FS

taxonomy of Section 2, is shown in Table 2.

3.2.1 The univariate ﬁlter paradigm: simple yet efﬁcient

Because of the high dimensionality of most microarray analyses,

fast and efﬁcient FS techniques such as univariate ﬁlter methods

have attracted most attention. The prevalence of these univariate

techniques has dominated the ﬁeld, and up to now comparative

evaluations of different classiﬁcation and FS techniques over DNA

microarray datasets only focused on the univariate case [29, 64, 72,

113]. This domination of the univariate approach can be explained

by a number of reasons:

• the output provided by univariate feature rankings is intuitive

and easy to understand;

• the gene ranking output could fulﬁll the objectives and

expectations that bio-domain experts have when wanting to

subsequently validate the result by laboratory techniques or in

order to explore literature searches. The experts could not feel

the need for selection techniques that take into account gene

interactions;

• the possible unawareness of subgroups of gene expression

domain experts about the existence of data analysis techniques

to select genes in a multivariate way;

• the extra computation time needed by multivariate gene

selection techniques.

Some of the simplest heuristics for the identiﬁcation of differentially

expressed genes include setting a threshold on the observed fold-

change differences in gene expression between the states under

study, and the detection of the threshold point in each gene

that minimizes the number of training sample misclassiﬁcation

(threshold number of misclassiﬁcation, TNoM [7]). However, a

wide range of new or adapted univariate feature ranking techniques

has since then been developped. These techniques can be divided

into two classes: parametric and model-free methods (see Table 2).

Parametric methods assume a given distribution from which the

samples (observations) have been generated. The two sample t-

test and ANOVA are among the most widely used techniques

in microarray studies, although the usage of their basic form,

possibly without justiﬁcation of their main assumptions, is not

advisible [53]. Modiﬁcations of the standard t-test to better deal

with the small sample size and inherent noise of gene expression

datasets include a number of t- or t-test like statistics (differing

primarily in the way the variance is estimated) and a number of

Bayesian frameworks [4, 35]. Although Gaussian assumptions have

dominated the ﬁeld, other types of parametrical approaches can also

be found in the literature, such as regression modelling approaches

[117] and Gamma distribution models [85].

Due to the uncertainty about the true underlying distribution of

Table 2. Key references for each type of feature selection technique in the microarray domain.

Filter methods

Wrapper methods Embedded methodsUnivariate

Multivariate

Parametric Model-free

t-test [53] Wilcoxon rank sum [117] Bivariate [10] Sequential search [51, 129] Random forest [25, 55]

ANOVA [53] BSS/WSS [29] CFS [124, 131] Genetic algorithms [56, 71, 86] Weight vector of SVM [44]

Bayesian [4, 35] Rank products [12] MRMR [26] Estimation of distribution Weights of logistic regression [81]

Regression [117] Random permutations USC [132] algorithms [9]

[31, 87, 88, 121] Markov blanket [38, 82, 128]

Gamma [85] TNoM [7]

many gene expression scenarios, and the difﬁculties to validate

distributional assumptions because of small sample sizes, non-

parametric or model-free methods have been widely proposed

as an attractive alternative to make less stringent distributional

assumptions [120]. Many model-free metrics, frequently borrowed

from the statistics ﬁeld, have demonstrated their usefulness in many

gene expression studies, including the Wilcoxon rank-sum test

[117], the between-within classes sum of squares (BSS/WSS) [29]

and the rank products method [12].

A speciﬁc class of model-free methods estimates the reference

distribution of the statistic using random permutations of the data,

allowing the computation of a model-free version of the associated

parametric tests. These techniques have emerged as a solid

alternative to deal with the speciﬁcities of DNA microarray data, and

do not depend on strong parametric assumptions [31, 87, 88, 121].

Their permutation principle partly alleviates the problem of small

sample sizes in microarray studies, enhancing the robustness against

outliers.

We also mention promising types of non-parametric metrics which,

instead of trying to identify differentially expressed genes at the

whole population level (e.g. comparison of sample means), are able

to capture genes which are signiﬁcantly disregulated in only a subset

of samples [80, 89]. These types of methods offer a more patient

speciﬁc approach for the identiﬁcation of markers, and can select

genes exhibiting complex patterns that are missed by metrics that

work under the classical comparison of two prelabeled phenotypic

groups. In addition, we also point out the importance of procedures

for controlling the different types of errors that arise in this complex

multiple testing scenario of thousands of genes [30, 92, 93, 114],

with a special focus on contributions for controlling the false

discovery rate (FDR).

3.2.2 Towards more advanced models: the multivariate paradigm

for ﬁlter, wrapper and embedded techniques

Univariate selection methods have certain restrictions and may lead

to less accurate classiﬁers by, for example, not taking into account

gene-gene interactions. Thus, researchers have proposed techniques

that try to capture these correlations between genes.

The application of multivariate ﬁlter methods ranges from simple

bivariate interactions [10] towards more advanced solutions

exploring higher order interactions, such as correlation based feature

selection (CFS) [124, 131] and several variants of the Markov

blanket ﬁlter method [38, 82, 128]. The Minimum Redundancy

- Maximum Releva nce (MRMR) [26] and Uncorrelated Shrunken

Centroid (USC) [132] algorithms are two other solid multivariate

ﬁlter procedures, highlighting the advantage of using multivariate

methods over univariate procedures in the gene expression domain.

Feature selection using wrapper or embedded methods offers an

alternative way to perform a multivariate gene subset selection,

incoporating the classiﬁer’s bias into the search and thus offering an

opportunity to construct more accurate classiﬁers. In the context of

microarray analysis, most wrapper methods use population based,

randomized search heuristics [9, 56, 71, 86], although also a few

examples use sequential search techniques [51, 129]. An interesting

hybrid ﬁlter-wrapper approach is introduced in [98], crossing

a univariately pre-ordered gene ranking with an incrementally

augmenting wrapper method.

Another characteristic of any wrapper procedure concerns the

scoring function used to evaluate each gene subset found. As the

0-1 accuracy measure allows for comparison with previous works,

the vast majority of papers uses this measure. However, recent

proposals advocate the use of methods for the approximation of the

area under the ROC curve [81], or the optimization of the LASSO

(Least Absolute Shrinkage and Selection Operator) model [39].

ROC curves certainly provide an interesting evaluation measure,

especially suited to the demand for screening different types of

errors in many biomedical scenarios.

The embedded capacity of several classiﬁers to discard input

features and thus propose a subset of discriminative genes, has

been exploited by several authors. Examples include the use of

random forests (a classiﬁer that combines many single decision

trees) in an embedded way to calculate the importance of each gene

[25, 55]. Another line of embedded FS techniques uses the weights

of each feature in linear classiﬁers such as SVMs [44] and logistic

regression [81]. These weights are used to reﬂect the relevance of

each gene in a multivariate way, and thus allow for the removal of

genes with very small weights.

Partially due to the higher computational complexity of wrapper and

to a lesser degree embedded approaches, these techniques have not

received as much interest as ﬁlter proposals. However, an advisable

practice is to pre-reduce the search space using a univariate ﬁlter

method, and only then apply wrapper or embedded methods, hence

ﬁtting the computation time to the available resources.

3.3 Mass spectra analysis

Mass spectrometry technology (MS) is emerging as a new and

attractive framework for disease diagnosis and protein-based

biomarker proﬁling [91]. A mass spectrum sample is characterized

by thousands of different mass/charge (m/z) ratios on the x-axis,

each with their corresponding signal intensity value on the y-axis. A

typical MALDI-TOF low-resolution proteomic proﬁle can contain

Table 3. Key references for each type of feature selection technique in

the domain of mass spectrometry.

Filter

Univariate

Multivariate

Parametric Model-free

t-test [77, 127] Peak Probability CFS [77]

F -test [8] Contrast [118] Relief-F [94]

Kolmogorov-Smirnov

test [135]

Wrapper

Genetic algorithms [70, 90]

Nature inspired [95, 96]

Embedded

Random forest/decision tree [37, 127]

Weight vector of SVM [57, 138, 94]

Neural network [5]

up to 15, 500 data points in the spectrum between 500 and 20, 000

m/z, and the number of points even grows using higher resolution

instruments.

For data mining and bioinformatics purposes, it can initially be

assumed that each m/z ratio represents a distinct variable whose

value is the intensity. As Somorjai et al. [112] explain, the data

analysis step is severely constrained by both high dimensional input

spaces and their inherent sparseness, just as it is the case with gene

expression datasets. Although the amount of publications on mass

spectrometry based data mining is not comparable to the level of

maturity reached in the microarray analysis domain, an interesting

collection of methods has been presented in the last 4-5 years (see

[49, 105] for recent reviews) since the pioneering work of Petricoin

et al. [90].

Starting from the raw data, and after an inital step to reduce noise

and normalize the spectra from different samples [19], the following

crucial step is to extract the variables that will constitute the initial

pool of candidate discriminative features. Some studies employ

the simplest approach of considering every measured value as a

predictive feature, thus applying FS techniques over initial huge

pools of about 15, 000 variables [70, 90], up to around 100, 000

variables [5]. On the other hand, a great deal of the current studies

performs aggressive feature extraction procedures using elaborated

peak detection and alignment techniques (see [19, 49, 105] for a

detailed description of these techniques). These procedures tend

to seed the dimensionality from which supervised FS techniques

will start their work in less than 500 variables [8, 96, 118]. A

feature extraction step is thus advisable to set the computational

costs of many FS techniques to a feasible size in these MS

scenarios. Table 3 presents an overview of FS techniques used

in the domain of mass spectrometry. Similar to the domain of

microarray analysis, univariate ﬁlter techn iques seem to be the most

common techniques used, although the use of embedded techniques

is certainly emerging as an alternative. Although the t-test maintains

a high level of popularity [77, 127], other parametric measures

(such as F -test [8]), and a notable variety of non-parametric

scores [118, 135] have also been used in several MS studies.

Multivariate ﬁlter techniques on the other hand, are still somewhat

underrepresented [77, 94].

Wrapper approaches have demonstrated their usefulness in MS

studies by a group of inﬂuential works. Different types of population

based randomized heuristics are used as search engines in the

major part of these papers: genetic algorithms [70, 90], particle

swarm optimization [95] and ant colony procedures [96]. It is worth

noting that while the ﬁrst two references start the search procedure

in ≈ 15, 000 dimensions by considering each m/z ratio as an

initial predictive feature, aggressive peak detection and alignment

processes reduce the initial dimension to about 300 variables in the

last two references [95, 96].

An increasing number of pape rs uses the embedded capacity of

several classiﬁers to discard input features. Variations of the popular

method originally proposed for gene expression domains by Guyon

et al. [44], using the weights of the variables in the SVM-

formulation to discard features with small weights, have been

broadly and successfully applied in the MS domain [57, 94, 138].

Based on a similar framework, the weights of the input masses in

a neural network classiﬁer have been used to rank the features’

importance in Ball et al. [5]. The embedded capacity of random

forests [127] and other types of decision tree based algorithms [37]

constitutes an alternative embedded FS strategy.

4 DEALING WITH SMALL SAMPLE DOMAINS

Small sample sizes, and their inherent risk of imprecision and

overﬁtting, pose a great challenge for many modelling problems

in bioinformatics [11, 84, 108]. In the context of feature selection,

two initiatives have emerged in response to this novel experimental

situation: the use of adequate evaluation criteria, and the use of

stable and robust feature selection models.

4.1 Adequate evaluation criteria

Several papers have warned about the substantial number of

applications not performing an independent and honest validation

of the reported accuracy percentages [3, 113, 112]. In such cases,

authors often select a discriminative subset of features using the

whole dataset. The accuracy of the ﬁnal classiﬁcation model is

estimated using this subset, thus testing the discrimination rule

on samples that were already used to propose the ﬁnal subset of

features. We feel that the need for an external feature selection

process in training the classiﬁcation rule at each stage of the

accuracy estimation procedure is gaining space in the bioinformatics

community practices.

Furthermore, novel predictive accuracy estimation methods with

promising characteristics, such as bolstered error estimation [107],

have emerged to deal with the speciﬁcities of small sample domains.

4.2 Ensemble feature selection approaches

Instead of choosing one particular FS method, and accepting its

outcome as the ﬁnal subset, different FS methods can be combined

using ensemble FS approaches. Based on the evidence that there

is often not a single universally optimal feature selection technique

[130], and due to the possible existence of more than one subset

of features that discriminates the data equally well [133], model

combination appro aches such as boosting have been adapted to

improve the robustness and stability of ﬁnal, discriminative methods

[7, 29].

Novel ensemble techniques in the microarray and mass spectrometry

domains include averaging over multiple single feature subsets

[69, 73], integrating a collection of univariate differential gene

expression purpose statistics via a distance synthesis scheme

[130], using different runs of a genetic algorithm to asses relative

importancies of each feature [70, 71], computing the Kolmogorov-

Smirnov test in different bootstrap samples to assign a probability

of being selected to each peak [134], and a number of Bayesian

averaging approaches [65, 133]. Furthermore, methods based on

a collection of decision trees (e.g. random forests) can be used

in an ensemble FS way to assess the relevance of each feature

[25, 37, 55, 127].

Although the use of ensemble approaches requires additional

computational resources, we would like to point out that they

offer an advisable framework to deal with small sample domains,

provided the extra computational resources are affordable.

5 FEATURE SELECTION IN UPCOMING DOMAINS

5.1 Single nucleotide polymorphism analysis

Single nucleotide polymorphisms (SNPs) are mutations at a single

nucleotide position that occurred during evolution and were passed

on through heredity, accounting for most of the genetic variation

among different individuals. SNPs are at the forefront of many

disease-gene association studies, their number being estimated at

about 7 million in the human genome [63]. Thus, selecting a

subset of SNPs that is sufﬁciently informative but still small enough

to reduce the genotyping overhead is an important step towards

disease-gene association. Typically, the number of SNPs considered

is not higher than tens of thousands with sample sizes of about one

hundred.

Several computational methods for htSNP selection (haplotype

SNPs; a set of SNPs located on one chromosome) have been

proposed in the past few years. One approach is based on the

hypothesis that the human genome can be viewed as a set of discrete

blocks that only share a very small set of common haplotypes [21].

This approach aims to identify a subset of SNPs that can either

distinguish all the common haplotypes [36], or at least explain

a certain percentage of them. Another common htSNP selection

approach is based on pairwise associations of SNPs, and tries to

select a set of htSNPs such that each of the SNPs on a haplotype

is highly associated with one of the htSNPs [15]. A third approach

considers htSNPs as a subset of all SNPs, from which the remaining

SNPs can be reconstructed [46, 66, 75]. The idea is to select htSNPs

based on how well they predict the remaining set of the unselected

SNPs.

When the haplotype structure in the target region is unknown, a

widely used approach is to choose markers at regular intervals

[67], given either the number of SNPs to choose or the desired

interval. In [74] an ensemble approach is successfully applied to the

identiﬁcation of relevant SNPs for alcoholism, while [41] propose

a robust feature selection technique based on a hybrid between

a genetic algorithm and an SVM. The Relief-F feature selection

algorithm, in conjunction with three classiﬁcation algorithms (k-

NN, SVM and naive Bayes) has been proposed in [123]. Genetic

algorithms have been applied to the search of the best subset of

SNPs, evaluating them with a multivariate ﬁlter (CFS), and also in a

wrapper manner (with a decision tree as supervised classiﬁcation

paradigm) [103]. The multiple linear regression SNP prediction

algorithm [48] predicts a complete genotype based on the values

of its informative SNPs (selected with a stepwise tag selection

algorithm), their positions among all SNPS, and a sample of

complete genotypes. In [104] the tag SNP selection method allows

to specify variable tagging thresholds, based on correlations, for

different SNPs.

5.2 Text and literature mining

Text and literature mining is emerging as a promising area for data

mining in biology [17, 54]. One important representation of text

and documents is the so-called bag-of-words (BOW) representation,

where each word in the text represents one variable, and its value

consists of the frequency of the speciﬁc word in the text. It goes

without saying that such a representation of the text may lead to

very high dimensional datasets, pointing out the need for feature

selection techniques.

Although the application of feature selection techniques is common

in the ﬁeld of text classiﬁcation (see e.g. [34] for a review), the

application in the biomedical domain is still in its infancy. Some

examples of FS techniques in the biomedical domain include the

work of Dobrokhotov et al. [27], who use the Kullback-Leibler

divergence as a univariate ﬁlter method to ﬁnd discriminating words

in a medical annotation task, the work of Eom and Zhang [32]

who use symmetrical uncertainty (an entropy based ﬁlter method)

for identifying relevant features for protein interaction discovery,

and the work of Han et al. [47], which discusses the use of feature

selection for a document classiﬁcation task.

It can be expected that, for tasks such as biomedical document

clustering and classiﬁcation, the large number of feature selection

techniques that were already developed in the text mining

community will be of practical use for researchers in biomedical

literature mining [17].

6 FS SOFTWARE PACKAGES

In order to provide the interested reader with some pointers to

existing software packages, Table 4 shows an overview of existing

software implementing a variety of feature selection methods.

All software packages mentioned are free for academic use, and

the software is organized into four sections: general purpose

FS techniques, techniques tailored to the domain of microarray

analysis, techniques speciﬁc to the domain of mass spectra analysis,

and techniques to handle SNP selection. For each software package,

the main reference, implementation language and website is shown.

In addition to these publicly available packages, we also provide a

companion website of this work (see the Abstract for the location).

On this website, the publications are indexed according to the

FS technique used, a number of keywords accompanying each

reference to understand its FS methodological contributions.

7 CONCLUSIONS AND FUTURE PERSPECTIVES

In this paper, we reviewed the main contributions of feature

selection research in a set of well-known bioinformatics applications.

Two main issues emerge as common problems in the bioinformatics

domain: the large input dimensionality, and the small sample sizes.

To deal with these problems, a wealth of FS techniques has been

data mining.

A large and fruitful effort has been performed during the last years

in the adaptation and proposal of univariate ﬁlter FS techniques. In

general, we observe that many researchers in the ﬁeld still think that

ﬁlter FS approaches are only restricted to univariate approaches. The

proposal of multivariate selection algorithms can be considered as

Table 4. Software for feature selection.

General purpose FS software

WEKA Java [126] http://www.cs.waikato.ac.nz/ml/weka

Fast Correlation Based Filter Java [136] http://www.public.asu.edu/˜huanliu/FCBF/FCBFsoftware.html

Feature Selection Book Ansi C [78] http://www.public.asu.edu/˜huanliu/Fsbook

MLC++ C++ [61] http://www.sgi.com/tech/mlc

Spider Matlab - http://www.kyb.tuebingen.mpg.de/bs/people/spider

SVM and Kernel Methods Matlab [14] http://asi.insa-rouen.fr/˜arakotom/toolbox/index

Matlab Toolbox

Microarray analysis FS software

SAM R, Excel [121] http://www-stat.stanford.edu/˜tibs/SAM/

GALGO R [119] http://www.bip.bham.ac.uk/bioinf/galgo.html

PCP C, C++ [13] http://pcp.sourceforge.net

GA-KNN C [71] http://dir.niehs.nih.gov/microarray/datamining/

Rankgene C [115] http://genomics10.bu.edu/yangsu/rankgene/

EDGE R [68] http://www.biostat.washington.edu/software/jstorey/edge/

GEPAS-Prophet Perl, C [83] http://prophet.bioinfo.cipf.es/

DEDS (Bioconductor) R [130] http://www.bioconductor.org/

RankProd (Bioconductor) R [12] http://www.bioconductor.org/

Limma (Bioconductor) R [111] http://www.bioconductor.org/

Multtest (Bioconductor) R [30] http://www.bioconductor.org/

Nudge (Bioconductor) R [22] http://www.bioconductor.org/

Qvalue (Bioconductor) R [114] http://www.bioconductor.org/

twilight (Bioconductor) R [102] http://www.bioconductor.org/

ComparativeMarkerSelection JAVA, R [42] http://www.broad.mit.edu/genepattern

(GenePattern)

Mass spectra analysis FS software

GA-KNN C [70] http://dir.niehs.nih.gov/microarray/datamining/

R-SVM R, C, C++ [138] http://www.hsph.harvard.edu/bioinfocore/RSVMhome/R-SVM.html

SNP analysis FS software

CHOISS C++, Perl [67] http://biochem.kaist.ac.kr/choiss.htm

MLR-tagging C [48] http://alla.cs.gsu.ed/˜software/tagging/tagging.html

WCLUSTAG JAVA [104] http://bioinfo.hku.hk/wclustag

one of the most promising future lines of work for the bioinformatics

community.

A second line of future research is the development of especially

ﬁtted ensemble FS approaches to enhance the robustness of the

ﬁnally selected feature subsets. We feel that, in order to alleviate

the actual small sample sizes of the majority of bioinformatics

applications, the further development of such techniques, combined

with appropriate evaluation criteria, constitutes an interesting

direction for future FS research.

Other interesting opportunities for future FS research will be the

extension towards upcoming bioinformatics domains such as SNPs,

text and literature mining, and the combination of heterogeneous

data sources. While in these domains, the FS component is not

yet as central as e.g. in gene expression or MS areas, we believe

that its application will become essential in dealing with the high

dimensional character of these applications.

To conclude, we would like to note that, in order to maintain

an appropriate size of the paper, we had to limit the number of

referenced studies. We therefore apologize to the authors of papers

that were not cited in this work.

8 ACKNOWLEDGEMENTS

We would like to thank the anonymous reviewers for their

constructive comments, which signiﬁcantly improved the quality

of this review. This work was supported by BOF grant 01P10306

from Ghent University to Y.S., and the SAIOTEK and ETORTEK

programs of the Basque Government and project TIN2005-03824 of

the Spanish Ministry of Education and Science to I.I. and P.L.

REFERENCES

[1]A. Al-Shahib, R. Breitling, and D. Gilbert. Feature selection and the class

imbalance problem in predicting protein function from sequence. Applied

Bioinformatics, 4(3):195–203, 2005.

[2]U. Alon, N. Barkai, K. G. D. Notterman, S. Ybarra, D. Mack, and A. Levine.

Broad patterns of gene expression revealed by clustering analysis of tumor and

normal colon tissues probed by oligonucleotide arrays. In Proceedings of the

National Academy of Sciences, USA, volume 96, pages 6745–6750, 1999.

[3]C. Ambroise and G. McLachlan. Selection bias in gene extraction on the basis

of microarray gene-expression data. In Proceedings of the National Academy of

Sciences, volume 99, pages 6562–6566, 2002.

[4]P. Baldi and A. Long. A Bayesian framework for the analysis of microarray

expression data: regularized t-test and statistical inferences of gene changes.

Bioinformatics, 17(6):509–516, 2001.

[5]G. Ball, S. Mian, F. Holding, R. Allibone, J. Lowe, S. Ali, G. Li, S. McCardle,

I. Ellis, C. Creaser, and R. Rees. An integrated approach utilizing artiﬁcial neural

networks and SELDI mass spectrometry for the classiﬁcation of human tumours

and rapid identiﬁcation of potential biomarkers. Bioinformatics, 18(3):395–404,

2002.

[6]M. Ben-Bassat. Pattern recognition and reduction of dimensionality. In

P. Krishnaiah and L. Kanal, editors, Handbook of Statistics II, volume 1, pages

773–791. North-Holland, 1982. Amsterdam.

[7]A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini.

Tissue classiﬁcation with gene expression proﬁles. Journal of Computational

Biology, 7(3-4):559–584, 2000.

[8]G. Bhanot, G. Alexe, B. Venkataraghavan, and A. Levine. A robust meta-

classiﬁcation strategy for cancer detection from MS data. Proteomics, 6(2):592–

604, 2006.

[9]R. Blanco, P. Larra

naga, I. Inza, and B. Sierra. Gene selection for cancer

classiﬁcation using wrapper approaches. International Journal of Pattern

Recognition and Artiﬁcial Intelligence, 18(8), 2004.

[10]T. Bø and I. Jonassen. New feature subset selection procedures for classiﬁcation

of expression proﬁles. Genome Biology, 3(4):research0017.1–research0017.11,

2002.

[11]U. Braga-Neto and E. Dougherty. Is cross-validation valid for small-sample

microarray classiﬁcation? Bioinformatics, 20(3):374–380, 2004.

[12]R. Breitling, P. Armengaud, A. Amtmann, and P. Herzyk. Rank products: a

simple, yet powerful, new method to detect differentially regulated genes in

replicated microarray experiments. FEBS Letters, 573:83–92, 2004.

[13]L. Buturovic. PCP: a program for supervised classiﬁcation of gene expression

proﬁles. Bioinformatics, 22(2):245–247, 2005.

[14]S. Canu, Y. Grandvalet, and A. Rakotomamonjy. SVM and Kernel Methods

Matlab Toolbox. In Perception Syst

emes et Information, INSA de Rouen, Rouen,

France, 2003.

[15]C. Carlson, M. Eberle, M. Rieder, Q. Yi, L. Kruglyak, and D. Nickerson.

Selecting a maximally informative set of single-nucleotide polymorphisms for

association analyses using linkage disequilibrium. American Journal of Human

Genetics, 74:106–120, 2004.

[16]N. Chuzhanova, A. Jones, and S. Margetts. Feature selection for genetic

sequence classiﬁcation. Bioinformatics, 14(2):139–143, 1998.

[17]A. Cohen and W. Hersch. A survey of current work in biomedical text mining.

Brieﬁngs in Bioinformatics, 6(1):57–71, 2005.

[18]P. Conilione and D. Wang. A comparative study on feature selection for E.coli

promoter recognition. International Journal of Information Technology, 11:54–

66, 2005.

[19]K. Coombes, K. Baggerly, and J. Morris. Pre-processing mass spectometry data.

In M. Dubitzky, M. Granzow, and D. Berrar, editors, Fundamentals of Data

Mining in Genomics and Proteomics, pages 79–99. Kluwer, 2007.

[20]W. Daelemans, V. Hoste, F. De Meulder, and B. Naudts. Combined optimization

of feature selection and algorithm parameter interaction in machine learning

of language. In Proceedings of the 14th European Conference on Machine

Learning (ECML-2003), pages 84–95, 2003.

[21]M. Daly, J. Rioux, S. Schaffner, T. Hudson, and E. Lander. High-resolution

haplotype structure in the human genome. Nature Genetics, 29:229–232, 2001.

[22]N. Dean and A. Raftery. Normal uniform mixture differential gene expression

detection in cDNA microarrays. BMC Bioinformatics, 6(173), 2005.

[23]S. Degroeve, B. De Baets, Y. Van de Peer, and P. Rouz

e. Feature subset selection

for splice site prediction. Bioinformatics, 18 Supp.2:75–83, 2002.

[24]A. Delcher, D. Harnon, S. Kasif, O. White, and S. Salzberg. Improved microbial

gene identiﬁcation with GLIMMER. Nucleic Acids Research, 27:4636–4641,

1999.

[25]R. D

ıaz-Uriarte and S. Alvarez de Andr

es. Gene selection and classiﬁcation of

microarray data using random forest. BMC Bioinformatics, 7(3), 2006.

[26]C. Ding and H. Peng. Minimum redundancy feature selection from microarray

gene expression data. In Proceedings of the IEEE Conference on Computational

Systems Bioinformatics, pages 523–528, 2003.

[27]P. Dobrokhotov, C. Goutte, A. Veuthey, and E. Gaussier. Combining NLP

and probabilistic categorisation for document and term selection for Swiss-Prot

medical annotation. Bioinformatics, 19 Supp.1:91–94, 2003.

[28]P. Duda, P. Hart, and D. Stork. Pattern Classiﬁcation. Wiley, New York, 2001.

[29]S. Dudoit, J. Fridlyand, and T. Speed. Comparison of discriminant methods for

the classiﬁcation of tumors using gene expression data. Journal of the American

Statistical Association, 97(457):77–87, 2002.

[30]S. Dudoit, J. Shaffer, and J. Boldrick. Multiple hypothesis testing in microarray

experiments. Statistical Science, 18:7–103, 2003.

[31]B. Efron, R. Tibshirani, J. Storey, and V. Tusher. Empirical Bayes analysis

of a microarray experiment. Journal of the American Statistical Association,

96(456):1151–1160, 2001.

[32]J. Eom and B. Zhang. PubMiner:machine learning-based text mining for

biomedical information analysis. In Lecture Notes in Artiﬁcial Intelligence,

volume 3192, pages 216–225, 2000.

[33]F. Ferri, P. Pudil, M. Hatef, and J. Kittler. Pattern Recognition in Practice

IV, Multiple Paradigms, Comparative Studies and Hybrid Systems, chapter

Comparative study of techniques for large-scale feature selection, pages 403–

413. Elsevier, 1994.

[34]G. Forman. An extensive empirical study of feature selection metrics for text

classiﬁcation. Journal of Machine Learning Research, 3:1289–1305, 2003.

[35]R. Fox and M. Dimmic. A two-sample Bayesian t-test for microarray data. BMC

Bioinformatics, 7(1):126, 2006.

[36]S. Gabriel, S. Schaffner, H. Nguyen, J. Moore, J. Roy, B. Blumenstiel,

J. Higgins, M. DeFelice, A. Lochner, M. Faggart, S. Liu-Cordero, C. Rotimi,

A. Adeyemo, R. Cooper, R. Ward, E. Lander, M. Daly, and D. Altshuler. The

structure of haplotype blocks in the human genome. Science, 296:2225–2229,

2002.

[37]P. Geurts, M. Fillet, D. de Seny, M.-A. Meuwis, M. Malaise, M.-P. Merville, and

L. Wehenkel. Proteomic mass spectra classiﬁcation using decision tree based

ensemble methods. Bioinformatics, 21(15):3138–3145, 2005.

[38]O. Gevaert, F. De Smet, D. Timmerman, Y. Moreau, and B. De Moor. Predicting

the prognosis of breast cancer by integrating clinical and microarray data with

Bayesian networks. Bioinformatics, 22(14):e184–e190, 2006.

[39]D. Ghosh and M. Chinnaiyan. Classiﬁcation and selection of biomarkers

in genomic data using LASSO. Journal of Biomedicine and Biotechnology,

2005(2):147–154, 2005.

[40]T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov,

H. Coller, M. Loh, J. Downing, M. Caliguri, C. Bloomﬁeld, and E. Lander.

Molecular classiﬁcation of cancer: class discovery and class prediction by gene

expression monitoring. Science, 286:531–537, 1999.

[41]B. Gong, Z. Guo, J. Li, G. Zhu, S. Lv, S. Rao, and X. Li. Application of genetic

algorithm – support vector machine hybrid for prediction of clinical phenotypes

based on geneome-wide SNP proﬁles of sib pairs. In Lecture Notes in Computer

Science 3614, pages 830–835. Springer, 2005.

[42]J. Gould, G. Getz, S. Monti, M. Reich, and J. Mesirov. Comparative gene marker

selection suite. Bioinformatics, 22(15):1924–1925, 2006.

[43]I. Guyon and A. Elisseeff. An introduction to variable and feature selection.

Journal of Machine Learning Research, 3:1157–1182, 2003.

[44]I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer

classiﬁcation using support vector machines. Machine Learning, 46(1-3):389–

422, 2002.

[45]M. Hall. Correlation-based Feature Selection for Machine Learning. PhD thesis,

Department of Computer Science, Waikato University, New Zealand, 1999.

[46]E. Halperin, G. Kimmel, and R. Shamir. Tag SNP selection in genotype data for

maximizing SNP prediction accuracy. Bioinformatics, 21(suppl. 1):i195–203,

2005.

[47]B. Han, Z. Obradovic, Z. Hu, C. Wu, and S. Vucetic. Substring selection for

biomedical document classiﬁcation. Bioinformatics, 22(17):2136–2142, 2006.

[48]J. He and A. Zelikovsky. MLR-tagging: informative SNP selection for unphased

genotypes based on multiple linear regression. Bioinformatics, 22(20):2558–

2561, 2006.

[49]M. Hilario, A. Kalousis, C. Pellegrini, and M. Muller. Processing and

classiﬁcation of protein mass spectra. Mass Spectometry Reviews, 25(3):409–

449, 2006.

[50]J. Holland. Adaptation in Natural and Artiﬁcial Systems. University of Michigan

Press, 1975.

[51]I. Inza, P. Larra

naga, R. Blanco, and A. Cerrolaza. Filter versus wrapper gene

selection approaches in DNA microarray domains. Artiﬁcial Intelligence in

Medicine, 31(2):91–103, 2004.

[52]I. Inza, P. Larra

naga, R. Etxebarria, and B. Sierra. Feature subset selection by

Bayesian networks based optimization. Artiﬁcal Intelligence, 123(1-2):157–184,

2000.

[53]P. Jafari and F. Azuaje. An assessment of recently published gene expression data

analyses: reporting experimental design and statistical factors. BMC Medical

Informatics and Decision Making, 6(1):27, 2006.

[54]L. Jensen, J. Saric, and P. Bork. Literature mining for the biologist:

from information retrieval to biological discovery. Nature Reviews Genetics,

7(2):119–129, 2006.

[55]H. Jiang, Y. Deng, H.-S. Cheng, L. Tao, Q. Sha, J. Chen, C.-J. Tsai, and

S. Zhang. Joint analysis of two microarray gene-expression data sets to select

lung adenocarcinoma marker genes. BMC Bioinformatics, 5(81), 2004.

[56]T. Jirapech-Umpai and S. Aitken. Feature selection and classiﬁcation for

microarray data analysis: evolutionary methods for identifying predictive genes.

BMC Bioinformatics, 6(148), 2005.

[57]K. Jong, E. Marchiori, M. Sebag, and A. van der Vaart. Feature selection in

proteomic pattern data with support vector machines. In Proceedings of the IEEE

Symposium on Computational Intelligence in Bioinformatics and Computational

Biology, pages 41–48, 2004.

[58]S. Keles, M. van der Laan, and M. Eisen. Identiﬁcation of regulatory elements

using a feature selection method. Bioinformatics, 18(9):1167–1175, 2002.

[59]S. Kim, J. Nam, J. Rhee, W. Lee, and B. Zhang. miTarget: microRNA target

gene prediction using a support vector machine. BMC Bioinformatics, 7(411),

2006.

[60]J. Kittler. Pattern Recognition and Signal Processing, chapter Feature set

search algorithms, pages 41–60. Sijthoff and Noordhoff, Alphen aan den Rijn,

Netherlands, 1978.

[61]R. Kohavi, D. Sommerﬁeld, and J. Dougherty. Data mining using MLC++: a

machine learning library in C++. In Tools with Artiﬁcial Intelligence, pages

234–245. IEEE Computer Society Press, 1996.

[62]D. Koller and M. Sahami. Toward optimal feature selection. In Proceedings of

the Thirteenth International Conference on Machine Learning, pages 284–292,

Bari, Italy, 1996.

[63]L. Kruglyak and D. A. Nickerson. Variation in the spice of life. Nature Genetics,

27:234–236, 2001.

[64]J. Lee, J. Loo, M. Park, and S. Song. An extensive comparison of recent

classiﬁcation tools applied to microarray data. Computational Statistics and

Data Analysis, 48:869–885, 2005.

[65]K. Lee, N. Sha, E. Dougherty, M. Vannucci, and B. Mallick. Gene selection: a

Bayesian variable selection approach. Bioinformatics, 19(1):90–97, 2003.

[66]P. H. Lee and H. Shatkay. BNTagger: improved tagging SNP selection using

Bayesian networks. Bioinformatics, 22(14):e211–e219, 2006.

[67]S. Lee and C. Kang. CHOISS for selection on single nucleotide polymorphism

markers on interval regularity. Bioinformatics, 20(4):581–582, 2004.

[68]J. Leek, E. Monsen, A. Dabney, and J. Storey. EDGE: extraction and analysis of

differential gene expression. Bioinformatics, 22(4):507–508, 2006.

[69]I. Levner. Feature selection and nearest centroid classiﬁcation for protein mass

spectrometry. BMC Bioinformatics, 6(68), 2005.

[70]L. Li, D. Umbach, P. Terry, and J. Taylor. Applications of the GA/KNN method

to SELDI proteomics data. Bioinformatics, 20(10):1638–1640, 2004.

[71]L. Li, C. Weinberg, T. Darden, and L. Pedersen. Gene selection for sample

classiﬁcation based on gene expression data: study of sensitivity to choice of

parameters of the GA/KNN method. Bioinformatics, 17(12):1131–1142, 2001.

[72]T. Li, C. Zhang, and M. Ogihara. A comparative study of feature selection

and multiclass classiﬁcation methods for tissue classiﬁcation based on gene

expression. Bioinformatics, 20(15):2429–2437, 2004.

[73]W. Li and Y. Yang. How many genes are needed for a discriminant microarray

data analysis? In S. M. Lin and K. F. Johnson, editors, Methods of Microarray

Data Analysis. First Conference on Critical Assessment of Microarray Data

Analysis, CAMDA2000, pages 137–150, 2002.

[74]X. Li, S. Rao, W.Zhang, G. Zheng, W. Jiang, and L. Du. Large-scale

ensemble decision analysis of sib-pair ibd proﬁles for identiﬁcation of the

relevant molecular signatures for alcoholism. In Lecture Notes in Computer

Science 3614, pages 1184–1189. Springer, 2005.

[75]Z. Lin and R. B. Altman. Finding haplotype tagging SNPs by use of principal

components analysis. American Journal of Human Genetics, 73:850–861, 2004.

[76]H. Liu, H. Han, J. Li, and L. Wong. Using amino acid patterns to accurately

predict translation initiation sites. In Silico Biology, 4(3):255–269, 2004.

[77]H. Liu, J. Li, and L. Wong. A comparative study on feature selection and

classiﬁcation methods using gene expression proﬁles and proteomic patterns.

Genome Informatics, 13:51–60, 2002.

[78]H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data

Mining. Kluwer Academic Publishers, 1998.

[79]H. Liu and L. Yu. Toward integrating feature selection algorithms for

classiﬁcation and clustering. IEEE Transactions on Knowledge and Data

Engineering, 17(4):491–502, 2005.

[80]J. Lyons-Weiler, S. Patel, M. Becich, and T. Godfrey. Tests for ﬁnding complex

patterns of differential expression in cancers: towards individualized medicine.

BMC Bioinformatics, 5(110), 2004.

[81]S. Ma and J. Huang. Regularized ROC method for disease classiﬁcation and

biomarker selection with microarray data. Bioinformatics, 21(24):4356–4362,

2005.

[82]H. Mamitsuka. Selecting features in microarray classiﬁcation using ROC curves.

Pattern Recognition, 39:2393–2404, 2006.

[83]I. Medina, D. Montaner, J. T

arraga, and J. Dopazo. Prophet, a web-based tool for

class prediction using microarray data. Bioinformatics, 23(3):390–391, 2007.

[84]A. Molinaro, R. Simon, and R. Pfeiffer. Prediction error estimation: a

comparison of resampling methods. Bioinformatics, 21(15):3301–3307, 2005.

[85]M. Newton, C. Kendziorski, C. Richmond, C. Blattner, and K. Tsui. On

differential variability of expression ratios: improving statistical inference about

gene expression changes from microarray data. Journal of Computational

Biology, 8:37–52, 2001.

[86]C. Ooi and P. Tan. Genetic algorithms applied to multi-class prediction for the

analysis of gene expression data. Bioinformatics, 19(1):37–44, 2003.

[87]W. Pan. On the use of permutation in and the performance of a class of

nonparametric methods to detect differential gene expression. Bioinformatics,

19(11):1333–1340, 2003.

[88]P. Park, M. Pagano, and M. Bonetti. A nonparametric scoring algorithm for

identifying informative genes from microarray data. Paciﬁc Symposium on

Biocomputing, 6:52–63, 2001.

[89]P. Pavlidis and P. Poirazi. Individualized markers optimize class prediction of

microarray data. BMC Bioinformatics, 7(1):345, 2006.

[90]E. Petricoin, A. Ardekani, B. Hitt, P. Levine, V. Fusaro, S. Steinberg, G. Mills,

C. Simone, D. Fishman, E. Kohn, and L. Liotta. Use of proteomics patterns in

serum to identify ovarian cancer. The Lancet, 359(9306):572–577, 2002.

[91]E. Petricoin and L. Liotta. Mass spectometry-based diagnostic: the upcoming

revolution in disease detection. Clinical Chemistry, 49(4):533–534, 2003.

[92]A. Ploner, S. Calza, A. Gusnanto, and Y. Pawitan. Multidimensional local false

discovery rate for microarray studies. Bioinformatics, 22(5):556–565, 2006.

[93]S. Pounds and C. Cheng. Improving false discovery rate estimation.

Bioinformatics, 20(11):1737–1754, 2004.

[94]J. Prados, A. Kalousis, J.-C. S

anchez, L. Allard, O. Carrette, and M. Hilario.

Mining mass-spectra for diagnosis and biomarker discovery of cerebral

accidents. Proteomics, 4(8):2320–2332, 2004.

[95]H. Ressom, R. Varghese, M. Abel-Hamid, S. Abdel-Latif Eissa, D. Saha,

L. Goldman, E. Petricoin, T. Conrads, T. Veenstra, C. Loffredo, and R. Goldman.

Analysis of mass spectral serum proﬁles for biomarker selection. Bioinformatics,

21(21):4039–4045, 2005.

[96]H. Ressom, R. Varghese, S. Drake, G. Hortin, M. Abel-Hamid, C. Loffredo, and

R. Goldman. Peak selection from MALDI-TOF mass spectra using ant colony

optimization. Bioinformatics, 23(5):619–626, 2007.

[97]D. Ross, U. Scherf, M. Eisen, C. Perou, C. Rees, P. Spellman, V. Iyer, S. Jeffrey,

M. Van de Rijn, M. Waltham, A. Pergamenschikov, J. Lee, D. Lashkari,

D. Shalon, T. Myers, J. Weinstein, D. Botstein, and P. Brown. Systematic

variation in gene expression patterns in human cancer cell lines. Nature

Genetics, 24(3):227–234, 2000.

[98]R. Ruiz, J. Riquelme, and J. Aguilar-Ruiz. Incremental wrapper-based gene

selection from microarray data for cancer classiﬁcation. Pattern Recognition,

39:2383–2392, 2006.

[99]Y. Saeys, S. Degroeve, D. Aeyels, P. Rouz

e, and Y. Van de Peer. Feature

selection for splice site prediction: a new method using EDA-based feature

ranking. BMC Bioinformatics, 5(1):64, 2004.

[100]Y. Saeys, P. Rouz

e, and Y. Van de Peer. In search of the small ones:

improved prediction of short exons in vertebrates, plants, fungi, and protists.

Bioinformatics, 23(4):414–420, 2007.

[101]S. Salzberg, A. Delcher, S. Kasif, and O. White. Microbial gene identiﬁcation

using interpolated markov models. Nucleic Acids Research, 26:544–548, 1998.

[102]S. Scheid and R. Spang. twilight; a Bioconductor package for estimating the

local false discovery rate. Bioinformatics, 21(12):2921–2922, 2005.

[103]S. Shah and A. Kusiak. Data mining and genetic algorithm based gene/SNP

selection. Artiﬁcial Intelligence in Medicine, 31:183–196, 2004.

[104]P. Sham, S. Ao, J. Kwan, P. Kao, F. Cheung, P. Fong, and M. Ng. Combining

functional and linkage disequilibrium information in the selection of tag snps.

Bioinformatics, 23(1):129–131, 2007.

[105]H. Shin and M. Markey. A machine learning perspective on the development

of clinical decision support systems utilizing mass spectra of blood samples.

Journal of Biomedical Informatics, 39:227–248, 2006.

[106]W. Siedelecky and J. Sklansky. On automatic feature selection. International

Journal of Pattern Recognition, 2:197–220, 1988.

[107]C. Sima, U. Braga-Neto, and E. Dougherty. Superior feature-set ranking for

small samples using bolstered error estimation. Bioinformatics, 21(7):1046–

1054, 2005.

[108]C. Sima and E. Dougherty. What should be expected from feature selection in

small-sample settings. Bioinformatics, 22(19):2430–2436, 2006.

[109]S. Sinha. Discriminative motifs. Journal of Computational Biology, 10(3-

4):599–615, 2003.

[110]D. Skalak. Prototype and feature selection by sampling and random mutation hill

climbing algorithms. In Proceedings of the Eleventh International Conference

on Machine Learning, pages 293–301, 1994.

[111]G. Smyth. Linear models and empirical Bayes methods for assessing differential

expression in microarray experiments. Statistical Applications in Genetics and

Molecular Biology, 3(1):Article 3, 2004.

[112]R. Somorjai, B. Dolenko, and R. Baumgartner. Class prediction and discovery

using gene microarray and proteomics mass spectroscopy data: curses, caveats,

cautions. Bioinformatics, 19(12):1484–1491, 2003.

[113]A. Statnikov, C. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy. A

comprensive evaluation of multicategory classiﬁcation methods for microarray

gene expression cancer diagnosis. Bioinformatics, 21(5):631–643, 2005.

[114]J. Storey. A direct approach to false discovery rates. Journal of the Royal

Statistical Society. Series B, 64:479–498, 2002.

[115]Y. Su, T. Murali, V. Pavlovic, M. Schaffer, and S. Kasif. RankGene:

identiﬁcation of diagnostic genes based on expression data. Bioinformatics,

19(12):1587–1579, 2003.

[116]M. Tadesse, M. Vannucci, and P. Lio. Identiﬁcation of DNA regulatory motifs

using Bayesian variable selection. Bioinformatics, 20(16):2553–2561, 2004.

[117]J. Thomas, J. Olson, S. Tapscott, and L. Zhao. An efﬁcient and robust statistical

modeling approach to discover differentially expressed genes using genomic

expression proﬁles. Genome Research, 11:1227–1236, 2001.

[118]R. Tibshirani, T. Hastie, B. Narasimhan, S. Soltys, G. Shi, A. Koong, and Q.-T.

Le. Sample classiﬁcation from protein mass spectrometry, by ‘peak probability

contrast’. Bioinformatics, 20(17):3034–3044, 2004.

[119]V. Trevino and F. Falciani. GALGO: an R package for multivariate variable

selection using genetic algorithms. Bioinformatics, 22(9):1154–1156, 2006.

[120]O. Troyanskaya, M. Garber, P. Brown, D. Bolstein, and R. Altman.

Nonparametric methods for identifying differentially expressed genes in

microarray data. Bioinformatics, 18(11):1454–1461, 2002.

[121]V. Tusher, R. Tibshirani, and G. Chu. Signiﬁcance analysis of microarrays

applied to ionizing radiation response. In Proceedings of the National Academy

of Sciences, volume 98, pages 5116–5121, 2001.

[122]R. Varshavsky, A. Gottlieb, M. Linial, and D. Horn. Novel unsupervised feature

ﬁltering of biological data. Bioinformatics, 22(14):e507–513, 2006.

[123]Y. Wang, F. Makedon, and J. Pearlman. Tumor classiﬁcation based on DNA

copy number aberrations determined using SNPS arrays. Oncology Reports,

5:1057–1059, 2006.

[124]Y. Wang, I. Tetko, M. Hall, E. Frank, A. Facius, K. Mayer, and H. Mewes. Gene

selection from microarray data for cancer classiﬁcation - a machine learning

approach. Computational Biology and Chemistry, 29:37–46, 2005.

[125]J. Weston, A. Elisseeff, B. Schoelkopf, and M. Tipping. Use of the zero-norm

with linear models and kernel methods. Journal of Machine Learning Research,

3:1439–1461, 2003.

[126]I. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and

Techniques, 2nd Edition. Morgan Kaufmann, San Francisco, 2005.

[127]B. Wu, T. Abbott, D. Fishman, W. McMurray, G. Mor, K. Stone, D. Ward,

K. Williams, and H. Zhao. Comparison of statistical methods for classiﬁcation of

ovarian cancer using mass spectometry data. Bioinformatics, 19(13):1636–1643,

2003.

[128]E. P. Xing, M. I. Jordan, and R. M. Karp. Feature selection for high-dimensional

genomic microarray data. In Proceedings of the Eighteenth International

Conference on Machine Learning, pages 601–608, 2001.

[129]M. Xiong, Z. Fang, and J. Zhao. Biomarker identiﬁcation by feature wrappers.

Genome Research, 11:1878–1887, 2001.

[130]Y. Yang, Y. Xiao, and M. Segal. Identifying differentially expressed genes

from microarray experiments via statistic synthesis. Bioinformatics, 21(7):1084–

1093, 2005.

[131]E. Yeoh, M. Ross, S. Shurtleff, W. Williams, D. Patel, R. Mahfouz, F. Behm,

S. Raimondi, M. Relling, A. Patel, P. A, C. Cheng, D. Campana, D. Wilkins,

X. Zhou, J. Li, H. Liu, C. Pui, W. Evans, C. Naeve, L. Wong, and

J. Downing. Classiﬁcation, subtype discovery, and prediction of outcome in

pediatric lymphoblastic leukemia by gene expression proﬁling. Cancer Cell,

1:133–143, 2002.

[132]K. Yeung and R. Bumgarner. Multiclass classiﬁcation of microarray data with

repeated measurements: application to cancer. Genome Biology, 4(12):R83,

2003.

[133]K. Yeung, R. Bumgarner, and A. Raftery. Bayesian model averaging:

development of an improved multi-class, gene selection and classiﬁcation tool

for microarray data. Bioinformatics, 21(10):2394–2402, 2005.

[134]J. Yu and X. Chen. Bayesian neural network approaches to ovarian cancer

identiﬁcation from high-resolution mass spectometry data. Bioinformatics,

21(Suppl. 1):i487–i494, 2005.

[135]J. Yu, S. Ongarello, R. Fiedler, X. Chen, G. Toffolo, C. Cobelli, and

Z. Trajanoski. Ovarian cancer identiﬁcation based on dimensionality reduction

for high-throughput mass spectometry data. Bioinformatics, 21(12):2200–2209,

2005.

[136]L. Yu and H. Liu. Efﬁcient feature selection via analysis of relevance and

redundancy. Journal of Machine Learning Research, 5(Oct):1205–1224, 2004.

[137]N. Zavaljevsky, F. Stevens, and J. Reifman. Support vector machines with

selective kernel scaling for protein classiﬁcation and identiﬁcation of key amino

acid positions. Bioinformatics, 18(5):689–696, 2002.

[138]X. Zhang, X. Liu, Q. Shi, X.-Q. Xu, H.-C. Leung, L. Harris, J. Iglehart,

A. Miron, J. Liu, and W. Wong. Recursive SVM feature selection and sample

classiﬁcation for mass-spectometry and microarray data. BMC Bioinformatics,

7(197), 2006.

A feature selection method for multimodal multispectral LiDAR sensing

Article

Full-text available

May 2024
ISPRS J PHOTOGRAMM

Optical remote sensing techniques can indicate the properties of objects by observing different modalities (physical quantities) of the backscattered light at different optical wavelengths. Established examples are reflectance, fluorescence, Raman, or depolarization spectroscopy. LiDAR sensing, on the other hand, allows acquiring the geometry of objects by measuring the propagation delay of optical probing signals. Multimodal multispectral (MM) LiDAR combines these capabilities and extends conventional monochromatic LiDAR in both spectral and modal dimensions within a single instrument, thus enriching point cloud data with non-geometric information. The potentially high dimension of MM LiDAR data, however, poses significant challenges for instrumental design, data acquisition, and data processing. MM LiDAR data are structured as several or all modalities are available in each of the spectral channels. The above challenges can thus be mitigated by feature selection (FS), if the structure of the features is taken into account, i.e., if entire spectral channels or entire modalities are selected or omitted. Herein, we focus on the feature selection method for MM LiDAR and propose a multiclass group feature selection algorithm (MGSVM FS) consisting of a structural sparsity-based embedded feature selection method with an all-in-one support vector machine (SVM). It tackles jointly the challenges arising from the high dimension of the MM data and the need for a multiclass classification task while exploiting the structure of the MM data. In addition, we introduce a complete workflow for evaluating the feature selection and for decision-making. We apply the framework to selecting an optimum spectral and modal configuration for remote material classification using an experimental MM LiDAR system that provides reflectance, distance, and degree of linear polarization in 28 spectral channels of 10 nm width. For the experimental investigation, we use MM LiDAR data obtained in a controlled lab environment from thirty specimens of four material classes relevant for construction. Using all three modalities, we find a configuration with only 3 spectral channels that achieves a classification mean-F1 score of 100% within this small dataset. Similar classification performance can also be achieved with only two modalities when using more spectral channels. MGSVM FS improves the classification mean-F1 score by up to 25% as compared to random selection and outperforms two other commonly used filter and embedded feature selection methods, in this application example. The proposed group feature selection algorithm and decision-making are useful for MM LiDAR, providing a link between instrumental design, data acquisition, and data processing. However, they are also transferable to other application fields related to multiclass classification, regression, and knowledge discovery, with features structured in groups. The collected MM feature dataset, the MGSVM FS algorithm, and the evaluation pipeline are accessible online.

Image classification using rapideye data: Integration of spectral and textual features in a random forest classifier

Article

Full-text available

Jan 2017
IEEE J-STARS

Information on crop types derived from remotely sensed images provides valuable input for many applications such as crop growth modeling and yield forecasting. In this paper, a random forest (RF) classifier was used for crop classification using multispectral RapidEye imagery over two study sites, one in northeastern China and one in eastern Ontario, Canada. Both vegetation indices (VIs) and textural features were derived from the RapidEye imagery and used for classification. A total of 20 VIs, categorized into two groups with and without the red edge (RE) band in an index, were calculated. A total of eight types of textu-ral features were derived using four different window sizes from both the RE and the near-infrared bands. To reduce redundancies among the VIs and textural features, feature selection using the principal component analysis, correlation analysis, and step-wise discriminant analysis was performed. Results showed that the overall classification accuracy was improved by ∼7% when the RE indices were combined with the five spectral bands in classification, as compared with that using the five bands alone. When textural information was included, the overall classification accuracy increased by ∼6% compared with that using the band reflectance alone. Furthermore, when all the features (band reflectance, VIs, and texture) were used, the overall classification accuracy increased Manuscript by ∼12% compared with that using only the band reflectance. The RF importance measures showed that the RE reflectance was important for classification, as indicated by the high importance for the triangular vegetation index, transformed chlorophyll absorption in reflectance index, and green-rededge normalized difference vegetation index. The gray-level co-occurrence matrix mean is the most useful for classification among the textural features. The study provides a means to feature extraction and selection for crop classification from remote sensing imagery. Index Terms-Crop classification, random forest (RF), Rapid-Eye, red edge (RE), spectral feature, textual feature.

ProkDBP: Toward more precise identification of prokaryotic DNA binding proteins

Article

Full-text available

May 2024
PROTEIN SCI

Prokaryotic DNA binding proteins (DBPs) play pivotal roles in governing gene regulation, DNA replication, and various cellular functions. Accurate computational models for predicting prokaryotic DBPs hold immense promise in accelerating the discovery of novel proteins, fostering a deeper understanding of prokaryotic biology, and facilitating the development of therapeutics targeting for potential disease interventions. However, existing generic prediction models often exhibit lower accuracy in predicting prokaryotic DBPs. To address this gap, we introduce ProkDBP, a novel machine learning‐driven computational model for prediction of prokaryotic DBPs. For prediction, a total of nine shallow learning algorithms and five deep learning models were utilized, with the shallow learning models demonstrating higher performance metrics compared to their deep learning counterparts. The light gradient boosting machine (LGBM), coupled with evolutionarily significant features selected via random forest variable importance measure (RF‐VIM) yielded the highest five‐fold cross‐validation accuracy. The model achieved the highest auROC (0.9534) and auPRC (0.9575) among the 14 machine learning models evaluated. Additionally, ProkDBP demonstrated substantial performance with an independent dataset, exhibiting higher values of auROC (0.9332) and auPRC (0.9371). Notably, when benchmarked against several cutting‐edge existing models, ProkDBP showcased superior predictive accuracy. Furthermore, to promote accessibility and usability, ProkDBP (https://iasri-sg.icar.gov.in/prokdbp/) is available as an online prediction tool, enabling free access to interested users. This tool stands as a significant contribution, enhancing the repertoire of resources for accurate and efficient prediction of prokaryotic DBPs.

Correlation-Based Feature Selection of Single Cell Transcriptomics Data from Multiple Sources

Preprint

Full-text available

May 2024

When using data mining or machine learning techniques on large and diverse datasets, it is often necessary to construct descriptive and predictive models. Descriptive models are used for discovering relationships among the attributes of the data while predictive models identify the characteristics of the data that will be collected in future. Bioinformatics data are high-dimensional, making it practically impossible to apply the majority of "classic" algorithms for classification and clustering. Even when the algorithms are useful, the training with large multidimensional data significantly increases the processing time. The algorithms specialized for working with high-dimensional data often cannot process data that contains large data sets that have several thousand dimensions (features). Dimension reduction methods (such as PCA) do not provide satisfactory results, and in addition, they obscure the meaning of the initial attributes in the data. For the constructed models to be usable, they must meet the requirement of scalability due to the large increase in the amount of bioinformatics data collected daily. Furthemore, the significance of the individual data features can also differ from source to source. This work describes an attribute selection method to efficiently classify high-dimensional (30,698) transcriptomics data collected from multiple sources. The proposed method was tested using 22 classification algorithms. The classification results for the selected sets of attributes are comparable to the results for the complete set of attributes.

Integrative omics data mining: Challenges and opportunities

Chapter

May 2024

Multi-class feature selection via Sparse Softmax with a discriminative regularization

Article

Full-text available

May 2024

Feature selection plays a critical role in many machine learning applications as it effectively addresses the challenges posed by “the curse of dimensionality” and enhances the generalization capability of trained models. However, existing approaches for multi-class feature selection (MFS) often combine sparse regularization with a simple classification model, such as least squares regression, which can result in suboptimal performance. To address this limitation, this paper introduces a novel MFS method called Sparse Softmax Feature Selection (S2FS\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S^2FS$$\end{document}). S2FS\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S^2FS$$\end{document} combines a l2,0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l_{2,0}$$\end{document}-norm regularization with the Softmax model to perform feature selection. By utilizing the l2,0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l_{2,0}$$\end{document}-norm, S2FS\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S^2FS$$\end{document} produces a more precise sparsity solution for the feature selection matrix. Additionally, the Softmax model improves the interpretability of the model’s outputs, thereby enhancing the classification performance. To further enhance discriminative feature selection, a discriminative regularization, derived based on linear discriminate analysis (LDA), is incorporated into the learning model. Furthermore, an efficient optimization algorithm, based on the alternating direction method of multipliers (ADMM), is designed to solve the objective function of S2FS\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S^2FS$$\end{document}. Extensive experiments conducted on various datasets demonstrate that S2FS\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S^2FS$$\end{document} achieves higher accuracy in classification tasks compared to several contemporary MFS methods.

Meta-Analysis and Review of in silico Methods in Drug Discovery – Part 1: Technological Evolution and Trends from Big Data to Chemical Space

Preprint

Full-text available

May 2024

Background The present review summarizes the state-of-the-art of in silico methods and techniques that are the most useful in drug discovery, their relationship with data science, as well as the successful application of data science, machine learning (ML) and artificial intelligence (AI) applications. A meta-analysis of the various technologies available is furthermore proposed as a guideline for the non-expert, reader relative to the several subject areas is also discussed in this article. The scope of this meta-analysis is to rank the enlisted technologies by their field of applications and to depict the latter according to knowledge accessibility, from students to experts.Method The search strategy utilized for this review first produced a general collection of 900 papers without duplications, which were subsequently streamlined and divided into two independent collections: the top 300 most-cited papers of all time (since 2000) and the papers with the highest interest for a systematic review analysis (high-impact exciting papers). Results In Part 1, we discuss the most cited and quality 97 articles in these top 300 papers most relevant to the field of in silico drug discovery. The different disciplines are listed according to their industrial and economic incurred to society, independently from the “metric” results of how many new drug approvals (NDAs) each discipline has generated to date.Conclusion Big data, the ensemble of known items stored in publicly available databases, has improved our understanding of the many fates of a potential drug candidate during its development and even after its commercialization. Moreover, the combination of new screening techniques and “omics” with old drugs has led to a new paradigm in which the unknown knowledge of any biological molecule and its cellular structure, now plays an important role as a target for a series of yet-to-be-developed drugs: the chemical space. Furthermore, leveraging big data, data science, ML, and AI can revolutionize drug discovery by swiftly analyzing massive datasets, predicting efficacy and safety profiles, streamlining development, cutting costs, and boosting success rates for new drugs. AI also speeds up the search for promising drug candidates, advancing innovative therapies.

Hybrid feature selection in a machine learning predictive model for perioperative myocardial injury in noncoronary cardiac surgery with cardiopulmonary bypass

Article

May 2024
Perfusion

Background Perioperative myocardial injury (PMI) is associated with increased mobility and mortality after noncoronary cardiac surgery. However, limited studies have developed a predictive model for PMI. Therefore, we used hybrid feature selection (FS) methods to establish a predictive model for PMI in noncoronary cardiac surgery with cardiopulmonary bypass (CPB). Methods This was a single-center retrospective study conducted at the Fuwai Hospital in China. Patients aged 18-70 years who underwent elective noncoronary surgery with CPB at our institution from December 2018 to April 2021 were enrolled. The primary outcome was PMI, defined as the postoperative cardiac troponin I (cTnI) levels exceeding 220 times of upper reference limit (URL). Statistical analyses were conducted by Python (Python Software Foundation, version 3.9.7 and integrated development environment Jupyter Notebook 1.1.0) and SPSS software version 26.0 (IBM Corp., Armonk, New York, USA). Results A total of 1130 patients were eventually eligible for this study. The incidence of PMI was 20.3% (229/1130) in the overall patients, 20.6% (163/791) in the training dataset, and 19.5% (66/339) in the testing dataset. The logistic regression model performed the best AUC of 0.6893 (95 CI%: 0.6371-0.7382) by the traditional selection method, and the random forest model performed the best AUC of 0.6937 (95 CI%: 0.6416-0.7423) by the union of Wrapper and Embedded method, and the CatBoost model performed the best AUC of 0.6828 (95 CI%: 0.6304-0.7320) by the union of Embedded and forward logistic regression technique, and the Naïve Bayes model achieved the best AUC with 0.7254 (95 CI%: 0.6746-0.7723) by forwarding logistic regression method. Moreover, the decision tree, KNeighborsClassifier, and support vector machine models performed the worse AUC in all selection forms. Furthermore, the SHapley Additive exPlanations plot showed that prolonged CPB, aortic clamp time, and preoperative low platelets count were strongly related to the PMI risk. Conclusions In total, four category feature selection methods were utilized, comprising five individual selection techniques and 15 combined methods. Notably, the combination of logistic regression and embedded methods demonstrated outstanding performance in predicting PMI risk. We also concluded that the machine learning model, including random forest, catboost, and Naive Bayes, were suitable candidates for establishing PMI predictive model. Nevertheless, additional investigation and validation are imperative for substantiating these finding.

Precision Healthcare: A Deep Dive into Machine Learning Algorithms and Feature Selection Strategies for Accurate Heart Disease Prediction

Article

May 2024
COMPUT BIOL MED

Machine Learning for Costing Gas-Turbine Components

Chapter

May 2024

Use of Proteomic Patterns in Serum to Identify Ovarian Cancer

Article

Jun 2002

In this study, a new high-order bioinformatics tool used to identify differences in proteomic patterns in serum was evaluated for its ability to detect the presence of cancer in the ovary. The proteomic pattern is generated using matrix-assisted laser desorption and ionization time-of-flight and surface-enhanced laser desorption and ionization time-of-flight mass spectroscopy from thousands of low-molecular-weight serum proteins. Proteomic spectra patterns were generated from 50 women with and 50 women without ovarian cancer and analyzed on the Protein Biology System 2 SELDI-TOF mass spectrometer (Ciphergen Biosystems, Freemont, CA) to find a pattern unique to ovarian cancer. In the graph of the analysis, each proteomic spectrum is comprised of 15,200 mass/charge (m/z) values located along the x axis with corresponding amplitude values along the y axis. By comparing the proteomic spectra derived from the serum of patients with known ovarian cancer to that of disease-free patients, a profile of ovarian cancer was identified in the peak amplitude values along the horizontal axis. The comparison was conducted using repetitive analysis of ever smaller subsets until discriminatory values from five protein peaks were isolated. The validity of this pattern was tested using an additional 116 masked serum samples from 50 women known to have ovarian cancer and 66 nonaffected women. All of the subjects with cancer and most of the women with no cancer were from the National Ovarian Cancer Early Detection Program at Northwestern University. The nonaffected women had been diagnosed with a variety of benign gynecologic conditions after evaluation for possible ovarian cancer and were considered to be a high-risk population. Serum samples were collected before examination, diagnosis, or treatment and frozen in liquid nitrogen. The samples were thawed and added to a C16 hydrophobic interaction protein chip for analysis. In the validation set, 63 of the 66 women with benign ovarian conditions were correctly identified in the spectra analysis. All 50 patients with a diagnosis of ovarian cancer were correctly identified in the analysis, including 18 women with stage I disease. Thus, the ability of proteomic patterns to detect the presence of ovarian cancer had a sensitively of 100%, a specificity of 95%, and a positive predictive value of 94%. In comparison, the positive predictive value for serum cancer antigen 125 in the set of patients was 35%. Additionally, no matching patterns were seen in serum samples from 266 men with benign and malignant prostate disease.

Adaptation in natural and artificial systems

Article

Jan 1994

J.H. Holland

Minimum redundancy feature selection from microarray gene expression data

Article

Jan 2003

Is cross-validation valid for small-sample microarray classification?

Article

Jan 2004

Toward integrating feature selection algorithms for classification and clustering

Article

Apr 2005

This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals unattempted combinations, and provides guidelines in selecting feature selection algorithms. With the categorizing framework, we continue our efforts toward building an integrated system for intelligent feature selection. A unifying platform is proposed as an intermediate step. An illustrative example is presented to show how existing feature selection algorithms can be integrated into a meta algorithm that can take advantage of individual algorithms. An added advantage of doing so is to help a user employ a suitable algorithm without knowing details of each algorithm. Some real-world applications are included to demonstrate the use of feature selection in data mining. We conclude this work by identifying trends and challenges of feature selection research and development.

PCP: a program for supervised classification of gene expression profiles

Article

Jan 2006

Ljubomir Buturovic

PCP (Pattern Classification Program) is an open-source machine learning program for supervised classification of patterns (vectors of measurements). The principal use of PCP in bioinformatics is design and evaluation of classifiers for use in clinical diagnostic tests based on measurements of gene expression. PCP implements leading pattern classification and gene selection algorithms and incorporates cross-validation estimation of classifier performance. Importantly, the implementation integrates gene selection and class prediction stages, which is vital for computing reliable performance estimates in small-sample scenarios. Additionally, the program includes automated and efficient model selection (optimization of parameters) for support vector machine (SVM) classifier. The distribution includes Linux and Windows/ Cygwin binaries. The program can easily be ported to other platforms.

Classification of cancer: Class discovery and class prediction by gene expression monitoring

Article