# Identification of signatures in biomedical spectra using domain knowledge.

**ABSTRACT** Demonstrate that incorporating domain knowledge into feature selection methods helps identify interpretable features with predictive capability comparable to a state-of-the-art classifier.

Two feature selection methods, one using a genetic algorithm (GA) the other a L(1)-norm support vector machine (SVM), were investigated on three real-world biomedical magnetic resonance (MR) spectral datasets of increasing difficulty. Consensus sets of the feature sets obtained by the two methods were also assessed.

Features identified independently by the two methods and by their consensus, determine class-discriminatory groups or individual features, whose predictive power compares favorably with that of a state-of-the-art classifier. Furthermore, the identified feature signatures form stable groupings at definite spectral positions, hence are readily interpretable. This is a useful and important practical result for generating hypothesis for the domain expert.

**0**Bookmarks

**·**

**92**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**This study proposes a method for the estimation of peripheral vascular occlusion (PVO) in diabetic foot using a support vector machine (SVM) classifier with the wolf pack search (WPS) algorithm. The long-term presence of elevated blood sugar levels commonly results in peripheral neuropathy, peripheral vascular disease, nephropathy, and retinopathy in patients with Type 2 diabetes mellitus. Patients with PVO disease have decreased walking capability and life quality in diabetes mellitus and poor peripheral circulation of PVO causes morbidity like infection and amputation of the legs or feet of diabetics. This progressively vascular occlusion is often ignored by the patients and primary care physicians in early stage. Therefore, a reliable method of diagnostic assistance is crucial for early diagnosis and monitoring of PVO and prevention of amputation. Photoplethysmography (PPG) is a non-invasive technique for detecting blood volume changes in peripheral vascular bed. Literature indicates that the pulse transit time increases and waveform shape changes increase in PPG of the vascular occlusion. PPG pulses of feet gradually become asynchronous due to the different speed of deteriorating patency and collateral circulation in the peripheral arteries. We utilized synchronizing chaotification to compare the bilateral similarity and asymmetry of PPG signals, and applied SVM to estimate three degrees of PVO. Among 33 subjects tested, this classification technique could recognize various butterfly motion patterns representing severities successfully including normal condition, lower-degree disease, and higher-degree disease. The proposed method has potential for providing diagnostic assistance for PVO of diabetics and other high-risk populations, with efficiency and higher accuracy.Biomedical Signal Processing and Control 01/2014; 9:45–55. · 1.53 Impact Factor - Michael C. Lee, Lilla Boroczky, Kivilcim Sungur-Stasik, Aaron D. Cann, Alain C. Borczuk, Steven M. Kawut, Charles A. Powell[Show abstract] [Hide abstract]

**ABSTRACT:**Accurate classification methods are critical in computer-aided diagnosis and other clinical decision support systems. Previous research has studied methods for combining genetic algorithms for feature selection with ensemble classifier systems in an effort to increase classification accuracy. We propose a two-step approach that first uses genetic algorithms to reduce the number of features used to characterize the data, then applies the random subspace method on the remaining features to create a set of diverse but high performing classifiers. These classifiers are combined using ensemble learning techniques to yield a final classification. We demonstrate this approach for computer-aided diagnosis of solitary pulmonary nodules from CT scans, in which the proposed method outperforms several previously described methods.Proceedings of the IEEE Symposium on Computer-Based Medical Systems 07/2008; - [Show abstract] [Hide abstract]

**ABSTRACT:**Applying Fourier-transform infrared (FTIR) spectroscopy (or related technologies such as Raman spectroscopy) to biological questions (defined as biospectroscopy) is relatively novel. Potential fields of application include cytological, histological and microbial studies. This potentially provides a rapid and non-destructive approach to clinical diagnosis. Its increase in application is primarily a consequence of developing instrumentation along with computational techniques. In the coming decades, biospectroscopy is likely to become a common tool in the screening or diagnostic laboratory, or even in the general practitioner's clinic. Despite many advances in the biological application of FTIR spectroscopy, there remain challenges in sample preparation, instrumentation and data handling. We focus on the latter, where we identify in the reviewed literature, the existence of four main study goals: Pattern Finding; Biomarker Identification; Imaging; and, Diagnosis. These can be grouped into two frameworks: Exploratory; and, Diagnostic. Existing techniques in Quality Control, Pre-processing, Feature Extraction, Clustering, and Classification are critically reviewed. An aspect that is often visited is that of method choice. Based on the state-of-art, we claim that in the near future research should be focused on the challenges of dataset standardization; building information systems; development and validation of data analysis tools; and, technology transfer. A diagnostic case study using a real-world dataset is presented as an illustration. Many of the methods presented in this review are Machine Learning and Statistical techniques that are extendable to other forms of computer-based biomedical analysis, including mass spectrometry and magnetic resonance.The Analyst 05/2012; 137(14):3202-15. · 3.91 Impact Factor

Page 1

Identification of signatures in biomedical

spectra using domain knowledge

Erinija Pranckevicienea,b, Ray Somorjaia,*,

Richard Baumgartnera, Moon-Gu Jeona

aInstitute for Biodiagnostics, National Research Council, 435 Ellice Avenue,

Winnipeg, Man., Canada R3B 1Y6

bKaunas University of Technology, Studentu 50, Kaunas, LT 3031, Lithuania

Received 13 May 2004; received in revised form 30 November 2004; accepted 6 December 2004

1. Introduction

Growing interest in discovering biologically mean-

ingful information in biomedical data led to the use

and development of many techniques, some also

applicable to biomedical spectra. A useful summary

is in [1]. Among these are: principal component

analysis, k-means and hierarchical clustering, sup-

portvectormachines(SVMs),hiddenMarkovmodels,

geneticalgorithms,neuralnetworktechniques,self-

organizing maps, classification and regression trees.

Using data mining techniques on microarrays/spec-

tra, one attempts to identify important clues for

classseparationandtousethisinformationtodesign

aclassificationrule.Standardmethodsofdimension-

ality reduction [2—4] and simple distance-based

classifiers applied to the original high-dimensional

data are not very effective. For high-dimensional

Artificial Intelligence in Medicine (2005) 35, 215—226

http://www.intl.elsevierhealth.com/journals/aiim

KEYWORDS

Classification of

biomedical spectra;

Dimensionality

reduction;

Feature selection;

Genetic algorithm;

L1-norm SVM;

Spectral signature;

Consensus feature sets;

Domain knowledge

Summary

Objective: Demonstrate that incorporating domain knowledge into feature selection

methods helps identify interpretable features with predictive capability comparable

to a state-of-the-art classifier.

Methods: Two feature selection methods, one using a genetic algorithm (GA) the

other a L1-norm support vector machine (SVM), were investigated on three real-world

biomedical magnetic resonance (MR) spectral datasets of increasing difficulty. Con-

sensus sets of the feature sets obtained by the two methods were also assessed.

Results and conclusions: Features identified independently by the two methods and

by their consensus, determine class-discriminatory groups or individual features,

whose predictive power compares favorably with that of a state-of-the-art classifier.

Furthermore, the identified feature signatures form stable groupings at definite

spectral positions, hence are readily interpretable. This is a useful and important

practical result for generating hypothesis for the domain expert.

# 2005 Elsevier B.V. All rights reserved.

* Corresponding author. Tel.: +1 204 984 4538;

fax: +1 204 984 5472.

E-mail address: ray.somorjai@nrc-cnrc.gc.ca (R. Somorjai).

0933-3657/$ — see front matter # 2005 Elsevier B.V. All rights reserved.

doi:10.1016/j.artmed.2004.12.002

Page 2

data, feature selection typicallyprecedes classifica-

tion [5—9]. For microarray data and biomedical

spectra, feature selection is necessary but often

not sufficient when additional information about

the classes is lacking. Features that are optimal

for classification do not necessarily possess biologi-

calrelevance.Forhigh-dimensionalbutsparsedata-

sets, many different combinations of attributes may

separate the data perfectly [10]. Which of these is

plausible? Are the discovered features truly charac-

teristicoftheclassesaslabelled(e.g.,cancerversus

normal), or do they also reflect other covariates

(e.g., gender, age, etc.), or even noise? Sometimes

the data labelling by the domain expert contains

wrong class assignments, confusing the feature

selection/classificationprocess

Incorporating domain knowledge helps deal more

efficiently with the problem of uncertain multiple

solutions, and aids in identifying the most appropri-

ate data analysis model.

Our goal is to discuss a feature selection strategy

that is determined by the domain knowledge avail-

able for typical biomedical spectra. Domain knowl-

edge (DK) is additional information about spectra

that distinguishes them from other types of data,

suchasmicroarrays.AdditionalDKenablesdesigning

a feature selection algorithm that reflects directly

the nature/characteristics of the data. Consider a

spectrum as a collection of peaks and valleys, whose

positions and intensities carry discriminatory infor-

mation relevant for classification. The physical/

chemical basis of class separation is reflected in

the peak/valley distribution and peak width. Typi-

cally, spectral data have high feature-space dimen-

sionality, although the number of discriminatory

features (intrinsic dimensionality) may possibly be

quite low. This happens because of the many corre-

lated features in a spectrum; thus, it is likely that if

a single attribute is discriminatory, so are its

immediate neighbors. Guaranteeing that the new

features correspond directly to the original posi-

tions in the spectra is very important for interpret-

ability. We concretize the concept of spectral

signature as a set of related spectral regions or

single spectral attributes. It is assumed that the

samples of a particular spectral class have common

specific spectral signatures. Generally, we do not

know the number, position or width of the spectral

regions and/or relative intensity levels of the spec-

and outcomes.

tral signature that separate classes. The discovery

of the class signature–—the discriminatory pattern

common to all samples of a particular class–—is the

goal of the feature selection step.

To discover the signature(s), we combine the

outputs of two methods. Both capture and retain

the original spectral features. One is a genetic-

algorithm-basedfeature

wrapped around a simple classifier (e.g., linear

discriminant analysis (LDA)). The other selects spec-

tral attributes through the use of the sparseness

propertyofanL1-norm SVMclassifier.Thistechnique

is referred to by several names: sparse classifier [5],

1-norm SVM classifier [11], SVM trained by linear

programming [12,13], linear sparse kernel Fisher

discriminant [14].

The proposed feature selection methodology nar-

rowstherangeofusefulspectralfeaturesforfurther

processing, considering only those signatures iden-

tified by both methods. We demonstrate the

approach on three real-life datasets. To avoid over-

optimistic assessment of the feature selection

methods because of selection bias [15], we partition

the data into a training and an independent test

sets. The test set was used only once at the very

end, after the feature selection was completed.

selection method,

2. Data

Three real-world, two-class datasets were used in

this feature selection study. Dataset1 contains MR

spectraofpathogenicfungi(Candidaalbicansversus

Candida tropicalis) [16]. Dataset2 comprises MR

spectra of biofluids obtained from normal subjects

and cancer patients [17—18]. Dataset3 consists of

MR spectra of biofluids obtained from patients with

successful renal transplant versus patients with

(rejected) kidney transplant [19].

The characteristics of the datasets are given in

Table 1: D is the dimensionality of the data, N1, N2

are the total number of samples in classes 1 and 2,

respectively. Tr1+ Tr2are the number of samples of

classes 1 and 2 in the training set, and Te1+ Te2in

the test set, respectively. The partition of the sam-

ples in the training and test sets remains the same in

all experiments. For the feature selection process,

the training samples Tr1+ Tr2are further divided

intotrainingandmonitoringsets.Asinglevalidation

216E. Pranckeviciene et al.

Table 1

Properties of the datasets

Name Dimensionality, DN1

N2

Tr1+ Tr2

Te1+ Te2

En

Dataset1

Dataset2

Dataset3

1500

300

3380

104

61

91

75

79

65

50 + 50

31 + 40

45 + 33

54 + 25

30 + 39

46 + 32

6

12

49

Page 3

is performed on the independent test set Te1+ Te2.

To estimate the intrinsic dimensionality of the data-

set, we analyse the spectrum of singular values of

the pooled covariance matrix [20, Chapter 9]. In the

Encolumn of Table 1, the number of eigenvalues

accounting for 95% of the variance is presented for

each dataset. This suggests the number of compo-

nentsneededtorepresentfully(thevarianceof)the

dataset, hence its level of difficulty. The three

datasetsrepresentthreelevelsofdifficultytypically

encountered in real-life biomedical data.

3. Methods

The specificcharacteristics of spectraldata (domain

knowledge) determine the structure of features

separating different spectral classes. It also dictates

how these structures are to be identified. In the

following, the vector variables will be denoted by

bold letters. Let the vector xðjÞ

spectrum of class j: xðjÞ

i

j = {?1, +1} denotes the class label, i = 1, ...,

N1+ N2is the total number of classes 1 and 2 sam-

ples, and D ? max(N1,N2) is the dimensionality of

the data. A spectral signature is a collection of R

attributesxir

computed

½dðiÞ

xir¼ ðdr1? dr2þ 1Þ?1X

It can either be a set of R non-overlapping, averaged

spectralregionsorRsingleattributes.Wedenoteitby

xiR= [xi1, ..., xir, ..., xiR], R ? D. The number of

regions or attributes in a spectral signature is much

smaller than the original dimensionality of the spec-

trum.IfdðiÞ

r2,thenxirisasingleoriginalattribute.

In the feature selection step, we use a linear discri-

minant as the wrapper. The linear discriminant func-

tionassignsanunknownsampletooneoftheclasses:

i

denote the ith

i1;...;xðjÞ

¼ ½xðjÞ

iD?, where

from

r2:

theinterval

r1;dðiÞ

r2? 2 [1, ..., D], dðiÞ

r1? dðiÞ

dr2

xik:

k¼dr1

(1)

r1¼ dðiÞ

gðxiRÞ ¼ wTxiRþ w0;

gðxiRÞ?0)xiR2class1;

gðxiRÞ<0)xiR2class2:

Eq. (2) implements the rule called linear discrimi-

nant analysis. Samples of the two classes are

labelled yi= {+1, ?1}. In the feature selection step,

we attempt to identify optimal spectral signatures

xiRand the corresponding optimal weight vector of

the discriminant function w?¼ ½w?

that maximizes the number of correct decisions:

(2)

1;...;w?

D? and w?

0

w?;w?

0¼ argmax

w;w0

X

N1þN2

i

yiðwTxiRþ w0Þ:

(3)

Additional notation: i — index of the sample,

w — weight vector of the discriminant function

(3), w?— optimal weight vector of the discriminant

function, wj — jth element of the weight vector.

For known xiR, one can use any classification

method to find the discriminant function. A table

of commonly used methods is given in [21, pp. 125—

126].

3.1. Feature selection methods

We investigate two methods of finding the spectral

signature xiR. The first uses a genetic algorithm to

find the number of optimal regions that optimally

separate the spectral classes. This algorithm takes

into account multivariate relationships between the

components of the spectral signatures. The second

is an L1-norm SVM that produces sparse solutions,

i.e., after training, the weight vector of the linear

discriminant contains only a small number of non-

zero components. The spectral positions corre-

spondingto nonzero weight

important for establishing the separation boundary.

This algorithm takes into account linear separation,

correlation and differences in the mean between

individual features. The two techniques are differ-

ent, but both use domain knowledge information

about the data, manifest as the spectral signature.

Both techniques implement feature selection meth-

ods that retain and/or directly reflect the original

features of the data. In the following subsections

the details of each feature selection method are

presented.

componentsare

3.1.1. Optimal region selection by a genetic

algorithm

The genetic-algorithm-based optimal region selec-

tor GA_ORS was developed at the Institute for Bio-

diagnostics (IBD) [8]. It is wrapper-based, and

several classifiers and crossvalidation methods have

been implemented. Typically, LDA with leave-one-

out (LOO) crossvalidation is used, abbreviated as

GA/LDA. The user-supplied initial inputs are the

number R of requested regions, and the mutation

and crossover rates of the genetic algorithm. The

objective function of GA is designed to simulta-

neously minimize empirical classification error

and maximize classification ‘‘crispness’’ (i.e., ren-

der class assignment probabilities as unequivocal as

possible) [8]. For selectedsignatures,theweights of

the optimal linear discriminant function are calcu-

lated using the resubstitution (‘‘plug-in’’) esti-

mates:

w?

R¼

X?1

Rðm1

R? m2

RÞ;

(4)

Identification of signatures in biomedical spectra using domain knowledge 217

Page 4

w?

0R¼ 1=2ðm1

þ logðp1=p2Þ:

Eqs. (4) and (5) implement Fisher’s linear discrimi-

nant function (LDF). The mj

sample means, pjof the prior class probabilities,

j = ?1,+1,andSRofthestandardpooledcovariance

matrix. The details about the parameters control-

ling the evolution of the genetic algorithm and

about its implementation are in [8]. In our study,

the mutation rate of the genetic algorithm was

0.001 and the crossover rate was 0.66.

We do not know the number of regions in

advance. For each dataset, we search for R = 2,

..., 10 regions composing the spectral signatures

xiR. For each R, we perform N runs of GA, using

differentrandomseedsforinitialization(N = 9inour

study). The number of iterations in one GA run is

fixed at 20. When the sample size is small, boot-

strapping [19] increases the confidence about the

selected regions. For each GA iteration, the training

setisbootstrapped200times,usinghalfthesamples

for training and the other half for monitoring. Each

run of GA produces a set of subregions. From this set

we pick balanced spectral signatures, i.e., those

signatures that perform comparably well (provide

similar classification accuracies) on both training

and monitoring sets. The following is our measure

of how well balanced a spectral signature is.

Let Ai

monitoringbe the classification

accuracies on the training and monitoring sets,

respectively, using the ith spectral signature xiR.

Then,thesmallerthe

Ai

monitoring),

balanced the classification accuracy of the ith spec-

tral signature. For bootstrapping, the value of Biis

based on the averages of Ai

obtained during bootstrap.

After completing the independent feature selec-

tion step of GA, we have 9N candidate signatures.

The union of all selected (some overlapping) regions

represents the distribution of locations in the spec-

trum (ahistogram) foundbyGA that seemimportant

forseparatingtheclasses. Thissetisusedforfurther

processing and the selection of a smaller number of

relevantregions.Theindependent feature selection

step of the genetic algorithm is shown in the flow-

chart in Fig. 1a.

R

X?1

Rm1T

R? m2

R

X?1

Rm2T

RÞ

(5)

Rare estimates of the

trainingand Ai

scoreBi= jAi

the

training?

better

monitoringj/(Ai

trainingþ Ai

trainingand Ai

monitoring

3.1.2. Spectral attribute selection by an L1-norm

support vector machine sparse classifier

Theoretical [22] and case studies [5,11,13] indicate

that the L1-norm SVM is effective when dealing with

high-dimensional data. SVM training, via minimizing

(by a quadratic programming procedure) the L2

norm of w, the weight vector of the discriminant

function, induces sparseness in the data represen-

tation. SVM training, via minimizing (by a linear

programming procedure) the L1norm of w, induces

sparseness in w. On publicly available datasets, the

different norm SVMs performed comparably [23].

For a two-class classification problem, the linear

SVM implements rule (2), with the additional

requirement that the minimum separation between

samples from the two classes be at a certain dis-

tance, given by the fixed margin. If the data are not

linearly separable, then the maximum-margin

separating hyperplane of the linear L1-norm SVM

is obtained through minimization of the following

objective function:

Jðw1;...;wD;j1;...;jN1þN2Þ ¼

X

D

j¼1

jwjj þ C

X

N1þN2

i¼1

ji

(6)

subject to the constraints:

yiðwTxiþ w0Þ?1 ? ji;

i ¼ 1;...;N1þ N2; ji?0:

(7)

The jiin the constraints (7) are proportional to the

distances of samples from the margin. To present

the problem in the form appropriate for the general

linear program solver, the variables in the objective

function must be positive. Hence, the objective

function is changed, modeling components of w

as a difference of two positive variables:

wj¼ uj? vj:

The absolute value of the weight is modeled as

jwjj ¼ ujþ vj:

The pair uj, vjthat satisfies Eqs. (8) and (9) simul-

taneously is unique. Such transformation is common

in linear programming [12]. The changed objective

function is

(8)

(9)

Jðu1;...;uD;v1;...;vD;j1;...;jN1þN2Þ

X

subject to the constraints:

yiððu ? vÞTxiþ ðu ? vÞ0Þ?1 ? ji;

i ¼ 1;...;N1þ N2; ji?0;

and

ji?0;

The optimal solution of the problem stated in

Eqs. (10)—(12) for the weight vector is

w?¼ u?? v?:

For this method, the key for feature selection is that

the solution vector (13) be sparse. The parameter C

¼

D

j¼1

ðujþ vjÞ þ C

X

N1þN2

i¼1

ji

(10)

(11)

uj?0;

vj?0;

j ¼ 1;...;D:

(12)

(13)

218E. Pranckeviciene et al.

Page 5

controls the sparseness of the solution. A sparse

weight vector means that many of its components

are negligibly small or zero. Zero weight vector

components correspond to features that do not

affect classification outcome. Nonzero weights

identify those components of the spectral signature

that contribute to classification. In this case, the

spectral signature consists of single spectral attri-

butes. Using a sparse weight vector in classification

helps prevent overfitting, but the trade-off is

reduced classification accuracy. Experimentally, as

the value of C increases, solution (13) becomes

closer to the solutions of standard SVM and Fisher’s

LDF. The similarity of SVM and Fisher’s LDF was

discussed theoretically in [24], noting that the

SVM is a sparse version of the Fisher’s linear dis-

criminant function. We illustrate the dependence of

sparseness on C in Fig. 2, using the GA3 feature set

for Dataset2. We show one C-dependent sparse and

onenon-sparsesolutionfortheseparatingboundary.

(For comparison, we also display the separation

boundaries of three other classification rules.) A

drawback is that the value of the constant C that

controls the level of sparseness and classification

accuracy has to be determined by trial and error or

set a priori [25]. In [5] the L1-norm SVM feature

selection method was named LIKNON. We also refer

to it as LIKNON.

Feature selection by LIKNON was performed

twice:(1)intheleave-one-outmode,with

Identification of signatures in biomedical spectra using domain knowledge 219

Figure 1

search for regions by GA. The output of (1) is a list of 9N candidate signatures. All of these were obtained in N different

runs of GA. The three best subsets, GA1, GA2 and GA3, based on the smallest value of Bi, are chosen as reference subsets.

From the list of 9N signatures, a union of regions is obtained. This is processed by LIKNON without frequency count (FC)

producing subset G_L, and with FC. From the subset with FC, e.g., with thresholds of 70% (or 90%) for the feature

selection frequency, subsets G_L70 (or G_L90) are formed. The subsets obtained after applying LIKNON to the union of

regions are consensus features. (b) Feature selection process performed independently by LIKNON, followed by an

exhaustivesearchprocedure.LIKNONwithfrequencycount(FC)andwithoutFCisappliedtoallfeatures.LIKNONwithout

FC produces the subset L. To the subset obtained by counting the attribute selection frequency, different thresholds

(e.g., 70% or 90%) are applied, producing subsets (e.g., L70 or L90). As an additional processing step, we select

exhaustively (e.g., from subset L70) a small number (e.g., 3 or 5) of most discriminative attributes that maximizes the

classification accuracy of the LDA. This step produces subsets L70_ES3 or L70_ES5, also considered as consensus features.

(a) Feature selection process combining the genetic algorithm and LIKNON approaches: (1) independent

Page 6

frequency count–—taking the most frequently found

features, and (2) LIKNON solved just once, without

frequency count–—selecting the features corre-

sponding to nonzero weights. LIKNON was applied

both to the union of regions selected by GA, and to

individual spectral attributes. The frequency count

is performed as follows. Having Tr1+ Tr2training

samples in the dataset, we identify a sparse classi-

fier, using LOO–—excluding one sample from the

training set, identifying a sparse classifier on the

remainder, and cycling through all samples. We

select attributes corresponding to nonzero weights,

based on the frequency of their occurrence. Let an

attribute be selected M times out of Tr1+ Tr2runs.

Thefrequency of its

(Tr1+ Tr2)%. There are two outputs of the indepen-

dent feature selection stage byLIKNON. The firstis a

set of the attributes that were present in all runs

above a certain threshold (70%, 75%, 90% and 95%).

The second, the set of important attributes identi-

fied from just a single run of LIKNON. The indepen-

dent LIKNON feature selection step is depicted in

the flowchart in Fig. 1b.

occurrenceis100M/

3.1.3. Consensus-based feature selection

GA/LDA typically identifies several near-optimal

signatures that give comparable accuracies. Two

basic questions arise: (1) which subset is preferable

and (2) how many regions comprise the signature.

The solution of LIKNON is fixed for a given data

configuration, providing the user with the most

likely number and location of single attributes of

interest in the spectrum. It appears that the dis-

criminatory regions found by GA/LDA contain, or are

adjacent to, the most frequent discriminatory attri-

butesindependentlyidentifiedbyLIKNON.Thecom-

bination of features, selected by the two methods,

provides good insight into the number and the dis-

tribution of discriminatory ‘‘hot spots’’ in the spec-

trum. Reasons for the agreement between the

outputs of the two methods are as follows. GA/

LDA finds signatures, favoring the Gaussian common

covariance matrix (GCCM) data model. GCCM

selects averaged spectral regions that maximize

the Mahalanobis distance between the classes.

LIKNON prefers/picks individual features from the

groups of correlated attributes that maximize the

weighted Euclidean distance between the classes.

The consensus-based feature selection is imple-

mented in two steps. First, a relatively small num-

ber of relevant features are found independently by

GA/LDA and by LIKNON. Second, in order to reach an

agreement between the two methods: (1) the union

of all optimal regions selected by GA/LDA is used as

input to LIKNON and (2) the number of best LIKNON

attributes is further reduced by an exhaustive

search (ES) procedure to find the best subset(s) of

attributes. The classification accuracy of the LDA is

the criterion maximized during the ES (abbreviated

as ES/LDA).

The output of the second step — application of

LIKNON to the union of GA/LDA regions and applica-

tion of ES/LDA to the set of frequent attributes

identified by LIKNON — contains the consensus fea-

tures. For analyzing the distribution of ‘‘hot spots’’

in the spectrum, we consider signatures, found

independently by the two methods and by consen-

sus.

The full feature selection process is illustrated in

Fig.1aandb,whereeachstepoftheprocessandthe

processing order are shown. Each signature has a

uniquenamewherelettersandnumbersinthename

denote details of the selection method. The ana-

lyzed signatures are described and compiled in

Table 2.

We had to vary the selection frequency threshold

for Dataset2 (to increase the number of attributes

selected by LIKNON) and for Dataset3 (to reduce the

number of attributes selected by LIKNON).

3.2. Classification methods

In order to assess the feature selection methodol-

ogy, we use several independent classifiers: the

single-layer perceptron (SLP), Fisher’s linear discri-

minant function, and generalized singular value

decomposition (GSVD) followed by nearest centroid

220 E. Pranckeviciene et al.

Figure 2

set GA3. Squares present the data points of class 1, the

crosses those of class 2. Decision boundaries of classifica-

tion rules: (solid line) linear L2-norm SVM;(dash dot) Least

squares; (dashed) L1-norm SVM; (dotted) Fisher’s linear

discriminant function. When C = 0.1, the L1-norm SVM

produces a sparse weight vector, for which feature X2 is

irrelevant. When C = 1, the L1-norm SVM gives a non-

sparse decision boundary.

Decision boundaries for Dataset2, feature sub-

Page 7

(NC) classifier. We designate the SVM, trained on the

full-dimensional data, as the benchmark. The

benchmark serves as a measure of base-line perfor-

mance. However, the SVM, trained on the full-

dimensional data does not provide interpretability,

much desirable for classification of spectra. Feature

selection reduces the dimensionality, thus allowing

for an easier interpretability of the features and the

classification outcome.

3.2.1. Support vector machines

L2-norm SVMs are currently considered state-of-the-

art classifiers. SVMs are learning systems that use

linear functions in the feature space as the hypoth-

esis space, and are trained with a learning algorithm

from a well-established optimization theory. SVM

tackles the overfitting problem by using regulariza-

tionduringthetrainingprocess,ultimatelyinvolving

only a small number of support vectors in the clas-

sification. For the classification experiments, we

used the LIBSVM [25]. We set the regularization

parameter to 1, and perform scaling. Scaling [25]

is an important preprocessing step for SVM. It pre-

vents attributes with greater range dominating

those with a much smaller range. Training and test

data in our study were scaled simultaneously to the

range between ?1 and 1.

3.2.2. Single-layer perceptron and Fisher’s

linear discriminant function

The SLP is a linear classifier with a nonlinear activa-

tion function and uses the gradient-descent optimi-

zation procedure for identification of its weights.

Priortoperceptron training,adata-whiteningtrans-

formation was applied. This transformation fre-

quently improves the classification accuracy [17].

Fisher’s linear discriminant function is given by

Eqs. (4) and (5), where multivariate parameters

areestimatedfromtrainingdatausingtheidentified

spectral signatures.

3.2.3. Generalized singular value decomposition

with nearest centroid classifier

TheGSVD,followedbyanearestcentroidclassifier,is

a linear classifier, obtained by maximizing the trace

of between-class scatter matrices SBand minimizing

the trace of within-class scatter matrices SW. The

transpose of the matrixis denoted bysuperscriptT. If

the total-class deviation matrix is V and the pooled

within-class deviation matrix W, then SB= VVTand

SW= WWT. The GSVD is applied to the combined

matrix G = [VW] and yields the decomposition of

the matrix G into G = XDY–—the multiplication of

the orthogonal matrixes X and Yand diagonal matrix

D. The nonzero diagonal elements of D measure the

numberofindependent(linear)featuresthatexistin

the training data. The corresponding orthogonal sin-

gular vectors of X define the transformation matrix

that is applied to the original data. The method

eliminates the well-known disadvantage of linear

discriminant analysis, which requires that at least

one of the scatter matrices must be nonsingular,

restricting LDAs application to datasets whose num-

ber of features does not exceed the number of data

vectors. After applying the transformation matrix to

the training and test sets, the data dimensionality

becomes k ? 1, where k is the number of classes of

thetrainingdata,andonecanclassifyinthisreduced

Identification of signatures in biomedical spectra using domain knowledge221

Table 2

Origin of selected subsets of the datasets

NameDescription of the selection method

Best subset independently selected by GA, with smallest Bi

Second best subset independently selected by GA, with second smallest Bi

Third best subset independently selected by GA, with third smallest Bi

Consensus subset, selected by LIKNON from union of all GA regions, frequency threshold > 90%

Consensus subset, selected by LIKNON from union of all GA regions, frequency threshold > 75%

Consensus subset, selected by LIKNON from union of all GA regions, frequency threshold > 70%

Consensus subset, single selection by LIKNON from union of all GA regions, no frequency count

Subset, single selection by LIKNON from union of all GA regions, no frequency count

Subset, single independent selection by LIKNON from all features, frequency threshold > 95%

Subset, single independent selection by LIKNON from all features, frequency threshold > 90%

Subset, single independent selection by LIKNON from all features, frequency threshold > 75%

Subset, single independent selection by LIKNON from all features, frequency threshold > 70%

Subset, single independent selection by LIKNON from all features, selection frequency = 100%

Consensus subset, best 3 attributes, selected by ES/LDA 3 from subset L70

Consensus subset, best 5 attributes, selected by ES/LDA 3 from subset L70

Consensus subset, best 3 attributes, selected by ES/LDA 3 from subset L75

Consensus subset, best 5 attributes, selected by ES/LDA 3 from subset L75

Consensus subset, best 3 attributes, selected by ES/LDA 3 from subset L95

Consensus subset, best 5 attributes, selected by ES/LDA 3 from subset L95

GA1

GA2

GA3

G_L90

G_L75

G_L70

G_L

L

L95

L90

L75

L70

L100

L70_ES3

L70_ES5

L75_ES3

L75_ES5

L95_ES3

L95_ES5

Page 8

space. The details of the algorithm are in [3]. This

feature selection and classification method is useful

for providing an insight about the complexity of the

dataset under analysis.

4. Results

The benchmark classification accuracies for the

three datasets are: Dataset1: training 100%, test

97.5%; Dataset2: training 98.6%, test 69.6%; Data-

set3: training 100%, test 61.5%. With feature selec-

tion, we reduce the dimensionality of the data and

obtain a set of contiguous segments in the spectra

that separate the two classes. The classification

accuracies of the selected subsets for each dataset

are comparable to the benchmark classification

accuracies and are presented in Tables 4, 6 and 8.

In these tables bold signifies the best result in a

column, bold italics the second best result.

Theotherresult,veryusefulinpractice,isthatthe

two independent feature selection methods produce

subsets of proximal spectral locations. Theidentities

of the selected subsets are presented in Tables 3, 5

and7.Thesignatures,independentlyselectedbythe

genetic algorithm (sets GA1, GA2 and GA3), and by

LIKNON (sets L and L70, L75, L90 and L95), and by

consensus (set G_L), overlap and intermix. Also, the

regions identified by LIKNON (set G_L) are present in

the best individual sets GA1, GA2 and GA3. For Data-

set1thisproximityisillustratedinFig.3.TheLIKNON-

selected features with frequency count (G_L70,

G_L90)andwithoutfrequencycount(G_L)arealmost

the same (Tables 3, 5 and 7). However, the classifica-

tion performances differ and favor LIKNON without

frequency count (Tables 4, 6 and 8). Thus, with less

computational effort we can get almost the same

result in terms of spectral locations. On the other

hand, the frequency count gives more confidence

about the importance of any particular feature.

222E. Pranckeviciene et al.

Table 3

Spectral identity of selected subsets of Dataset1

Name Identity of spectral signature

GA1

GA2

GA3

G_L90

G_L70

G_L

L

L90

L70

L70_ES3

L70_ES5

136—160, 316—340, 398—422, 697—724, 780—793, 817—823, 1075—1099, 1298—1322

276—296, 312—336, 533—558, 720—744, 1000—1024, 1026—1050

48—78, 434—458, 623—649, 707—725, 812—836, 974—998, 1301—1326, 1365—1369

136—160, 1117—1141

136—160, 316—340, 1117—1141

136—160, 157—174, 199—223, 316—340, 424—444, 1052—1076, 1117—1141

145, 185, 322, 1132, 1166, 1167, 1168, 1181

57, 145, 185, 322, 1132, 1168, 1181

57, 145, 185, 322, 1132, 1167, 1168, 1181

57, 1168, 1132

57, 185, 322, 1132, 1181

LIKNON parameter C = 0.5.

Table 4

Classification accuracies for Dataset1

NameNo. of

features

SLP accuracy (%)Fisher’s LDF

accuracy (%)

GSVD NC

accuracy (%)

SVM scaled

accuracy (%)

Training TestTrainingTest Training TestTrainingTest

GA1

GA2

GA3

G_L90

G_L70

G_L

L

L90

L70

L70_ES3

L70_ES5

8

6

8

2

3

7

8

7

8

3

5

99.0

100.0

99.0

90.0

94.0

98.0

97.0

97.0

97.0

94.0

97.0

98.7

93.7

98.7

91.1

93.7

97.5

94.9

93.7

93.7

93.7

94.9

99.0

96.0

99.0

89.0

90.0

96.0

92.0

94.0

94.0

93.0

94.0

98.7

92.4

98.7

88.6

92.4

96.2

92.4

92.4

92.4

93.7

93.7

99.0

96.0

99.0

89.0

90.0

96.0

92.0

94.0

94.0

93.0

94.0

98.7

92.4

98.7

88.6

92.4

96.2

92.4

92.4

92.4

93.7

93.7

99.0

98.3

99.0

89.0

93.0

95.0

95.0

97.0

97.0

94.0

97.0

97.5

93.7

97.5

89.9

94.9

96.2

89.9

89.9

89.9

93.7

93.7

Mean

S.D.

Median

Interquartile range

96.5

2.9

97.0

5.0

94.9

2.4

93.7

3.8

94.2

3.2

94.0

4.0

93.8

3.0

92.4

3.8

94.2

3.2

94.0

4.0

93.7

3.0

92.4

3.8

95.8

3.0

97.0

4.3

93.3

3.2

93.7

6.3

Page 9

For each dataset, the two last rows of Tables 4, 6

and 8 present the average and standard deviation

and the robust median and Interquartile range

values of the classification accuracies across the

differentclassification rules.Intermsoftheaverage

classification accuracy for each classification rule,

the results are comparable to the benchmark

accuracies. In all cases, one of the winning signa-

tures is among the subsets identified by GA/LDA.

4.1. Spectral signatures for Dataset1

The results for Dataset1 are presented in Tables 3

and 4. The subsets with best classification perfor-

manceswereGA1, GA3,andG_L,consisting ofseven

to eight features. GA1 and GA2 are obvious winners

in terms of classification accuracy. However, they

alone do not provide a clear understanding of where

the differences inspectra arelocated.Region-based

features give better classification performances

than individual spectral attributes. Analyzing the

identities of the selected subsets in Table 3, we note

that GA1—GA3 agree very well with the consensus

subset G_L. The individual attributes selected by

LIKNON (set L70) and the smaller subsets (L70_ES

and L90_ES) are present in or adjacent to the sub-

sets GA1, GA2 and GA3 and G_L. The discriminatory

locations given by consensus can be approximately

summarized as the segments 40—80, 130—200,

300—450, and 1000—1200.

4.2. Spectral signatures for Dataset2

The results for Dataset2 are presented in Tables 5

and 6. The subsets with best classification perfor-

mances were GA1, GA3, G_L, L, L75_ES3. The iden-

Identification of signatures in biomedical spectra using domain knowledge223

Table 5

Spectral identities of selected subsets of Dataset2

Name Identity of spectral signature

GA1

GA2

GA3

G_L90

G_L75

G_L

4—19, 38—43, 106—111, 196—201, 253—260

7—12, 16—19, 32—36, 105—111, 194—200, 238—243, 289—294

29—36, 108—112

184—189, 232—234

32—33, 38—43, 54—57, 59—62, 184—189, 232—234

16—19, 26—31, 32—33, 36—41, 38—43, 54—57, 59—62, 82—88, 86—91,

105—110, 109—111, 184—189, 199—204, 218—224, 232—234

18, 36, 40, 41, 51, 61, 90, 91, 92,93, 110, 139, 141, 149, 184, 185,

216, 219, 236, 241

91, 92, 93, 110, 149, 216, 236, 241

18, 36, 41, 54, 61, 91, 92, 93, 110, 149, 184, 216, 219, 236, 241

18, 110, 216

41, 54, 184, 216, 236

L

L90

L75

L75_ES3

L75_ES5

LIKNON parameter C = 0.5.

Table 6

Classification accuracies for Dataset2

NameNo. of

features

SLP accuracy (%) Fisher’s LDF

accuracy (%)

GSVD NC

accuracy (%)

SVM scaled

accuracy (%)

TrainingTestTraining TestTraining TestTrainingTest

GA1

GA2

GA3

G_L90

G_L75

G_L

L

L90

L75

L75_ES3

L75_ES5

5

7

2

2

6

88.7

95.8

83.1

73.2

90.1

95.8

98.6

85.9

98.6

87.3

90.1

72.5

63.8

72.5

59.4

60.9

65.2

73.9

68.1

72.5

73.9

60.9

81.7

87.3

80.3

66.2

76.1

87.3

91.6

77.5

90.1

76.1

80.3

65.2

68.1

72.5

59.4

60.9

73.9

69.6

69.6

72.5

73.9

62.3

83.1

91.6

78.9

66.2

76.1

87.3

91.6

77.5

91.6

77.5

80.3

65.2

65.2

72.5

59.4

59.4

72.5

69.6

71.0

71.0

73.9

62.3

84.5

91.6

78.9

69.0

84.5

100.0

88.7

60.6

100.0

80.3

87.3

71.0

65.2

72.5

63.8

60.9

69.6

72.5

60.9

66.7

72.5

62.3

15

20

8

15

3

5

Mean

S.D.

Median

Interquartile range

89.7

7.6

90.1

9.9

67.6

5.7

68.1

11.6

81.3

7.5

80.3

11.2

68.0

5.3

69.6

10.2

81.9

7.8

80.3

14.1

67.4

5.4

69.6

10.2

84.1

11.9

84.5

12.7

67.0

4.9

66.7

10.2

Page 10

tities selected independently by LIKNON (set L), the

genetic algorithm (sets GA1, GA2 and GA3) and

consensus (set G_L) again agree very well. The

better performances are achieved with fewer fea-

tures (GA2 and L75_ES3). GA1, GA2 and G_L are

composed of the subdivisions of several contiguous

segments; selections L and L75 indicate the pre-

sence of certain discriminative positions 91, 92, 93,

110 that appear in the other subsets with better

classification performances. The following approx-

imate discriminatory segments can be derived from

the feature selection results: 4—19, 30—43, 60—62,

90—112.Mostoftheregionsandattributesfoundare

approximately distributed among several broader

bands: 4—60, 90—112, 180—260.

4.3. Spectral signatures for Dataset3

The results for Dataset3 are presented in Tables 7

and 8. The better classification performances were

with the subsets GA1, GA3 and L95_ES3. Feature

selection suggests that the discriminatory locations

in the spectra are distributed approximately in the

segment 1440—2420. Dataset3 is a very difficult

real-life dataset. Indications about the difficulty

of the Dataset3 are:

(Table 1), and poor benchmark accuracy on the test

set (61.6%). LIKNON, applied independently for fea-

ture selection in set L, found 52 important attri-

butes, indicating very heterogeneous data. We do

not present these attributes in Tables 7 and 8.

many singularvalues

5. Discussion and conclusions

The main purpose of our study was to investigate a

methodology that effectively reduces the dimen-

sionality of the spectral data, while preserving the

interpretability of the retained features. The meth-

odology consists of using the consensus of the fea-

ture selection outputs of the two methods, the

genetic algorithm and LIKNON. The genetic algo-

rithm performs heuristic search with adaptation and

produces multiple signatures that are not generally

stable. Especially when the sample size is small,

there may well be several equally good discrimina-

tory subregions. LIKNON produces a single fixed

solution at a time for a fixed configuration of the

data. The solutions of LIKNON are stable, but do not

immediately provide good classification perfor-

mance. We try to take advantage of the consensus

224E. Pranckeviciene et al.

Table 7

Spectral identities of selected subsets of Dataset3

NameIdentity of spectral signature

GA1

GA2

GA3

G_L90

G_L70

444—497, 677—731, 939—997, 1202—1255, 2361—2416, 2550—2612, 2680—2731

47—122, 805—862, 2362—2415

65—124, 1663—1693, 2362—2415

1583—1650, 1663—1693, 1805—1856, 1847—1900, 1933—1959, 2415—2441

793—804, 1442—1495, 1583—1650, 1663—1693, 1805—1856, 1847—1900, 1864—1911,

1933—1959, 2415—2441

1583—1650, 1663—1693, 1805—1856, 1847—1900, 1859—1910, 1933—1959, 1961—1984,

2362—2415

1580, 1623, 1863, 1962, 1966, 2389

1440, 1452, 1458, 1467, 1477, 1580, 1623, 1719, 1848, 1861, 1863, 1950, 1956, 1962,

1966, 1969, 1972, 1976, 1977, 2389, 2401

1580, 1848, 1956

1848, 1861, 1956, 1966, 2399

G_L

L100

L95

L95_ES3

L95_ES5

LIKNON parameter C = 0.7.

Figure 3

consensus-based feature selection for Dataset1. The cen-

troids of the two spectral classes are: class (1) solid line,

class (2) dashed line. The distribution of spectral signa-

tures is given by shaded vertical bars, vertical lines and

black brackets on the top and bottom of the axis. Shaded

bars represent subset G_L, vertical lines represent subset

L70, and black bars on the top and bottom bracket the

union of the subsets GA1, GA2 and GA3.

Proximity of important subsets, identified by

Page 11

of both methods. Via the search with the genetic

algorithm,weconductanexploratoryanalysisofthe

likely distribution of the discriminatory locations

(region subsets) in the spectra. LIKNON will identify

the discriminatory features producing a stable solu-

tion. If we apply LIKNON to the union of GA regions

and to the individual spectral attributes obtained

with and without frequency count, we reduce the

uncertainty of the GA solutions, obtaining sets of

discriminatory locations both as regions and as indi-

vidual attributes.

Dataset2 was also analyzed in [17,18]. The clas-

sification accuracies obtained in these studies were

higher, but the results were not interpreted or

interpretable. The particular attributes identified

in [18] that led to good classification performances,

are among the consensus results of the current

study.

Agreementbetweenthemethodswithregardtoa

particular region or attribute in the spectra suggests

that this region/location is responsible for discrimi-

nating between the classes. Such information is

useful for the domain expert for making decisions

based on spectral measurements. This information

can also be used in creating a more robust classifica-

tion rule, or a multiple classification system (MCS).

Investigation of the utility of the signatures identi-

fied by consensus opens an interesting research

opportunity for MCS design. For example, the ran-

dom subspace method (RSM) [26] builds the MCS

using sets of features (in our case spectral signa-

tures), which are obtained randomly. We believe

that incorporation of domain knowledge using con-

sensus based spectral signatures may be beneficial

for the RSM. In general, the design of tools for

improved classification of spectral data without

sacrificing interpretability by intelligently combin-

ing the consensus features/classifiers, is a topic for

further research.

When there is uncertainty about the number and

identityoftheimportantspectralattributes,domain

knowledge helps determine the optimal feature

selectionmodel,definesthestructureofthefeatures

and suggests methods to identify these structures.

Domain knowledge is concretized through the notion

ofspectralsignature,andsuggestedtwoappropriate

methods for feature selection.

We demonstrated the potential of this approach

on three real-world biomedical spectral datasets of

increasing difficulty. Both methods achieve effec-

tive reduction of dimensionality, and classification

performancesatleastcomparabletothebenchmark

SVM method. When the number of regions of inter-

est is unknown, we use consensus solutions to deter-

mine the potential groups or individual features.

Agreement between the outputs of the two dif-

ferent feature selection methods is a useful and

important practical result for improved feature

selection and classification of biomedical spectra.

Signaturesidentifiedindependentlybybothmethods

appear to form stable groupings at definite spectral

positions.Wecantakeadvantageoftheconsensusof

the methodsforimproved

Althoughthetwomethodsusedifferentoptimization

criteria and metrics (one is relying on regions, the

other on individual spectral measurements), both

aim to identify maximally discriminatory features

that tend to be well-separated along the spectra,

because these are less correlated locally.

As our main conclusion, we suggest narrowing the

range of important candidate features by consider-

ing both GA-, LIKNON- and consensus-identified sub-

featureselection.

Identification of signatures in biomedical spectra using domain knowledge225

Table 8

Classification accuracies for Dataset3

Name No. of

features

SLP accuracy (%)Fisher’s LDF

accuracy (%)

GSVD NC

accuracy (%)

SVM scaled

accuracy (%)

Training TestTrainingTestTraining TestTrainingTest

GA1

GA2

GA3

G_L90

G_L70

G_L

L100

L95

L95_ES3

L95_ES5

7

3

3

6

9

8

6

91.0

88.5

85.9

85.9

91.0

85.9

75.6

100.0

80.8

84.6

60.3

57.7

57.7

56.4

51.3

57.7

48.7

46.2

61.5

56.4

88.5

74.4

80.8

79.5

82.1

80.8

71.8

94.9

76.9

80.8

60.3

56.4

59.0

52.6

50.0

46.2

47.4

46.2

59.0

56.4

89.7

76.9

79.5

79.5

80.8

79.5

73.1

94.9

76.9

79.5

59.0

59.0

59.0

50.0

47.4

46.2

47.4

47.4

59.0

56.4

75.6

71.8

78.2

83.3

83.3

80.8

71.8

100.0

74.4

79.5

65.4

56.4

60.3

55.1

55.1

51.3

53.8

44.9

60.3

55.1

22

3

5

Mean

S.D.

Median

Interquartile range

88.4

6.5

85.9

8.3

61.2

5.0

57.1

9.0

81.1

6.6

80.8

8.6

60.8

5.6

54.5

11.6

81.1

6.5

79.5

8.4

53.1

5.8

53.2

11.6

81.9

8.2

78.8

10.2

61.2

5.5

55.1

7.7

Page 12

sets, especially when only a small set of samples is

available for classifier training and testing.

Acknowledgements

We thank the reviewers for the valuable comments

that helped to improve the paper. The support of

Natural Sciences and Engineering Research Council

of Canada (NSERC) is greatly acknowledged.

References

[1] Valfar F. Pattern recognition techniques in microarray data

analysis: a survey. Ann NY Acad Sci 2002;980:41—64.

[2] Chen L, Liao HM, Ko M, Lin J, Yu G. A new LDA-based face

recognition system which can solve the small sample size

problem. Pattern Recogn 2000;33:1713—26.

[3] Howland P, Jeon M, Park H. Structure preserving dimension

reduction for clustered text data based on the generalized

singular value decomposition. SIAM J Matrix Anal Appl

2003;25(1):165—79.

[4] Yu H, Yang J. A direct lda algorithm for high-dimensional

data with application to face recognition. Pattern Recogn

2001;34:2067—70.

[5] Bhattacharrya C, Grate LR, Rizki A, et al. Simultaneous

relevant feature identification and classification in high-

dimensional spaces: application to molecular profiling data.

Signal Process 2003;83(4):729—43.

[6] Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for

cancer classification using support vector machines. Mach

Learn 2002;46:389—422.

[7] Li Y, Campbell C, Tipping M. Bayesian automatic relevance

determination algorithms for classifying gene expression

data. Bioinformatics 2002;18:1332—9.

[8] Nikulin AE, Dolenko B, Bezabeh T, Somorjai R. Near-optimal

region selection for feature space reduction: novel prepro-

cessing methods for classifying MR spectra. NMR Biomed

1998;11:209—16.

[9] Shevade SK, Keerthi SS. A simple and efficient algorithm for

gene selection using sparse logistic regression. Bioinfor-

matics 2003;19:2246—53.

[10] Somorjai RL, Dolenko B, Baumgartner R. Class prediction

and discovery using gene microarray and proteomics mass

spectroscopydata:curses,caveats,cautions.Bioinformatics

2003;19:1484—91.

[11] Guo GD, Dyer CR. Simultaneous feature selection and clas-

sifier training via linear programming: a case study for face

expression recognition. In: Proceedings of the 2003 IEEE

Computer Society Conference on Computer Vision and Pat-

tern Recognition CVPR2003, vol. 1. IEEE Computer Society;

2003. p. 346—52.

[12] Kecman V. Learning and soft computing: support vector

machines, neural networks, and fuzzy logic models. Cam-

bridge, MA: MIT Press, 2001.

[13] Kecman V, Hadzic I. Support vectors selection by linear

programming.In:ProceedingsoftheIEEE—INNS—ENNSInter-

national Joint Conference on Neural Networks IJCNN 2000,

vol. 5. IEEE Computer Society; 2000. p. 193—8.

[14] Mika S, Ratsch G, Weston J, Sholkopf B, Smola A, Muller KR.

Constructing descriptive and discriminative nonlinear fea-

tures: Rayleigh coefficients in kernel feature spaces. IEEE

PAMI 2003;25(5):623—8.

[15] Ambroise C, McLachlan GJ. Selection bias in gene extraction

on the basis of microarray gene-expression data. PNAS

2002;99(10):6562—6.

[16] HimmelreichU,SomorjaiRL,DolenkoB,LeeOC,DanielH-M,

Mountford CE, et al. Rapid identification of Candida species

using nuclear magnetic resonance spectroscopy and a sta-

tistical classification strategy. Appl Environ Microbiol

2003;69:4566—74.

[17] Raudys S, Somorjai R, Baumgartner R. Reducing the over-

confidence of base classifiers when combining their

decisions. In: Windeatt T, Roli F, editors. Proceedings of

Fourth International Workshop on Multiple Classifier Sys-

tems, MCS 2003, Lecture Notes in Computer Science

2709. Berlin, Heidelberg: Springer-Verlag; 2003. p. 65—73.

[18] Somorjai R, Janeliunas A, Baumgartner R, Raudys S. Com-

parison of two classification methodologies on a real-world

biomedical problem. In: Caelli T, Amin A, Duin RPW, Kamel

MS, de Ridder D, editors. Proceedings of the Joint IAPR

International Workshops on Structural, Syntactic, and Sta-

tistical Pattern Recognition, SSPR 2002 and SPR 2002, Lec-

ture Notes in Computer Science 2396. Berlin, Heidelberg:

Springer-Verlag; 2003. p. 433—41.

[19] Somorjai RL, Dolenko B, Nikulin A, Nickerson P, Rush D, Shaw

A, et al. Distinguishing normal allografts from biopsy-proven

rejections: application of a three-stage classification strat-

egytourineMRandIRspectra.VibSpectrosc2002;28(1):97—

102.

[20] Fukunaga K. Introduction to statistical pattern recognition.

San Diego, CA: Academic Press, 1990.

[21] Duda R, Hart P, Stork PD. Pattern classification. New York:

Wiley, 2001.

[22] Bradley PS, Mangasarian OL, Street WN. Feature selection

via mathematical programming. INFORMS J Comput 1998;

10(2):209—17.

[23] Pedroso JP, Murata N. Support vector machines with differ-

ent norms: motivation, formulations and results. Pattern

Recogn Lett 2001;12(2):1263—72.

[24] Shashua A. On the equivalence between the support vector

machine for classification and sparsified Fisher’s linear dis-

criminant. Neural Process Lett 1999;9(2):129—39.

[25] Chang C, Lin CJ. LIBSVM: a library for support vector

machines, 2001. Software available at http://www.csie.n-

tu.edu.tw/cjlin/libsvm.

[26] Ho TK. The random subspace method for constructing deci-

sion forests. IEEE Trans Pattern Anal Mach Intell 1998;

20(8):832—44.

226E. Pranckeviciene et al.