Conference PaperPDF Available

Class-Specific Feature Selection for One-Against-All Multiclass SVMs.

Authors:

Abstract

This paper proposes a method to perform class-specific feature selection in multiclass support vector machines addressed with the one-against-all strategy. The main issue arises at the final step of the classification process, where binary classifier outputs must be compared one against another to elect the winning class. This comparison may be biased towards one specific class when the binary classifiers are built on distinct feature subsets. This paper proposes a normalization of the binary classifiers outputs that allows fair comparisons in such cases.
Class-Specific Feature Selection for
One-Against-All Multiclass SVMs
Ga¨el de Lannoy and Damien Fran¸cois and Michel Verleysen
Universit´e catholique de Louvain
Institute of Information and Communication Technologies,
Electronics and Applied Mathematics
Machine Learning Group
Place du Levant, 3 Louvain-la-Neuve, Belgium
Abstract. This paper proposes a method to perform class-specific fea-
ture selection in multiclass support vector machines addressed with the
one-against-all strategy. The main issue arises at the final step of the
classification process, where binary classifier outputs must be compared
one against another to elect the winning class. This comparison may be
biased towards one specific class when the binary classifiers are built on
distinct feature subsets. This paper proposes a normalization of the binary
classifiers outputs that allows fair comparisons in such cases.
1 Introduction
Many supervised classification tasks in a wide variety of domains involve multi-
class targets. One frequently used and easy method for solving these problems is
to train several off-the-shelf binary support vector machines (SVMs) classifiers
and to extend their decision to multiclass targets by using the one-against-one
(OAO) or the one-against-all (OAA) approaches. A vast literature exists on the
pros and cons of these two approaches, and a comprehensive review can be found
for example in [1] and [2].
In the OAA approach, the output value of each competing classifier is used
in the decision rule rather than the thresholded class prediction as in the OAO
approach. The problem with this OAA decision rule is that every classifier
participating to the decision is assumed equally reliable, which is rarely the case.
This problem has previously been adressed in [3] where a classifier reliability
measure is included in the OAA decision process; experiments show that the
performances are improved.
Nevertheless, despite the interesting performance increase, one major draw-
back of this reliability measure is that the competing classifiers must be trained
on the same feature sets to keep the output values comparable. However, the
optimal feature subsets might be different for each one-against-all sub-problem,
and it is known that spurious features can harm the classifier even if the latter
is able to prune out features intrinsically [4]. In such situations, the feature
selection step should rather be made where the training of the model actually
happens, and so at the class level rather than at the multiclass level.
In this work, we show how such reliability measure can be modified to over-
come this limitation, and therefore allow the feature selection to be made at -
263
ESANN 2011 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 27-29 April 2011, i6doc.com publ., ISBN 978-2-87419-044-5.
Available from http://www.i6doc.com/en/livre/?GCOI=28001100817300.
and optimized for - the binary classifier level where the training actually hap-
pens. The following of this paper is organized as follows. Section 2 provides a
short overview of the theoretical background on the methods used in this work.
Section 3 introduces the classifier reliability measure and shows how this mea-
sure can be included in the OAA decision. Section 4 describes the experiments
and the results.
2 One-against-all strategy for multiclass SVMs
SVMs are linear machines that rely on a preprocessing to represent the feature
vectors in a higher dimension, typically much higher than the original feature
space. With an appropriate non-linear mapping ϕ(x) to a sufficiently high-
dimensional space, finite data from two categories can always be separated by
a hyperplane. In SVMs, this hyperplane is chosen as the one with the largest
margin. SVMs have originally been designed for binary classification tasks [5].
This two-class formulation of SVMs where yi {−1,1}can be extended to
solve multiclass problems where yi {1,2, . . . , M}by constructing Mbinary
classifiers, each classifier being trained with the examples of one class with a
positive label and all the other samples with a negative label.
Let S={(x1, y1),(x2, y2),...,(xn, yn)}be a set of ntraining samples where
xiRpis a p-dimensional feature vector and yi {−1,1}is the associated
binary class label. In SVMs, the jth classifier yields the following decision func-
tion:
fj(x) = wT
jϕ(x) + bj(1)
where wjand bjare the parameters of the hyperplane obtained during the
training of the jth classifier. Geometrically, fj(x) corresponds to the distance
between xand the functional margin of the classifier j. At the classification
phase, a new observation is then assigned to the class jwhich produces the
largest output value amongst the Mclassifiers:
j= arg max
j=1...M fj(x) = arg max
j=1...M wT
jϕ(x) + bj.(2)
3 Improving the OAA decision
One major drawback of the OAA approach for solving multiclass problems is
that the classifier generating the highest value from its decision function is se-
lected as the winner without considering the reliability of each classifier. The
two underlying assumptions behind this approach are first that the classifiers
are equally reliable, and second that they have been constructed on the same
features. This section first recalls Liu and Zheng’s reliability measure [3] associ-
ated to a SVM classifier that overcomes the first assumption. Second, we show
that this measure can be improved to permit the use of distinct feature sets for
each binary classifier. Finally, an improved decision rule for the OAA approach
based on the reliability measure is given.
264
ESANN 2011 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 27-29 April 2011, i6doc.com publ., ISBN 978-2-87419-044-5.
Available from http://www.i6doc.com/en/livre/?GCOI=28001100817300.
3.1 Reliability measure
To overcome the first assumption, one would obviously consider the output of a
classifier more reliable if the true generalization error R=E[y6=sign(f(x)] is
small. Unfortunately, this value is always unknown and must be estimated from
data by the empirical error e
R= 1 1
nPn
i=1[yi=sign(f(xi))]. However, when
the number of training samples is relatively small compared to the number of
features, it has been shown that a small empirical e
Rdoes not guarantee a small
R[6].
For this reason, a better classifier reliability measure can be based on an
upper bound of R. Indeed, minimizing the SVM ob jective function has been
shown to also minimizing an upper bound on the true generalization error R[6].
Following this idea, the following reliability measure λhas been proposed by [3]:
λ= exp
1
2kwk2+CPn
i=1(1 yif(xi))+
Cn !,(3)
where (z)+=zif z > 0 and 0 otherwise. The Cn denominator is included to
cancel the effect of different training sizes and regularization parameter value.
In the linearly separable case, the λreliability measure associated to a classifier
is large if its geometrical margin 2
kwk2is large.
3.2 Reliability measure with distinct feature sets
By removing most irrelevant and redundant features from the data, feature selec-
tion helps improving the performance of learning models by alleviating the effect
of the curse of dimensionality, enhancing generalization capability, speeding up
learning process and improving model interpretability. In the OAA approach,
one classifier is built to discriminate each class against all the others. Each
feature can however have a different discriminative power for each of the bi-
nary classifiers and useless features can harm the classifier even if it is able to
adapt its weights accordingly during the learning process [4]. This situation is
known to happen for example in the classification of heart beats where it has
been observed that the duration between successive heart beats discriminates
for some cardiac pathologies while it is the morphology of the heart beats that
discriminate for other cardiac pathologies [7]. In such situations, the selection of
features should thus rather be made at the class level rather than at the global
level. Nevertheless, building each classifier in a distinct feature space would
make the comparison of the output values unreliable.
To alleviate the effect of dealing with distinct numbers of features, a weighting
by the cardinality of wis inserted into Eq. (3):
β= exp
1
2kwk2+CPn
i=1(1 yif(xi))+
Cn kwk0.(4)
The effect of the cardinality is to normalize the squared Euclidean norm of w
with respect to the dimension of the space in which it lives, i.e. the size of the
265
ESANN 2011 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 27-29 April 2011, i6doc.com publ., ISBN 978-2-87419-044-5.
Available from http://www.i6doc.com/en/livre/?GCOI=28001100817300.
selected feature subset. This kind of normalization is rather common in tools
aimed at missing data analysis [8].
3.3 Improved OAA decision rule
Assume Mclassifiers have been trained, each on an optimal subset of features.
The reliability measure βjis also computed for each of the classifiers. Given a
new sample x,fjis evaluated for 1 jMaccording to Eq. (1) and a soft
decision function zj {−1,1}is generated:
zj= sign(fj(x))(1 exp( |fj(x)|)).(5)
The output of each classifier is then weighted by the associated reliability mea-
sure and the sample xis assigned to the class jaccording to:
j= arg max
j=1...M zj(x)βj.(6)
The weighting of the output of each classifier by its associated βmeasure pe-
nalizes classifiers with a small margin and a poor generalization ability, and
also allows every competing classifier to have distinct features, distinct meta-
parameters and a distinct number of observations.
4 Experimental results
The proposed weighted OAA decision rule is experimented on three multiclass
datasets from the UCI repository1. The details of the three datasets are shown
in Tab. 1. Five methods are compared in the experiments:
1. OAA without feature selection;
2. OAA with global feature selection;
3. λ-weighted OAA without feature selection;
4. λ-weighted OAA with class-wise feature selection;
5. βnormalized OAA with class-wise feature selection.
The selection of features is achieved using a permutation test with the mutual
information criterion in a naive ranking approach [9]. The RBF kernel is used
in the SVM classifier. The regularization parameter and kernel parameter are
optimized on the training set using a 5-fold cross-validation over a wide range
of values, and the performances are evaluated on the test set.
The classification error for the five methods are presented in Table 2 together
with the percentage of selected features. When the feature selection is achieved
at the class level, the average feature selection percentage is reported. The
results surprisingly show that the weighting by the λreliability measure does not
1http://archive.ics.uci.edu/ml/
266
ESANN 2011 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 27-29 April 2011, i6doc.com publ., ISBN 978-2-87419-044-5.
Available from http://www.i6doc.com/en/livre/?GCOI=28001100817300.
Name Training Test Classes Features
Segmentation 210 2100 7 19
Vehicle 676 170 4 18
Isolet 3119 1559 26 617
Table 1: Number of samples, classes and features of the three datasets used in
the experiments. For the isolet dataset, only a subsample (50%) of the original
training data has been considered.
Segmentation Vehicle Isolet
Weighting Selection Error Features Error Features Error Features
none none 10.0% 100% 17.7% 100% 4.5% 100%
none global 8.4% 78% 17.7% 100% 4.5% 100%
λnone 9.8% 100% 18.8% 100% 9.5% 100%
λclass 8.8% 65% 20.0% 93.0% 7.5% 78%
βclass 6.9% 65% 17.0% 93.0% 3.9% 78%
Table 2: Comparison of the classification error for the five methods. The per-
centage of selected features is also reported.
always improves the classification performances. However, the best results are
achieved by the βweighting and the class-wise feature selection. In particular,
the results obtained with the class-wise feature selection and βweighting are
better than with the λweighting and the same class-wise feature selection. This
shows the need to include the so-called zero-norm of win the computation of the
reliability measure when a distinct subset of features are used in each classifier.
Furthermore, the results obtained with the class-wise feature selection and the
βweighting are better than with the global feature selection. This confirms the
benefit from the class level feature selection over the global feature selection.
5 Conclusion
Most methods for multiclass classification assume that there is an optimal sub-
set of features that is common to all classes, while in many applications, it may
not be the case. In the one-against-all approach, using distinct feature subsets
for each class might however lead to unfair and biased final decision rules. To
alleviate this problem, the output of the competing classifiers should be normal-
ized before being compared. The normalization that is proposed in this paper
takes into account the number of features used and a measure of reliability of the
classifier. On three standard benchmark datasets, the proposed approach, used
in conjunction with support vector machines, yields better results than selecting
features across all classes.
The class-dependent feature selection methodology allows increasing the per-
formances compared with a feature selection common to all classes. It further-
267
ESANN 2011 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 27-29 April 2011, i6doc.com publ., ISBN 978-2-87419-044-5.
Available from http://www.i6doc.com/en/livre/?GCOI=28001100817300.
more brings insights about the relationships between the features and the specific
target classes.
Acknowledgments
Ga¨el de Lannoy is funded by a F.R.I.A grant. Computations have been run
on the Lemaitre cluster thanks to the “Calcul Intensif et Stockage de Masse”
(CISM) of the Universit´e catholique de Louvain.
References
[1] Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass support vector
machines. IEEE Transactions on Neural Networks, 13(2):415–425, 2002.
[2] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-
correcting output codes. Journal of Artificial Intelligence Research, 2:263–286, 1995.
[3] Yi Liu and Y.F. Zheng. One-against-all multi-class svm classification using reliability
measures. In Neural Networks, 2005. IJCNN ’05. Proceedings. 2005 IEEE International
Joint Conference on, volume 2, pages 849 854 vol. 2, 31 2005.
[4] Isabelle Guyon and Andr´e Elisseeff. An introduction to variable and feature selection.
Journal of Machine Learning Research, 3:1157–1182, 2003.
[5] Nello Cristianini and John Shawe-Taylor. An introduction to support vector machines :
and other kernel-based learning methods. Cambridge University Press, 1 edition, March
2000.
[6] V. N. Vapnik. An overview of statistical learning theory. Neural Networks, IEEE Trans-
actions on, 10(5):988–999, 1999.
[7] K.S. Park, B.H. Cho, D.H. Lee, S.H. Song, J.S. Lee, Y.J. Chee, I.Y. Kim, and S.I. Kim.
Hierarchical support vector machine based heartbeat classification using higher order statis-
tics and hermite basis function. In Computers in Cardiology, 2008, pages 229–232, Sept.
2008.
[8] Pedro J. Garc´ıa-Laencina, Jos´e-Luis Sancho-G´omez, An´ıbal R. Figueiras-Vidal, and Michel
Verleysen. K nearest neighbours with mutual information for simultaneous classification
and missing data imputation. Neurocomput., 72(7-9):1483–1493, 2009.
[9] Damien Fran¸cois, Fabrice Rossi, Vincent Wertz, and Michel Verleysen. Resampling meth-
ods for parameter-free and robust feature selection with mutual information. Neurocom-
puting, Elsevier, 70:1276–1288, 2007.
268
ESANN 2011 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 27-29 April 2011, i6doc.com publ., ISBN 978-2-87419-044-5.
Available from http://www.i6doc.com/en/livre/?GCOI=28001100817300.
... Traditional feature selection methods choose a single subset of the features as the "optimal" subset for the entire dataset. Apart from the commonly used global approach, some studies [8,6,9,10,11,12,13,14] have used class-specific approaches for selecting features, where for each class, a unique subset of the original features is selected. If there are C classes, in the class-specific approach, C subsets are chosen. ...
... Different sets of features could be redundant for different classes. The class-specific feature selection (CSFS) works in [8,6,9,10,11,12,13,14] have proposed suitable frameworks that exploit classspecific feature subsets to solve classification problems. They have shown that the classifiers built with the subsets chosen by class-specific methods performed better than or comparable to the classifiers built with subsets chosen by the traditional global feature selection methods. ...
... The majority of the CSFS methods [8,6,10,11,13,14] follow the one-versusall (OVA) strategy to decompose a C-class classification problem into C binary classification problems. They choose C class-specific feature subsets optimal for the C binary classification problems. ...
Preprint
Full-text available
Recently, several studies have claimed that using class-specific feature subsets provides certain advantages over using a single feature subset for representing the data for a classification problem. Unlike traditional feature selection methods, the class-specific feature selection methods select an optimal feature subset for each class. Typically class-specific feature selection (CSFS) methods use one-versus-all split of the data set that leads to issues such as class imbalance, decision aggregation, and high computational overhead. We propose a class-specific feature selection method embedded in a fuzzy rule-based classifier, which is free from the drawbacks associated with most existing class-specific methods. Additionally, our method can be adapted to control the level of redundancy in the class-specific feature subsets by adding a suitable regularizer to the learning objective. Our method results in class-specific rules involving class-specific subsets. We also propose an extension where different rules of a particular class are defined by different feature subsets to model different substructures within the class. The effectiveness of the proposed method has been validated through experiments on three synthetic data sets.
... In this equation the input vector x is used to assign to the class which maps to the largest value of the corresponding decision function applied to SVM [21]. The final class of x will be implemented as X=arg max (Di(x)) i= 1….k ...
Article
Full-text available
Heart arrhythmias are the different types of heartbeats which are irregular in nature. In Tachycardia the heartbeat works too fast and in case of Bradycardia it works too slow. In the study of different cardiac conditions automatic detection of heart arrhythmia is done by the classification and feature extraction of Electrocardiogram(ECG) data. Various Support Vector Machine based methods are used to analyze and classify ECG signals for arrhythmia detection. There are several Support Vector Machine (SVM) methods used to classify the ECG data such as one against all, one against one and fuzzy decision function. This classification detects the existence of the arrhythmia and it helps the physicians to treat the heart patient with more accurate way. To train SVM, the MIT BIH Arrhythmia database is used which works with the heart disorder like sinus bradycardy, old inferior myocardial infarction, coronary artery disease, right bundle branch block. All three methods are implemented in proper way, and their rate of accuracy with SVM classifier is optimal when it is processed with the one-against-all method. The data sets of ECG arrhythmia are usually complex in nature, so for the SVM based classification one-against-all method has great impact and will fetch better result.
... The problem of classification can be translated by trying to find a hyperplane that separates the groups. SVMs with two-class formulations where ∈ can be extended to accomplish a multiclass problem where ∈ … by setting M binary classifier, with each classifier being trained using a positive label class and the rest of it using a negative label class 16 . 1, ), ( 2, 2, ), … , ( 2 2, ) } be a set of n training samples where ∈ Rp is a -dimensional feature vector and ∈ {−1,1} with represents the class label from dataset about moving Indonesia capital tweets. ...
Article
Full-text available
Last October 2019, Indonesian Twitter community is busy discussing the issue of moving the capital city, and people are very eager to share their opinion in various expressions. This form of expression was alleged as a form of society expressing their opinions and arguments. This research uses a dataset from online discussions about moving Indonesian capital on Twitter. The goal of this study aims to identify whether a tweet contains argument or not. In this experiment, we use Multi-Class Support Vector Machine (SVM), and Multinomial Naïve Bayes (MNB) as the classifier and TF-IDF as feature extraction. Variation of Twitter data characters that have a lot of noise will be a challenge in this study so that some preprocessing processes will be carried out to overcome this problem. This research will investigate several combinations of preprocessing to discover the best result. We classify each tweet information such as argument, non-argument, and unknown. The best results with an accuracy of 71.42% were obtained by performing SVM with only a unigram feature. This study shows that the stopwords feature has effectiveness depends on which feature combination is implemented in the model.
... [54] adapted the SVM algorithm with a cost-sensitive learning method using the Quasi-Newton-based optimization scheme. To overcome the loss of class information when the OAA strategy is applied, [55] proposed a normalization procedure weighing the output of each two-class SVM classifier with a reliability measure. ...
Article
Full-text available
Decision-making using machine learning requires a deep understanding of the model under analysis. Variable importance analysis provides the tools to assess the importance of input variables when dealing with complex interactions, making the machine learning model more interpretable and computationally more efficient. In classification problems with imbalanced datasets, this task is even more challenging. In this article, we present two variable importance techniques, a nonparametric solution, called mh-χ2, and a parametric method based on Global Sensitivity Analysis. The mh-χ2 employs a multivariate continuous response framework to deal with the multiclass classification problem. Based on the permutation importance framework, the proposed mh-χ2 algorithm captures the dissimilarities between the distribution of misclassification errors generated by the base learner, Conditional Inference Tree, before and after permuting the values of the input variable under analysis. The GSA solution is based on the Covariance decomposition methodology for multivariate output models. Both solutions will be assessed in a comparative study of several Random Forest-based techniques with emphasis in the multiclass classification problem with different imbalanced scenarios. We apply the proposed techniques in two real application cases in order first, to quantify the importance of the 35 companies listed in the Spanish market index IBEX35 on the economic, political and social uncertainties reflected in economic newspapers in Spain during the first quadrimester of 2020 due to the COVID-19 pandemic and second, to assess the impact of energy factors on the occurrence of spike prices on the Spanish electricity market.
... A maximum of 79.43% is reached with eight features using the k-NN classifier. Classifier tuning: 1 Euclidean distance, 2,3 k=5, 4 hidden layers=3, neurons/layer=20, 5 #terms=5 (regularly distributed), 6 Gaussian kernel. ...
Article
Full-text available
A multi-scale feature selection method based on the Choquet Integral is presented in this paper. Usually, aggregation decision-making problems are well solved, relying on few decision rules associated to a small number of input parameters. However, many industrial applications require the use of numerous features although not all of them will be relevant. Thus, a new feature selection model is proposed to achieve a suitable set of input features while reducing the complexity of the decision-making problem. First, a new criterion, combining the importance of the parameters as well as their interaction indices is defined to sort them out by increasing impact. Then, this criterion is embedded into a new random parameter space partitioning algorithm. Last, this new feature selection method is applied to an industrial wood singularity identification problem. The experimental study is based on the comparative analysis of the results obtained from the process of selecting parameters in several feature selection methods. The experimental study attests to the relevance of the remaining set of selected parameters.
... Untuk menghemat waktu komputasi pada saat klasifikasi, digunakan teknik one against all [17] [18]. Berdasar model ini, selanjutnya data uji yang telah ditransformasi ke ruang eigen dengan mengalikannya dengan matriks transformasi diatas akan diklasifikasikan menggunakan model SVM tersebut. ...
Article
Full-text available
This paper presents the modeling of face recognition using feature extraction based on Principal Component Analysis (PCA) and Support Vector Machine (SVM) as a classifier. Three PCA techniques were compared, they are 1DPCA, 2DPCA and Bi-2DPCA. Meanwhile, three type of SVM kernel functions-linear, polynomial, and radial basis function (RBF) were used. The experiment used the ORL Face Database AT&T Laboratory, which contain 400 images with 10 images per each person. The leave one out method is used for validating each pair of extraction and classifier method. The highest accuracy is obtained by a combination of linear kernel and Bi-2DPCA85%, with 94.25%, and also the fastest computation time, is 15.34 seconds. Index Terms— Face Recognition, Principle Component Analysis, Kernel, Support Vector Machine, Leave-one Out Cross Validation
... There are studies from related fields that propose to select a possible different feature subset for each class. For example, de Lannoy et al. propose a method to perform class-specific feature selection in multiclass support vector machines and experimentally validate its performance [40] . Zhou and Wang use class separability measure to select different feature subsets for different classes and compare their method with class independent feature selection method by applying the method on several biomedical data with support vector machine [41] . ...
Article
Gene expression profiles are being used to categorize disease specific genes and classify different tumor subtypes at the molecular level. Due to the inherent nature of these data having high dimensionality and small sample sizes, current conventional machine learning and statistical techniques have drawbacks in achieving satisfactory predictive classification performance in clinical samples. The typical approach to handling this situation is to eliminate noisy and redundant genes from the original gene space. There are currently multiple gene selection methods available, but most of them seek to find a common subset of genes for all tumor subtypes and fail to reflect the unique characteristics of each subtype. Consequently, in this study, we propose a general framework that aims to identify subset of genes for each tumor subtype, and also give another gene selection framework that combines the obtained subtype specific gene subsets into a single gene subset. We then present a corresponding classification model for distinguishing different tumor subtypes, and implement three specific gene selection algorithms within the two frameworks. Finally, extensive experimental results on the six benchmark microarray data validate the proposed tumor subtype dependent selection process to predict and rank specific molecular biomarkers to define tumor subtypes. This new process contributes significantly to the enhancement of tumor-predictive classification performance.
Chapter
Unlike traditional feature selection methods, the class-specific feature selection methods select an optimal feature subset for each class. Typically class-specific feature selection methods use one-versus-all split of the data set that leads to issues such as class imbalance, decision aggregation, and high computational overhead. We propose a class-specific feature selection method embedded in a fuzzy rule-based classifier, which is free from the drawbacks associated with most existing class-specific methods. Additionally, our method can be adapted to control the level of redundancy in the class-specific feature subsets by adding a suitable regularizer to the learning objective. Our method results in class-specific rules involving class-specific features. We also propose an extension where different rules of a particular class are defined by different feature subsets to model different substructures within the class. The effectiveness of the proposed method is validated through experiments on three synthetic data sets.KeywordsClass-specific feature selectionRule-specific feature selectionRedundancy controlFuzzy rule-based classifiersWithin-class substructures
Conference Paper
Full-text available
The heartbeat class detection of the electrocardiogram is important in cardiac disease diagnosis. For detecting morphological QRS complex, conventional detection algorithm have been designed to detect P, QRS, T wave. However, the detection of the P and T wave is difficult because their amplitudes are relatively low, and occasionally they are included in noise. We applied two morphological feature extraction methods: higher-order statistics and Hermite basis functions. Moreover, we assumed that the QRS complexes of class N and S may have a morphological similarity, and those of class V and F may also have their own similarity. Therefore, we employed a hierarchical classification method using support vector machines, considering those similarities in the architecture. The results showed that our hierarchical classification method gives better performance than the conventional multiclass classification method. In addition, the Hermite basis functions gave more accurate results compared to the higher order statistics.
Article
Full-text available
Variable and feature selection have become the focus of much research in areas of application for which datasets with tells or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
Conference Paper
Full-text available
Support vector machines (SVM) is originally designed for binary classification. To extend it to multi-class scenario, a typical conventional way is to decompose an M-class problem into a series of two-class problems, for which one-against-all is the earliest and one of the most widely used implementations. However, certain theoretical analysis reveals a drawback, i.e., the competence of each classifier is totally neglected when the results of classification from the multiple classifiers are combined for the final decision. To overcome this limitation, this paper introduces reliability measures into the multi-class framework. Two measures are designed: static reliability measure (SRM) and dynamic reliability measure (DRM). SRM works on a collective basis and yields a constant value regardless of the location of the test sample. DRM, on the other hand, accounts for the spatial variation of the classifier's performance. Based on these two reliability measures, a new decision strategy for the one-against-all method is proposed, which is tested on benchmark data sets and demonstrates its effectiveness.
Article
Missing data is a common drawback in many real-life pattern classification scenarios. One of the most popular solutions is missing data imputation by the K nearest neighbours (KNN) algorithm. In this article, we propose a novel KNN imputation procedure using a feature-weighted distance metric based on mutual information (MI). This method provides a missing data estimation aimed at solving the classification task, i.e., it provides an imputed dataset which is directed toward improving the classification performance. The MI-based distance metric is also used to implement an effective KNN classifier. Experimental results on both artificial and real classification datasets are provided to illustrate the efficiency and the robustness of the proposed algorithm.
Article
From the publisher: This is the first comprehensive introduction to Support Vector Machines (SVMs), a new generation learning system based on recent advances in statistical learning theory. SVMs deliver state-of-the-art performance in real-world applications such as text categorisation, hand-written character recognition, image classification, biosequences analysis, etc., and are now established as one of the standard tools for machine learning and data mining. Students will find the book both stimulating and accessible, while practitioners will be guided smoothly through the material required for a good grasp of the theory and its applications. The concepts are introduced gradually in accessible and self-contained stages, while the presentation is rigorous and thorough. Pointers to relevant literature and web sites containing software ensure that it forms an ideal starting point for further study. Equally, the book and its associated web site will guide practitioners to updated literature, new applications, and on-line software.
Article
Support vector machines (SVMs) were originally designed for binary classification. How to effectively extend it for multiclass classification is still an ongoing research issue. Several methods have been proposed where typically we construct a multiclass classifier by combining several binary classifiers. Some authors also proposed methods that consider all classes at once. As it is computationally more expensive to solve multiclass problems, comparisons of these methods using large-scale problems have not been seriously conducted. Especially for methods solving multiclass SVM in one step, a much larger optimization problem is required so up to now experiments are limited to small data sets. In this paper we give decomposition implementations for two such "all-together" methods. We then compare their performance with three methods based on binary classifications: "one-against-all," "one-against-one," and directed acyclic graph SVM (DAGSVM). Our experiments indicate that the "one-against-one" and DAG methods are more suitable for practical use than the other methods. Results also show that for large problems methods by considering all data at once in general need fewer support vectors.
Article
Statistical learning theory was introduced in the late 1960's. Until the 1990's it was a purely theoretical analysis of the problem of function estimation from a given collection of data. In the middle of the 1990's new types of learning algorithms (called support vector machines) based on the developed theory were proposed. This made statistical learning theory not only a tool for the theoretical analysis but also a tool for creating practical algorithms for estimating multidimensional functions. This article presents a very general overview of statistical learning theory including both theoretical and algorithmic aspects of the theory. The goal of this overview is to demonstrate how the abstract learning theory established conditions for generalization which are more general than those discussed in classical statistical paradigms and how the understanding of these conditions inspired new algorithmic approaches to function estimation problems. A more detailed overview of the theory (without proofs) can be found in Vapnik (1995). In Vapnik (1998) one can find detailed description of the theory (including proofs).
Article
Multiclass learning problems involve finding a definition for an unknown function f(x) whose range is a discrete set containing k ? 2 values (i.e., k "classes"). The definition is acquired by studying collections of training examples of the form hx i ; f(x i )i. Existing approaches to multiclass learning problems include direct application of multiclass algorithms such as the decision-tree algorithms C4.5 and CART, application of binary concept learning algorithms to learn individual binary functions for each of the k classes, and application of binary concept learning algorithms with distributed output representations. This paper compares these three approaches to a new technique in which error-correcting codes are employed as a distributed output representation. We show that these output representations improve the generalization performance of both C4.5 and backpropagation on a wide range of multiclass learning tasks. We also demonstrate that this approach is robust with respect to changes in the size of the training sample, the assignment of distributed representations to particular classes, and the application of overfitting avoidance techniques such as decision-tree pruning. Finally, we show that---like the other methods---the error-correcting code technique can provide reliable class probability estimates. Taken together, these results demonstrate that error-correcting output codes provide a general-purpose method for improving the performance of inductive learning programs on multiclass problems. 1.