Conference PaperPDF Available

Efficient Two Stage Voting Architecture for Pairwise Multi-label Classification

Abstract and Figures

A common approach for solving multi-label classification problems using problem-transformation methods and dichotomizing classifiers is the pair-wise decomposition strategy. One of the problems with this approach is the need for querying a quadratic number of binary classifiers for making a prediction that can be quite time consuming especially in classification problems with large number of labels. To tackle this problem we propose a two stage voting architecture (TSVA) for efficient pair-wise multiclass voting to the multi-label setting, which is closely related to the calibrated label ranking method. Four different real-world datasets (enron, yeast, scene and emotions) were used to evaluate the performance of the TSVA. The performance of this architecture was compared with the calibrated label ranking method with majority voting strategy and the quick weighted voting algorithm (QWeighted) for pair-wise multi-label classification. The results from the experiments suggest that the TSVA significantly outperforms the concurrent algorithms in term of testing speed while keeping comparable or offering better prediction performance.
Content may be subject to copyright.
Efficient Two Stage Voting Architecture for
Pairwise Multi-label Classification
Gjorgji Madjarov, Dejan Gjorgjevikj and Tomche Delev
Faculty of Electrical Engineering and Information Technologies, Ss. Cyril and
Methodius University, Rugjer Boshkovikj bb, 1000 Skopje, R. of Macedonia
{madzarovg,dejan,tdelev}@feit.ukim.edu.mk
Abstract. A common approach for solving multi-label classification
problems using problem-transformation methods and dichotomizing clas-
sifiers is the pair-wise decomposition strategy. One of the problems with
this approach is the need for querying a quadratic number of binary
classifiers for making a prediction that can be quite time consuming es-
pecially in classification problems with large number of labels. To tackle
this problem we propose a two stage voting architecture (TSVA) for
efficient pair-wise multiclass voting to the multi-label setting, which is
closely related to the calibrated label ranking method. Four different
real-world datasets (enron, yeast, scene and emotions) were used to eval-
uate the performance of the TSVA. The performance of this architecture
was compared with the calibrated label ranking method with majority
voting strategy and the quick weighted voting algorithm (QWeighted)
for pair-wise multi-label classification. The results from the experiments
suggest that the TSVA significantly outperforms the concurrent algo-
rithms in term of testing speed while keeping comparable or offering
better prediction performance.
Keywords: Multi-label, classification, calibration, ranking
1 Introduction
Traditional single-label classification is concerned with learning from set of ex-
amples that are associated with a single label λifrom a finite set of disjoint
labels L={λ1, λ2, ..., λQ},Q > 1. If Q= 2, then the learning problem is called
a binary classification problem, while if Q > 2, then it is called a multi-class
classification problem. On the other hand, multi-label classification is concerned
with learning from a set of examples S={(x1, Y1),(x2, Y2), ..., (xp, Yp)}(xiX,
Xdenote the domain of examples) where each of the examples is associated with
a set of labels YiL.
Many classifiers were originally developed for solving binary decision prob-
lems, and their extensions to multi-class and multi-label problems are not straight-
forward. Because of that, a common approach to address the multi-label classi-
fication problem is utilizing class binarization methods, i.e. decomposition of
the problem into several binary subproblems that can then be solved using
2 Gjorgji Madjarov, Dejan Gjorgjevikj, Tomche Delev
a binary base learner. The simplest strategy in the multi-label setting is the
one-against-all strategy also referred to as the binary relevance method. It ad-
dresses the multi-label classification problem by learning one classifier (model)
Mk(1 kQ) for each class, using all the examples labeled with that class
as positive examples and all other (remaining) examples as negative examples.
At query time, each binary classifier predicts whether its class is relevant for the
query example or not, resulting in a set of relevant labels.
Another approach for solving the multi-label classification problem using bi-
nary classisifers is pair-wise classification or round robin classification [1][2]. Its
basic idea is to use Q(Q1) /2 classifiers covering all pairs of labels. Each
classifier is trained using the samples of the first label as positive examples and
the samples of the second label as negative examples. To combine these clas-
sifiers, the pair-wise classification method naturally adopts the majority voting
algorithm. Given a test instance, each classifier delivers a prediction for one of
the two labels. This prediction is decoded into a vote for one of the labels. After
the evaluation of all Q(Q1) /2 classifiers the labels are ordered according to
their sum of votes. To predict only the relevant classes for each instance a label
ranking algorithm is used. Label ranking studies the problem of learning a map-
ping from set of instances to rankings over a finite number of predefined labels.
It can be considered as a natural generalization of conventional classification,
where only a single label (the top-label) is requested instead of a ranking of all
labels.
Brinker et al. [3] propose a conceptually new technique for extending the com-
mon pair-wise learning approach to the multi-label scenario named calibrated
label ranking. The key idea of calibrated label ranking is to introduce an ar-
tificial (calibration) label λ0, which represents the split-point between relevant
and irrelevant labels. The calibration label λ0is assumed to be preferred over
all irrelevant labels, but all relevant labels are preferred over it. At prediction
time (when majority voting strategy is usually used), one will get a ranking over
Q+ 1 labels (the Qoriginal labels plus the calibration label). The calibrated
label ranking is considered a combination of both multi-label classification and
ranking.
Besides the majority voting that is usually used strategy in the prediction
phase of the calibrated label ranking algorithm, Park et al. [4] propose an-
other more effective voting algorithm named Quick Weighted Voting algorithm
(QWeighted). QWeighted computes the class with the highest accumulated vot-
ing mass avoiding the evaluation of all possible pair-wise classifiers. It exploits
the fact that during a voting procedure some classes can be excluded from the
set of possible top rank classes early in the process when it becomes clear that
even if they reach the maximal voting mass in the remaining evaluations they
can no longer exceed the current maximum. Pair-wise classifiers are selected de-
pending on a voting loss value, which is the number of votes that a class has
not received. The voting loss starts with a value of zero and increases monoton-
ically with the number of performed preference evaluations. The class with the
current minimal loss is the top candidate for the top rank class. If all prefer-
TSV Architecture 3
ences involving this class have been evaluated (and it still has the lowest loss),
it can be concluded that no other class can achieve a better ranking. Thus, the
QWeighted algorithm always focuses on classes with low voting loss. An adapta-
tion of QWeighted to multi-label classification (QWeightedML) [5] is to repeat
the process while all relevant labels are not determined i.e. until the returned
class is the artificial label λ0, which means that all remaining classes will be
considered to be irrelevant.
In this paper we propose an efficient Two Stage Voting Architecture (TSVA)
that modifies the majority voting algorithm for calibrated label ranking tech-
nique [6]. We have evaluated the performance of this architecture on a selection
of multi-label datasets that vary in terms of problem domain and number of
labels. The results demonstrate that our modification outperforms the majority
voting algorithm for pair-wise multi-label classification and the QWeightedML
[5] algorithm in terms of testing speed, while keeping comparable prediction
results.
For the readers’ convenience, in Section 2 we will briefly introduce notations
and evaluation metrics used in multi-label learning. The Two Stage Voting Ar-
chitecture is explained in Section 3. The experimental results that compare the
performance of the proposed TSVA with concurrent methods are presented in
Section 4. Section 5 gives a conclusion.
2 Preliminaries
Let Xdenote the domain of instances and let L={λ1, λ2, ..., λQ}be the fi-
nite set of labels. Given a training set S={(x1, Y1),(x2, Y2), ..., (xp, Yp)}(xi
X, YiL), the goal of the learning system is to output a multi-label clas-
sifier h:X2Lwhich optimizes some specific evaluation metric. In most
cases however, instead of outputting a multi-label classifier, the learning system
will produce a real-valued function of the form f:X×LR. It is sup-
posed that, given an instance xiand its associated label set Yi, a successful
learning system will tend to output larger values for labels in Yithan those
not in Yi, i.e. f(xi, y1)> f(xi, y2) for any y1Yiand y2/Yi. The real-
valued function f(,) can be transformed to a ranking function rankf(,),
which maps the outputs of f(xi, y) for any yLto {λ1, λ2, ..., λQ}such that
if f(xi, y1)> f(xi, y2) then rankf(xi, y1)< rankf(xi, y2). Note that the cor-
responding multi-label classifier h() can also be derived from the function
f(,) : h(xi) = {y|f(xi, y)> t(xi); yL}, where t() is a threshold function
which is usually set to be the zero constant function. Performance evaluation of
multi-label learning system is different from that of classical single-label learning
system. The following multi-label evaluation metrics proposed in [7] are used in
this paper:
(1) Hamming loss: evaluates how many times an instance-label pair is misclas-
sified, i.e. a label not belonging to the instance is predicted or a label belonging
to the instance is not predicted. The performance is perfect when hlossS(h) = 0.
4 Gjorgji Madjarov, Dejan Gjorgjevikj, Tomche Delev
The smaller the value of hlossS(h), the better the performance. This metric is
given by
hlossS(h) = 1
p
p
X
i=1
1
Q|h(xi)∆Yi|(1)
where stands for the symmetric difference between two sets and Qis the
total number of possible class labels. Note that when |Yi|= 1 for all instances, a
multi-label system reduces to multi-class single-label one and the hamming loss
becomes 2/Q times the usual classification error.
While hamming loss is based on the multi-label classifier h(), the other
four metrics are defined based on the real-valued function f(,) that takes into
account the ranking quality of different labels for each instance:
(2) One-error: evaluates how many times the top-ranked label is not in the
set of proper labels of the instance. The performance is perfect when one
errorS(f) = 0. The smaller the value of one errorS(f), the better the perfor-
mance. This evaluation metric is given by:
one errorS(f) = 1
p
p
X
i=1 arg max
yYf(xi, y)/Yi(2)
where for any predicate π, [[π]] equals 1 if πholds and 0 otherwise. Note
that, for single-label classification problems, the one-error is identical to ordinary
classification error.
(3) Coverage: evaluates how far, on the average we need to go down the list of
ranked labels in order to cover all the proper labels of the instance. The smaller
the value of coverageS(f), the better the performance.
coverageS(f) = 1
p
p
X
i=1
max
yYi
rankf(xi, y)1 (3)
(4) Ranking loss: evaluates the average fraction of label pairs that are re-
versely ordered for the particular instance given by:
rlossS(f) = 1
p
p
X
i=1
|Di|
|Yi|
¯
Yi
(4)
where Di=f(y1, y2)|f(xi, y1)f(xi, y2),(y1, y2)Yiׯ
Yi, while ¯
Yde-
notes the complementary set of Yin L. The smaller the value of rlossS(f), the
better the performance, so the performance is perfect when rlossS(f) = 0.
(5) average precision: evaluates the average fraction of labels ranked above
a particular label yYthat actually are in Y. The performance is perfect
when avgprecS(f) = 1; the bigger the value of avgprecS(f), the better the
performance. This metric is given by:
avgprecS(f) = 1
p
p
X
i=1
1
|Yi|X
yYi
|Li|
rankf(xi, y)(5)
TSV Architecture 5
where Li={y0|rankf(xi, y0)rankf(xi, y), y0Yi}.
Note that in the rest of this paper, the performances of the multi-label learn-
ing algorithms are evaluated based on the five metrics explained above.
3 Two Stage Voting Architecture (TSVA)
Conventional pair-wise approach learns a model Mij for all combinations of
labels λiand λjwith 1 i < j Q. This way Q(Q1) /2 different pair-
wise models are learned. Each pearwise model Mij is learned with the examples
labelled with label λias positive examples and the examples labelled with λj
as negative examples. The main disadvantage of this approach is that in the
prediction phase a quadratic number of base classifiers (models) have to be
consulted for each test example.
Further, as a result of introducing the artificial calibration label λ0in the
calibrated label ranking algorithm, the number of the base classifiers is increased
by Qi.e. additional set of Qbinary preference models M0k(1 kQ) is
learned. The models M0kthat are learned by a pair-wise approach to calibrated
ranking, and the models Mkthat are learned by conventional binary relevance
are equivalent. At prediction time (when standard majority voting algorithm
is usually used) each test instance needs to consult all the models (classifiers)
in order to rank the labels by their order of preference. This results in slower
testing, especially when the number of the labels in the problem is big.
In this paper we propose an efficient two stage voting architecture which
modifies the majority voting algorithm for the calibrated label ranking technique.
It reduces the number of base classifiers that are needed to be consulted in order
to make a final prediction for a given test instance. The number of base classifiers
that are trained by the calibrated label ranking algorithm and the TSVA in the
learning process is equivalent.
The proposed (TSV) architecture is organized in two layers. In the first layer
of the architecture Qclassifiers are located, while in the second layer of the
architecture the rest Q(Q1)/2 classifiers are located. All of the classifiers
in the first layer are the binary relevance models M0k, while in the second layer
of the architecture the pair-wise models Mij are located. Each model M0kfrom
the first layer is connected with Q1 models Mij from the second layer, where
k=ior k=j(1 iQ1, i + 1 jQ). An example of TSVA for solving
four-class multi-label classification problems is shown on Fig. 1.
At prediction time, each model M0kof the first layer of the architecture
tries to determine the relevant labels for the corresponding test example. Each
model M0kgives the probability (the output value of model M0kis convert
to probability) that the test example is associated with the label λk. If that
probability is appropriately small (under some threshold), we can conclude that
the artificial calibration label λ0is preferred over the label λki.e. the label λk
belongs to the set of irrelevant labels. In such case, one can conclude that for
the test example, the pair-wise models of the second layer Mij where i=kor
j=k, need not be consulted, because the binary relevance model M0kfrom
6 Gjorgji Madjarov, Dejan Gjorgjevikj, Tomche Delev
the first layer has already made a decision that the label λkbelongs to the set
of irrelevant labels. For each test example for which it is known that the label
λkbelongs to the set of irrelevant labels, the number of models that should be
consulted decreases for Q1.
Fig. 1. TSV Architecture
In order to make a decision which labels belong to the set of irrelevant labels
i.e. which pair-wise models Mij from the second layer do not have to be consulted
a threshold t(0 t1) is introduced.
According to the previously mentioned, in TSVA every test instance first
consults all binary relevance models M0kof the first layer of the architecture.
If the corresponding model M0k(1 kQ) response with a probability that
is above the threshold t, the test instance is then forwarded only to the models
Mij of the second layer of the architecture that are associated to the model
M0k. The pair-wise model Mij from the second layer is connected to the binary
relevance models M0iand M0j. This does not mean that the model Mij has to
be consulted twice, if the prediction probabilities of the models M0iand M0jare
both above the threshold t. Instead the model Mij is consulted only once and its
prediction is decoded into a vote for one of the labels λior λj. If the prediction
of one of the models M0iand M0jresults with probability under the threshold
t, the corresponding model Mij is not consulted and the vote from this model
goes to the label which binary relevance model prediction probability is above
the threshold t.
By increasing the value of the threshold, the number of consulted pair-wise
models decreases. If t= 1 the test instance is not forwarded to the second layer
of the architecture and the TSVA becomes binary relevance method. On the
other hand, if t= 0, all pair-wise models of the second layer are consulted and
the TSVA becomes calibrated label ranking method with majority voting.
4 Experimental results
In this section, we present the results of our experiments with several multi-
label classification problems. The performance was measured on the problem of
recognition of text, music, image and gene function.
TSV Architecture 7
Here, the performance of the TSV architecture is compared with the cali-
brated label ranking method with majority voting strategy for pair-wise multi-
label classification (CLR-S) and the QWeightedML algorithm [5].
The training and testing of the TSVA was performed using a custom de-
veloped application that uses the MULAN library [8] for the machine learning
framework Weka [9]. The LIBSVM library [10] utilizing the SVMs with radial
basis kernel were used for solving the partial binary classification problems. Usu-
ally, the most important criterion when evaluating a classifier is its prediction
performance, but very often the testing time of the classifier can be equally
important. In our experiments, four different multi-label classification problems
were addressed by each classifying methods. The recognition performance and
the testing time were recorded for every method. The problems considered in the
experiments include scene [11] (scene), gene function [12] (yeast), text [13](en-
ron) and music [14] (emotions) classification.
The complete description of the datasets (domain, number of training and
test instances, number of features, number of labels) is shown in Table 1.
Table 1. Datasets
scene yeast enron emotions
Domain image biology text music
Training Instances 1211 1500 1123 391
Test Instances 1159 917 579 202
Features 294 103 1001 72
Labels 6 14 53 6
In all classification problems the classifiers were trained using all available
training samples of the sets and were evaluated by recognizing all the test samples
from the corresponding set. Table 2 gives the performance of each method applied
on each of the datasets. The first column of the table describes the datasets. The
second column shows the values of the threshold tfor each dataset separately,
for which the presented results of TSVA are obtained.
The value of the threshold tfor each dataset was determined by 5-fold cross
validation using only the samples of the training set in order to achieve maximum
benefit in terms of prediction results on testing speed.
Table 2 clearly shows that among the three tested approaches TSVA offers
best performance in terms of testing speed. The results show that for the four
treated classification problems TSVA is 2 to 4 times faster than calibrated la-
bel ranking algorithm with majority voting and 10% to 15% faster than the
QWeightedML method. It can also be noticed that TSVA offers better per-
formance than QWeightedML method in all evaluation metrics, while showing
comparable performance to calibrated label ranking algorithm with majority
voting. The dependence of the predictive performances for different values of
8 Gjorgji Madjarov, Dejan Gjorgjevikj, Tomche Delev
Table 2. The evaluation of each method for every dataset
tEvaluation Metric CLR-S QWeightedML TSVA
Hamming Loss 0.0476 0.0481 0.0501
One-error 0.2297 0.2262 0.2193
Coverage 11.5198 20.3333 14.4317
enron 0.03 Ranking Loss 0.0756 0.1516 0.0969
Avg. Precision 0.7018 0.6543 0.6970
Testing time (s) 605.06 174.31 147.57
Hamming Loss 0.2566 0.2623 0.2590
One-error 0.3812 0.3762 0.3663
Coverage 2.4059 2.8465 2.3960
emotions 0.25 Ranking Loss 0.2646 0.3381 0.2612
Avg. Precision 0.7215 0.6795 0.7242
Testing time (s) 2.56 1.67 1.34
Hamming Loss 0.1903 0.1909 0.1906
One-error 0.2334 0.2301 0.2300
Coverage 6.2758 8.6215 6.7633
yeast 0.15 Ranking Loss 0.1632 0.2934 0.1805
Avg. Precision 0.7685 0.7003 0.7641
Testing time (s) 104.34 60.39 54.65
Hamming Loss 0.0963 0.0956 0.0946
One-error 0.2349 0.2349 0.2366
Coverage 0.4883 0.7073 0.4974
scene 0.02 Ranking Loss 0.0779 0.1190 0.0799
Avg. Precision 0.8600 0.8400 0.8598
Testing time (s) 66.15 40.32 35.73
TSV Architecture 9
the threshold t(0 t1) are shown on Fig. 2. Fig. 3 shows the testing time
for the four classification problems as a function of the selected threshold t. It
can be noticed that for small values of the threshold t(0.0 - 0.2) the predictive
performance of TSVA changes moderately, but the testing time decreases for
more than 40%. The reduction of the testing time of the TSVA over the CLR-S
becomes even more notable as the number of labes in the treated classifica-
tion problem increases. The experiments showed that for the enron dataset with
quite big number of labels (53) the testing time of TSVA is four times shorter
comparing to the calibrated label ranking algorithm.
Fig. 2. Predictive performance of TSVA as a function of the threshold t(0 t1)
for each dataset
5 Conclusion
A two stage voting architecture (TSVA) for efficient pair-wise multiclass voting
to the multi-label setting was presented. The performance of this architecture
was compared with the calibrated label ranking method with majority voting
strategy for pair-wise multi-label classification and the QWeightedML algorithm
on four different real-world datasets (enron, yeast, scene and emotions). The re-
sults show that the TSVA significantly outperforms the calibrated label ranking
method with majority voting and the QWeightedML algorithm in term of test-
ing speed while keeping comparable or offering better prediction performance.
TSVA was 2 to 4 times faster than calibrated label ranking algorithm with ma-
jority voting and 10% to 15% faster than the QWeightedML method. TSVA is
expected to show even bigger advantage when addressing classification problems
with large number of labels.
10 Gjorgji Madjarov, Dejan Gjorgjevikj, Tomche Delev
Fig. 3. Testing time of TSVA as a function of the threshold t(0 t1) for each
dataset measured in seconds
References
1. Furnkranz, J.: Round robin classification. Journal of Machine Learning Research
2(5), 721-747 (2002)
2. Wu, T. F., Weng, C.J.L., R.C.: Probability estimates for multiclass classification by
pairwise coupling. Journal of Machine Learning Research 5(8), 975-1005 (2004)
3. Brinker, K., Furnkranz, J., Hullermeier, E.: A unified model for multilabel clas-
sification and ranking. In: 17th European conference on artificial intelligence, pp.
489-493. Riva Del Garda, Italy (2006)
4. Park, S.H., Furnkranz, J.: Efficient pairwise classification. In: 18th European Con-
ference on Machine Learning, pp. 658–665. Warsaw, Poland (2007)
5. Loza Mencia, E., Park., S.H., Furnkranz, J.: Efficient voting prediction for pairwise
multi-label classification. Neurocomputing 73, 1164-1176 (2010)
6. Furnkranz, J., Hullermeier, E., Loza Mencia, E., Brinker, K.: Multi-label classifica-
tion via calibrated label ranking. Machine Learning 73(2), 133-153 (2008)
7. Schapire, R.E., Singer, Y.: Boostexter: a boosting-based system for text categoriza-
tion. Machine Learning 39(2), 135–168 (2000)
8. http://mulan.sourceforge.net/
9. http://www.cs.waikato.ac.nz/ml/weka/
10. http://www.csie.ntu.edu.tw/ cjlin/libsvm/
11. Boutell, M.R. , Luo, J., X.S., Brown, C.: Learning multi-labelscene classiffication.
Pattern Recognition 37(9), 1757-1771 (2004)
12. Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. Advances
in Neural Information Processing Systems 14, (2001)
13. http://bailando.sims.berkeley.edu/enron email.html
14. Trohidis, K., Tsoumakas, G., Vlahavas, I.: Multilabel classification of music into
emotions. In: International Conference on Music Information Retrieval, pp. 320-330.
Philadelphia, PA, USA (2008)
... § § Available at http://waikato.github.io/meka/datasets/ 20 Relevance (BR) [31], Ensembles of Classifiers Chains (ECC) [22], the multi-label Knearest neighbor algorithm (ML-KNN) [26], Two stage voting architecture algorithm(TSVA) [37] and the Random k-labelsets algorithm (RAKEL) [23]. In this paper, these mentioned algorithms will be used as the baselines for MLC performance comparison and analysis. ...
... Binary Relevance BR [31] Classifier Chains CC [22] Ensemble of Classifier Chains ECC [22] Multi-Label K-Nearest Neighbor ML-KNN [25] Two Stage Voting Architecture TSVA [37] RAndom K-labELsets RAKEL [23] 7.3 Performance Indexes ...
Preprint
As a very popular multi-label classification method, Classifiers Chain has recently been widely applied to many multi-label classification tasks. However, existing Classifier Chains methods are difficult to model and exploit the underlying dependency in the label space, and often suffer from the problems of poorly ordered chain and error propagation. In this paper, we present a three-phase augmented Classifier Chains approach based on co-occurrence analysis for multi-label classification. First, we propose a co-occurrence matrix method to model the underlying correlations between a label and its precedents and further determine the head labels of a chain. Second, we propose two augmented strategies of optimizing the order of labels of a chain to approximate the underlying label correlations in label space, including Greedy Order Classifier Chain and Trigram Order Classifier Chain. Extensive experiments were made over six benchmark datasets, and the experimental results show that the proposed augmented CC approaches can significantly improve the performance of multi-label classification in comparison with CC and its popular variants of Classifier Chains, in particular maintaining lower computational costs while achieving superior performance.
... The pairwise comparison adopted in calibrated label ranking is helpful to reduce the effects on class imbalance; but the computational complexity is in quadratic scale though it can be alleviated in some extent by pruning approaches (Mencía et al., 2010, Madjarov et al., 2010, Madjarov et al., 2012a. ...
Conference Paper
Clinical outcomes in focal brain injury reflect the interactions between two distinct anatomically distributed patterns: the functional organisation of the brain and the structural distribution of injury. The challenge of understanding the functional architecture of the brain is familiar; that of understanding the lesion architecture is barely acknowledged. Yet, models of the functional consequences of focal injury are critically dependent on our knowledge of both. The studies described in this thesis seek to show how machine learning-enabled high-dimensional multivariate analysis powered by large-scale data can enhance our ability to model the relation between focal brain injury and clinical outcomes across an array of modelling applications. All studies are conducted on internationally the largest available set of MR imaging data of focal brain injury in the context of acute stroke (N=1333) and employ kernel machines at the principal modelling architecture. First, I examine lesion-deficit prediction, quantifying the ceiling on achievable predictive fidelity for high-dimensional and low-dimensional models, demonstrating the former to be substantially higher than the latter. Second, I determine the marginal value of adding unlabelled imaging data to predictive models within a semi-supervised framework, quantifying the benefit of assembling unlabelled collections of clinical imaging. Third, I compare high- and low-dimensional approaches to modelling response to therapy in two contexts: quantifying the effect of treatment at the population level (therapeutic inference) and predicting the optimal treatment in an individual patient (prescriptive inference). I demonstrate the superiority of the high-dimensional approach in both settings.
... Especially, these approaches exploit secondorder (pairwise) or high-order label correlations. For second-order approaches, they can utilize the ranking criterion, such as the multi-label pairwise perceptron [23], or the co-occurrence patterns, such as the two-stage voting architecture [24]. For high-order approaches, they can impose all other class labels' influences on each label or part of class labels (label subsets), e.g., utilizing hypothesis of linear combination: instance-based learning and logistic regression (IBLR-ML and IBLR-ML+) [8], nonlinear mapping: dependent binary relevance [25], shared subspace [20], randomly selecting the label subsets [21], and utilizing graph structure to determine the specific label subsets: conditional dependency networks [13]. ...
As an important machine learning task, multi-label learning deals with the problem where each sample instance (feature vector) is associated with multiple labels simultaneously. Most existing approaches focus on manipulating the label space, such as exploiting correlations between labels and reducing label space dimension, with identical feature space in the process of classification. One potential drawback of this traditional strategy is that each label might have its own specific characteristics and using identical features for all label cannot lead to optimized performance. In this article, we propose an effective algorithm named LSDM, i.e., leveraging label-specific discriminant mapping features for multi-label learning, to overcome the drawback. LSDM sets diverse ratio parameter values to conduct cluster analysis on the positive and negative instances of identical label. It reconstructs label-specific feature space which includes distance information and spatial topology information. Our experimental results show that combining these two parts of information in the new feature representation can better exploit the clustering results in the learning process. Due to the problem of diverse combinations for identical label, we employ simplified linear discriminant analysis to efficiently excavate optimal one for each label and perform classification by querying the corresponding results. Comparison with the state-of-the-art algorithms on a total of 20 benchmark datasets clearly manifests the competitiveness of LSDM.
... Therefore, the predicted label set for unseen instance x corresponds to: Remarks: The pseudo-code of Calibrated Label Ranking is summarized inFig. 5 [60], [61]. By exploiting idiosyncrasy of the underlying binary learning algorithm B, such as dual representation for Perceptron [58], the quadratic number of classifiers can be induced more efficiently in training phase [57]. ...
Article
Full-text available
Multi-label learning studies the problem where each example is represented by a single instance while associated with a set of labels simultaneously. During the past decade, significant amount of progresses have been made toward this emerging machine learning paradigm. This paper aims to provide a timely review on this area with emphasis on state-of-the-art multi-label learning algorithms. Firstly, fundamentals on multi-label learning including formal definition and evaluation metrics are given. Secondly and primarily, eight representative multi-label learning algorithms are scrutinized under common notations with relevant analyses and discussions. Thirdly, several related learning settings are briefly summarized. As a conclusion, online resources and open research problems on multi-label learning are outlined for reference purposes.
... To the best of our knowledge such ensembles of ensembles or multi-level ensemble meta classifiers have not been considered in the literature before, probably because personal computers routinely used in research have only recently become powerful enough to train them for data sets large enough to justify the use of such large classification systems. On the other hand, the motivation and inspiration for our study originally came from many different multi-stage procedures that had been treated previously, for example, by Christen (2007), Islam & Abawajy (2012), Jiangning et al. (2012) and Madjarov et al. (2011. Diabetes is a condition requiring continuous everyday monitoring of medical tests to adjust the diet, administer medication, update or modify treatment plans and provide further assistance (Wickramasinghe et al. 2011). ...
Conference Paper
Full-text available
This paper is devoted to empirical investigation of novel multi-level ensemble meta classifiers for the detection and monitoring of progression of cardiac autonomic neuropathy, CAN, in diabetes patients. Our experiments relied on an extensive database and concentrated on ensembles of ensembles, or multi-level meta classifiers, for the classification of cardiac autonomic neuropathy progression. First, we carried out a thorough investigation comparing the performance of various base classifiers for several known sets of the most essential features in this database and determined that Random Forest significantly and consistently outperforms all other base classifiers in this new application. Second, we used feature selection and ranking implemented in Random Forest. It was able to identify a new set of features, which has turned out better than all other sets considered for this large and well-known database previously. Random Forest remained the very best classifier for the new set of features too. Third, we investigated meta classifiers and new multi-level meta classifiers based on Random Forest, which have improved its performance. The results obtained show that novel multi-level meta classifiers achieved further improvement and obtained new outcomes that are significantly better compared with the outcomes published in the literature previously for cardiac autonomic neuropathy.
Article
Multi-label learning tackles the problem in which each instance is composed of a single sample and associated with multiple labels simultaneously. In the past decades, many algorithms have been proposed for this emerging machine learning paradigm. However, these methods often focus on designing new classification strategies. If the original input data with a high dimensionality, it will suffer from the curse of dimensionality and get highly biased estimates. To address the issue, this paper proposes a novel feature construction approach named Structured Feature (SF), which can achieve a discriminant representation for multi-label data by exploiting the local and global geometric structures. These structures are characterized by the reconstruction coefficients of data in manifold and sparse subspaces. Since reconstruction coefficients reflect the data similarity relationships, and the similar data usually has the common class labels, SF incorporates the similarity relationships and supervised information into the feature representation. The SF contains supervised information, thus, even though classifier design ignores the correlations and interdependences between labels, it is able to obtain satisfactory classification performance.SF combines manifold and sparse structures in the feature representation for multi-label learning. Extensive experiments validate that SF can achieve comparable to or even better results than other state-of-the-art algorithms.
Article
Full-text available
Dependencies between the labels are commonly regarded as the crucial issue in multi-label classification. Rules provide a natural way for symbolically describing such relationships. For instance, rules with label tests in the body allow for representing directed dependencies like implications, subsumptions, or exclusions. Moreover, rules naturally allow to jointly capture both local and global label dependencies. In this paper, we introduce two approaches for learning such label-dependent rules. Our first solution is a bootstrapped stacking approach which can be built on top of a conventional rule learning algorithm. For this, we learn for each label a separate ruleset, but we include the remaining labels as additional attributes in the training instances. The second approach goes one step further by adapting the commonly used separate-and-conquer algorithm for learning multi-label rules. The main idea is to re-include the covered examples with the predicted labels so that this information can be used for learning subsequent rules. Both approaches allow for making label dependencies explicit in the rules. In addition, the usage of standard rule learning techniques targeted at producing accurate predictions ensures that the found rules are useful for actual classification. Our experiments show (a) that the discovered dependencies contribute to the understanding and improve the analysis of multi-label datasets, and (b) that the found multi-label rules are crucial for the predictive performance as our proposed approaches beat the baseline using conventional rules.
Conference Paper
This paper is devoted to multi-tier ensemble classifiers for the detection and filtering of phishing emails. We introduce a new construction of ensemble classifiers, based on the well known and productive multi-tier approach. Our experiments evaluate their performance for the detection and filtering of phishing emails. The multi-tier constructions are well known and have been used to design effective classifiers for email classification and other applications previously. We investigate new multi-tier ensemble classifiers, where diverse ensemble methods are combined in a unified system by incorporating different ensembles at a lower tier as an integral part of another ensemble at the top tier. Our novel contribution is to investigate the possibility and effectiveness of combining diverse ensemble methods into one large multi-tier ensemble for the example of detection and filtering of phishing emails. Our study handled a few essential ensemble methods and more recent approaches incorporated into a combined multi-tier ensemble classifier. The results show that new large multi-tier ensemble classifiers achieved better performance compared with the outcomes of the base classifiers and ensemble classifiers incorporated in the multi-tier system. This demonstrates that the new method of combining diverse ensembles into one unified multi-tier ensemble can be applied to increase the performance of classifiers if diverse ensembles are incorporated in the system.
Conference Paper
A common approach for solving multi-label learning problems using problem-transformation methods and dichotomizing classifiers is the pair-wise decomposition strategy. One of the problems with this approach is the need for querying a quadratic number of binary classifiers for making a prediction that can be quite time consuming, especially in learning problems with large number of labels. To tackle this problem we propose a Two Stage Classifier Chain Architecture (TSCCA) for efficient pair-wise multi-label learning. Six different real-world datasets were used to evaluate the performance of the TSCCA. The performance of the architecture was compared with six methods for multi-label learning and the results suggest that the TSCCA outperforms the concurrent algorithms in terms of predictive accuracy. In terms of testing speed TSCCA shows better performance comparing to the pair-wise methods for multi-label learning.
Article
Full-text available
Pairwise coupling is a popular multi-class classification method that combines together all pairwise comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement.
Article
This work focuses on algorithms which learn from examples to perform multiclass text and speech categorization tasks. Our approach is based on a new and improved family of boosting algorithms. We describe in detail an implementation, called BoosTexter, of the new boosting algorithms for text categorization tasks. We present results comparing the performance of BoosTexter and a number of other text-categorization algorithms on a variety of tasks. We conclude by describing the application of our system to automatic call-type identification from unconstrained spoken customer responses.
Article
The pairwise approach to multilabel classification reduces the problem to learning and aggregating preference predictions among the possible labels. A key problem is the need to query a quadratic number of preferences for making a prediction. To solve this problem, we extend the recently proposed QWeighted algorithm for efficient pairwise multiclass voting to the multilabel setting, and evaluate the adapted algorithm on several real-world datasets. We achieve an average-case reduction of classifier evaluations from n2 to , where n is the total number of possible labels and d is the average number of labels per instance, which is typically quite small in real-world datasets.
Article
In classic pattern recognition problems, classes are mutually exclusive by definition. Classification errors occur when the classes overlap in the feature space. We examine a different situation, occurring when the classes are, by definition, not mutually exclusive. Such problems arise in semantic scene and document classification and in medical diagnosis. We present a framework to handle such problems and apply it to the problem of semantic scene classification, where a natural scene may contain multiple objects such that the scene can be described by multiple class labels (e.g., a field scene with a mountain in the background). Such a problem poses challenges to the classic pattern recognition paradigm and demands a different treatment. We discuss approaches for training and testing in this scenario and introduce new metrics for evaluating individual examples, class recall and precision, and overall accuracy. Experiments show that our methods are suitable for scene classification; furthermore, our work appears to generalize to other classification problems of the same nature.
Conference Paper
This article presents a Support Vector Machine (SVM) like learning sys- tem to handle multi-label problems. Such problems are usually decom- posed into many two-class problems but the expressive power of such a system can be weak (5, 7). We explore a new direct approach. It is based on a large margin ranking system that shares a lot of common proper- ties with SVMs. We tested it on a Yeast gene functional classification problem with positive results.
Conference Paper
Pairwise classification is a class binarization procedure that converts a multi-class problem into a series of two-class problems, one problem for each pair of classes. While it can be shown that for training, this procedure is more efficient than the more commonly used one-against-all approach, it still has to evaluate a quadratic number of classifiers when computing the predicted class for a given example. In this paper, we propose a method that allows a faster computation of the predicted class when weighted or unweighted voting are used for combining the predictions of the individual classifiers. While its worst-case complexity is still quadratic in the number of classes, we show that even in the case of completely random base classifiers, our method still outperforms the conventional pairwise classifier. For the more practical case of well-trained base classifiers, its asymptotic computational complexity seems to be almost linear.
Conference Paper
Label ranking studies the problem of learning a mapping from instances to rankings over a predefined set of labels. Hitherto existing approaches to label ranking implicitly operate on an under- lying (utility) scale which is not calibrated in the sense that it lacks a natural zero point. We propose a suitable extension of label rank- ing that incorporates the calibrated scenario and substantially extends the expressive power of these approaches. In particular, our extension suggests a conceptually novel technique for extending the common learning by pairwise comparisonapproach to the multilabel scenario, a setting previously not being amenable to the pairwise decomposi- tion technique. We present empirical results in the area of text catego- rization and gene analysis, underscoring the merits of the calibrated model in comparison to state-of-the-art multilabel learning methods.
Article
Label ranking studies the problem of learning a mapping from instances to rankings over a predefined set of labels. Hitherto existing approaches to label ranking implicitly operate on an underlying (utility) scale which is not calibrated in the sense that it lacks a natural zero point. We propose a suitable extension of label ranking that incorporates the calibrated scenario and substantially extends the expressive power of these approaches. In particular, our extension suggests a conceptually novel technique for extending the common learning by pairwise comparison approach to the multilabel scenario, a setting previously not being amenable to the pairwise decomposition technique. The key idea of the approach is to introduce an artificial calibration label that, in each example, separates the relevant from the irrelevant labels. We show that this technique can be viewed as a combination of pairwise preference learning and the conventional relevance classification technique, where a separate classifier is trained to predict whether a label is relevant or not. Empirical results in the area of text categorization, image classification and gene analysis underscore the merits of the calibrated model in comparison to state-of-the-art multilabel learning methods.
Article
In this paper, we discuss round robin classification (aka pairwise classification), a technique for handling multi-class problems with binary classifiers by learning one classifier for each pair of classes. We present an empirical evaluation of the method, implemented as a wrapper around the Ripper rule learning algorithm, on 20 multi-class datasets from the UCI database repository. Our results show that the technique is very likely to improve Ripper's classification accuracy without having a high risk of decreasing it. More importantly, we give a general theoretical analysis of the complexity of the approach and show that its run-time complexity is below that of the commonly used one-against-all technique. These theoretical results are not restricted to rule learning but are also of interest to other communities where pairwise classification has recently received some attention. Furthermore, we investigate its properties as a general ensemble technique and show that round robin classification with C5.0 may improve C5.0's performance on multi-class problems. However, this improvement does not reach the performance increase of boosting, and a combination of boosting and round robin classification does not produce any gain over conventional boosting. Finally, we show that the performance of round robin classification can be further improved by a straight-forward integration with bagging.