Content uploaded by Dejan Gjorgjevikj

Author content

All content in this area was uploaded by Dejan Gjorgjevikj

Content may be subject to copyright.

Eﬃcient Two Stage Voting Architecture for

Pairwise Multi-label Classiﬁcation

Gjorgji Madjarov, Dejan Gjorgjevikj and Tomche Delev

Faculty of Electrical Engineering and Information Technologies, Ss. Cyril and

Methodius University, Rugjer Boshkovikj bb, 1000 Skopje, R. of Macedonia

{madzarovg,dejan,tdelev}@feit.ukim.edu.mk

Abstract. A common approach for solving multi-label classiﬁcation

problems using problem-transformation methods and dichotomizing clas-

siﬁers is the pair-wise decomposition strategy. One of the problems with

this approach is the need for querying a quadratic number of binary

classiﬁers for making a prediction that can be quite time consuming es-

pecially in classiﬁcation problems with large number of labels. To tackle

this problem we propose a two stage voting architecture (TSVA) for

eﬃcient pair-wise multiclass voting to the multi-label setting, which is

closely related to the calibrated label ranking method. Four diﬀerent

real-world datasets (enron, yeast, scene and emotions) were used to eval-

uate the performance of the TSVA. The performance of this architecture

was compared with the calibrated label ranking method with majority

voting strategy and the quick weighted voting algorithm (QWeighted)

for pair-wise multi-label classiﬁcation. The results from the experiments

suggest that the TSVA signiﬁcantly outperforms the concurrent algo-

rithms in term of testing speed while keeping comparable or oﬀering

better prediction performance.

Keywords: Multi-label, classiﬁcation, calibration, ranking

1 Introduction

Traditional single-label classiﬁcation is concerned with learning from set of ex-

amples that are associated with a single label λifrom a ﬁnite set of disjoint

labels L={λ1, λ2, ..., λQ},Q > 1. If Q= 2, then the learning problem is called

a binary classiﬁcation problem, while if Q > 2, then it is called a multi-class

classiﬁcation problem. On the other hand, multi-label classiﬁcation is concerned

with learning from a set of examples S={(x1, Y1),(x2, Y2), ..., (xp, Yp)}(xi∈X,

Xdenote the domain of examples) where each of the examples is associated with

a set of labels Yi⊆L.

Many classiﬁers were originally developed for solving binary decision prob-

lems, and their extensions to multi-class and multi-label problems are not straight-

forward. Because of that, a common approach to address the multi-label classi-

ﬁcation problem is utilizing class binarization methods, i.e. decomposition of

the problem into several binary subproblems that can then be solved using

2 Gjorgji Madjarov, Dejan Gjorgjevikj, Tomche Delev

a binary base learner. The simplest strategy in the multi-label setting is the

one-against-all strategy also referred to as the binary relevance method. It ad-

dresses the multi-label classiﬁcation problem by learning one classiﬁer (model)

Mk(1 ≤k≤Q) for each class, using all the examples labeled with that class

as positive examples and all other (remaining) examples as negative examples.

At query time, each binary classiﬁer predicts whether its class is relevant for the

query example or not, resulting in a set of relevant labels.

Another approach for solving the multi-label classiﬁcation problem using bi-

nary classisifers is pair-wise classiﬁcation or round robin classiﬁcation [1][2]. Its

basic idea is to use Q∗(Q−1) /2 classiﬁers covering all pairs of labels. Each

classiﬁer is trained using the samples of the ﬁrst label as positive examples and

the samples of the second label as negative examples. To combine these clas-

siﬁers, the pair-wise classiﬁcation method naturally adopts the majority voting

algorithm. Given a test instance, each classiﬁer delivers a prediction for one of

the two labels. This prediction is decoded into a vote for one of the labels. After

the evaluation of all Q∗(Q−1) /2 classiﬁers the labels are ordered according to

their sum of votes. To predict only the relevant classes for each instance a label

ranking algorithm is used. Label ranking studies the problem of learning a map-

ping from set of instances to rankings over a ﬁnite number of predeﬁned labels.

It can be considered as a natural generalization of conventional classiﬁcation,

where only a single label (the top-label) is requested instead of a ranking of all

labels.

Brinker et al. [3] propose a conceptually new technique for extending the com-

mon pair-wise learning approach to the multi-label scenario named calibrated

label ranking. The key idea of calibrated label ranking is to introduce an ar-

tiﬁcial (calibration) label λ0, which represents the split-point between relevant

and irrelevant labels. The calibration label λ0is assumed to be preferred over

all irrelevant labels, but all relevant labels are preferred over it. At prediction

time (when majority voting strategy is usually used), one will get a ranking over

Q+ 1 labels (the Qoriginal labels plus the calibration label). The calibrated

label ranking is considered a combination of both multi-label classiﬁcation and

ranking.

Besides the majority voting that is usually used strategy in the prediction

phase of the calibrated label ranking algorithm, Park et al. [4] propose an-

other more eﬀective voting algorithm named Quick Weighted Voting algorithm

(QWeighted). QWeighted computes the class with the highest accumulated vot-

ing mass avoiding the evaluation of all possible pair-wise classiﬁers. It exploits

the fact that during a voting procedure some classes can be excluded from the

set of possible top rank classes early in the process when it becomes clear that

even if they reach the maximal voting mass in the remaining evaluations they

can no longer exceed the current maximum. Pair-wise classiﬁers are selected de-

pending on a voting loss value, which is the number of votes that a class has

not received. The voting loss starts with a value of zero and increases monoton-

ically with the number of performed preference evaluations. The class with the

current minimal loss is the top candidate for the top rank class. If all prefer-

TSV Architecture 3

ences involving this class have been evaluated (and it still has the lowest loss),

it can be concluded that no other class can achieve a better ranking. Thus, the

QWeighted algorithm always focuses on classes with low voting loss. An adapta-

tion of QWeighted to multi-label classiﬁcation (QWeightedML) [5] is to repeat

the process while all relevant labels are not determined i.e. until the returned

class is the artiﬁcial label λ0, which means that all remaining classes will be

considered to be irrelevant.

In this paper we propose an eﬃcient Two Stage Voting Architecture (TSVA)

that modiﬁes the majority voting algorithm for calibrated label ranking tech-

nique [6]. We have evaluated the performance of this architecture on a selection

of multi-label datasets that vary in terms of problem domain and number of

labels. The results demonstrate that our modiﬁcation outperforms the majority

voting algorithm for pair-wise multi-label classiﬁcation and the QWeightedML

[5] algorithm in terms of testing speed, while keeping comparable prediction

results.

For the readers’ convenience, in Section 2 we will brieﬂy introduce notations

and evaluation metrics used in multi-label learning. The Two Stage Voting Ar-

chitecture is explained in Section 3. The experimental results that compare the

performance of the proposed TSVA with concurrent methods are presented in

Section 4. Section 5 gives a conclusion.

2 Preliminaries

Let Xdenote the domain of instances and let L={λ1, λ2, ..., λQ}be the ﬁ-

nite set of labels. Given a training set S={(x1, Y1),(x2, Y2), ..., (xp, Yp)}(xi∈

X, Yi⊆L), the goal of the learning system is to output a multi-label clas-

siﬁer h:X→2Lwhich optimizes some speciﬁc evaluation metric. In most

cases however, instead of outputting a multi-label classiﬁer, the learning system

will produce a real-valued function of the form f:X×L→R. It is sup-

posed that, given an instance xiand its associated label set Yi, a successful

learning system will tend to output larger values for labels in Yithan those

not in Yi, i.e. f(xi, y1)> f(xi, y2) for any y1∈Yiand y2/∈Yi. The real-

valued function f(•,•) can be transformed to a ranking function rankf(•,•),

which maps the outputs of f(xi, y) for any y∈Lto {λ1, λ2, ..., λQ}such that

if f(xi, y1)> f(xi, y2) then rankf(xi, y1)< rankf(xi, y2). Note that the cor-

responding multi-label classiﬁer h(•) can also be derived from the function

f(•,•) : h(xi) = {y|f(xi, y)> t(xi); y∈L}, where t(•) is a threshold function

which is usually set to be the zero constant function. Performance evaluation of

multi-label learning system is diﬀerent from that of classical single-label learning

system. The following multi-label evaluation metrics proposed in [7] are used in

this paper:

(1) Hamming loss: evaluates how many times an instance-label pair is misclas-

siﬁed, i.e. a label not belonging to the instance is predicted or a label belonging

to the instance is not predicted. The performance is perfect when hlossS(h) = 0.

4 Gjorgji Madjarov, Dejan Gjorgjevikj, Tomche Delev

The smaller the value of hlossS(h), the better the performance. This metric is

given by

hlossS(h) = 1

p

p

X

i=1

1

Q|h(xi)∆Yi|(1)

where ∆stands for the symmetric diﬀerence between two sets and Qis the

total number of possible class labels. Note that when |Yi|= 1 for all instances, a

multi-label system reduces to multi-class single-label one and the hamming loss

becomes 2/Q times the usual classiﬁcation error.

While hamming loss is based on the multi-label classiﬁer h(•), the other

four metrics are deﬁned based on the real-valued function f(•,•) that takes into

account the ranking quality of diﬀerent labels for each instance:

(2) One-error: evaluates how many times the top-ranked label is not in the

set of proper labels of the instance. The performance is perfect when one −

errorS(f) = 0. The smaller the value of one −errorS(f), the better the perfor-

mance. This evaluation metric is given by:

one −errorS(f) = 1

p

p

X

i=1 arg max

y∈Yf(xi, y)/∈Yi (2)

where for any predicate π, [[π]] equals 1 if πholds and 0 otherwise. Note

that, for single-label classiﬁcation problems, the one-error is identical to ordinary

classiﬁcation error.

(3) Coverage: evaluates how far, on the average we need to go down the list of

ranked labels in order to cover all the proper labels of the instance. The smaller

the value of coverageS(f), the better the performance.

coverageS(f) = 1

p

p

X

i=1

max

y∈Yi

rankf(xi, y)−1 (3)

(4) Ranking loss: evaluates the average fraction of label pairs that are re-

versely ordered for the particular instance given by:

rlossS(f) = 1

p

p

X

i=1

|Di|

|Yi|

¯

Yi

(4)

where Di=f(y1, y2)|f(xi, y1)≤f(xi, y2),(y1, y2)∈Yi×¯

Yi, while ¯

Yde-

notes the complementary set of Yin L. The smaller the value of rlossS(f), the

better the performance, so the performance is perfect when rlossS(f) = 0.

(5) average precision: evaluates the average fraction of labels ranked above

a particular label y∈Ythat actually are in Y. The performance is perfect

when avgprecS(f) = 1; the bigger the value of avgprecS(f), the better the

performance. This metric is given by:

avgprecS(f) = 1

p

p

X

i=1

1

|Yi|X

y∈Yi

|Li|

rankf(xi, y)(5)

TSV Architecture 5

where Li={y0|rankf(xi, y0)≤rankf(xi, y), y0∈Yi}.

Note that in the rest of this paper, the performances of the multi-label learn-

ing algorithms are evaluated based on the ﬁve metrics explained above.

3 Two Stage Voting Architecture (TSVA)

Conventional pair-wise approach learns a model Mij for all combinations of

labels λiand λjwith 1 ≤i < j ≤Q. This way Q∗(Q−1) /2 diﬀerent pair-

wise models are learned. Each pearwise model Mij is learned with the examples

labelled with label λias positive examples and the examples labelled with λj

as negative examples. The main disadvantage of this approach is that in the

prediction phase a quadratic number of base classiﬁers (models) have to be

consulted for each test example.

Further, as a result of introducing the artiﬁcial calibration label λ0in the

calibrated label ranking algorithm, the number of the base classiﬁers is increased

by Qi.e. additional set of Qbinary preference models M0k(1 ≤k≤Q) is

learned. The models M0kthat are learned by a pair-wise approach to calibrated

ranking, and the models Mkthat are learned by conventional binary relevance

are equivalent. At prediction time (when standard majority voting algorithm

is usually used) each test instance needs to consult all the models (classiﬁers)

in order to rank the labels by their order of preference. This results in slower

testing, especially when the number of the labels in the problem is big.

In this paper we propose an eﬃcient two stage voting architecture which

modiﬁes the majority voting algorithm for the calibrated label ranking technique.

It reduces the number of base classiﬁers that are needed to be consulted in order

to make a ﬁnal prediction for a given test instance. The number of base classiﬁers

that are trained by the calibrated label ranking algorithm and the TSVA in the

learning process is equivalent.

The proposed (TSV) architecture is organized in two layers. In the ﬁrst layer

of the architecture Qclassiﬁers are located, while in the second layer of the

architecture the rest Q∗(Q−1)/2 classiﬁers are located. All of the classiﬁers

in the ﬁrst layer are the binary relevance models M0k, while in the second layer

of the architecture the pair-wise models Mij are located. Each model M0kfrom

the ﬁrst layer is connected with Q−1 models Mij from the second layer, where

k=ior k=j(1 ≤i≤Q−1, i + 1 ≤j≤Q). An example of TSVA for solving

four-class multi-label classiﬁcation problems is shown on Fig. 1.

At prediction time, each model M0kof the ﬁrst layer of the architecture

tries to determine the relevant labels for the corresponding test example. Each

model M0kgives the probability (the output value of model M0kis convert

to probability) that the test example is associated with the label λk. If that

probability is appropriately small (under some threshold), we can conclude that

the artiﬁcial calibration label λ0is preferred over the label λki.e. the label λk

belongs to the set of irrelevant labels. In such case, one can conclude that for

the test example, the pair-wise models of the second layer Mij where i=kor

j=k, need not be consulted, because the binary relevance model M0kfrom

6 Gjorgji Madjarov, Dejan Gjorgjevikj, Tomche Delev

the ﬁrst layer has already made a decision that the label λkbelongs to the set

of irrelevant labels. For each test example for which it is known that the label

λkbelongs to the set of irrelevant labels, the number of models that should be

consulted decreases for Q−1.

Fig. 1. TSV Architecture

In order to make a decision which labels belong to the set of irrelevant labels

i.e. which pair-wise models Mij from the second layer do not have to be consulted

a threshold t(0 ≤t≤1) is introduced.

According to the previously mentioned, in TSVA every test instance ﬁrst

consults all binary relevance models M0kof the ﬁrst layer of the architecture.

If the corresponding model M0k(1 ≤k≤Q) response with a probability that

is above the threshold t, the test instance is then forwarded only to the models

Mij of the second layer of the architecture that are associated to the model

M0k. The pair-wise model Mij from the second layer is connected to the binary

relevance models M0iand M0j. This does not mean that the model Mij has to

be consulted twice, if the prediction probabilities of the models M0iand M0jare

both above the threshold t. Instead the model Mij is consulted only once and its

prediction is decoded into a vote for one of the labels λior λj. If the prediction

of one of the models M0iand M0jresults with probability under the threshold

t, the corresponding model Mij is not consulted and the vote from this model

goes to the label which binary relevance model prediction probability is above

the threshold t.

By increasing the value of the threshold, the number of consulted pair-wise

models decreases. If t= 1 the test instance is not forwarded to the second layer

of the architecture and the TSVA becomes binary relevance method. On the

other hand, if t= 0, all pair-wise models of the second layer are consulted and

the TSVA becomes calibrated label ranking method with majority voting.

4 Experimental results

In this section, we present the results of our experiments with several multi-

label classiﬁcation problems. The performance was measured on the problem of

recognition of text, music, image and gene function.

TSV Architecture 7

Here, the performance of the TSV architecture is compared with the cali-

brated label ranking method with majority voting strategy for pair-wise multi-

label classiﬁcation (CLR-S) and the QWeightedML algorithm [5].

The training and testing of the TSVA was performed using a custom de-

veloped application that uses the MULAN library [8] for the machine learning

framework Weka [9]. The LIBSVM library [10] utilizing the SVMs with radial

basis kernel were used for solving the partial binary classiﬁcation problems. Usu-

ally, the most important criterion when evaluating a classiﬁer is its prediction

performance, but very often the testing time of the classiﬁer can be equally

important. In our experiments, four diﬀerent multi-label classiﬁcation problems

were addressed by each classifying methods. The recognition performance and

the testing time were recorded for every method. The problems considered in the

experiments include scene [11] (scene), gene function [12] (yeast), text [13](en-

ron) and music [14] (emotions) classiﬁcation.

The complete description of the datasets (domain, number of training and

test instances, number of features, number of labels) is shown in Table 1.

Table 1. Datasets

scene yeast enron emotions

Domain image biology text music

Training Instances 1211 1500 1123 391

Test Instances 1159 917 579 202

Features 294 103 1001 72

Labels 6 14 53 6

In all classiﬁcation problems the classiﬁers were trained using all available

training samples of the sets and were evaluated by recognizing all the test samples

from the corresponding set. Table 2 gives the performance of each method applied

on each of the datasets. The ﬁrst column of the table describes the datasets. The

second column shows the values of the threshold tfor each dataset separately,

for which the presented results of TSVA are obtained.

The value of the threshold tfor each dataset was determined by 5-fold cross

validation using only the samples of the training set in order to achieve maximum

beneﬁt in terms of prediction results on testing speed.

Table 2 clearly shows that among the three tested approaches TSVA oﬀers

best performance in terms of testing speed. The results show that for the four

treated classiﬁcation problems TSVA is 2 to 4 times faster than calibrated la-

bel ranking algorithm with majority voting and 10% to 15% faster than the

QWeightedML method. It can also be noticed that TSVA oﬀers better per-

formance than QWeightedML method in all evaluation metrics, while showing

comparable performance to calibrated label ranking algorithm with majority

voting. The dependence of the predictive performances for diﬀerent values of

8 Gjorgji Madjarov, Dejan Gjorgjevikj, Tomche Delev

Table 2. The evaluation of each method for every dataset

tEvaluation Metric CLR-S QWeightedML TSVA

Hamming Loss 0.0476 0.0481 0.0501

One-error 0.2297 0.2262 0.2193

Coverage 11.5198 20.3333 14.4317

enron 0.03 Ranking Loss 0.0756 0.1516 0.0969

Avg. Precision 0.7018 0.6543 0.6970

Testing time (s) 605.06 174.31 147.57

Hamming Loss 0.2566 0.2623 0.2590

One-error 0.3812 0.3762 0.3663

Coverage 2.4059 2.8465 2.3960

emotions 0.25 Ranking Loss 0.2646 0.3381 0.2612

Avg. Precision 0.7215 0.6795 0.7242

Testing time (s) 2.56 1.67 1.34

Hamming Loss 0.1903 0.1909 0.1906

One-error 0.2334 0.2301 0.2300

Coverage 6.2758 8.6215 6.7633

yeast 0.15 Ranking Loss 0.1632 0.2934 0.1805

Avg. Precision 0.7685 0.7003 0.7641

Testing time (s) 104.34 60.39 54.65

Hamming Loss 0.0963 0.0956 0.0946

One-error 0.2349 0.2349 0.2366

Coverage 0.4883 0.7073 0.4974

scene 0.02 Ranking Loss 0.0779 0.1190 0.0799

Avg. Precision 0.8600 0.8400 0.8598

Testing time (s) 66.15 40.32 35.73

TSV Architecture 9

the threshold t(0 ≤t≤1) are shown on Fig. 2. Fig. 3 shows the testing time

for the four classiﬁcation problems as a function of the selected threshold t. It

can be noticed that for small values of the threshold t(0.0 - 0.2) the predictive

performance of TSVA changes moderately, but the testing time decreases for

more than 40%. The reduction of the testing time of the TSVA over the CLR-S

becomes even more notable as the number of labes in the treated classiﬁca-

tion problem increases. The experiments showed that for the enron dataset with

quite big number of labels (53) the testing time of TSVA is four times shorter

comparing to the calibrated label ranking algorithm.

Fig. 2. Predictive performance of TSVA as a function of the threshold t(0 ≤t≤1)

for each dataset

5 Conclusion

A two stage voting architecture (TSVA) for eﬃcient pair-wise multiclass voting

to the multi-label setting was presented. The performance of this architecture

was compared with the calibrated label ranking method with majority voting

strategy for pair-wise multi-label classiﬁcation and the QWeightedML algorithm

on four diﬀerent real-world datasets (enron, yeast, scene and emotions). The re-

sults show that the TSVA signiﬁcantly outperforms the calibrated label ranking

method with majority voting and the QWeightedML algorithm in term of test-

ing speed while keeping comparable or oﬀering better prediction performance.

TSVA was 2 to 4 times faster than calibrated label ranking algorithm with ma-

jority voting and 10% to 15% faster than the QWeightedML method. TSVA is

expected to show even bigger advantage when addressing classiﬁcation problems

with large number of labels.

10 Gjorgji Madjarov, Dejan Gjorgjevikj, Tomche Delev

Fig. 3. Testing time of TSVA as a function of the threshold t(0 ≤t≤1) for each

dataset measured in seconds

References

1. Furnkranz, J.: Round robin classiﬁcation. Journal of Machine Learning Research

2(5), 721-747 (2002)

2. Wu, T. F., Weng, C.J.L., R.C.: Probability estimates for multiclass classiﬁcation by

pairwise coupling. Journal of Machine Learning Research 5(8), 975-1005 (2004)

3. Brinker, K., Furnkranz, J., Hullermeier, E.: A uniﬁed model for multilabel clas-

siﬁcation and ranking. In: 17th European conference on artiﬁcial intelligence, pp.

489-493. Riva Del Garda, Italy (2006)

4. Park, S.H., Furnkranz, J.: Eﬃcient pairwise classiﬁcation. In: 18th European Con-

ference on Machine Learning, pp. 658–665. Warsaw, Poland (2007)

5. Loza Mencia, E., Park., S.H., Furnkranz, J.: Eﬃcient voting prediction for pairwise

multi-label classiﬁcation. Neurocomputing 73, 1164-1176 (2010)

6. Furnkranz, J., Hullermeier, E., Loza Mencia, E., Brinker, K.: Multi-label classiﬁca-

tion via calibrated label ranking. Machine Learning 73(2), 133-153 (2008)

7. Schapire, R.E., Singer, Y.: Boostexter: a boosting-based system for text categoriza-

tion. Machine Learning 39(2), 135–168 (2000)

8. http://mulan.sourceforge.net/

9. http://www.cs.waikato.ac.nz/ml/weka/

10. http://www.csie.ntu.edu.tw/ cjlin/libsvm/

11. Boutell, M.R. , Luo, J., X.S., Brown, C.: Learning multi-labelscene classiﬃcation.

Pattern Recognition 37(9), 1757-1771 (2004)

12. Elisseeﬀ, A., Weston, J.: A kernel method for multi-labelled classiﬁcation. Advances

in Neural Information Processing Systems 14, (2001)

13. http://bailando.sims.berkeley.edu/enron email.html

14. Trohidis, K., Tsoumakas, G., Vlahavas, I.: Multilabel classiﬁcation of music into

emotions. In: International Conference on Music Information Retrieval, pp. 320-330.

Philadelphia, PA, USA (2008)