Content uploaded by Ricardo Cerri
Author content
All content in this area was uploaded by Ricardo Cerri on Dec 13, 2017
Content may be subject to copyright.
A Self-Organizing Map-based Method for
Multi-Label Classification
Gustavo G. Colombini
Department of Computer Science
Federal University of S˜
ao Carlos
Rodovia Washington Lu´
ıs, km 235
S˜
ao Carlos - SP - Brazil
Email: gustavogiordanocolombini@gmail.com
Iuri Bonna M. de Abreu
Department of Computer Science
Federal University of S˜
ao Carlos
Rodovia Washington Lu´
ıs, km 235
S˜
ao Carlos - SP - Brazil
Email: iuri.bonna@gmail.com
Ricardo Cerri
Department of Computer Science
Federal University of S˜
ao Carlos
Rodovia Washington Lu´
ıs, km 235
S˜
ao Carlos - SP - Brazil
Email: cerri@dc.ufscar.br
Abstract—In Machine Learning, multi-label classification is the
task of assigning an instance to two or more categories simulta-
neously. This is a very challenging task, since datasets can have
many instances and become very unbalanced. While most of the
methods in the literature use supervised learning to solve multi-
label problems, in this paper we propose the use of unsupervised
learning through neural networks. More specifically, we explore
the power of Self-Organizing Maps (Kohonen Maps), since they
have a self-organization ability and maps input instances to a map
of neurons. Because instances that are assigned to similar groups
of labels tend to be more similar, there is a network tendency that,
after organization, training instances which are similar to each
other are mapped to closer neurons in the map. Testing instances
can then be mapped to specific neurons in the network, being
classified in the labels assigned to training instances mapped
to these neurons. Our proposal was experimentally compared
to other literature methods, showing competitive performances.
The evaluation was performed using freely available datasets and
measures specifically designed for multi-label problems.
I. INTRODUCTION
In the Machine Learning literature, conventional classifi-
cation problems are called single-label. In these problems, a
classifier is trained in a set of instances which are associated
with one single class lfrom a set of disjoint classes L, where
|L|>1. However, there are more complex problems in which
instances can be classified in many classes simultaneously.
These problems are called multi-label.
In a multi-label classification, instances are frequently asso-
ciated with a set of classes Y⊂L. In the past, the multi-label
classification was mainly motivated by tasks such as document
classification [1], [2], [3] and medical diagnostic [4]. Many
works can also be found in Bioinformatics [5], [6], [7], [8],
[9] and Image Classification [10], [11]. In document cate-
gorization problems, documents usually belong to more than
one class, e.g., Computer Sciences and Biology. In medical
diagnostic problems, a patient can be suffering from diabetes
and prostate cancer at the same time. An image can contain
mountain and beach characteristics simultaneously [12]. In
addition to these fields, multi-label classification methods are
also being applied to the classification of sentiments, since it
is possible to extract emotions from natural language texts,
such as micro-blogs [13].
Figure 1 illustrates a comparison between a conventional
classification problem, in which instances can be assigned only
to one class, and a multi-label classification problem. Figure
1(a) illustrates a classification problem where a document
belongs to only one class (“Biology” or “Computer Science”),
but never to both classes simultaneously. Figure 1(b) illustrates
a problem where a document can be assigned simultaneously
to the “Biology” and “Computer Science” classes. In Fig-
ure 1(b), instances inside the highlighted region addresses both
Computer Science and Biology subjects.
Different strategies are proposed in the literature to deal with
multi-label classification. The existing strategies fall into two
approaches: algorithm-independent and algorithm-dependent.
The algorithm-independent approach uses traditional clas-
sification algorithms, transforming the original multi-label
problem into a set of single-label problems. The algorithm-
dependent approach develops specific algorithms to deal with
the multi-label problem. These algorithms can be based on
conventional classifiers, as Support Vector Machines [14] and
Decision Trees [15].
While most of the literature methods use supervised learn-
ing to solve multi-label problems, in this paper we propose
the use of unsupervised learning through neural networks.
Because instances classified in similar sets of labels have an
intrinsic relationship, we explore the power of Self-Organizing
Maps (Kohonen Maps) [16], to map similar input instances
to network neurons. Training instances which are similar to
each other are mapped to closer neurons in the map. This
is performed by the adaptation process of the Kohonen Map.
Testing instances can then be mapped to specific neurons in
the network, being classified in the labels assigned to training
instances mapped to these neurons.
The remainder of this paper is organized as follows. Sec-
tion II reviews literature on multi-label classification; our pro-
posed method using Kohonen Maps is presented in Section III,
while Section IV presents the datasets, algorithms and evalua-
tion measures used; our experiments are reported in Section V,
together with their analysis; Finally, Section VI provides our
final considerations and future research directions.
(a)
Computer Science
Biology
(b)
Fig. 1. Example of classification problems: (a) conventional classification (single-label); (b) multi-label classification. Adapted from [11].
II. RE LATE D WOR K
Many studies have been proposed based on both algorithm
dependent and algorithm independent approaches. This section
presents some of them.
A. Algorithm Independent Approach
A very simple method, based on the algorithm independent
approach, uses Lclassifiers, with Lbeing the number of
classes which are involved in the problem. Each classifier
is then associated to a classes and trained to solve a binary
classification problem. The classes for which a classifier is
associated is considered against all the other involved classes.
This method is called Binary-Relevance (BR) [17]. A draw-
back of this method is that it assumes that the classes assigned
to an instance are independent of each other. This is not always
true, and ignoring all possible correlations between classes
may harm the generalization ability of the classifiers.
Over the years, BR improvements were proposed. In the
works of Cherman et. al. [18], Read et. al [19] and Dem-
bczynski et. al. [20], methods based on the Binary-Relevance
transformation were proposed. The idea is to use the instances’
classes to complement their attribute vectors, trying to incor-
porate the dependencies between labels in the learning pro-
cess. In Huang and Zhou [21], instances were clustered, and
similarities calculated within each cluster. These similarities
were used to augment the original feature vectors. Yu et.
al. [22] used concepts of neighborhood rough sets. The idea
was to find out the possibly related labels for a given instance,
excluding all unrelated ones. Spolaˆ
or et. al. [23] used labels
pairwise correlations to construct new binary labels to augment
the original feature vectors.
Another example of the algorithm independent approach is
the Label-Powerset (LP) transformation. For each instance,
all the classes assigned to it are combined into a new and
unique class. With this combination, the correlations between
classes are considered, but the number of classes involved
in the problem can be considerably increased, leading some
classes to end up with very few positive instances. This label
combination strategy was used in the works of Tsoumakas et.
al. [12] and Boutell et. al. [10].
Still in Tsoumakas et. al. [12], a method called Random
k-Labelsets (RAKEL) was proposed, based on the Label-
Powerset strategy. This method iteratively builds a combina-
tion of mLabel-Powerset classifiers. Being Lthe set of labels
of the problem, a k-labelset is given by subset of Y⊆L,
with k=|Y|. The term Lkrepresents the set of all the k-
labelsets of L. At every iteration, 1...m, a k-labelset Yiis
random selected of Lk, without replacements. A classifier Hi
is then trained for Yi. For the classification of a new instance,
each classifier Hitakes a binary decision for each label λj
from the k-labelset Yi. An average decision is calculated for
each label λjin L, and the final decision is positive for a given
label if the average decision is greater than a given threshold
t. The RAKEL method was proposed to take into account the
correlations between the classes and, at the same time, avoid
the disadvantages of the Label-Powerset method, where some
classes might end with few positive instances.
B. Algorithm Dependent Approach
A decision tree-based method was proposed by Clare and
King [5]. In this work, the authors modified the C4.5 [15]
algorithm to deal with protein classification according to its
functions. The C4.5 algorithm defines the decision tree nodes
through a measure called entropy. The authors modified the en-
tropy formula, originally elaborated for single-label problems,
in a way to allow its use in multi-label problems. Another
modification made by the authors was the use of the tree leaf-
nodes to represent a set of labels. A leaf node contains a set
of labels, and when reached, a separated rule is produced for
each class.
In Zhang and Zhou [6] a method based on the KNN
algorithm was proposed, called ML-kNN. In this method, for
each instance, the classes which are associated with its K
closest neighbors are recovered, and the number of neighbors
associated to each class is recovered. The maximum a posteri-
ori principle is then used to define the set of classes of a new
instance.
Also in Zhang and Zhou [24] a multi-label error mea-
sure was proposed for the training of neural networks with
the Back-propagation algorithm. The measure considers the
multiple classes of the instances in the calculation of the
classification error.
In Schapire and Singer [25], [26], two extensions of the
Adaboost algorithm [27] were proposed, allowing its use in
multi-label problems. In the first one, a modification was
proposed to measure the predictive performance of the induced
model, verifying its ability to predict a correct set of classes.
In the second, a change in the algorithm makes it predicts a
ranking of classes for each instance.
Thabtah et. al. [28] proposed a multi-label algorithm based
in class association rules. The algorithm was called multi-class
multi-label associative classification (MMAC). Initially, a set
of rules is created, and all instances associated to this set are
removed. The remaining instances are then utilized to create
a new set of rules. This procedure is executed until there are
no instances left.
A classification algorithm based on entropy was proposed
by Shenghuo et. al. [29] to the task of data recovery. The
authors used the model to explore correlations between classes
in multi-label documents.
Madjarov et. al. [30] published a work in which multiple
classification methods, based both in the algorithm dependent
and independent approaches, were compared. Several evalu-
ation measures were also used in the experiments. The best
performances were obtained by methods which try to consider
the dependencies of the classes during the training process.
III. MULTI -LA BE L CLASSIFICATION WITH
SEL F-ORGANIZING MA PS
In this section, we present our method for multi-label
classification using Kohonen Maps. Knowing the existence
of instance correlations in multi-label datasets, we used the
adaptive power of the self-organizing maps in order to group
correlated instances in a same region of the Kohonen Map.
We call our method Self Organizing Maps Multi-label Learn-
ing (SOM-MLL).
A. Kohonen Maps
Kohonen Maps [16] are self-organizing neural networks
capable of mapping similar instances to neurons next to each
other, belonging to a two-dimensional map of neurons. This
mapping is done through a competition between the map
neurons. The winner neuron is the one whose weight vector
is the closest to the instance vector being mapped.
When an instance is mapped to a neuron, the weights of
its neuron connections are adjusted in a way to strengthen
this mapping. A neighborhood around the winner neuron
(neighboring neurons) also have their weights adjusted in order
to form a neighborhood of similar neurons around the winner
neuron. Figure 2 illustrates a neural network connected to a
training instance.
With a Kohonen Map, it is possible to analyse the attributes
which led the instances to be mapped to a same region.
The mapping is performed calculating the Euclidean distance
between the instances’ attribute vector and the neurons’ weight
Input Instance
Matrix of
Neurons
Synapic
Connections
Winning
Neuron
Fig. 2. Mapping an instance to a Kohonen Map.
vectors. The smaller the distance between these vectors, the
closer the instance is from a neuron.
Figure 3 illustrates different views of a neuron map obtained
after training a Kohonen Map. Figure 3(a) illustrates, using a
color palette, the number of instances mapped to each neuron.
The black color represents neurons that did not have instances
mapped to it. Figure 3(b) illustrates the distance between
instances mapped to each neuron.
The idea behind the Kohonen Map is that similar instances
are mapped to the same neighborhood of neurons. Thus,
groups of similar instances are obtained. After obtaining these
groups, the map can be used for classification. For this, we can
map test instances to the already trained map. This process is
explained in the next sections.
B. Mapping Procedure
During the network training, and when mapping a test
instance, the winner neuron is obtained using the Euclidean
distance. This measure is used to calculate the distance be-
tween the attribute vector of an instance (x) and the weight
vector (wj) of a neuron j. This calculation is presented on
Equation 1, where Arepresents the number of attributes of
an instance. The winning neuron is the one closest to the
input instance. In the case of categorical attributes, other
distance measures can be used.
dj(x) = v
u
u
t
A
X
i=1
(xi−wji )2(1)
After obtaining the winning neuron, its weights should be
adjusted to approximate it to the instance. Also, a topological
neighbourhood should be defined, and the weights of the
neurons in this neighbourhood are also adjusted, approximat-
ing them to the the winning neuron. These adjustments are
necessary to guarantee that a neighbourhood of close neurons
is created around the winning neuron.
A good choice for the neighbourhood function is the Gaus-
sian function. This function is presented in Equation 2, where
hj,i represents the neighbourhood around the winning neuron
i, formed by other excited neurons j. Also, dj,i defines a lateral
distance (in the neuron grid) between a winning neuron iand
an excited neuron j. The σparameter defines how broad is the
Fig. 3. Kohonen Maps: (a) number of instances mapped to each neuron; (b) distances between instances mapped to each neuron.
neighbourhood, influencing how the neighbourhood excited
neurons participate in the learning process.
hj,i =exp −d2
j,i
2σ2!(2)
For auto-organization, the weight vector of a neuron j
should be adjusted according to the input instance xand a
learning rate η. This adjustment is given by Equation 3.
∆wj=ηhj,i (x−wj)(3)
Given a weight vector wjat iteration t, the updated weight
vector at iteration t+ 1 is given by Equation 4.
∆wj(t+ 1) = wj(t) + ηhj,i (x−wj(t)) (4)
The training process continues for a given number of
iterations. With repetitive presentations of the instances, the
network tends to converge, and the weights in the map tend
to follow the distribution of the input vectors. Algorithm III.1
shows the SOM-MLL training procedure.
C. Classification Procedure
To classify an instance xi, the classes of all instances are
represented by a binary vector vi. In this vector, the jth
position corresponds to the jth class of the problem. If an
instance xibelongs to class cj, then the position vi,j receives
the value 1, and 0 otherwise. With that representation, is
possible to classify a test instance using a prototype vector
v. After mapping a test instance to its closest neuron, the
prototype vector is obtained averaging the class vectors of
the training instances mapped to this neuron. The formula
to obtain the vector vfor a neuron nis presented in the
Equation 5. In this equation, Sn,j is the set of training
instances mapped to the neuron nwhich are classified in class
cj, and nis the full set of instances mapped to neuron n.
vn,j =|Sn,j |
|Sn|(5)
From Equation 5, we see that each position vn,j contains
the proportion of instances mapped to neuron nwhich are
Algorithm III.1: SOM-MLL training procedure
Function: train-SOM-MLL(X,e)
input : X= [q, (a+l)]: dataset with qinstances, a
attributes and llabels
e: number of epochs
output: W= [n, a]: weight matrix with nneurons and a
weights
Randomly Initialize weight matrix W;
for i←1to edo
Randomize dataset X;
for j←1to qdo
// Select winner neuron from
neuron grid Ω
o(xj) = argmink||xj−wk||k∈Ω;
// Adjust weights of all excited
neurons
wk(i+1) = wk(i) + η(i)hk,o(xj)(i)(xj(i)−wk(i))
return {W};
classified in class cj. This can be interpreted as the probability
of an instance to belong to class cj. To obtain a deterministic
prediction, a threshold is used. Thus, if a threshold value equal
to 0.5is used, all the positions whose values are greater or
equal to 0.5receives the value 1, and 0 otherwise. With this, it
is possible to compare the vectors of predicted classes with the
vectors of true classes. Figure 4 illustrates a prototype vector,
where a threshold value of 0.5was used. Algorithm III.2
presents the procedure to classify a new instance.
IV. MATERIALS AND METHODS
In this Section we present the datasets, algorithms and
evaluation measures used in our experiments.
A. Datasets
All the datasets used in this work are freely available1.
We chose seven representative ones from different applica-
1http://mulan.sourceforge.net/datasets-mlc.html
Class 1
Class 2
Class 3
Class 4
Class 5
Class 6
Class 7
Class 8
Class 9
Class 10
Class 11
Class 12
Class 13
Class 14
Class 15
Class 16
0.9 0.80.8 0.6 0.7 0.7 0.6 0.5 0.70.0 0.6 0.5 0.0 0.40.0 0.6
(a)
(b) 1 11 1 0 1 0 1 10 1 1 0 00 1
Fig. 4. Predictions for SOM-MLL. (a) prototype vector; (b) final predictions after threshold application. Adapted from Cerri et. al. [31].
Algorithm III.2: SOM-MLL classification procedure
Function: classify-SOM-MLL(X,e)
input : Xtrain = [q, (a+l)]: dataset with qinstances, a
attributes and llabels
Xtest = [m, a]: dataset with minstances and a
attributes
W= [n, a]: weight matrix with nneurons and a
weights
output: P= [m, q]: prediction matrix with mrows and
lcolumns
for j←1to mdo
// Select winner neuron from neuron
grid Ω
o(xtest
j) = argmink||xtest
j−wk||k∈Ω;
// Get training instances mapped to
winner neuron
T←instances mapped to o(xtest
j);
// Get prototype vector
vj←average of the label vectors from T;
// Associate prototype to instance
xtest∗
j←xtest
j+vj;
pj←vj;
return {P};
tion domains: audio, images, music and biology. The main
characteristics of the datasets are shown in Table I.
Each column from Table I gives, respectively, the dataset
identification, application domain, number of instances, num-
ber of nominal and numeric attributes, label cardinality, label
density, and number of distinct labelsets. Label Cardinality
(LC) is the average number of labels per instance, while Label
Density (LD) is LC divided by the total number of labels. In
Equations 6 and 7, mgives the total number of instances and
qgives the number of labels.
LC =1
m
m
X
i=1
|Yi|(6)
LD =1
m
m
X
i=1
|Yi|
q(7)
Analysing Label Cardinality (LC) and Label Density (LD)
is important to understand the behaviour of the classification
algorithms. While LC does not consider the number of labels,
LD does. LC can be used to quantify the number of alternative
labels assigned to an instance. Two datasets can have the same
LC, but different LD, causing the same classifier to behave
different. The number of distinct label sets is also important,
strongly influencing methods which operate on subsets of
labels [17].
B. Classification Algorithms
We compared our proposal with several algorithm depen-
dent and independent-based methods from the literature. The
algorithms used are listed below.
•Support Vector Machine (SVM) [14], J48 decision tree
induction [15] and k-Nearest Neighbours (kNN) [32]. All
these algorithm were used with the Binary-Relevance and
Label-Powerset transformations;
•Back-Propagation Multi-Label Learning (BPMLL) [24],
a neural network algorithm dependent-based method;
•Multi-Label k-Nearest Neighbours (MLkNN) [33], a
KNN algorithm dependent-based method.
All the classification algorithms are implemented within
Mulan [34], a Java library for multi-label learning. For all
algorithms used, the default parameter values were used.
Considering our proposed method, it was implemented
using the Kohonen R package [35] within the R programming
language. Table II presents the SOM-MLL parameter values.
C. Evaluation Measures
Unlike single-label classification, wherein an instance is
classified either correctly or wrongly, in multi-label classifi-
cation, a classification can be considered partially correct or
partially wrong, requiring specific evaluation measures.
Let Hbe a multi-label classifier, with Zi=H(xi)the set
of predicted labels by Hfor a given instance xi;Yithe set of
true labels, Lthe total set of labels and Sthe set of instances.
Two commonly used evaluation measures are Precision and
Recall. They were used in the work of Godbole [36], and are
presented in Equations 8 and 9.
P recision(H, S ) = 1
|S|
|S|
X
i=1
|Yi∩Zi|
|Zi|(8)
Recall(H, S ) = 1
|S|
|S|
X
i=1
|Yi∩Zi|
|Yi|(9)
TABLE I
DATASET S‘S TATIST ICS
Name Domain # Instances # Nominal # Numeric # Labels Cardinality Density # Distinct
cal500 music 502 0 68 174 26.044 0.150 502
birds audio 645 2 258 19 1.014 0.053 133
emotions music 593 0 72 6 1.869 0.311 27
flags image 194 9 10 7 3.392 0.485 54
genbase biology 662 1186 0 27 1.252 0.046 32
scene image 2407 0 294 6 1.074 0.179 15
yeast biology 2417 0 103 14 4.237 0.303 198
TABLE II
SOM-MLL PARAMETER VALUES
Parameter Value
Grid topology Hexagonal
Number of neurons 5 ×5 = 25
Neighbourhood function Gaussian
Learning rate (η) Linearly decreases from 0.05 to 0.01 at each epoch
Neighbourhood radius (σ) Start with a number that covers 2/3 of all neighbour neurons. Linearly decreases at each epoch until reaches the negative value of that number
As Precision and Recall alone are not adequate for the
evaluation of classifiers, we also used the harmonic mean
of these two measures, called Fmeasure. Its calculation is
presented in Equation 10.
F measure(H, S )=2×P recision ×Recall
P recision +Recall (10)
In order to validate our analysis, all the experiments were
performed using the 10-fold cross validation strategy. To split
the data, we used the iterative stratification strategy proposed
by [37]. In this strategy, the desired number of instances in
each subset is calculated. Then, each instance is examined
iteratively so that the algorithm can select an appropriate sub-
set for distribution. The stratification strategy are implemented
within the utiml R Package [38].
V. EX PE RI ME NT S AN D DISCUSSION
Tables III, IV and V show the mean precision, recall and
fmeasure results obtained in our experiments. We refer to our
method as Self Organizing Maps Multi-label Learning (SOM-
MLL). For the algorithm independent-based methods, we refer
to them as Binary-Relevance (BR) or Label-Powerset (LP).
As can be seen considering the average and individual
dataset results, the performance of SOM-MLL can be consid-
ered competitive with the performances of the other literature
methods. Regarding the precision values, SOM-MLL was
able to obtain better results than some algorithm dependent
and independent based methods, specially in datasets cal500,
emotions, flags and yeast. If we look at the recall values,
Table IV shows that our method obtained smaller values than
most of the methods. This may be explained by the use of
only the winner neuron when calculating the prototype vector
used to classify a new instance. The use of a neighbourhood
of neurons can lead to a better coverage.
Considering the f-measure results, SOM-MLL could not
obtain better results than the SVM, KNN and MLkNN algo-
rithms. However, the results obtained can be considered very
promising, specially if we compare our method with J48 and
BPMLL, where we obtained competitive or better results.
We consider the results obtained so far very promising,
specially considering that there is still room for improvements
in the algorithm. As already mentioned, one modification that
could improve the results is related to the number of neurons
used to construct the average label vector of a test instance.
Instead of considering only the training instances mapped to
the winning neuron, we can also consider the training instances
mapped to the neighbourhood of the winning neuron. A thresh-
old can be used to vary the size of this neighbourhood. Such
modification can considerably improve the results, specially
considering that, in our current version, there are winning
neurons with only one or two training instances mapped to
it, which can drop the recall values obtained.
Another improvement that can be implemented is related to
the number of neurons used to build the grid of neurons for
training. Currently, we are using the default values provided
by the Kohonen R package. However, we could also tune
this value specifically for each dataset. The other parameters,
such as learning rate or size of the neighbourhood for weight
update, could also be tuned for each dataset.
To verify if statistically significant results were obtained,
we applied the Friedman [39] statistical test considering the
fmeasure results. The p-value obtained was 0.035, which
does not provide strong evidence about statistically significant
differences. The Nemenyi post-hoc test was then applied to
identify which pairwise comparisons presented statistically
significant differences. The critic diagram in Figure 5 shows
the results of the Nemenyi test. We connected methods where
no statistically significant results were detected.
According to Figure 5, the only method which obtained
statistically better results than SOM-MLL was the SVM
with the Label Powerset transformation. We would like to
emphasize, however, that this conclusion is based on the
averaged f-measure values obtained considering the fmea-
sure on all datasets. Also, the Friedman p-value of 0.035
TABLE III
PRECISION RESULTS
Dataset SOM-MLL SVM-BR J48-BR KNN-BR SVM-LP J48-LP KNN-LP BPMLL MLkNN
cal500 0.60 ±0.02 0.62 ±0.07 0.45 ±0.08 0.35 ±0.04 0.34 ±0.05 0.34 ±0.04 0.35 ±0.04 0.35 ±0.04 0.60 ±0.06
birds 0.53 ±0.03 0.70 ±0.12 0.63 ±0.09 0.66 ±0.10 0.70 ±0.12 0.63 ±0.13 0.66 ±0.07 0.45 ±0.11 0.62 ±0.10
emotions 0.63 ±0.07 0.68 ±0.10 0.59 ±0.13 0.63 ±0.11 0.68 ±0.16 0.58 ±0.15 0.63 ±0.11 0.64 ±0.12 0.70 ±0.16
flags 0.68 ±0.05 0.72 ±0.06 0.69 ±0.14 0.68 ±0.16 0.69 ±0.12 0.66 ±0.14 0.68 ±0.16 0.69 ±0.08 0.72 ±0.10
genbase 0.93 ±0.03 0.99 ±0.02 0.99 ±0.03 0.99 ±0.02 0.99 ±0.02 0.99 ±0.03 0.99 ±0.02 0.04 ±0.04 0.98 ±0.05
scene 0.53 ±0.04 0.62 ±0.08 0.56 ±0.06 0.71 ±0.05 0.76 ±0.05 0.60 ±0.07 0.71 ±0.05 0.37 ±0.08 0.70 ±0.07
yeast 0.71 ±0.01 0.72 ±0.06 0.60 ±0.06 0.60 ±0.07 0.66 ±0.05 0.54 ±0.06 0.60 ±0.07 0.62 ±0.05 0.72 ±0.04
Average 0.65 0.72 0.64 0.66 0.68 0.62 0.66 0.45 0.72
TABLE IV
REC ALL R ES ULTS
Dataset SOM-MLL SVM-BR J48-BR KNN-BR SVM-LP J48-LP KNN-LP BPMLL MLkNN
cal500 0.23 ±0.01 0.23 ±0.04 0.29 ±0.07 0.35 ±0.06 0.35 ±0.06 0.34 ±0.05 0.35 ±0.06 0.72 ±0.05 0.22 ±0.05
birds 0.73 ±0.02 0.66 ±0.11 0.61 ±0.10 0.67 ±0.10 0.68 ±0.09 0.62 ±0.09 0.67 ±0.10 0.52 ±0.20 0.56 ±0.09
emotions 0.60 ±0.05 0.66 ±0.11 0.57 ±0.10 0.63 ±0.08 0.71 ±0.09 0.58 ±0.17 0.63 ±0.08 0.73 ±0.11 0.63 ±0.18
flags 0.65 ±0.06 0.76 ±0.16 0.74 ±0.12 0.65 ±0.14 0.68 ±0.10 0.66 ±0.15 0.65 ±0.18 0.76 ±0.12 0.76 ±0.17
genbase 0.92 ±0.03 0.99 ±0.02 0.99 ±0.02 0.99 ±0.02 0.99 ±0.03 0.98 ±0.04 0.99 ±0.02 0.66 ±0.03 0.95 ±0.05
scene 0.51 ±0.04 0.65 ±0.07 0.64 ±0.08 0.70 ±0.05 0.75 ±0.06 0.60 ±0.06 0.70 ±0.05 0.83 ±0.16 0.69 ±0.06
yeast 0.54 ±0.01 0.58 ±0.03 0.58 ±0.07 0.60 ±0.06 0.62 ±0.04 0.54 ±0.07 0.60 ±0.06 0.69 ±0.05 0.59 ±0.07
Average 0.59 0.64 0.63 0.65 0.68 0.61 0.65 0.70 0.62
TABLE V
FMEASURE RESULTS
Dataset SOM-MLL SVM-BR J48-BR KNN-BR SVM-LP J48-LP KNN-LP BPMLL MLkNN
cal500 0.32 ±0.01 0.34 ±0.07 0.34 ±0.07 0.34 ±0.05 0.34 ±0.05 0.33 ±0.05 0.34 ±0.05 0.45 ±0.03 0.32 ±0.06
birds 0.56 ±0.03 0.66 ±0.10 0.61 ±0.09 0.65 ±0.10 0.68 ±0.09 0.61 ±0.09 0.65 ±0.10 0.44 ±0.12 0.58 ±0.09
emotions 0.60 ±0.06 0.60 ±0.11 0.55 ±0.08 0.60 ±0.06 0.67 ±0.12 0.55 ±0.14 0.60 ±0.08 0.66 ±0.10 0.63 ±0.16
flags 0.64 ±0.05 0.73 ±0.11 0.70 ±0.13 0.65 ±0.16 0.67 ±0.09 0.66 ±0.15 0.65 ±0.15 0.70 ±0.10 0.73 ±0.11
genbase 0.92 ±0.04 0.99 ±0.02 0.99 ±0.03 0.99 ±0.02 0.99 ±0.03 0.99 ±0.04 0.99 ±0.02 0.06 ±0.06 0.96 ±0.05
scene 0.52 ±0.04 0.62 ±0.07 0.56 ±0.04 0.70 ±0.05 0.75 ±0.06 0.51 ±0.06 0.70 ±0.05 0.49 ±0.11 0.69 ±0.06
yeast 0.59 ±0.01 0.61 ±0.03 0.56 ±0.06 0.57 ±0.07 0.62 ±0.04 0.51 ±0.06 0.57 ±0.07 0.63 ±0.06 0.62 ±0.05
Average 0.59 0.65 0.61 0.64 0.67 0.59 0.64 0.49 0.64
CD
1 2 345 6 789
SVM-LP
SVM-BR
KNN-LP
KNN-BR
SOM-MLL
J48-LP
J48-BR
BPMLL
MLkNN
Fig. 5. Critical Diagram for the Nemenyi post-hoc Statistical Test.
does not provide strong evidence of statistically significant
differences. Considering the individual datasets, SOM-MLL
obtained better precision and recall results than SVM-LP in
some datasets, resulting in very competitive fmeasure values in
some cases. See for example datasets cal500, flags and yeast.
Again, considering a neighbourhood of neurons, better results
can be obtained.
VI. CONCLUSIONS AND FUTURE WO RK S
In this paper, we proposed a method called Self Organizing
Maps Multi-label Learning (SOM-MLL), which uses Kohonen
Maps for multi-label classification. The training instances are
mapped to the neurons of a grid, which organizes itself in
order to make similar instances to be mapped to the same
region of the grid. The idea is that instance classified into a
similar set of classes are mapped to the same region of the
grid. To classify a new test instance, it is mapped to a neuron
of the grid, and the classes of the training instances mapped
to this neuron are used to label the test instance.
The experiments showed that SOM-MLL presented compet-
itive and promising results compared to the literature methods
investigated, specially considering there is still a lot of room
for improving the algorithm.
As future work, we plan to extend our method, allowing
a neighbourhood of neurons to be used to classify a new
instance, instead of only the winning neuron. Also, different
neighbourhood sizes and topologies can be experimented,
together with different algorithm parameters. More multi-label
classification algorithms and datasets should also be used in
the experimental comparisons.
ACKNOWLEDGMENT
The authors would like to thank CAPES, CNPq and
FAPESP for their financial support, specially the grant
#2015/14300-1 - S˜
ao Paulo Research Foundation (FAPESP).
REFERENCES
[1] T. Gonc¸alves and P. Quaresma, “A preliminary approach to the multilabel
classification problem of portuguese juridical documents,” in EPIA,
2003, pp. 435–444.
[2] B. Lauser and A. Hotho, “Automatic multi-label subject indexing in a
multilingual environment,” in Proc. of the 7th European Conference in
Research and Advanced Technology for Digital Libraries, ECDL 2003,
vol. 2769. Springer, 2003, pp. 140–151.
[3] X. Luo and N. A. Zincir-Heywood, “Evaluation of two systems on multi-
class multi-label document classification,” in International Syposium on
Methodologies for Intelligent Systems, 2005, pp. 161–169.
[4] A. Karalic and V. Pirnat, “Significance level based multiple tree classi-
fication,” in Informatica, vol. 15, no. 5, 1991, p. 12.
[5] A. Clare and R. D. King, “Knowledge discovery in multi-label phe-
notype data,” in 5th European Conference on Principles of Data
Mining and Knowledge Discovery (PKDD2001), ser. LNAI, vol. 2168.
Springer, 2001, pp. 42–53.
[6] M.-L. Zhang and Z.-H. Zhou, “A k-Nearest Neighbor Based Algorithm
for Multi-label Classification,” vol. 2. The IEEE Computational
Intelligence Society, 2005, pp. 718–721 Vol. 2.
[7] A. Elisseeff and J. Weston, “Kernel Methods for Multi-labelled Classi-
fication and Categorical Regression Problems,” in Advances in Neural
Information Processing Systems. MIT Press, 2001, pp. 681–687.
[8] C. Vens, J. Struyf, L. Schietgat, S. Dˇ
zeroski, and H. Blockeel, “Decision
trees for hierarchical multi-label classification,” Machine Learning,
vol. 73, no. 2, pp. 185–214, 2008.
[9] R. Cerri, R. C. Barros, A. C. P. L. F. de Carvalho, and Y. Jin, “Reduction
strategies for hierarchical multi-label classification in protein function
prediction,” BMC Bioinformatics, vol. 17, no. 1, p. 373, 2016.
[10] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-label
scene classification,” Pattern Recognition, vol. 37, no. 9, pp. 1757–1771,
2004.
[11] X. Shen, M. Boutell, J. Luo, and C. Brown, “Multilabel machine learning
and its application to semantic scene classification,” in Society of Photo-
Optical Instrumentation Engineers (SPIE) Conference Series, vol. 5307,
Dec. 2003, pp. 188–199.
[12] G. Tsoumakas and I. Katakis, “Multi Label Classification: An
Overview,” International Journal of Data Warehousing and Mining,
vol. 3, no. 3, pp. 1–13, 2007.
[13] S. M. Liu and J.-H. Chen, “A multi-label classification based approach
for sentiment classification,” Expert Systems with Application, pp. 1083–
1093, 2015.
[14] V. N. Vapnik, The Nature of Statistical Learning Theory (Information
Science and Statistics). Springer-Verlag New York, Inc., 1999.
[15] J. R. Quinlan, C4.5: programs for machine learning. San Francisco,
CA, USA: Morgan Kaufmann Publishers Inc., 1993.
[16] T. K. Kohonen, “The self-organizing map,” Proceedings of the IEEE,
vol. 78, no. 9, pp. 1464–1480, Sept 1990. [Online]. Available:
http://dx.doi.org/10.1109/5.58325
[17] G. Tsoumakas, I. Katakis, and I. P. Vlahavas, “Mining Multi-label
Data,” in Data Mining and Knowledge Discovery Handbook, 2nd ed.,
O. Maimon and L. Rokach, Eds. Springer, 2010, pp. 667–685.
[18] E. A. Cherman, J. Metz, and M. C. Monard, “Incorporating label depen-
dency into the binary relevance framework for multi-label classification,”
Expert Systems with Applications, vol. 39, no. 2, pp. 1647–1655, Feb.
2012.
[19] J. Read, B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains for
multi-label classification,” in Proceedings of the European Conference
on Machine Learning and Knowledge Discovery in Databases: Part II,
ser. ECML PKDD ’09. Berlin, Heidelberg: Springer-Verlag, 2009, pp.
254–269.
[20] K. Dembczynski, W. Cheng, and E. H´
ullermeier, “Bayes optimal multi-
label classification via probabilistic classifier chains,” in Proceedings of
the 27th International Conference on Machine Learning, J. F´
urnkranz
and T. Joachims, Eds. Omnipress, 2010, pp. 279–286.
[21] S.-J. Huang and Z.-H. Zhou, “Multi-label learning by exploiting label
correlations locally,” in AAAI Conference on Artificial Intelligence, 2012,
pp. 949–955.
[22] Y. Yu, W. Pedrycz, and D. Miao, “Multi-label classification by exploiting
label correlations,” Expert Syst. Appl., vol. 41, no. 6, pp. 2989–3004,
2014.
[23] N. Spolaˆ
or, M. C. Monard, G. Tsoumakas, and H. D. Lee, “A systematic
review of multi-label feature selection and a new method based on label
construction,” Neurocomputing, vol. 180, no. C, pp. 3–15, 2016.
[24] M.-L. Zhang and Z.-H. Zhou, “Multilabel Neural Networks with Ap-
plications to Functional Genomics and Text Categorization,” IEEE
Transactions on Knowledge and Data Engineering, vol. 18, pp. 1338–
1351, 2006.
[25] R. E. Schapire and Y. Singer, “Improved Boosting Algorithms Using
Confidence-rated Predictions,” in Machine Learning, vol. 37. Hingham,
MA, USA: Kluwer Academic Publishers, 1999, pp. 297–336.
[26] ——, “BoosTexter: a boosting-based system for text categorization,” in
Machine Learning, vol. 39. Hingham, MA, USA: Kluwer Academic
Publishers, 2000, pp. 135–168.
[27] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of
on-line learning and an application to boosting,” in European Conference
on Computational Learning Theory, 1995, pp. 23–37.
[28] F. A. Thabtah, P. Cowling, Y. Peng, R. Rastogi, K. Morik, M. Bramer,
and X. Wu, “Mmac: A new multi-class, multi-label associative classi-
fication approach,” in Fourth IEEE International Conference on Data
Mining, 2004, pp. 217–224.
[29] S. Zhu, X. Ji, W. Xu, and Y. Gong, “Multi-labelled classification using
maximum entropy method,” in International conference on research and
development in information retrieval. New York, NY, USA: ACM,
2005, pp. 274–281.
[30] G. Madjarov, D. Kocev, D. Gjorgjevikj, and S. Dzeroski, “An extensive
experimental comparison of methods for multi-label learning,” Pattern
Recognition, vol. 45, no. 9, pp. 3084–3104, 2012.
[31] R. Cerri, G. L. Pappa, A. C. P. Carvalho, and A. A. Freitas,
“An extensive evaluation os decision tree-based hierarchical multilabel
classification methods and performance measures,” Computational
Intelligence, 2013, aceito para publicac¸ ˜
ao. [Online]. Available:
http://dx.doi.org/10.1111/coin.12011
[32] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learning
algorithms,” Machine Learning, vol. 6, no. 1, pp. 37–66, 1991.
[33] M.-L. Zhang and Z.-H. Zhou, “ML-KNN: A lazy learning approach
to multi-label learning,” Pattern Recognition, vol. 40, pp. 2038–2048,
July 2007. [Online]. Available: http://portal.acm.org/citation.cfm?id=
1234417.1234635
[34] G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, and I. Vlahavas,
“Mulan: A java library for multi-label learning,” Journal of Machine
Learning Research, vol. 12, pp. 2411–2414, 2011.
[35] R. Wehrens and L. Buydens, “Self- and super-organising maps in r:
the kohonen package,” J. Stat. Softw., vol. 21, no. 5, 2007. [Online].
Available: http://www.jstatsoft.org/v21/i05
[36] S. Godbole and S. Sarawagi, “Discriminative methods for multi-labeled
classification,” in 8th Pacific-Asia Conference on Knowledge Discovery
and Data Mining. Springer, 2004, pp. 22–30. [Online]. Available:
http://www.springerlink.com/content/maa4ag38jd3pwrc0
[37] K. Sechidis, G. Tsoumakas, and I. Vlahavas, On the Stratification
of Multi-label Data. Berlin, Heidelberg: Springer Berlin Heidelberg,
2011, pp. 145–158. [Online]. Available: http://dx.doi.org/10.1007/
978-3- 642-23808- 6 10
[38] A. Rivolli, utiml: Utilities for Multi-Label Learning, 2016, r package
version 0.1.0. [Online]. Available: http://CRAN.R-project.org/package=
utiml
[39] J. Demˇ
sar, “Statistical Comparisons of Classifiers over Multiple Data
Sets,” Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006.