ArticlePublisher preview available

The use of data-derived label hierarchies in multi-label classification

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Instead of traditional (multi-class) learning approaches that assume label independency, multi-label learning approaches must deal with the existing label dependencies and relations. Many approaches try to model these dependencies in the process of learning and integrate them in the final predictive model, without making a clear difference between the learning process and the process of modeling the label dependencies. Also, the label relations incorporated in the learned model are not directly visible and can not be (re)used in conjunction with other learning approaches. In this paper, we investigate the use of label hierarchies in multi-label classification, constructed in a data-driven manner. We first consider flat label sets and construct label hierarchies from the label sets that appear in the annotations of the training data by using a hierarchical clustering approach. The obtained hierarchies are then used in conjunction with hierarchical multi-label classification (HMC) approaches (two local model approaches for HMC, based on SVMs and PCTs, and two global model approaches, based on PCTs for HMC and ensembles thereof). The experimental results reveal that the use of the data-derived label hierarchy can significantly improve the performance of single predictive models in multi-label classification as compared to the use of a flat label set, while this is not preserved for the ensemble models.
This content is subject to copyright. Terms and conditions apply.
J Intell Inf Syst (2016) 47:57–90
DOI 10.1007/s10844-016-0405-8
The use of data-derived label hierarchies in multi-label
classification
Gjorgji Madjarov1·Dejan Gjorgjevikj1·
Ivica Dimitrovski1·Saˇ
so Dˇ
zeroski2
Received: 31 July 2015 / Revised: 22 March 2016 / Accepted: 29 March 2016 /
Published online: 18 April 2016
© Springer Science+Business Media New York 2016
Abstract Instead of traditional (multi-class) learning approaches that assume label inde-
pendency, multi-label learning approaches must deal with the existing label dependencies
and relations. Many approaches try to model these dependencies in the process of learning
and integrate them in the final predictive model, without making a clear difference between
the learning process and the process of modeling the label dependencies. Also, the label
relations incorporated in the learned model are not directly visible and can not be (re)used
in conjunction with other learning approaches. In this paper, we investigate the use of label
hierarchies in multi-label classification, constructed in a data-driven manner. We first con-
sider flat label sets and construct label hierarchies from the label sets that appear in the
annotations of the training data by using a hierarchical clustering approach. The obtained
hierarchies are then used in conjunction with hierarchical multi-label classification (HMC)
approaches (two local model approaches for HMC, based on SVMs and PCTs, and two
global model approaches, based on PCTs for HMC and ensembles thereof). The experimen-
tal results reveal that the use of the data-derived label hierarchy can significantly improve
the performance of single predictive models in multi-label classification as compared to the
use of a flat label set, while this is not preserved for the ensemble models.
Gjorgji Madjarov
gjorgji.madjarov@finki.ukim.mk
Dejan Gjorgjevikj
dejan.gjorgjevikj@finki.ukim.mk
Ivica Dimitrovski
ivica.dimitrovski@finki.ukim.mk
Saˇ
so Dˇ
zeroski
saso.dzeroski@ijs.si
1Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University,
Rugjer Boshkovikj 16, 1000 Skopje, Macedonia
2Department of Knowledge Technologies, Joˇ
zef Stefan Institute, Jamova cesta 39,
1000 Ljubljana, Slovenia
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... Lastly, the probabilities of the metalabels are mapped to the original labels and used to obtain the final predictions. Similar approaches are proposed in the literature (Madjarov et al, 2015) (Madjarov et al, 2016) (Nikoloski et al, 2017) (Papanikolaou et al, 2018) (Madjarov et al, 2019), which transform a flat label space into a hierarchical one, and used MLC or HMC methods to deal with MLC tasks. These approaches are presented in the following sections. ...
... Extended research was done by Madjarov et al (2016) in which two main modifications were included. First, in the creation of the HMC model, they evaluated different approaches using four different types of single predictive models that correspond to binary classification, hierarchical single-label classification, MLC and HMC. ...
... In this section, related works using the strategy to create a label hierarchy for transforming an MLC task in a HMC task were presented. The approaches presented by Tsoumakas et al (2008), Papanikolaou et al (2018) and Madjarov et al (2016), which used only the label space in the label hierarchy definition and were evaluated over datasets frequently used in the literature, will be compared to the method proposed in this paper. ...
Preprint
Full-text available
Multi-label classification (MLC) is a very explored field in recent years. The most common approaches that deal with MLC problems are classified into two groups: (i) problem transformation which aims to adapt the multi-label data, making the use of traditional binary or multiclass classification algorithms feasible, and (ii) algorithm adaptation which focuses on modifying algorithms used into binary or multiclass classification, enabling them to make multi-label predictions. Several approaches have been proposed aiming to explore the relationships among the labels, with some of them through the transformation of a flat multi-label label space into a hierarchical multi-label label space, creating a tree-structured label taxonomy and inducing a hierarchical multi-label classifier to solve the classification problem. This paper presents a novel method in which a label hierarchy structured as a directed acyclic graph (DAG) is created from the multi-label label space, taking into account the label co-occurrences using the notion of closed frequent labelset. With this, it is possible to solve an MLC task as if it was a hierarchical multi-label classification (HMC) task. Global and local HMC approaches were tested with the obtained label hierarchies and compared with the approaches using tree-structured label hierarchies showing very competitive results. The main advantage of the proposed approach is better exploration and representation of the relationships between labels through the use of DAG-structured taxonomies, improving the results. Experimental results over 32 multi-label datasets from different domains showed that the proposed approach is better than related approaches in most of the multi-label evaluation measures. Moreover, we found that both tree and in specially DAG-structured label hierarchies combined with a local hierarchical classifier are more suitable to deal with imbalanced multi-label datasets.
... The results from the evaluation reveal that better predictive performance can be achieved by using data-driven approaches to construct the hierarchies rather than considering either, the flat multi-target regression task, or the pre-defined hierarchy created by a domain expert. Moreover, for large datasets, the results are in line with teh results for MLC [16], [17]: divisive hierarchy creation algorithms (balanced k-means and PCTs for clustering) are the best methods for clustering large output spaces. All in all, constructing a hierarchy of the target variables improves the predictive performance of the predictive models. ...
... To the best of our knowledge, structuring of the target space for MTR has not been explored yet. Hence, we overview the methods for structuring the output space for the related multi-label classification (MLC) task where learning hierarchies in the output space has been studied to a wider extent [16], [17], [52]- [55]. Joly et al. (2014) [52] propose a method for dimensionaltiy reduction of the output space by random projections of it, mainly focused on MLC task. ...
... Similarly, Joly et al. (2017) [56], proposes a gradient boosting method for MTR which automatically adapt the target correlations by random projection of the output space. Madjarov et al. (2016) [16] present a comprehensive study of different data-derived methods for structuring the label space in the form of hierarchies for MLC. Namely, they use the label co-occurrence matrix to obtain a hierarchy of labels by using several clustering algorithms such as: agglomerative clustering with single and complete linkage, balanced k-means and PCTs. ...
Article
Full-text available
The task of multi-target regression (MTR) is concerned with learning predictive models capable of predicting multiple target variables simultaneously. MTR has attracted an increasing attention within research community in recent years, yielding a variety of methods. The methods can be divided into two main groups: problem transformation and problem adaptation. The former transform a MTR problem into simpler (typically single target) problems and apply known approaches, while the latter adapt the learning methods to directly handle the multiple target variables and learn better models which simultaneously predict all of the targets. Studies have identified the latter group of methods as having competitive advantage over the former, probably due to the fact that it exploits the interrelations of the multiple targets. In the related task of multi-label classification, it has been recently shown that organizing the multiple labels into a hierarchical structure can improve predictive performance. In this paper, we investigate whether organizing the targets into a hierarchical structure can improve the performance for MTR problems. More precisely, we propose to structure the multiple target variables into a hierarchy of variables, thus translating the task of MTR into a task of hierarchical multi-target regression (HMTR). We use four data-driven methods for devising the hierarchical structure that cluster the real values of the targets or the feature importance scores with respect to the targets. The evaluation of the proposed methodology on 16 benchmark MTR datasets reveals that structuring the multiple target variables into a hierarchy improves the predictive performance of the corresponding MTR models. The results also show that data-driven methods produce hierarchies that can improve the predictive performance even more than expert constructed hierarchies. Finally, the improvement in predictive performance is more pronounced for the datasets with very large numbers (more than hundred) of targets.
... Although there is no direct association between confusion-matrix derived and probabilistic error/loss instruments, the former can be expressed (e.g., <(=>>), <(#$?), or <(&1)) in terms of expected values formulated by expected values of confusion-matrix elements (<(#$), <(&$), <(&'), and <(#')) [33]. Note that although multi-class/multi-labeled performance evaluation [34][35][36] is out of scope, this study also provides a baseline for the performance evaluation of classifications with a higher number of classes because binaryclassification evaluation metrics can be used for multi-class or multi-label classification by micro-or macroaveraging binary metrics over time [37,38] or making specific adaption such as one-versus-all approach [35,36,39]. , and c4 = 0 ("negative") with p4 = 0.2 (true negative). ...
Preprint
Full-text available
Performance evaluation is key to building, training, validating, testing, comparing, and publishing classifier models for machine-learning-based classification problems. Two categories of performance instruments are confusion-matrix-derived metrics such as accuracy, true positive rate, and F1 and graphical-based metrics such as area-under-receiver-operating-characteristic-curve. Probabilistic-based performance instruments that are originally used for regression and time series forecasting are also applied in some binary-class or multi-class classifiers, such as artificial neural networks. Besides widely-known probabilistic instruments such as Mean Squared Error (MSE), Root Mean Square Error (RMSE), and LogLoss, there are many instruments. However, it is not identified that any of those is proper to use specifically in binary-classification performance evaluation. This study proposes BenchMetrics Prob, a qualitative and quantitative benchmarking method, to systematically evaluate probabilistic instruments via five criteria and fourteen simulation cases based on hypothetical classifiers on synthetic datasets. These criteria and cases give more insights to select a proper instrument in a binary-classification performance evaluation. The method was tested on over 31 instruments/instrument variants and the results have distinguished that three instruments are the most robust for binary-classification performance evaluation, namely Sum Squared Error (SSE), MSE with RMSE variant, and Mean Absolute Error (MAE). The results also showed that instrument variants with summarization functions other than mean (e.g., median and geometric mean) and the instrument subtypes proposed later to improve performance evaluation in regression such as relative/percentage/symmetric-percentage error instruments are not robust. Researchers should be aware of using those instruments in selecting or reporting performance in binary classification problems.
... We consider two spaces i.e., representations to cluster the targets: the original target space (TS) i.e., the values of a given target for each example and feature ranking space (FR) i.e., the importance scores for each feature with respect to a given target, for transforming the original MTR task to a HMTR task. Madjarov et al. (2016) [16] present a comprehensive study of different data-derived methods for structuring the label space in the form of hierarchies for MLC. Namely, they use the label co-occurrence matrix to obtain a hierarchy of labels by using several clustering algorithms such as: agglomerative clustering with single and complete linkage, balanced k-means and PCTs. ...
Thesis
Full-text available
The proposed dissertation belongs primarily to the field of machine learning on the one hand, but also to the field of soil science on the other hand. In terms of machine learning, it is concerned with the improvement of existing machine learning algorithms for predicting structured outputs, more specifically for multi-target prediction. In terms of soil science, it addresses two case studies of applying machine learning methods for multi-target prediction to two practical problems of modeling two different soil functions from data in the context of Irish agriculture. The majority of approaches for multi-target prediction (MTP) do not explicitly take into account the dependencies among the multiple targets. In order to address this drawback, in the proposed dissertation, we propose approaches that nd dependencies in the target space by explicitly structuring, in a hierarchical manner, the different targets. Using different representations of the target's attributes (based on the feature importance scores of the input attributes for predicting each target), we use hierarchical clustering of the targets. Having discovered a hierarchy on the target space, we obtain a reformulation of the original task of multi-target prediction into a task of hierarchical multi-target prediction. We then employ approaches for hierarchical multi-target prediction on the transformed task, expecting improved predictive performance. We address two tasks of MTP, namely multi-label classification (MLC) and multi-target regression (MTR). In both cases, we use feature importance estimation based on tree-ensembles, for classification and regression, respectively, based on the GENIE3 approach. We use different clustering approaches for structuring the target space, including balanced k-means, agglomerative clustering, and predictive clustering trees (PCTs): Of these, balanced k-means gives the best results. On the hierarchical versions of the problems, we use PCT ensembles for hierarchical MLC (HMLC) and hierarchical MTR (HMTR), respectively. We conduct an extensive experimental evaluation on various benchmark datasets for MTP (MLC and MTR) tasks, showing the advantage of using our proposed method for structuring the output space. Using ensembles of PCTs for HMLC and HMTR on the structured output spaces performs clearly better than using PCT ensembles for MLC and MTR on the original spaces. The differences in performance are largest for large output spaces (with more than 100 targets). We also address two case studies of applying machine learning methods for multi-target prediction to two practical problems of modeling two different soil functions from data in the context of Irish agriculture. The data were provided by TEAGASC, Environment Soils and Land-use Department, from Ireland. TEAGASC was also the source of expertise about the tasks. First, we apply PCTs for MTR, as well as ensembles (random forests) thereof to the task of estimating the total herbage production and nutrient uptake, i.e., the task of modeling the soil function of primary productivity, on Irish dairy farms. We then apply PCTs (and ensembles) for semi-supervised MTR to model a combination of another two soil functions, i.e., water regulation and purification, and provision and cycling of nutrients. More specifically, we learn models for assessing the chemical quality (nitrogen and phosphorus loss from soils through runo and leaching) and the biological quality of water in Irish agricultural lands. In the latter case, we used incompletely (partially) labeled data, which has missing values for the target variables we want to predict. This is an innovative use of semi-supervised PCTs for MTR, as only fully labeled (all target values present) or fully unlabeled (no target values) examples had been used so far, whereas the real-world data from this study has partially labeled examples (with some but not all target values). In both case studies, models are learnt in the form of PCTs and PCT ensembles. They are both accurate (especially ensembles) and understandable (individual PCTs). They reveal knowledge about the studied domains, which is both consistent with existing knowledge of domain experts and provides new insights, important for practical use in the context of achieving better soil function outcomes for given fields/agricultural lands.
... In [18] it is shown that the use of hierarchy helps obtain better single tree models. Moreover, [20] shows that MLC can be approached as HMLC by constructing hierarchies of the labels by clustering the label co-occurrences. However, it is still possible to approach HMLC problems by ignoring the hierarchy at the learning phase and use any of the MLC methods, such as binary relevance or power set approach [28]. ...
... Thus, machine learning approaches have been used to solve this problem, by considering the gene function prediction problem as a classification task. As a single gene may have many functions that are structured according to a predefined hierarchy [2], the gene function prediction problem naturally belongs to the hierarchical multi-label classification (HMC) problems [3]. In the HMC problem, classes are organized in a predefined hierarchical structure [4], and an instance can be assigned with a set of classes [5]. ...
Article
Full-text available
Gene function prediction is used to assign biological or biochemical functions to genes, which continues to be a challenging problem in modern biology. Genes may exhibit many functions simultaneously, and these functions are organized into a hierarchy, such as a directed acyclic graph (DAG) for Gene Ontology (GO). Because of these characteristics, gene function prediction can be seen as a typical hierarchical multi-label classification (HMC) task. A novel HMC method based on neural networks is proposed in this article for predicting gene function based on GO. The proposed method belongs to a local approach by transferring the HMC task to a set of subtasks. There are three strategies implemented in this method to improve its performance. First, to tackle the imbalanced data set problem when building the training data set for each class, negative instances selecting policy and SMOTE approach are used to preprocess each imbalanced training data set. Second, a particular multi-layer perceptron (MLP) is designed for each node in GO. Third, a post processing method based on the Bayesian network is used to guarantee that the results are consistent with the hierarchy constraint. The experimental results indicate that the proposed HMC-MLPN method is a promising method for gene function prediction based on a comparison with two other state-of-the-art methods.
Article
In multi-target prediction, an instance has to be classified along multiple target variables at the same time, where each target represents a category or numerical value. There are several strategies to tackle multi-target prediction problems: the local strategy learns a separate model for each target variable independently, while the global strategy learns a single model for all target variables together. Previous studies suggested that the global strategy should be preferred because (1) learning is more efficient, (2) the learned models are more compact, and (3) it overfits much less than the local strategy, as it is harder to overfit on several targets at the same time than on one target. However, it is not clear whether the global strategy exploits correlations between the targets optimally. In this paper, we investigate whether better results can be obtained by learning multiple multi-target models on several partitions of the targets. To answer this question, we first determined alternative partitions using an exhaustive search strategy and a strategy based on a genetic algorithm, and then compared the results of the global and local strategies against these. We used decision trees and random forests as base models. The results show that it is possible to outperform global and local approaches, but finding a good partition without incurring in overfitting remains a challenging task.
Chapter
In the recent literature on multi-label classification, a lot of attention is given to methods that exploit label dependencies. Most of these methods assume that the dependencies are static over the entire instance space. In contrast, here we present an approach that dynamically adapts the label partitions in a multi-label decision tree learning context. In particular, we adapt the recently introduced predictive bi-clustering tree (PBCT) method towards multi-label classification tasks. This way, tree nodes can split the instance-label matrix both in a horizontal and a vertical way. We focus on hierarchical multi-label classification (HMC) tasks, and map the label hierarchy to a feature set over the label space. This feature set is exploited to infer vertical splits, which are regulated by a lookahead strategy in the tree building procedure. We evaluate our proposed method using benchmark datasets. Experiments demonstrate that our proposal (PBCT-HMC) obtained better or competitive results in comparison to its direct competitors, both in terms of predictive performance and model size. Compared to an HMC method that does not produce label partitions though, our method results in larger models on average, while still producing equally large or smaller models in one third of the datasets by creating suitable label partitions.
Article
The increase of the number of web pages prompts for improvement of the search engines. One such improvement is specifying the desired web genre of the resulting web pages. The prediction of web genres triggers expectations about the type of information contained in a given web page. More specifically, web genres can be seen as textual categories such as scientific papers, home pages or eshops. Arguably, in the context of web search, specifying genre beside topical keywords enables a user to easily find a scientific paper (genre) about text mining (topic). Typically, web genre prediction is treated as a predictive modelling task of multi-class classification, with some recent studies advocating the introduction of a structure in the output space: either by considering multiple web genres per web page or exploiting a hierarchy of web genres. We investigate the structuring of the output space by constructing hierarchies using data-driven methods, experts or even randomly. We also use 10 different representations of the web pages. We use predictive clustering trees and ensembles thereof to properly assess the influence of the different information sources. The experimental evaluation is performed on two benchmark corpora: 20-genre and SANTINIS-ML. The results reveal that exploiting a hierarchy of web genres yields best predictive performance across both datasets, all predictive models, all feature sets and all hierarchies. Next, data-driven hierarchy construction is at least as good as expert-constructed hierarchy with the added value that the hierarchy construction is automatic and fast. Furthermore, ensembles offer state-of-the-art predictive performance and they have a superior performance than single tree models.
Article
Full-text available
Multi-label classification methods are increasingly required by modern applications, such as protein function classification, music categorization, and semantic scene classification. This article introduces the task of multi-label classification, organizes the sparse related literature into a structured presentation and performs comparative experimental results of certain multilabel classification methods. It also contributes the definition of concepts for the quantification of the multi-label nature of a data set.
Article
Full-text available
Multi-label learning has become a relevant learning paradigm in the last years due to the increasing number of fields where it can be applied and also to the emerging number of techniques that are being developed. This paper presents an up-to-date tutorial about multi-label learning that introduces the paradigm and describes the main contributions developed. Evaluation measures, fields of application, trending topics and resources are also presented.
Book
The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Conference Paper
Motivated by an increasing number of new applications, the research community is devoting an increasing amount of attention to the task of multi-label classification (MLC). Many different approaches to solving multi-label classification problems have been recently developed. Recent empirical studies have comprehensively evaluated many of these approaches on many datasets using different evaluation measures. The studies have indicated that the predictive performance and efficiency of the approaches could be improved by using data derived (artificial) hierarchies, in the learning and prediction phases. In this paper, we compare different clustering algorithms for constructing the label hierarchies (in a data-driven manner), in multi-label classification. We consider flat label sets and construct the label hierarchies from the label sets that appear in the annotations of the training data by using four different clustering algorithms (balanced k-means, agglomerative clustering with single and complete linkage and predictive clustering trees). The hierarchies are then used in conjunction with global hierarchical multi-label classification (HMC) approaches. The results from the statistical and experimental evaluation reveal that the data-derived label hierarchies used in conjunction with global HMC methods greatly improve the performance of MLC methods. Additionally, multi-branch hierarchies appear much more suitable for the global HMC approaches as compared to the binary hierarchies.
Article
LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Article
We address the task of hierarchical multi-label classification (HMC). HMC is a task of structured output prediction where the classes are organized into a hierarchy and an instance may belong to multiple classes. In many problems, such as gene function prediction or prediction of ecological community structure, classes inherently follow these constraints. The potential for application of HMC was recognized by many researchers and several such methods were proposed and demonstrated to achieve good predictive performances in the past. However, there is no clear understanding when is favorable to consider such relationships (hierarchical and multi-label) among classes, and when this presents unnecessary burden for classification methods. To this end, we perform a detailed comparative study over 8 datasets that have HMC properties. We investigate two important influences in HMC: the multiple labels per example and the information about the hierarchy. More specifically, we consider four machine learning tasks: multi-label classification, hierarchical multi-label classification, single-label classification and hierarchical single-label classification. To construct the predictive models, we use predictive clustering trees (a generalized form of decision trees), which are able to tackle each of the modelling tasks listed. Moreover, we investigate whether the influence of the hierarchy and the multiple labels carries over for ensemble models. For each of the tasks, we construct a single tree and two ensembles (random forest and bagging). The results reveal that the hierarchy and the multiple labels do help to obtain a better single tree model, while this is not preserved for the ensemble models.