A first approach for cost-sensitive classification with linguistic Genetic Fuzzy Systems in imbalanced data-sets
ABSTRACT Classification in imbalanced domains has become one of the most relevant problems within the area of Machine Learning at the present. This problem has raised in significance due to its presence in many real applications and it occurs when the distribution of the available examples to carry out the learning process is very different between the classes (often for binary class data-sets). Usually, the underrepresented class is the concept of the most interest for the problem, being the cost derived from a misclassification of these examples much higher than that of the remaining examples. In this work we analyze the behaviour of a cost-sensitive learning method for Fuzzy Rule Based Classification Systems in the scenario of high imbalanced data-sets. Specifically, we focus on one representative rule learning approach for Genetic Fuzzy Systems, the Fuzzy Hybrid Genetics-Based Machine Learning algorithm. The experimental results show how our cost-sensitive approach in this type of domains will help us to obtain very accurate solutions in shorter training times and also with a lower complexity with respect to other possibilities proposed for classification with imbalanced problems such as the use of preprocessing to rebalance the class distribution.
Full-textDOI: · Available from: Francisco Herrera, May 30, 2015
SourceAvailable from: Jose G Moreno-Torres[Show abstract] [Hide abstract]
ABSTRACT: Class imbalance is among the most persistent complications which may confront the traditional supervised learning task in real-world applications. The problem occurs, in the binary case, when the number of instances in one class significantly outnumbers the number of instances in the other class. This situation is a handicap when trying to identify the minority class, as the learning algorithms are not usually adapted to such characteristics.The approaches to deal with the problem of imbalanced datasets fall into two major categories: data sampling and algorithmic modification. Cost-sensitive learning solutions incorporating both the data and algorithm level approaches assume higher misclassification costs with samples in the minority class and seek to minimize high cost errors. Nevertheless, there is not a full exhaustive comparison between those models which can help us to determine the most appropriate one under different scenarios.The main objective of this work is to analyze the performance of data level proposals against algorithm level proposals focusing in cost-sensitive models and versus a hybrid procedure that combines those two approaches. We will show, by means of a statistical comparative analysis, that we cannot highlight an unique approach among the rest. This will lead to a discussion about the data intrinsic characteristics of the imbalanced classification problem which will help to follow new paths that can lead to the improvement of current models mainly focusing on class overlap and dataset shift in imbalanced classification.Expert Systems with Applications 06/2012; 39(7):6585–6608. DOI:10.1016/j.eswa.2011.12.043 · 1.97 Impact Factor