Figure 5 - available via license: Creative Commons Attribution 4.0 International
Content may be subject to copyright.
demonstrates the ξ correlation coefficients of the AUCs of different algorithms with the R aug and ImR aug . It can be seen that the result of the ξ correlation coefficient is similar to the result of the Pearson correlation coefficient. The x i correlation coefficient of the AUC of the RF algorithm with the ImR aug is the highest and the correlation coefficient of the NB algorithm with R aug is largely improved by ImR aug . In addition, the ξ correlation coefficients of the AUCs of different algorithms with the ImR aug are all higher than that with the R aug . Therefore, the comparison of ξ correlation coefficient also demonstrates the superior of ImR aug .
Source publication
Class imbalance, as a phenomenon of asymmetry, has an adverse effect on the performance of most machine learning and overlap is another important factor that affects the classification performance of machine learning algorithms. This paper deals with the two factors simultaneously, addressing the class overlap under imbalanced distribution. In this...
Citations
... Various research works attempt to exploit domain knowledge to address the class imbalance problem, but not in the meteorological domain. Ref. [10] addresses the problem of noisy and borderline examples when using oversampling methods, while [11] deals simultaneously with the problems of class imbalance and class overlap. Ref. [12] uses domain specific knowledge to address the problem of class imbalance in text sentiment classification. ...
We deal with the problem of class imbalance in data mining and machine learning classification algorithms. This is the case where some of the class labels are represented by a small number of examples in the training dataset compared to the rest of the class labels. Usually, those minority class labels are the most important ones, implying that classifiers should primarily perform well on predicting those labels. This is a well-studied problem and various strategies that use sampling methods are used to balance the representation of the labels in the training dataset and improve classifier performance. We explore whether expert knowledge in the field of Meteorology can enhance the quality of the training dataset when treated by pre-processing sampling strategies. We propose four new sampling strategies based on our expertise on the data domain and we compare their effectiveness against the established sampling strategies used in the literature. It turns out that our sampling strategies, which take advantage of expert knowledge from the data domain, achieve class balancing that improves the performance of most classifiers.