Table 3 - uploaded by Alberto Fernández
Content may be subject to copyright.
List of SMOTE-based approaches for other learning paradigms

List of SMOTE-based approaches for other learning paradigms

Source publication
Article
Full-text available
The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered “de facto” standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to different type of problems. Since its publication in 2002, SMOTE has proven success...

Contexts in source publication

Context 1
... of SMOTE have been applied other learning paradigms: (1) streaming data (see Section 4.1); (2) Semi-supervised and active learning (in Section 4.2); (3) Multi-instance and multi-label classification (Section 4.3); (4) Regression (in Section 4.4) and (5) Other and more complex prediction problems and such as text classification, low quality data classification, and so on (see Section 4.5). Table 3 presents a summary of the SMOTE extensions by chronological order, indicating their references, algorithm names and learning paradigms they tackle. In the following, we will give a brief description of each learning paradigm and the associated developed techniques. ...
Context 2
... of SMOTE have been applied other learning paradigms: (1) streaming data (see Section 4.1); (2) Semi-supervised and active learning (in Section 4.2); (3) Multi-instance and multi-label classification (Section 4.3); (4) Regression (in Section 4.4) and (5) Other and more complex prediction problems and such as text classification, low quality data classification, and so on (see Section 4.5). Table 3 presents a summary of the SMOTE extensions by chronological order, indicating their references, algorithm names and learning paradigms they tackle. In the following, we will give a brief description of each learning paradigm and the associated developed techniques. ...
Context 3
... of SMOTE have been applied other learning paradigms: (1) streaming data (see Section 4.1); (2) Semi-supervised and active learning (in Section 4.2); (3) Multi-instance and multi-label classification (Section 4.3); (4) Regression (in Section 4.4) and (5) Other and more complex prediction problems and such as text classification, low quality data classification, and so on (see Section 4.5). Table 3 presents a summary of the SMOTE extensions by chronological order, indicating their references, algorithm names and learning paradigms they tackle. In the following, we will give a brief description of each learning paradigm and the associated developed techniques. ...

Similar publications

Article
Full-text available
Due to its plug-and-play functionality and wide device support, the universal serial bus (USB) protocol has become one of the most widely used protocols. However, this widespread adoption has introduced a significant security concern: the implicit trust provided to USB devices, which has created a vast array of attack vectors. Malicious USB devices...
Article
Full-text available
Magnetic field errors and misalignments cause optics perturbations, which can lead to machine safety issues and performance degradation. The correlation between magnetic errors and deviations of the measured optics functions from design can be used in order to build supervised learning models able to predict magnetic errors directly from a selectio...
Article
Full-text available
The gustatory, olfactory, and trigeminal systems are anatomically separated. However, they interact cognitively to give rise to oral perception, which can significantly affect health and quality of life. We built a Supervised Learning (SL) regression model that, exploiting participants’ features, was capable of automatically analyzing with high pre...
Conference Paper
Full-text available
Sarcasm is a common feature of user interaction on social networking sites. Sarcasm differs with typical communication in alignment of literal meaning with intended meaning. Humans can recognize sarcasm from sufficient context information including from the various contents available on SNS. Existing literature mainly uses text data to detect sarca...
Article
Full-text available
Traditional supervised learning classifier needs a lot of labeled samples to achieve good performance, however in many biological datasets there is only a small size of labeled samples and the remaining samples are unlabeled. Labeling these unlabeled samples manually is difficult or expensive. Technologies such as active learning and semi-supervise...

Citations

... Notably, there was a significant discrepancy in sample sizes across behaviors, with cooking behavior having a substantially larger amount of data (ranging from 65% to 75%) than bathing and laundry behaviors. To address the potential bias towards the larger dataset and ensure balanced learning, the Synthetic Minority Over-sampling Technique (SMOTE) method was employed to oversample the bathing and laundry behaviors in the training set (Fernández et al., 2018). Detailed information is provided in Supplementary Material S1. ...
Article
Full-text available
Classifying household water-consumption behaviors is crucial for providing targeted suggestions for water-saving behaviors and enabling effective resource management and conservation. Although it is common knowledge that energy consumption is closely coupled with household water consumption, the effectiveness of energy consumption information in classifying household water behaviors remains unexplored. This study proposes a hybrid model of long short-term memory (LSTM) and random forest (RF) using water and electricity consumption as inputs to classify household water-consumption behaviors. Data from three households in Beijing collected from January to March 2020 were used for the case studies. The hybrid model achieved a macro F1 score of 0.89 at a 5-min resolution, outperforming the standalone LSTM and RF models. Additionally, the inclusivity of time-series electricity consumption improves the accuracy (F1 scores) of classifying bathing and laundry behaviors by 0.12 and 0.20, respectively. These findings underscore the scientific value of integrating electricity consumption as a proxy variable in water-consumption behavior classification models, demonstrating its potential to enhance accuracy while simplifying data acquisition processes. This study establishes a framework for demand-side water management aimed at empowering residents to understand their own water-energy consumption behavior patterns and engage in personalized water conservation efforts.
... Conversely, over-sampling aims to augment the class size of minority groups in order to prevent the loss of knowledge. Synthetic minority over-sampling technique (SMOTE) constructs synthetic examples of the minority class, thus minimizing the possibility of overfitting (Torgo et al. 2013;Fernandez et al. 2018). In this study, our training set contains 16,364 positive and 22,338 negative instances, thus suffering a class imbalance. ...
Article
Full-text available
MicroRNAs (miRNA) are categorized as short endogenous non-coding RNAs, which have a significant role in post-transcriptional gene regulation. Identifying new animal precursor miRNA (pre-miRNA) and miRNA is crucial to understand the role of miRNAs in various biological processes including the development of diseases. The present study focuses on the development of a Light Gradient Boost (LGB) based method for the classification of animal pre-miRNAs using various sequence and secondary structural features. In various pre-miRNA families, distinct k-mer repeat signatures with a length of three nucleotides have been identified. Out of nine different classifiers that have been trained and tested in the present study, LGB has an overall better performance with an AUROC of 0.959. In comparison with the existing methods, our method ‘pmiRScan’ has an overall better performance with accuracy of 0.93, sensitivity of 0.86, specificity of 0.95 and F-score of 0.82. Moreover, pmiRScan effectively classifies pre-miRNAs from four distinct taxonomic groups: mammals, nematodes, molluscs and arthropods. We have used our classifier to predict genome-wide pre-miRNAs in human. We find a total of 313 pre-miRNA candidates using pmiRScan. A total of 180 potential mature miRNAs belonging to 60 distinct miRNA families are extracted from predicted pre-miRNAs; of which 128 were novel and are note reported in miRBase. These discoveries may enhance our current understanding of miRNAs and their targets in human. pmiRScan is freely available at http://www.csb.iitkgp.ac.in/applications/pmiRScan/index.php.
... SMOTE generates synthetic samples by interpolating between existing minority class examples, effectively mitigating overfitting without discarding important data [70]. Although more sophisticated algorithms like Borderline-SMOTE [71], ADASYN [72], and SMOTEENN [73] offer refined resampling processes, they introduce additional complexity and computational overhead [74]. Additionally, its proven efficacy in SFP scenarios further supports our choice. ...
Article
Full-text available
In software development, Software Fault Prediction (SFP) is essential for optimising resource allocation and improving testing efficiency. Traditional SFP methods typically use binary-class models, which can provide a limited perspective on the varying risk levels associated with individual software modules. This study explores the impacts of Error-type Metrics on the fault-proneness of software modules in domain-specific software projects. Also, it aims to enhance SFP methods by introducing a risk-based approach using Error-type Metrics. This method categorises software modules into High, Medium, and Low-Risk categories, offering a more granular and informative fault prediction framework. This approach aims to refine the fault prediction process and contribute to more effective resource allocation and project management in software development. We explore the domain-specific impact of Error-type Metrics through Principal Component Analysis (PCA), aiming to fill a gap in the existing literature by offering insights into how these metrics affect machine learning models across different software domains. We employ three machine learning models - Support Vector Machine (SVM), Random Forest (RF), and Extreme Gradient Boosting (XGB) - to test our approach. The Synthetic Minority Over-sampling Technique (SMOTE) is used to address class imbalance. Our methodology is validated on fault data from four open-source software projects, aiming to confirm the robustness and generalisability of our approach. The PCA findings provide evidence of the varied impacts of Error-type Metrics in different software environments. Comparative analysis indicates a strong performance by the XGB model, achieving an accuracy of 97.4%, a Matthews Correlation Coefficient of 96.1%, and an F1-score of 97.4% across the datasets. These results suggest the potential of the proposed method to contribute to software testing and quality assurance practices. Our risk-based SFP approach introduces a new perspective to risk assessment in software development. The study’s findings contribute insights into the domain-specific applicability of Error-type Metrics, expanding their potential utility in SFP. Future research directions include refining our fault-counting methodology and exploring broader applications of Error-type Metrics and our proposed risk-based approach.
... SMOTE, 2002 yılında Chawla ve arkadaşları tarafından önerilen bir fazla örnekleme (oversampling) tekniğidir ve dengesiz veri setlerindeki azınlık sınıfının örnek sayısını artırmak için sentetik veri örnekleri oluşturur [39]. SMOTE'un temel çalışma prensibi, azınlık sınıfındaki örnekler arasında doğrusal interpolasyon yoluyla yeni örnekler üretmektir. ...
Article
Full-text available
ÖZET Otizm Spektrum Bozukluğu (OSB), sosyal etkileşim ve iletişimde zorluklar, tekrarlayıcı işlemler ve duygusal sorunlar gibi belirgin bir dağılıma sahip karmaşık bir nörogelişimsel durumdur. Bireylerin sosyal etkileşimi, iletişimin gelişmesi ve belirli davranış kalıpları üzerindeki zorluklarla birlikte kendini gösterir. Otizmin genellikle erken çocukluk döneminde başladığı ve bu dönemde belirginlik kazandığı gözlemlenmektedir. Otizmde erken teşhis önemlidir; çünkü erken tanı ile tedavinin erken başlaması mümkündür. OSB sorununun teşhisi için geleneksel yöntemlere ek olarak, günümüzde istikrarlı çıkarımları ile farklı pek çok alanda uygulamaları olan makine öğrenmesi yöntemleri teşhis başarısını arttırmak amacıyla kullanılmaktadır. Yöntemler, büyük veri setlerini analiz ederek otizm belirtilerini hızla tanımlamak, erken ve doğru teşhis sağlamak amacıyla bu çalışmada 17 girdi ve 1 hedef olmak üzere toplam 18 öznitelik değişkenden oluşan bir veri seti üzerinde öznitelik seçimi yöntemi ve sınıf dengeleme yöntemleri uygulanarak ardından dört farklı makine öğrenmesi algoritması (K-En Yakın Komşu, Lojistik Regresyon, Naive Bayes, Destek Vektör Makineleri) ile sınıflandırma işlemleri gerçekleştirilmiştir. Sınıflandırma performansı doğruluk, duyarlılık, özgüllük ve F1 skoru gibi metriklerle değerlendirilmiştir. Öznitelik seçimi sonrası, Destek Vektör Makineleri ve Lojistik Regresyon algoritmaları ile %100 doğruluk oranı elde edilirken, K-En Yakın Komşu ve Naive Bayes algoritmaları sırasıyla %94,7 ve %96,7 doğruluk sağlamıştır. Öznitelik seçimi yapılmadığında ise en yüksek doğruluk oranı %96,2 olarak kaydedilmiştir. Sonuçlar, öznitelik seçiminin makine öğrenmesi algoritmalarının sınıflandırma performansını belirgin bir şekilde artırdığını göstermektedir. Bu sonuçlar doğrultusunda OSB tanısında makine öğrenmesi yöntemlerinin uygulanabilirliğini ve doğruluğunu ortaya koymakta olup, teşhis sürecini iyileştirmek adına önemli bir katkı sağlamaktadır. ABSTRACT Autism Spectrum Disorder (ASD) is a complex neurodevelopmental condition with a marked distribution of difficulties in social interaction and communication, repetitive processing and emotional problems. It is manifested by difficulties with individuals' social interaction, the development of communication and difficulties with certain patterns of behavior. It is observed that autism usually starts in early childhood and gains prominence in this period. Early diagnosis is important in autism because early diagnosis allows treatment to start early. In addition to traditional methods for diagnosing ASD, machine learning methods, which have applications in many different fields with their stable inferences, are now used to increase diagnostic success. In this study, SelectKBest feature selection method was applied using a dataset consisting of 18 feature variables (17 input and 1 target) and classification processes were performed with four different machine learning algorithms (K-Nearest Neighbor, Logistic Regression, Naive Bayes, Support Vector Machines). Classification performance was evaluated with metrics such as accuracy, sensitivity, specificity and F1 score. After feature selection, Support Vector Machines and Logistic Regression algorithms achieved 100% accuracy, while K-Nearest Neighbor and Naive Bayes algorithms achieved 94.7% and 96.7% accuracy, respectively. When no feature selection was performed, the highest accuracy rate was recorded as 96.2%. The results show that feature selection significantly improves the classification performance of machine learning algorithms. These results demonstrate the applicability and accuracy of machine learning methods in the diagnosis of ASD and provide an important contribution to improve the diagnostic process.
... The SMOTE technique works by first setting the number of oversamples N and then randomly selecting a sample from the minority class and finding its K nearest neighbours. Subsequently, N of these neighbours are randomly selected and a new sample is generated by interpolating between the sample and its neighbours, thus balancing the class distribution of the dataset [26]. In this experiment, given that the AI4I dataset is highly imbalanced (with the ratio of 0-labeled samples to 1-labeled samples being greater than 10:1), using SMOTE to balance the dataset is an appropriate choice. ...
Article
Full-text available
Fault diagnosis plays an integral role in machine health monitoring. However, in practical applications, there are obvious differences in class distribution within the data, leading to poor performance of the algorithm in identifying a few classes. Meanwhile, overfitting and computational resource requirements have become a challenge. Recently, the stacking model has been promoted in the field of fault diagnosis, but its performance evaluation of stacking models in many literature is not comprehensive enough. In this paper, an Advanced Ensemble Trees model (AET) is proposed. The SMOTE (Synthetic Minority Oversampling Technique) resampling technique is used to optimise the dataset balance. Then, the advantages of Support Vector Machines (SVM) and multi-tree models are combined to form a robust base model using hyper-parameter tuning. Simple Logistic Regression (LR) is used as a meta-model to construct the new stacking model. Through extensive experimental validation, it is found that the AET model is close to 99% in several key performance metrics and outperforms existing machine learning methods and relatively short model training time.
... SMOTE helps to build a more broad decision boundary by creating new samples by interpolating between existing ones rather than replicating current minority samples. [30] [31]. ...
Article
Full-text available
Wireless Sensor Networks (WSN) play a pivotal role in various domains, including monitoring, security, and data transmission. However, their susceptibility to intrusions poses a significant challenge. This paper proposes a novel Intrusion Detection System (IDS) leveraging Particle Swarm Optimization (PSO) and an ensemble machine learning approach combining Random Forest (RF), Decision Tree (DT), and K-Nearest Neighbors (KNN) models to enhance the accuracy and reliability of intrusion detection in WSNs. The system addresses key challenges such as the imbalanced nature of datasets and the evolving complexity of network attacks. By incorporating Synthetic Minority Oversampling Technique Tomek (SMOTE-Tomek) techniques to balance the dataset and employing explainable AI methods such as Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP), the proposed model achieves significant improvements in detection accuracy, precision, recall, and F1 score while providing clear, interpretable results. Extensive experimentation on WSN-DS dataset demonstrates the system’s efficacy, achieving an accuracy of 99.73%, with precision, recall, and F1 score values of 99.72% each, outperforming existing approaches. This work offers a robust, scalable solution for securing WSNs, contributing to both academic research and practical applications.
... Recent research focuses on two key areas, one of which aims to improve the sampling performance of SMOTE. To this end, several variants have been developed to accomplish these key objectives: enhancing clustering methods, optimizing dataset-specific parameter values, selecting optimal candidate sample points for oversampling prior to generating synthetic examples, filtering artificial instances or selecting precise samples from minority classes, and managing dimensionality changes (e.g., K-Means SMOTE, MWMOTE, Borderline-SMOTE1, TRIM, TSMOTE) [11,28,57,84,96]. The other dimension aims to ensure that synthetic data production fairly represents vulnerable groups, such as those defined by ethnicity, gender, income and other subgroups when data size is small and heterogeneous [105]. ...
Article
Full-text available
Substance use poses a significant public health challenge worldwide, including in Finland. This study seeks to predict patterns of substance use, aiming to identify the driving factors behind these trends using artificial intelligence techniques. This research utilizes data from the 2022 Finnish National Drug Survey, comprising 3,857 participants, to develop predictive models targeting the use of cannabis, ecstasy, amphetamine, cocaine, and non-prescribed medications. Analysis of 23 questionnaire items yielded 76 features across four substance use dimensions: demographic attributes, experience and preferences of drug use, health-related aspects of drug use, and social attributes of drug use. In addition to traditional machine learning (ML) approaches previously applied in this field, three sophisticated deep learning models—standard LSTM, BiLSTM, and Recursive LSTM—were employed to evaluate their predictive performance. These LSTM models were further augmented with SHAP analysis to identify the primary influences on substance use patterns. While all these artificial intelligence models demonstrated superior predictive performance, our focus was specifically on the outcomes of the LSTM models due to their novel application in this field. The results underscore the exceptional performance of both LSTM and ML models in unraveling complex substance use behaviors, underlining their applicability in diverse public health contexts. This study not only sheds light on the predictors of substance uses but also furthers methodological innovation in drug research, charting new directions for crafting targeted intervention strategies and policies. The observed variability in predictor significance across different substances indicates the necessity for tailored prevention programs catering to particular user groups. Integrating machine learning with social science and public health policy, our research deepens the understanding of the factors influencing substance use and promotes effective strategies for its mitigation. Despite some limitations, this investigation establishes a foundation for future studies and accentuates the critical role of advanced computational techniques in addressing intricate social issues.
... Avoiding minority sample duplication by developing synthetic samples decreases overfitting (He and Garcia 2009). It enhances ML models in terms of recall and F1-score on imbalanced datasets (Fernández et al. 2018). Despite employing different feature selection methods than earlier studies, the deductible, incident severity, umbrella insurance coverage limit, and policy type are common relevant features. ...
Article
Full-text available
Like other industries, insurance companies processed large volumes of data during the industrial revolution. The industry's major concern is increasing numbers of fraudulent claims. These claims affect not only financial losses but also the entire industry, honest policyholders, and society. Machine learning (ML) approaches are recently utilized in insurance fraud detection to reduce such losses. To further improve, this article introduces a novel prediction framework for fraudulent claims called the Two-step models. The anonymous US auto insurance dataset was used to demonstrate and evaluate the framework. Under-sampling and synthetic minority over-sampling technique (SMOTE) were used to balance data. Mutual information was employed as a feature selection tool. Five proposed models were built in two steps. Early on, eight basic ML models were implemented. The top three affective models were chosen based on their F-measure scores. Then, their predicted values were used as components to construct the two-step models using ensemble techniques. Statistical tests were utilized to appraise all models. Numerical results indicated that the proposed models yielded significant enhancements. Moreover, the most effective model is a combination of SMOTE and improved multilayer perceptron (IMLP). This research could help insurance firms improve their fraud detection systems to prevent insurance abuse.
... SMOTE) proposed by (Chawla et al., 2002). In this method instead of simply replicating minority samples, synthetic samples are generated by interpolating along the line connecting several minority class instances within a defined neighborhood (Fernández et al., 2018;. In our study, following a trial-and-error approach, we set the neighborhood as the 5 closest instances around the sample of interest. ...
Article
Full-text available
Prediction of the rapid intensification (RI) of tropical cyclones (TCs) is crucial for improving disaster preparedness against storm hazards. These events can cause extensive damage to coastal areas if occurring close to landfall. Available models struggle to provide accurate RI estimates due to the complexity of underlying physical mechanisms. This study provides new insights into the prediction of a subset of rapidly intensifying TCs influenced by prolonged ocean warming events known as marine heatwaves (MHWs). MHWs could provide sufficient energy to supercharge TCs. Preconditioning by MHW led to RI of recent destructive TCs, Otis (2023), Doksuri (2023), and Ian (2022), with economic losses exceeding $150 billion. Here, we analyze the TC best track and sea surface temperature data from 1981 to 2023 to identify hotspot regions for compound events, where MHWs and RI of tropical cyclones occur concurrently or in succession. Building upon this, we propose an ensemble machine learning model for RI forecasting based on storm and MHW characteristics. This approach is particularly valuable as RI forecast errors are typically largest in favorable environments, such as those created by MHWs. Our study offers insight into predicting MHW TCs, which have been shown to be stronger TCs with potentially higher destructive power. Here, we show that using MHW predictors instead of the conventional method of using sea surface temperature reduces the false alarm rate by 30%. Overall, our findings contribute to coastal hazard risk awareness amidst unprecedented climate warming causing more frequent MHWs.
... GP-based imbalance handling: we use SMOTE on R to generate synthetic positive samples. But, SMOTE often generates many false positive samples [6]. To solve this problem, we assign an uncertainty score for each synthetic positive sample via the variance function of a GP. ...
... 6: Method and its implementation link. ...
Preprint
Full-text available
In image classification tasks, deep learning models are vulnerable to image distortions i.e. their accuracy significantly drops if the input images are distorted. An image-classifier is considered "reliable" if its accuracy on distorted images is above a user-specified threshold. For a quality control purpose, it is important to predict if the image-classifier is unreliable/reliable under a distortion level. In other words, we want to predict whether a distortion level makes the image-classifier "non-reliable" or "reliable". Our solution is to construct a training set consisting of distortion levels along with their "non-reliable" or "reliable" labels, and train a machine learning predictive model (called distortion-classifier) to classify unseen distortion levels. However, learning an effective distortion-classifier is a challenging problem as the training set is highly imbalanced. To address this problem, we propose two Gaussian process based methods to rebalance the training set. We conduct extensive experiments to show that our method significantly outperforms several baselines on six popular image datasets.