Table 1 - uploaded by Alberto Fernández
Content may be subject to copyright.
Source publication
The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered “de facto” standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to different type of problems. Since its publication in 2002, SMOTE has proven success...
Contexts in source publication
Context 1
... section is devoted to enumerate and categorize them according to the properties studied before. Table 1 presents an enumeration of the methods reviewed in this paper. In this field, it is usual that the authors provide a name for their proposal, with a few exceptions. ...
Context 2
... we can see in Table 1, the most frequent properties exploited by the techniques are the initial selection and adaptive generation of synthetic examples. Filtering is becoming more common in recent years, as well as the use of kernel functions. ...
Context 3
... to space limitations, it is not possible to describe all the reviewed techniques. Nevertheless, we will provide brief explanations for the most well-known techniques from Table 1: • Borderline-SMOTE ( Han et al., 2005): This algorithm draws from the premise of that the examples far from the borderline may contribute little to the classification success. Thus, the technique indentifies those examples which belong to the borderline by using the ratio between the majority and minority examples within the neighborhood of each instance to be oversampled. ...
Context 4
... 2 shows a list of ensemble based techniques that incorporate SMOTE itself or a derivative of SMOTE as a major step to achieve the diversity of the set of classifiers learned to form the ensemble. Note that this table only contains the methods concerned with The structure of the Table 2 is very similar to the previous one, Table 1. The dimensionality change and filtering are two properties not used in ensembles. ...
Context 5
... section is devoted to enumerate and categorize them according to the properties studied before. Table 1 presents an enumeration of the methods reviewed in this paper. In this field, it is usual that the authors provide a name for their proposal, with a few exceptions. ...
Context 6
... we can see in Table 1, the most frequent properties exploited by the techniques are the initial selection and adaptive generation of synthetic examples. Filtering is becoming more common in recent years, as well as the use of kernel functions. ...
Context 7
... to space limitations, it is not possible to describe all the reviewed techniques. Nevertheless, we will provide brief explanations for the most well-known techniques from Table 1: • Borderline-SMOTE ( Han et al., 2005): This algorithm draws from the premise of that the examples far from the borderline may contribute little to the classification success. Thus, the technique indentifies those examples which belong to the borderline by using the ratio between the majority and minority examples within the neighborhood of each instance to be oversampled. ...
Context 8
... 2 shows a list of ensemble based techniques that incorporate SMOTE itself or a derivative of SMOTE as a major step to achieve the diversity of the set of classifiers learned to form the ensemble. Note that this table only contains the methods concerned with The structure of the Table 2 is very similar to the previous one, Table 1. The dimensionality change and filtering are two properties not used in ensembles. ...
Context 9
... section is devoted to enumerate and categorize them according to the properties studied before. Table 1 presents an enumeration of the methods reviewed in this paper. In this field, it is usual that the authors provide a name for their proposal, with a few exceptions. ...
Context 10
... we can see in Table 1, the most frequent properties exploited by the techniques are the initial selection and adaptive generation of synthetic examples. Filtering is becoming more common in recent years, as well as the use of kernel functions. ...
Context 11
... to space limitations, it is not possible to describe all the reviewed techniques. Nevertheless, we will provide brief explanations for the most well-known techniques from Table 1: • Borderline-SMOTE ( Han et al., 2005): This algorithm draws from the premise of that the examples far from the borderline may contribute little to the classifica- tion success. Thus, the technique indentifies those examples which belong to the borderline by using the ratio between the majority and minority examples within the neighborhood of each instance to be oversampled. ...
Context 12
... 2 shows a list of ensemble based techniques that incorporate SMOTE itself or a derivative of SMOTE as a major step to achieve the diversity of the set of classifiers learned to form the ensemble. Note that this table only contains the methods concerned with The structure of the Table 2 is very similar to the previous one, Table 1. The dimen- sionality change and filtering are two properties not used in ensembles. ...
Similar publications
Due to its plug-and-play functionality and wide device support, the universal serial bus (USB) protocol has become one of the most widely used protocols. However, this widespread adoption has introduced a significant security concern: the implicit trust provided to USB devices, which has created a vast array of attack vectors. Malicious USB devices...
Magnetic field errors and misalignments cause optics perturbations, which can lead to machine safety issues and performance degradation. The correlation between magnetic errors and deviations of the measured optics functions from design can be used in order to build supervised learning models able to predict magnetic errors directly from a selectio...
The gustatory, olfactory, and trigeminal systems are anatomically separated. However, they interact cognitively to give rise to oral perception, which can significantly affect health and quality of life. We built a Supervised Learning (SL) regression model that, exploiting participants’ features, was capable of automatically analyzing with high pre...
Sarcasm is a common feature of user interaction on social networking sites. Sarcasm differs with typical communication in alignment of literal meaning with intended meaning. Humans can recognize sarcasm from sufficient context information including from the various contents available on SNS. Existing literature mainly uses text data to detect sarca...
Traditional supervised learning classifier needs a lot of labeled samples to achieve good performance, however in many biological datasets there is only a small size of labeled samples and the remaining samples are unlabeled. Labeling these unlabeled samples manually is difficult or expensive. Technologies such as active learning and semi-supervise...
Citations
... There are two key problems existing during the procedure. One is how to determine an appropriate k value; the other is how to select the nearest neighbors as candidate assistance reference sample set (CAR) for PR [2,[12][13][14]. More details are discussed below. ...
... (a) To verify the validity of HNaNE, several SMOTE methods with different k-parameters are used as a comparison. The k-parameters are set according to the recommendations in [14], ranging from 5 to 10. (b) To validate HNaNSMOTE, SMOTE with k = λ (i.e., SMNaNE and SMHNaNE) is also utilized for comparison. ...
... In high-dimensional datasets, interpolation-based oversampling approaches are frequently paired with sparse strategies such as dimensionality reduction to mitigate the detrimental impacts of the curse of dimensionality [7,14]. In Experiment C, sparse PCA is applied to reduce the dimensions of 13 high-dimensional datasets. ...
In recent years, researchers have developed numerous interpolation-based oversampling techniques to tackle class imbalance in classification tasks. However, most existing techniques encounter the challenge of k parameter due to the involvement of k nearest neighbor (kNN). Furthermore, they only adopt one sole neighborhood rule, disregarding the positional characteristics of minority samples. This often leads to the generation of synthetic noise or overlapping samples. This paper proposes a non-parameter oversampling framework called the hybrid natural neighbor synthetic minority oversampling technique (HNaNSMOTE). HNaNSMOTE effectively determines an appropriate k value through iterative search and adopts a hybrid neighborhood rule for each minority sample to generate more representative and diverse synthetic samples. Specifically, 1) a hybrid natural neighbor search procedure is conducted on the entire dataset to obtain a data-related k value, which eliminates the need for manually preset parameters. Different natural neighbors are formed for each sample to better identify the positional characteristics of minority samples during the procedure. 2) To improve the quality of the generated samples, the hybrid natural neighbor (HNaN) concept has been proposed. HNaN utilizes kNN and reverse kNN to find neighbors adaptively based on the distribution of minority samples. It is beneficial for mitigating the generation of synthetic noise or overlapping samples since it takes into account the existence of majority samples. Experimental results on 32 benchmark binary datasets with three classifiers demonstrate that HNaNSMOTE outperforms numerous state-of-the-art oversampling techniques for imbalanced classification in terms of Sensitivity and G-mean.
... One of the key strengths of this current research lies in the ability to scale the predictive model across different markets. Emerging markets, characterized by limited historical data and increased market volatility, could benefit significantly from AI techniques that do not rely solely on past performance [43]. In these markets, where traditional models often fail to capture local economic factors or market sentiment, machine learning approaches can account for various influencing variables, offering more accurate and context-sensitive predictions. ...
Addressing resource scarcity and climate change necessitates a transition to sustainable consumption and circular economy models, fostering environmental, social, and economic resilience. This study introduces a deep learning-based ensemble framework to optimize initial public offering (IPO) performance prediction while extending its application to circular economy processes, such as resource recovery and waste reduction. The framework incorporates advanced techniques, including hyperparameter optimization, dynamic metric adaptation (DMA), and the synthetic minority oversampling technique (SMOTE), to address challenges such as class imbalance, risk-adjusted metric enhancement, and robust forecasting. Experimental results demonstrate high predictive performance, achieving an accuracy of 76%, precision of 83%, recall of 75%, and an AUC of 0.9038. Among ensemble methods, Bagging achieved the highest AUC (0.90), outperforming XGBoost (0.88) and random forest (0.75). Cross-validation confirmed the framework’s reliability with a median AUC of 0.85 across ten folds. When applied to circular economy scenarios, the model effectively predicted sustainability metrics, achieving R² values of 0.76 for both resource recovery and waste reduction with a low mean absolute error (MAE = 0.11). These results highlight the potential to align financial forecasting with environmental sustainability objectives. This study underscores the transformative potential of deep learning in addressing financial and sustainability challenges, demonstrating how AI-driven models can integrate economic and environmental goals. By enabling robust IPO predictions and enhancing circular economy outcomes, the proposed framework aligns with Industry 5.0’s vision for human-centric, data-driven, and sustainable industrial innovation, contributing to resilient economic growth and long-term environmental stewardship.
... The fundamental concept of SMOTE involves the generation of artificial samples in the feature space. It randomly chooses a minority class instance and calculates the k-nearest neighbors for this instance [38]. A synthetic sample is then created by selecting one of the k-nearest neighbors and forming a random linear combination of the features from the chosen neighbor and the original instance. ...
... This study employed another performance metric, namely the AUC-ROC curve, to analyze the effectiveness of the proposed experiment. The ROC [38] curve is a vital tool in assessing the performance of disease prediction models. By analyzing ...
Early detection and characterization are crucial for treating and managing Parkinson's disease (PD). The increasing prevalence of PD and its significant impact on the motor neurons of the brain impose a substantial burden on the healthcare system. Early‐stage detection is vital for improving patient outcomes and reducing healthcare costs. This study introduces an ensemble boosting machine, termed PD_EBM, for the detection of PD. PD_EBM leverages machine learning (ML) algorithms and a hybrid feature selection approach to enhance diagnostic accuracy. While ML has shown promise in medical applications for PD detection, the interpretability of these models remains a significant challenge. Explainable machine learning (XML) addresses this by providing transparency and clarity in model predictions. Techniques such as Local Interpretable Model‐agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP) have become popular for interpreting these models. Our experiment used a dataset of 195 clinical records of PD patients from the University of California Irvine (UCI) Machine Learning repository. Comprehensive data preparation included encoding categorical features, imputing missing values, removing outliers, addressing data imbalance, scaling data, selecting relevant features, and so on. We propose a hybrid boosting framework that focuses on the most important features for prediction. Our boosting model employs a Decision Tree (DT) classifier with AdaBoost, followed by a linear discriminant analysis (LDA) optimizer, achieving an impressive accuracy of 99.44%, outperforming other boosting models.
... (3) Distinguish between shallow and deep rockburst based on depth of burial, and for the first time model rockburst from this perspective, and the resulting model accuracy exceeds that of some mainstream full-depth rockburst case models. Based on the above issues, the framework study and model development in this study considered six oversampling techniques (SMOTE (Fernández et al., 2018); ADASYN (He et al., 2008); KMeansSMOTE (Douzas et al., 2018); SMOTENC (Fonseca and Bacao, 2023); BordenlineSMOTE (Han et al., 2005); SVMSMOTE (Wang et al., 2021),12 classifiers (Decision Tree, DT (Song and Ying, 2015); Extra Trees, ET (Geurts et al., 2006); Gradient Boosting, GBD (Natekin and Knoll, 2013); Gaussian Process Regression, GPR (Schulz et al., 2018); K-Nearest Neighbor, KNN (Peterson, 2009); Light Gradient Boosting Machine, LGB (Fan et al., 2019); Multilayer Perceptron, MLP ; Naive Bayes model, NBM (Murphy, 2006); Quadratic Discriminant Analysis Algorithm, QDA (Kim et al., 2011); Random Forest, RF (Biau and Scornet, 2016); Support Vector Classification, SVC (Hsu et al., 2003); EXtreme Gradient Boosting, XGB (Chen et al., 2015)). Eighty-four algorithm combinations were systematically evaluated, leading to the selection of the top-performing two. ...
The occurrence of class-imbalanced datasets is a frequent observation in natural science research, emphasizing the paramount importance of effectively harnessing them to construct highly accurate models for rockburst prediction. Initially, genuine rockburst incidents within a burial depth of 500 m were sourced from literature, revealing a small dataset imbalance issue. Utilizing various mainstream oversampling techniques, the dataset was expanded to generate six new datasets, subsequently subjected to 12 classifiers across 84 classification processes. The model incorporating the highest-scoring model from the original dataset and the top two models from the expanded dataset, yielded a high-performance model. Findings indicate that the KMeansSMOTE oversampling technique exhibits the most substantial enhancement across the combined 12 classifiers, whereas individual classifiers favor ET+SVMSMOTE and RF+SMOTENC. Following multiple rounds of hyper parameter adjustment via random cross-validation, the ET+SVMSMOTE combination attained the highest accuracy rate of 93.75%, surpassing mainstream models for rockburst prediction. Moreover, the SVMSMOTE technique, augmenting samples with fewer categories, demonstrated notable benefits in mitigating overfitting, enhancing generalization, and improving Recall and F1 score within RF classifiers. Validated for its high generalization performance, accuracy, and reliability. This process also provides an efficient framework for model development.
... Examples include its use in education [102], data mining [68], healthcare [50], finance [120], and environmental studies [51]. The application of SMOTE for addressing class imbalance problems has expanded across various fields, resulting in the development of several SMOTE variants over the years [31]. Notable variants include SMOTE-ENN, K-Means SMOTE, SMOTE-SVM, Borderline SMOTE, Geometric SMOTE, and Weighted SMOTE [55]. ...
... This process generates random points along the "line segments" connecting the selected minority instance and its neighbors. The procedure is repeated for various samples in the minority class until the desired number of synthetic samples is produced [31]. Studies have demonstrated that combining SMOTE with under-sampling techniques results in more robust performance compared to using SMOTE alone [20,106,51]. ...
Bluebottles (\textit{Physalia} spp.) are marine stingers resembling jellyfish, whose presence on Australian beaches poses a significant public risk due to their venomous nature. Understanding the environmental factors driving bluebottles ashore is crucial for mitigating their impact, and machine learning tools are to date relatively unexplored. We use bluebottle marine stinger presence/absence data from beaches in Eastern Sydney, Australia, and compare machine learning models (Multilayer Perceptron, Random Forest, and XGBoost) to identify factors influencing their presence. We address challenges such as class imbalance, class overlap, and unreliable absence data by employing data augmentation techniques, including the Synthetic Minority Oversampling Technique (SMOTE), Random Undersampling, and Synthetic Negative Approach that excludes the negative class. Our results show that SMOTE failed to resolve class overlap, but the presence-focused approach effectively handled imbalance, class overlap, and ambiguous absence data. The data attributes such as the wind direction, which is a circular variable, emerged as a key factor influencing bluebottle presence, confirming previous inference studies. However, in the absence of population dynamics, biological behaviours, and life cycles, the best predictive model appears to be Random Forests combined with Synthetic Negative Approach. This research contributes to mitigating the risks posed by bluebottles to beachgoers and provides insights into handling class overlap and unreliable negative class in environmental modelling.
... The data were prepared through balancing and normalization techniques to ensure data quality during model training. For data balancing, a combination of undersampling and SMOTE (synthetic minority over-sampling technique) [33] was applied. Initially, undersampling was used to reduce the number of incorrect samples. ...
This work presents an approach based on signal processing and artificial intelligence (AI) to identify the pre-insertion resistor (PIR) and main contact instants during the operation of high-voltage SF6 circuit breakers to help improve the settings of controlled switching and attenuate transients. For this, the current and voltage signals of a real Brazilian substation are used as AI inputs, considering the noise and interferences common in this type of environment. Thus, the proposed modeling considers the signal preprocessing steps for feature extraction, the generation of the dataset for model training, the use of different machine learning techniques to automatically find the desired points, and, finally, the identification of the best moments for controlled switching of the circuit breakers. As a result, the models evaluated obtained good performance in the identification of operation points above 93%, considering precision and accuracy. In addition, valuable statistical notes related to the controlled switching condition are obtained from the circuit breakers evaluated in this research.
... The Synthetic Minority Oversampling Technique (SMOTE) is a widely-used ML method that addresses this issue [21]. SMOTE creates a balanced dataset by oversampling minority instances, enhancing machine learning models' performance on imbalanced data using a simple and efficient augmentation technique [22]. This work examines the use of ML algorithms to create predictive models for ICU admission, using an extensive array of clinical and laboratory characteristics in COVID-19 patients, including variables that differentiate between those with and without diabetes. ...
Intensive Care Units (ICUs) have been in great demand worldwide since the COVID-19 pandemic, necessitating organized allocation. The spike in critical care patients has overloaded ICUs, which along with prolonged hospitalizations, has increased workload for medical personnel and lead to a significant shortage of resources. The study aimed to improve resource management by quickly and accurately identifying patients who need ICU admission. We designed an intelligent decision support system that employs machine learning (ML) to anticipate COVID-19 ICU admissions in Kuwait. Our algorithm examines several clinical and demographic characteristics to identify high-risk individuals early in illness diagnosis. We used 4399 patients to identify ICU admission with predictors such as shortness of breath, high D-dimer values, and abnormal chest X-rays. Any data imbalance was addressed by employing cross-validation along with the Synthetic Minority Oversampling Technique (SMOTE), the feature selection was refined using backward elimination, and the model interpretability was improved using Shapley Additive Explanations (SHAP). We employed various ML classifiers, including support vector machines (SVM). The SVM model surpasses all other models in terms of precision (0.99) and area under curve (AUC, 0.91). This study investigated the healthcare process during a pandemic, facilitating ML-based decision-making solutions to confront healthcare problems.
... Notably, there was a significant discrepancy in sample sizes across behaviors, with cooking behavior having a substantially larger amount of data (ranging from 65% to 75%) than bathing and laundry behaviors. To address the potential bias towards the larger dataset and ensure balanced learning, the Synthetic Minority Over-sampling Technique (SMOTE) method was employed to oversample the bathing and laundry behaviors in the training set (Fernández et al., 2018). Detailed information is provided in Supplementary Material S1. ...
Classifying household water-consumption behaviors is crucial for providing targeted suggestions for water-saving behaviors and enabling effective resource management and conservation. Although it is common knowledge that energy consumption is closely coupled with household water consumption, the effectiveness of energy consumption information in classifying household water behaviors remains unexplored. This study proposes a hybrid model of long short-term memory (LSTM) and random forest (RF) using water and electricity consumption as inputs to classify household water-consumption behaviors. Data from three households in Beijing collected from January to March 2020 were used for the case studies. The hybrid model achieved a macro F1 score of 0.89 at a 5-min resolution, outperforming the standalone LSTM and RF models. Additionally, the inclusivity of time-series electricity consumption improves the accuracy (F1 scores) of classifying bathing and laundry behaviors by 0.12 and 0.20, respectively. These findings underscore the scientific value of integrating electricity consumption as a proxy variable in water-consumption behavior classification models, demonstrating its potential to enhance accuracy while simplifying data acquisition processes. This study establishes a framework for demand-side water management aimed at empowering residents to understand their own water-energy consumption behavior patterns and engage in personalized water conservation efforts.
... Conversely, over-sampling aims to augment the class size of minority groups in order to prevent the loss of knowledge. Synthetic minority over-sampling technique (SMOTE) constructs synthetic examples of the minority class, thus minimizing the possibility of overfitting (Torgo et al. 2013;Fernandez et al. 2018). In this study, our training set contains 16,364 positive and 22,338 negative instances, thus suffering a class imbalance. ...
MicroRNAs (miRNA) are categorized as short endogenous non-coding RNAs, which have a significant role in post-transcriptional gene regulation. Identifying new animal precursor miRNA (pre-miRNA) and miRNA is crucial to understand the role of miRNAs in various biological processes including the development of diseases. The present study focuses on the development of a Light Gradient Boost (LGB) based method for the classification of animal pre-miRNAs using various sequence and secondary structural features. In various pre-miRNA families, distinct k-mer repeat signatures with a length of three nucleotides have been identified. Out of nine different classifiers that have been trained and tested in the present study, LGB has an overall better performance with an AUROC of 0.959. In comparison with the existing methods, our method ‘pmiRScan’ has an overall better performance with accuracy of 0.93, sensitivity of 0.86, specificity of 0.95 and F-score of 0.82. Moreover, pmiRScan effectively classifies pre-miRNAs from four distinct taxonomic groups: mammals, nematodes, molluscs and arthropods. We have used our classifier to predict genome-wide pre-miRNAs in human. We find a total of 313 pre-miRNA candidates using pmiRScan. A total of 180 potential mature miRNAs belonging to 60 distinct miRNA families are extracted from predicted pre-miRNAs; of which 128 were novel and are note reported in miRBase. These discoveries may enhance our current understanding of miRNAs and their targets in human. pmiRScan is freely available at http://www.csb.iitkgp.ac.in/applications/pmiRScan/index.php.
... SMOTE generates synthetic samples by interpolating between existing minority class examples, effectively mitigating overfitting without discarding important data [70]. Although more sophisticated algorithms like Borderline-SMOTE [71], ADASYN [72], and SMOTEENN [73] offer refined resampling processes, they introduce additional complexity and computational overhead [74]. Additionally, its proven efficacy in SFP scenarios further supports our choice. ...
In software development, Software Fault Prediction (SFP) is essential for optimising resource allocation and improving testing efficiency. Traditional SFP methods typically use binary-class models, which can provide a limited perspective on the varying risk levels associated with individual software modules. This study explores the impacts of Error-type Metrics on the fault-proneness of software modules in domain-specific software projects. Also, it aims to enhance SFP methods by introducing a risk-based approach using Error-type Metrics. This method categorises software modules into High, Medium, and Low-Risk categories, offering a more granular and informative fault prediction framework. This approach aims to refine the fault prediction process and contribute to more effective resource allocation and project management in software development. We explore the domain-specific impact of Error-type Metrics through Principal Component Analysis (PCA), aiming to fill a gap in the existing literature by offering insights into how these metrics affect machine learning models across different software domains. We employ three machine learning models - Support Vector Machine (SVM), Random Forest (RF), and Extreme Gradient Boosting (XGB) - to test our approach. The Synthetic Minority Over-sampling Technique (SMOTE) is used to address class imbalance. Our methodology is validated on fault data from four open-source software projects, aiming to confirm the robustness and generalisability of our approach. The PCA findings provide evidence of the varied impacts of Error-type Metrics in different software environments. Comparative analysis indicates a strong performance by the XGB model, achieving an accuracy of 97.4%, a Matthews Correlation Coefficient of 96.1%, and an F1-score of 97.4% across the datasets. These results suggest the potential of the proposed method to contribute to software testing and quality assurance practices. Our risk-based SFP approach introduces a new perspective to risk assessment in software development. The study’s findings contribute insights into the domain-specific applicability of Error-type Metrics, expanding their potential utility in SFP. Future research directions include refining our fault-counting methodology and exploring broader applications of Error-type Metrics and our proposed risk-based approach.