Learning from Imbalanced Data Sets
Abstract
This book provides a general and comprehensible overview of imbalanced learning. It contains a formal description of a problem, and focuses on its main features, and the most relevant proposed solutions. Additionally, it considers the different scenarios in Data Science for which the imbalanced classification can create a real challenge.
This book stresses the gap with standard classification tasks by reviewing the case studies and ad-hoc performance metrics that are applied in this area. It also covers the different approaches that have been traditionally applied to address the binary skewed class distribution. Specifically, it reviews cost-sensitive learning, data-level preprocessing methods and algorithm-level solutions, taking also into account those ensemble-learning solutions that embed any of the former alternatives. Furthermore, it focuses on the extension of the problem for multi-class problems, where the former classical methods are no longer to be applied in a straightforward way.
This book also focuses on the data intrinsic characteristics that are the main causes which, added to the uneven class distribution, truly hinders the performance of classification algorithms in this scenario. Then, some notes on data reduction are provided in order to understand the advantages related to the use of this type of approaches.
Finally this book introduces some novel areas of study that are gathering a deeper attention on the imbalanced data issue. Specifically, it considers the classification of data streams, non-classical classification problems, and the scalability related to Big Data. Examples of software libraries and modules to address imbalanced classification are provided.
This book is highly suitable for technical professionals, senior undergraduate and graduate students in the areas of data science, computer science and engineering. It will also be useful for scientists and researchers to gain insight on the current developments in this area of study, as well as future research directions.
... In cybersecurity, false negatives mean undetected intrusions that can compromise critical infrastructure, leak sensitive data, or initiate cascading failures across digital ecosystems. Such events often remain undiscovered for weeks or months, amplifying the scale and complexity of incident response efforts [24]. Even in transportation and predictive maintenance, failure to detect early signs of mechanical failure can result in catastrophic accidents, leading to loss of life, lawsuits, and reputational damage. ...
... Many implementations in scikit-learn, TensorFlow, and XGBoost allow automatic or manual weighting of classes during model training. This technique is computationally efficient and can improve recall without altering the dataset itself [24]. One advantage of algorithm-level strategies is their ability to integrate class imbalance awareness directly into the optimization process, making them more efficient than external resampling in large datasets. ...
... Unlike ROC curves that assess true positive rate against false positive rate, PR curves focus on the trade-off between precision (how many predicted positives are true positives) and recall (how many actual positives are correctly identified). This makes PR curves particularly useful when the positive class is rare and false negatives carry high cost [24]. The area under the PR curve (AUC-PR) offers a summary measure of model effectiveness on the minority class. ...
Machine learning (ML) has become central to data-driven decision-making in critical sectors such as healthcare, finance, and national security. However, a persistent challenge across these domains is the problem of imbalanced datasets, where instances of the minority class-often representing the most critical outcomes-are significantly underrepresented. In healthcare, these include rare diseases or adverse drug events; in finance, fraudulent transactions; and in security, cyberattacks or insider threats. Standard classification algorithms tend to be biased toward the majority class, resulting in poor detection of high-impact but rare occurrences. This paper presents an optimized ML framework for imbalanced classification, combining advanced resampling strategies (SMOTE, ADASYN), cost-sensitive learning, and ensemble methods like Balanced Random Forests and XGBoost. We evaluate the framework using publicly available and proprietary datasets from U.S. healthcare institutions, financial platforms, and cyberthreat monitoring systems. Performance is measured through precision-recall curves, F1-scores, and area under the precision-recall curve (AUPRC), which are more informative than traditional accuracy metrics in imbalanced scenarios. Case studies demonstrate how the framework significantly improves minority class detection-identifying rare cancers with higher precision, flagging financial fraud in real-time, and enhancing intrusion detection systems in zero-day attack scenarios. Furthermore, the solution incorporates explainable AI techniques (e.g., SHAP values) to ensure model transparency and regulatory compliance in sensitive sectors. The proposed system provides a scalable, interpretable, and domain-adaptable approach for deploying ML in high-stakes imbalanced environments, supporting U.S. priorities in public health, economic integrity, and national security.
... The minority class is hard to predict as there are fewer samples and so, there is less learning opportunity than that of majority samples. Most of the learning methods are biased towards the majority class, so most of the minority samples are not modelled well [31]. Though intense work happened to resolve issues with imbalanced learning, still there are many short comings [32]. ...
... Tomek Links is an under-sampling technique that removes instances from the majority class that are close to the minority class, while Borderline SMOTE is an oversampling technique where misclassified minority samples are selected for oversampling instead of blind oversampling. Random oversampling increases the likelihood of overfitting occurrence as it makes exact minority class sample copies [31]. To address this problem of overfitting, only selected samples needs to be oversampled-artificially creating additional samples of the minority class at border line can effectively augment the data. ...
Objective
The present study explores the classification of Alzheimer’s disease (AD) stages, encompassing cognitive normalcy, Mild Cognitive Impairment (MCI), and AD/Dementia, through the application of Machine Learning (ML) multiclassification algorithms. This investigation utilizes blood gene expression datasets obtained from participants in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and the National Center for Biotechnology Information (NCBI). Three blood gene expression datasets of high dimensionality and low sample size (HDLSS) have been utilized in this study, with one dataset exhibiting significant class imbalance. This study integrates clinical data from electronic health records (EHRs) with gene expression datasets, which has been found to significantly enhance the accuracy of stage diagnosis.
Methods
A combination of XGBoost and SFBS (“sequential floating backward selection”) methods is utilized to select features. Our research identified a subset of 95 gene transcripts exhibiting optimal efficacy from an extensive collection of over 49,000 transcripts within the ADNI gene expression dataset. Furthermore, our analysis of two integrated NCBI datasets revealed 125 gene transcripts demonstrating superior effectiveness among more than 30,000 potential candidates. These findings resulted in the development of two distinct model categories: one derived from the ADNI dataset and the other from the integrated NCBI dataset. DL classifier is used for developing models of both categories while GB (Gradient Boost), SVM (Support Vector Machine) classifier based models are built to identify AD stages from NCBI participants. Because of high data imbalance in genomic data, border line oversampling is explored for model training and original data for validation. We have conducted a multimodal analysis and stage classification by integrating the ADNI gene expression and clinical datasets using ‘Feature-Level Fusion’.
Result
In the case of ADNI study participants, we obtained best multi-classification performance with ‘ROC AUC’ scores of 0. 76, 0.76, 0.71 for the CN, MCI, and Dementia stages, respectively. We achieved F1 scores of 0.71, 0.77, 0.53 for these same categories. For the NCBI-based model, the best AUC scores of 0.82, 0.74, and 0.79 (for CN, MCI, and AD, respectively) and F1 scores of 0.75, 0.60, and 0.77 were attained when evaluated using GSE3060 test data. When assessed with GSE3061 test data, the model achieved optimal AUC scores of 0.81, 0.75, and 0.78, and F1 scores of 0.74, 0.67, and 0.73.This research identified MAPK14, MID1, TEP1, PLG, DRAXIN, USP47 as genes associated with AD. In the context of ADNI data, the integration of clinical data with gene expression data led to an enhancement of the best F1 scores to 0.85, 0.86, and 0.83 for CN, MCI, and AD, respectively. Additionally, the ROC AUC scores were improved to 0.90, 0.85, and 0.89.
Conclusion
Using machine learning multiclassification techniques on blood gene expression profile data from ADNI and NCBI, we achieved the most promising results to date for diagnosing multiple stages of Alzheimer’s disease. This proves that the efficacy of our feature selection techniques that could find essential genes associated with AD. Highly accurate of diagnosis of stages that include MCI from genetic data can potentially provide timely alert for individuals susceptible/predisposed to AD.
... In the second step, in contrast, we had to rebalance the classes as these were still unequally represented. Depending on the method used for rebalancing, useful information was removed (e. g. when using random undersampling of the majority class [44]), samples were duplicated (e. g. when using random oversampling of the minority class [44]) or synthetic data was added, potentially containing unrealistic or even wrong information (e. g. when using Synthetic Minority Over-sampling Technique [45]). Consequently, the application of methods for class rebalancing comes with uncertainty. ...
... In the second step, in contrast, we had to rebalance the classes as these were still unequally represented. Depending on the method used for rebalancing, useful information was removed (e. g. when using random undersampling of the majority class [44]), samples were duplicated (e. g. when using random oversampling of the minority class [44]) or synthetic data was added, potentially containing unrealistic or even wrong information (e. g. when using Synthetic Minority Over-sampling Technique [45]). Consequently, the application of methods for class rebalancing comes with uncertainty. ...
Background: The principal treatment against bacterial infections are antibiotic therapies. However, increasing antibiotic resistances pose a major threat to global health care systems by which sepsis patients are particularly affected. Those patients urgently need to be treated with the most effective antibiotic therapy to maximize their chances of survival while simultaneously preventing the development of both individual and global resistances. Consequently, in order to select a proper empiric antibiotic therapy, the treating physicians need to account for many different factors. A clinical decision support system (CDSS) aims to support physicians in deciding on a fast and targeted antibiotic therapy.
Objective: The purpose of this work is to explore the extent to which the realization of a CDSS is possible based on the data available to us, and to document our insights gained during the development of a foundational model designed to assist physicians in determining empiric treatment options for sepsis patients. In this regard, we aim to highlight the importance of close interprofessional collaboration between scientists from various disciplines and to analyze the effects of data quality and quantity on the performance of our statistical models.
Methods: Empirical scientists regularly conducted interviews with medical practitioners in order to acquire medical knowledge required to develop sound statistical models. We developed and applied two-step cross-sectional as well as time series classification models to carefully preprocessed data of sepsis patients admitted to the intensive care unit of a German hospital.
Results: We identified several factors as crucial information for valid decisions on empiric therapy for treating sepsis patients. These include the patients' core data, especially the infection focus. To prevent further resistances, individual risk factors such as travel history and professional background should be considered. The evaluation of a therapy's effectiveness is mainly based on the patient's general condition and blood values such as procalcitonin and interleukin 6. One key factor in the acceptance of CDSS is the explainability of the results produced by the applied methods. Our models come along with mainly moderate but comprehensive predictive ability for all considered empiric antibiotic therapies.
Conclusion: This work highlights the importance of interprofessional collaboration between medical experts and model developers, ensuring that data quality and clinical relevance are central to the process. It emphasizes the urgent need for high-quality, comprehensive data to overcome challenges such as data discontinuity and improve model performance, particularly through enhanced digitization in healthcare. This foundational work will facilitate future efforts to develop a CDSS for treating sepsis patients and to translate it to clinical use.
... To address this, SMOTE was employed. SMOTE generates new synthetic examples of the minority class rather than duplicating existing ones, effectively improving classifier performance on the underrepresented class [26]. ...
... To mitigate the class imbalance, SMOTE was applied exclusively to the training set after subject-wise splitting, effectively balancing the dataset and enhancing the classification accuracy of the developed system [29]. SMOTE's effectiveness in handling class imbalances aligns with findings from healthcare applications, though its application must be carefully managed to prevent data leakage [30,31]. ...
Background: Parkinson’s disease (PD) is a progressive neurodegenerative condition that impairs motor and non-motor functions. Early and accurate diagnosis is critical for effective management and care. Leveraging machine learning (ML) techniques, this study aimed to develop a robust prediction system for PD using a stacked ensemble learning approach, addressing challenges such as imbalanced datasets and feature optimization. Methods: An open-access PD dataset comprising 22 vocal attributes and 195 instances from 31 subjects was utilized. To prevent data leakage, subjects were divided into training (22 subjects) and testing (9 subjects) groups, ensuring no subject appeared in both sets. Preprocessing included data cleaning and normalization via min–max scaling. The synthetic minority oversampling technique (SMOTE) was applied exclusively to the training set to address class imbalance. Feature selection techniques—forward search, gain ratio, and Kruskal–Wallis test—were employed using subject-wise cross-validation to identify significant attributes. The developed system combined support vector machine (SVM), random forest (RF), K-nearest neighbor (KNN), and decision tree (DT) as base classifiers, with logistic regression (LR) as the meta-classifier in a stacked ensemble learning framework. Performance was evaluated using both recording-wise and subject-wise metrics to ensure clinical relevance. Results: The stacked ensemble learning model achieved realistic performance with a recording-wise accuracy of 84.7% and subject-wise accuracy of 77.8% on completely unseen subjects, outperforming individual classifiers including KNN (81.4%), RF (79.7%), and SVM (76.3%). Cross-validation within the training set showed 89.2% accuracy, with the performance difference highlighting the importance of proper validation methodology. Feature selection results showed that using the top 10 features ranked by gain ratio provided optimal balance between performance and clinical interpretability. The system’s methodological robustness was validated through rigorous subject-wise evaluation, demonstrating the critical impact of validation methodology on reported performance. Conclusions: By implementing subject-wise validation and preventing data leakage, this study demonstrates that proper validation yields substantially different (and more realistic) results compared to flawed recording-wise approaches. The findings underscore the critical importance of validation methodology in healthcare ML applications and provide a template for methodologically sound PD classification research. Future research should focus on validating the model with larger, multi-center datasets and implementing standardized validation protocols to enhance clinical applicability.
... Given the severe label imbalance in these tasks, the primary evaluation metric was mean average precision (mAP), specifically the "macro-averaged" AP across classes. While the area under the receiver operating characteristic curve (AUROC) is often used for similar datasets [8,9], it can be heavily inflated in the presence of class imbalance [10,11]. In contrast, mAP is more suitable for long-tailed, multi-label settings as it measures the performance across decision thresholds without degrading under-class imbalance [12]. ...
The CXR-LT series is a community-driven initiative designed to enhance lung disease classification using chest X-rays (CXR). It tackles challenges in open long-tailed lung disease classification and enhances the measurability of state-of-the-art techniques. The first event, CXR-LT 2023, aimed to achieve these goals by providing high-quality benchmark CXR data for model development and conducting comprehensive evaluations to identify ongoing issues impacting lung disease classification performance. Building on the success of CXR-LT 2023, the CXR-LT 2024 expands the dataset to 377,110 chest X-rays (CXRs) and 45 disease labels, including 19 new rare disease findings. It also introduces a new focus on zero-shot learning to address limitations identified in the previous event. Specifically, CXR-LT 2024 features three tasks: (i) long-tailed classification on a large, noisy test set, (ii) long-tailed classification on a manually annotated "gold standard" subset, and (iii) zero-shot generalization to five previously unseen disease findings. This paper provides an overview of CXR-LT 2024, detailing the data curation process and consolidating state-of-the-art solutions, including the use of multimodal models for rare disease detection, advanced generative approaches to handle noisy labels, and zero-shot learning strategies for unseen diseases. Additionally, the expanded dataset enhances disease coverage to better represent real-world clinical settings, offering a valuable resource for future research. By synthesizing the insights and innovations of participating teams, we aim to advance the development of clinically realistic and generalizable diagnostic models for chest radiography.
... Over the past two decades, research in imbalanced domain learning has progressed to encompass various topics [25,4,10,17,15,12]. The primary areas of investigation include formalizing the problem, identifying limitations in conventional learning algorithms, developing strategies to address these limitations, and pursuing robust evaluation metrics. ...
Handling imbalanced target distributions in regression tasks remains a significant challenge in tabular data settings where underrepresented regions can hinder model performance. Among data-level solutions, some proposals, such as random sampling and SMOTE-based approaches, propose adapting classification techniques to regression tasks. However, these methods typically rely on crisp, artificial thresholds over the target variable, a limitation inherited from classification settings that can introduce arbitrariness, often leading to non-intuitive and potentially misleading problem formulations. While recent generative models, such as GANs and VAEs, provide flexible sample synthesis, they come with high computational costs and limited interpretability. In this study, we propose adapting an existing CART-based synthetic data generation method, tailoring it for imbalanced regression. The new method integrates relevance and density-based mechanisms to guide sampling in sparse regions of the target space and employs a threshold-free, feature-driven generation process. Our experimental study focuses on the prediction of extreme target values across benchmark datasets. The results indicate that the proposed method is competitive with other resampling and generative strategies in terms of performance, while offering faster execution and greater transparency. These results highlight the method's potential as a transparent, scalable data-level strategy for improving regression models in imbalanced domains.
... Undersampling approaches address this by removing instances of the majority class to achieve balance. A commonly used undersampling technique is Random Undersampling, a non-heuristic method that serves as a baseline in many studies (40). In Random Undersampling, instances from the majority (negative) class are randomly selected and removed until the dataset's class distribution matches that of the minority (positive) class. ...
The prediction of protein-protein interactions (PPIs) is essential for understanding biological functions and disease mechanisms. Traditional methods for PPI prediction often focus on physical interactions, overlooking the complex indirect relationships mediated by intermediate proteins. The recent advances in Large Language Models (LLMs) present a novel opportunity to address these challenges. By treating protein sequences as natural language, LLMs can capture both direct and indirect interactions, enhancing prediction capabilities. In this paper, we propose a new framework that leverages a BERT-based LLM fine-tuned specifically for PPI prediction. Our model encodes protein sequences into high-dimensional embeddings, capturing long-range dependencies between amino acids, which are critical for identifying PPIs. By fine-tuning the LLM on a novel PPI dataset, we achieve an accuracy of 93% and a F1-score of 83%. The proposed framework shows improved performance in terms of both prediction accuracy and generalization, demonstrating the potential of LLMs to revolutionize PPI prediction and provide valuable insights for fields such as drug discovery and molecular biology.
... Suppose € is the monetary cost of operations for a LSTID alert, while ̄€ identifies 698 the cost of a missed alert. The total cost can be expressed as (Fernández et al. 2018 calibration phase one can output not only calibrated probability estimates, but also a 715 measure of confidence in these estimates, leading to conceivably safer outcomes. 716 ...
Large-Scale Travelling Ionospheric Disturbances (LSTIDs) are wave-like ionospheric fluctuations, generally triggered by geomagnetic storms, which play a critical role in space weather dynamics. In this work, we present a machine learning model able to forecast the occurrence of LSTIDs over the European continent up to three hours in advance. The model is based on CatBoost, a gradient boosting framework. It is trained on a human-validated LSTID catalogue with the various physical drivers including ionogram information, geomagnetic and solar activity indices. There are three forecasting modes depending on the demanded scenarios with varying relative costs of false positives and false negatives. It is crucial to make the model predictions explainable, so that the output contribution of each physical factor input is visualised through the game-theoretic SHapley Additive exPlanation (SHAP) formalism. The validation procedure consists of a global-level evaluation and interpretation step firstly, followed by an event-level validation against independent detection methods, which highlights the model’s predictive robustness and suggests its potential for real-time space weather forecasting. Depending on the operating mode, we report an improvement ranging from +72% to +93% over the performance of a rule-based benchmark. Our study concludes with a comprehensive analysis of future research directions and actions to be taken towards full operability. We discuss probabilistic forecasting approaches from a cost-sensitive learning perspective, along with performance-centric model monitoring. Finally, through the lens of the conformal prediction framework, we further comment on the uncertainty quantification for end-user risk management and mitigation.
... Differences in staining protocols, imaging equipment, and sample preparation techniques across laboratories can impact performance. The identified problems can lead to models that favor over or underrepresented classes, resulting in poor predictive performance for minority classes [102]. Further, the variations in staining intensity can lead to inconsistent cell morphology, making it harder for models to distinguish WBC subtypes. ...
Data science (DS) methods and Artificial intelligence (AI) are critical in today's healthcare services operations. This study focuses on evaluating the effectiveness of AI and DS in biomedical diagnostics, including automatic detection and counting of white blood cells (WBCs) and types, which provide valuable information for diagnosing and treating blood diseases such as leukemia. Automating these tasks using AI and DS saves time and avoids or minimizes errors compared to manual processes, which can be complex and error prone. The study utilizes bibliographic data from SCOPUS to evaluate research on applying AI algorithms and DS methods for mapping and classifying WBCs images for treatment of blood diseases, such as leukemia using literature survey and science mapping methodology. The results show the potency of different DS methods and AI algorithms, such as machine learning, deep learning, and classification algorithms that enable the automatic detection of WBCs images. AI and DS algorithms offer critical benefits in effectively and efficiently analyzing microscopic images of blood cells. The automatic identification, localization, and classification of WBCs speed up the patient diagnosis process, allowing hematologists to focus on interpreting results. Automatic processes identify specific abnormalities and patterns, enhancing accuracy and timely diagnoses. Future work will examine the application of generative AI in blood cells diagnostics.
... A DTC is made up of nodes and edges, where the decision node provides the variable and the edge specifies the values given by the input variable [46]. Select the best nodes based on normalised information gain for each iteration and add them as child nodes. ...
Type 2 Gestational Diabetes Mellitus (GDM) is a chronic, non-communicable condition influenced by genetics as well as lifestyle choices such an unhealthy diet, being overweight, smoking, and insufficient exercise. Early detection of diabetes mellitus helps people to live longer, healthier lives, and prevents further health issues. In the modern healthcare system, artificial intelligence tools are utilized to automate disease identification. Machine learning approaches assist clinicians to detect diabetes mellitus early after gathering patient data. In order to identify GDM early, the classifier in this study is trained using the PIMA Indian data set from the UCI library. This research proposed an improved hybrid sampling and slime mould bio inspired algorithm-based machine learning technique for GDM identification. It is divided into three stages. First stage involves pre-processing the dataset and training the machine learning classifier without sampling. In the second stage, the sampling strategies are used to create a balanced dataset and the machine learning classifier is trained for prediction. In the third stage, the optimal sampling strategy is chosen, and the performance of the machine learning methodology is fine-tuned using slime mould meta heuristic optimization. For this research, popular sampling techniques SMOTE, SMOTE + EN, SMOTE + ENC, SMOTE + TOMEK, and SMOTE + ENN, are utilized to address the challenges associated with binary class imbalance learning. For machine training, the seven popular classifiers are used: Naive Bayes (NBC), Logistic Regression (LRC), Linear SVC (LSVC), Random Forest (RFC), K Nearest Neighbour (KNN), Decision tree (DTC), and Extra Tree classifier (ETC) are utilized. According to the results of this experiment, the hybrid SMOTE + ENN with slime mould meta heuristic tuning of the ETC classifier has a greater accuracy (98.3%) than other models.
... However, even in large national datasets, the infrequency of asthma attacks relative to the prevalence of asthma means that it can be challenging to identify potentially causal relationships that may lead to increased probability of an asthma attack. Furthermore, the imbalance between positive (attack) and negative (no attack) samples in the data poses a practical challenging setting for developing predictive models [16][17][18] . Therefore, many prediction models report either low (such as below 50%) sensitivity (predicted risk in people who had asthma attacks) or positive predictive value (incidence of asthma attacks in people with high predicted risk) 19 . ...
Primary care consultations provide an opportunity for patients and clinicians to assess asthma attack risk. Using a data-driven risk prediction tool with routinely collected health records may be an efficient way to aid promotion of effective self-management, and support clinical decision making. Longitudinal Scottish primary care data for 21,250 asthma patients were used to predict the risk of asthma attacks in the following year. A selection of machine learning algorithms (i.e., Naïve Bayes Classifier, Logistic Regression, Random Forests, and Extreme Gradient Boosting), hyperparameters, training data enrichment methods were explored, and validated in a random unseen data partition. Our final Logistic Regression model achieved the best performance when no training data enrichment was applied. Around 1 in 3 (36.2%) predicted high-risk patients had an attack within one year of consultation, compared to approximately 1 in 16 in the predicted low-risk group (6.7%). The model was well calibrated, with a calibration slope of 1.02 and an intercept of 0.004, and the Area under the Curve was 0.75. This model has the potential to increase the efficiency of routine asthma care by creating new personalized care pathways mapped to predicted risk of asthma attacks, such as priority ranking patients for scheduled consultations and interventions. Furthermore, it could be used to educate patients about their individual risk and risk factors, and promote healthier lifestyle changes, use of self-management plans, and early emergency care seeking following rapid symptom deterioration.
... Machine learning models, including neural networks and gradient boosted trees, often suffer from calibration issues (Zadrozny and Elkan 2001;Guo et al. 2017;Zadrozny and Elkan 2001;Kuleshov, Fenner, and Ermon 2018;Fernández et al. 2018). Calibration refers to the alignment of a model's predicted probabilities with the actual likelihood of outcomes. ...
Probability calibration transforms raw output of a classification model into empirically interpretable probability. When the model is purposed to detect rare event and only a small expensive data source has clean labels, it becomes extraordinarily challenging to obtain accurate probability calibration. Utilizing an additional large cheap data source is very helpful , however, such data sources oftentimes suffer from biased labels. To this end, we introduce an approximate expectation-maximization (EM) algorithm to extract useful information from the large data sources. For a family of calibration methods based on the logistic likelihood, we derive closed-form updates and call the resulting iterative algorithm CalEM. We show that CalEM inherits convergence guarantees from the approximate EM algorithm. We test the proposed model in simulation and on the real marketing datasets, where it shows significant performance increases.
... Dalam situasi ketidakseimbangan kelas, model cenderung memberikan prediksi yang lebih akurat untuk kelas mayoritas, sementara performa menurun drastis untuk kelas minoritas [6]. Ini menjadi isu krusial ketika varietas kacang tertentu, yang mungkin kurang terwakili, justru memiliki nilai ekonomi atau gizi yang tinggi [7], [8]. ...
Kondisi data tidak imbang (imbalance) merupakan salah satu tantangan utama dalam masalah klasifikasi terkait kualitas atau penyakit pada bidang agrikultur. Penelitian ini bertujuan untuk mengimplementasikan algoritma XGBoost dalam klasifikasi varietas kacang kering dengan fokus pada penanganan ketidakseimbangan kelas melalui pembobotan kelas. Dataset yang digunakan terdiri dari tujuh jenis kacang kering dengan berbagai karakteristik fisik yang diukur dalam piksel, yang meliputi fitur dimensi dan bentuk. Proses normalisasi dilakukan menggunakan teknik min-max normalization untuk memastikan skala data konsisten. Untuk menangani ketidakseimbangan kelas, teknik pembobotan kelas diterapkan dalam XGBoost, yang memberikan bobot lebih pada kelas minoritas. Grid Search dengan 5-fold cross-validation digunakan untuk menemukan kombinasi hyperparameter terbaik, yang menghasilkan akurasi cross-validation sebesar 92.5% dan skor terbaik pada 92.8%. Evaluasi model pada data uji menunjukkan akurasi 93%, dengan hasil precision, recall, dan f1-score yang seimbang pada setiap kelas. Hasil ini menunjukkan bahwa XGBoost dengan pembobotan kelas dapat mengatasi ketidakseimbangan kelas dan memberikan akurasi yang tinggi pada klasifikasi kacang kering
... This strategy is especially useful for our dataset, where one class is disproportionately over-represented (Rezvani and Wang, 2023). Although oversampling can be advantageous in certain scenarios, it carries the risk of overfitting by duplicating examples from minority classes, which can reduce the generalizability of the model (Fernández et al., 2018;Estabrooks et al., 2004). ...
Introduction
Brain tumors are a leading cause of mortality worldwide, with early and accurate diagnosis being essential for effective treatment. Although Deep Learning (DL) models offer strong performance in tumor detection and segmentation using MRI, their black-box nature hinders clinical adoption due to a lack of interpretability.
Methods
We present a hybrid AI framework that integrates a 3D U-Net Convolutional Neural Network for MRI-based tumor segmentation with radiomic feature extraction. Dimensionality reduction is performed using machine learning, and an Adaptive Neuro-Fuzzy Inference System (ANFIS) is employed to produce interpretable decision rules. Each experiment is constrained to a small set of high-impact radiomic features to enhance clarity and reduce complexity.
Results
The framework was validated on the BraTS2020 dataset, achieving an average DICE Score of 82.94% for tumor core segmentation and 76.06% for edema segmentation. Classification tasks yielded accuracies of 95.43% for binary (healthy vs. tumor) and 92.14% for multi-class (healthy vs. tumor core vs. edema) problems. A concise set of 18 fuzzy rules was generated to provide clinically interpretable outputs.
Discussion
Our approach balances high diagnostic accuracy with enhanced interpretability, addressing a critical barrier in applying DL models in clinical settings. Integrating of ANFIS and radiomics supports transparent decision-making, facilitating greater trust and applicability in real-world medical diagnostics assistance.
... Class imbalance is widely faced in various fields, particularly in the medical domain. Here, in such datasets, often, one class contains much fewer samples than the other [1]. Such training examples can bias traditional learning classifiers, as they will most likely favor the majority class and fail to correctly identify rare instances [2,3]. ...
... The wellknown ensemble learning method based on Bagged trees was considered for its good performance in various applications and for handling imbalanced data classes. The associated performance was evaluated according to the following metrics [46][47][48]: True Positive Rate (or recall, or sensitivity), False Negative Rate, Positive Predictive Values, False Discovery Rate, Area Under Curve, Accuracy Rate, and Total Cost. The execution time is given for comparison purposes based on a standard PC use. ...
Despite tremendous efforts devoted to the area, image texture analysis is still an open research field. This paper presents an algorithm and experimental results demonstrating the feasibility of developing automated tools to detect abnormal X-ray images based on tissue attenuation. Specifically, this work proposes using the variability characterised by singular values and conditional indices extracted from the singular value decomposition (SVD) as image texture features. In addition, the paper introduces a “tuning weight" parameter to consider the variability of the X-ray attenuation in tissues affected by pathologies. This weight is estimated using the coefficient of variation of the minimum covariance determinant from the bandwidth yielded by the non-parametric distribution of variance-decomposition proportions of the SVD. When multiplied by the two features (singular values and conditional indices), this single parameter acts as a tuning weight, reducing misclassification and improving the classic performance metrics, such as true positive rate, false negative rate, positive predictive values, false discovery rate, area-under-curve, accuracy rate, and total cost. The proposed method implements an ensemble bagged trees classification model to classify X-ray chest images as COVID-19, viral pneumonia, lung opacity, or normal. It was tested using a challenging, imbalanced chest X-ray public dataset. The results show an accuracy of 88% without applying the tuning weight and 99% with its application. The proposed method outperforms state-of-the-art methods, as attested by all performance metrics.
... Despite this imbalance, our model achieved good performance in metrics that are unaffected by class imbalance, such as the F1 score. Furthermore, the chosen algorithm, XGBoost, demonstrates robustness to class imbalance and may not require additional techniques, like SMOTE, to artificially rectify the imbalance distribution [47,48]. Third, we analyzed a relatively small sample size (n = 147). ...
Diabetic foot infections (DFIs) are a prevalent diabetes-related complication. Managing DFIs requires timely antibiotic treatment but identifying the best antibiotic often depends on microbiological cultures, which can take days and may be unavailable or prohibitively expensive in resource-limited settings.
We aimed to develop a classification model that uses readily available clinical and laboratory data to differentiate between DFIs that are Gram+ resistant, Gram- resistant, or none.
We used retrospective data from patients treated for DFIs at a hospital in Lima, Peru. Gram+ multidrug-resistant bacteria (MDRB) included MDR species of Staphylococcus aureus, other Staphylococcus, and Enterococcus, whereas Gram- MDRB included MDR species of Enterobacteriaceae, Pseudomonas, and Acinetobacter. Twenty clinical (e.g., Wagner classification) and laboratory (e.g., HbA1c) variables were used as predictors in a XGBoost model which was internally validated.
One hundred forty-seven patients, predominantly male (75.1%), with a mean age of 59.7 years. Of these, 19.7% had no MDRB, 34.0% had Gram+ MDRB, and 46.3% had Gram- MDRB. The model achieved an overall F1 score of 83.9%. The highest precision (91.8%) was observed for the Gram- class; the highest recall (93.3%) was observed for the Gram+ class. The Gram+ class was correctly classified 75% of the time; the Gram- class had a correct classification rate of 90%.
Our work suggests it is possible to distinguish between DFIs that are non-MDR, Gram+ MDR, or Gram- MDR using readily available information. Although further validation is required, this model offers promising evidence for a digital bedside tool to guide empirical antibiotic treatment for DFIs.
... Machine learning models, including neural networks and gradient boosted trees, often suffer from calibration issues (Zadrozny and Elkan 2001;Guo et al. 2017;Zadrozny and Elkan 2001;Kuleshov, Fenner, and Ermon 2018;Fernández et al. 2018). Calibration refers to the alignment of a model's predicted probabilities with the actual likelihood of outcomes. ...
Probability calibration transforms raw output of a classification model into empirically interpretable probability. When the model is purposed to detect rare event and only a small expensive data source has clean labels, it becomes extraordinarily challenging to obtain accurate probability calibration. Utilizing an additional large cheap data source is very helpful, however, such data sources oftentimes suffer from biased labels. To this end, we introduce an approximate expectation-maximization (EM) algorithm to extract useful information from the large data sources. For a family of calibration methods based on the logistic likelihood, we derive closed-form updates and call the resulting iterative algorithm CalEM. We show that CalEM inherits convergence guarantees from the approximate EM algorithm. We test the proposed model in simulation and on the real marketing datasets, where it shows significant performance increases.
... By optimizing the threshold value using a validation set, we can learn the cost matrix from the training data. Overall, this allows us to train an effective cost-sensitive classifier even when the cost matrix is not initially known [11]. ...
We can definitely say that supervised learning is challenging when dealing with imbalanced data. There are various ways to tackle this dilemma, such as generating synthetic data and modifying classification algorithms. Over-sampling techniques allow us to obtain additional data for training and improving efficacy, but they may introduce some fuzzy noise. Our research focuses on sensitivity costing technique, and we compare several algorithms that include cost balancing to effectively overcome the data imbalance across layers. We conducted extensive experiments with the dataset and found that applying the sensitivity cost technique improved classification results compared to the imbalanced classification dataset. Grid search class weights consistently outperformed other methods, producing better ROCAUC scores for XGBoost, LightGBM, and Keras neural network (Keras-NN).
... However, the dataset we studied exhibits an imbalanced distribution, which negatively impacts the model's predictive ability. To address such datasets, techniques such as ensemble learning, cost-sensitive learning, category weights, and resampling methods, among others, can substantially improve the model's performance 31,32 . In this study, we utilized ensemble learning techniques to augment the model's generalization and stability. ...
Lymph node metastasis is a critical factor for determining therapeutic strategies and assessing the prognosis of early gastric cancer. This work aimed to establish a more dependable predictive model for identify lymph node metastasis in early gastric cancer. The study utilized both univariate and multivariate logistic regression analyses to identify independent risk factors for lymph node metastasis of early gastric cancer, while employing five distinct algorithms to calculate feature weights. The optimal feature combination for each algorithm model was determined by combining the six highest weight features from all five models along with the independent risk factors. An ensemble learning model was subsequently constructed by integrating these five models. The model’s performance was evaluated by the AUC, accuracy, and F1 score. Following this, a threshold was determined based on the F1 score, and the model’s performance was assessed using an external validation set. The lymph node metastasis rate of early gastric cancer in our study was 16.4%. The ensemble learning model achieved an AUC value of 0.860 in the test set, with an accuracy of 82.35% and an F1 score of 0.611. Based on the F1 score, the model’s threshold was set at 0.18. Additionally, the model demonstrated an AUC of 0.892 in the external validation set, along with an accuracy of 78.30% and an F1 score of 0.60.We constructed an ensemble learning model for predicting lymph node metastasis of early gastric cancer. Gastric surgery should be considered as the preferred treatment when the risk of lymph node metastasis exceeds 18%.
... A subclass of data-level strategies, named synthetic procedures, generates new samples in the minority class, with many variants introduced in the literature [5,12,10]. One key characteristic is that most of them are primarily designed to handle numerical features and thus do not handle categorical features [7,26]. In practice, categorical features are very common in tabular data (e.g. ...
This study investigates rare event detection on tabular data within binary classification. Standard techniques to handle class imbalance include SMOTE, which generates synthetic samples from the minority class. However, SMOTE is intrinsically designed for continuous input variables. In fact, despite SMOTE-NC-its default extension to handle mixed features (continuous and categorical variables)-very few works propose procedures to synthesize mixed features. On the other hand, many real-world classification tasks, such as in banking sector, deal with mixed features, which have a significant impact on predictive performances. To this purpose, we introduce MGS-GRF, an oversampling strategy designed for mixed features. This method uses a kernel density estimator with locally estimated full-rank covariances to generate continuous features, while categorical ones are drawn from the original samples through a generalized random forest. Empirically, contrary to SMOTE-NC, we show that MGS-GRF exhibits two important properties: (i) the coherence i.e. the ability to only generate combinations of categorical features that are already present in the original dataset and (ii) association, i.e. the ability to preserve the dependence between continuous and categorical features. We also evaluate the predictive performances of LightGBM classifiers trained on data sets, augmented with synthetic samples from various strategies. Our comparison is performed on simulated and public real-world data sets, as well as on a private data set from a leading financial institution. We observe that synthetic procedures that have the properties of coherence and association display better predictive performances in terms of various predictive metrics (PR and ROC AUC...), with MGS-GRF being the best one. Furthermore, our method exhibits promising results for the private banking application, with development pipeline being compliant with regulatory constraints.
... The SMOTE technique has been extensively validated in various medical prediction tasks and is recognized for its effectiveness in improving model performance in unbalanced datasets 36 . Using this approach, we ensured that the predictive model could better identify patterns in groups of underrepresented patients, ultimately improving the overall accuracy of the prediction. ...
Background: Heart Failure (HF) and Hypertension (HTN) are common yet severe cardiovascular conditions, both of which significantly increase the risk of adverse outcomes. Patients with comorbid HF and HTN face an elevated risk of mortality. Despite the importance of early risk assessment, current ICU management strategies struggle to accurately predict mortality, limiting effective clinical interventions. To address this gap, we developed a machine learning model to predict 28-day mortality in ICU patients with HF and HTN.
Methods: We extracted data from the MIMIC-IV database, identifying a cohort of 10,010 patients with both HF and HTN, among whom the 28-day mortality rate was 9.82\%. The dataset was randomly split into training (70\%) and testing (30\%) cohorts. Feature selection was performed using a combination of the SelectKBest and Recursive Feature Elimination (RFE) wrapper methods. To address the class imbalance, we employed the Synthetic Minority Over-sampling Technique (SMOTE). Six machine learning models were developed and evaluated: Random Forest (RF), XGBoost, LightGBM, AdaBoost, Logistic Regression (LR), and a Neural Network (NN).
Results: A total of 18 features were selected. The best-performing model was LightGBM, achieving an AUROC of 0.8921 (95\% CI: 0.8694 - 0.9118) with a sensitivity of 0.7941 and a specificity of 0.8391 at an optimal threshold of 0.2067. External validation further demonstrated strong performance with an AUROC of 0.7404 (95\% CI: 0.7130 - 0.7664).
Conclusion: Our proposed model achieved an AUROC improvement of 16.8\% compared to the best existing study on the same topic, while reducing the number of predictive features by 18.2\%. This enhanced model underscores the potential of leveraging these selected features and our LightGBM model as a valuable tool in enhancing resource allocation and providing more personalized interventions for HF patients with HTN in ICU settings.
... The different methods of oversampling were applied to avoid possible model bias for more represented taxa. The following four oversampling techniques used in previous studies (e.g., Nguyen, Cooper & Kamei, 2011;Douzas, Bacao & Last, 2018;Wills, Underwood & Barrett, 2020) were evaluated: (1) Random Oversampling (RandomOverSampler), which repeats values (Fernández et al., 2018) and is a non-heuristic algorithm; (2) SMOTE, which generates synthetic samples for the minority class by interpolating between existing minority samples (Chawla et al., 2002); (3) Borderline Synthetic Minority Oversampling Technique (BorderlineSMOTE), which is a variant of SMOTE, but focuses on generating synthetic samples near the borderline of the classes (Han, Wang & Mao, 2005); and (4) K-Means Synthetic Minority Oversampling Technique (KMeansSMOTE), which uses K-Means clustering to generate synthetic samples by clustering the minority class and then applying SMOTE within each cluster (Douzas, Bacao & Last, 2018). We then compared three scaling methods, which were chosen following Ahsan et al. (2021) that analyzes the effect of different standardization techniques on diverse ML model performances. ...
Classifying objects, such as taxonomic identification of fossils based on morphometric variables, is a time-consuming process. This task is further complicated by intra-class variability, which makes it ideal for automation via machine learning (ML) techniques. In this study, we compared six different ML techniques based on datasets with morphometric features used to classify isolated theropod teeth at both genus and higher taxonomic levels. Our model also intends to differentiate teeth from different positions on the tooth row ( e.g. , lateral, mesial). These datasets present different challenges like over-representation of certain classes and missing measurements. Given the class imbalance, we evaluate the effect of different standardization and oversampling techniques on the classification process for different classification models. The obtained results show that some classification models are more sensitive to class imbalance than others. This study presents a novel comparative analysis of multi-class classification methods for theropod teeth, evaluating their performance across varying taxonomic levels and dataset balancing techniques. The aim of this study is to evaluate which ML methods are more suitable for the classification of isolated theropod teeth, providing recommendations on how to deal with imbalanced datasets using different standardization, oversampling, and classification tools. The trained models and applied standardizations are made publicly available, providing a resource for future studies to classify isolated theropod teeth. This open-access methodology will enable more reliable cross-study comparisons of fossil records.
This study develops a predictive model for loan default in Indonesia’s multifinance industry by implementing and comparing three machine learning methods: Logistic Regression, Support Vector Machine (SVM) with RBF kernel, and XGBoost. Using imbalanced datasets from three multifinance companies representing different portfolio characteristics vehicle, multipurpose, and electronics financing the research applies the SMOTE technique to address class imbalance and enhance model sensitivity. Results show that XGBoost outperforms both Logistic Regression and SVM in accuracy (0.9970), precision (0.9482), recall (0.9987), and AUC (0.9996), while also being the most computationally efficient. Feature importance analysis highlights late payment history, financial ratios, credit scores, and demographic variables as key predictors, with XGBoost capturing complex non-linear interactions. The study introduces a novel multi-layered framework for credit risk management, including scoring engines, early warning systems, and segment-based risk strategies. Segment analysis reveals higher default risks among younger, divorced, and less-educated borrowers, as well as for unsecured loans and high debt-to-income ratios. The model’s adaptability across varying institutional datasets demonstrates the need for company-specific calibration. Compared to previous single-model or single-company approaches, this research provides a comprehensive, scalable, and high-performing solution for predictive credit risk modeling in the Indonesian context. Simulation results suggest that the implementation of this framework could reduce NPF by up to 2.3 percentage points and enhance risk-adjusted returns by 3.8–4.2%, offering substantial practical value to multifinance companies.
Background
Southeast Asia regularly experiences severe haze events driven by transboundary pollution, significantly impacting public health. Accurate short-term forecasting of particulate matter concentrations, especially PM 10 , is crucial for timely interventions.
Objective
To improve the prediction of hourly PM 10 pollution levels by integrating topological data analysis (TDA) with attention-based convolutional neural networks (ABCNNs), focusing on classifying air quality into eight severity levels.
Methods
The proposed framework combines CNNs, self-attention mechanisms, and persistent homology-derived topological features from three key environmental variables. PM 10 category labels were predicted 6, 12, and 24 hours ahead. Data from 15 stations in Malaysia (2019–2020) were used, with feature selection based on correlation analysis. Performance was benchmarked against standard models including Random Forest, Support Vector Classifier, and traditional ABCNNs.
Results
Topological ABCNNs outperformed all baseline models across all prediction horizons. For 6-hour predictions, the model achieved an average accuracy of 0.9677 and F1 score of 0.9770. For 12- and 24-hour predictions, average accuracies were 0.9512 and 0.9086, respectively. The model also maintained robust performance across regions and better predicted rare high-pollution events.
Conclusion
Incorporating topological features into ABCNNs significantly enhances predictive performance for air pollution classification. This hybrid model offers a scalable and accurate tool for environmental monitoring and public health planning, particularly in regions vulnerable to haze pollution.
Graphical abstract
Early hospital admission prediction at the triage stage is an important and challenging task for emergency departments (EDs), aimed at effectively managing and utilizing limited medical resources for critical patients. A retrospective study was conducted at MacKay Memorial Hospital (MMH) from 2011 to 2018, including 1,061,760 records of valid patients, using logistic regression (LR), eXtreme Gradient Boosting (XGBoost), Word2Vec, and bidirectional encoder representations from transformers (BERT). The chief complaints (CCs) and limited structured variables collected at triage are considered predictor variables. The results show that XGBoost achieves better prediction than LR with patient structured variables and better prediction than Word2Vec with patient CCs in terms of AUC and F-measure. We further propose the novel concept of generating expanded CCs as BERT input by integrating the original CCs with selected structured variables using XGBoost to predict the probability of patient admission. Among the structured variables, triage category, mode of arrival, age, arrival time, and fever status are the most important. This study demonstrates BERT's (in particular, BERT-ROS with 5 variables) superior prediction capability compared to other models by considering only patient CCs or expanded CCs in terms of AUC and F-measure. Moreover, given the low admission rates in Taiwan's EDs, this study employs imbalanced data processing to show that the proposed method enhances the predictive capability of hospitalization. These experimental results provide a reference model with associated variables for developing a hospital admission tool at triage, identifying the risk of stratification of critical patients.
This study identifies drought events using the Standardized Precipitation Index (SPI) and applies four machine learning model Support Vector Machines (SVM), Random Forest (RF), Gradient Boosting (GB), and Logistic Regression for drought prediction. Meteorological data from 12 stations across Punjab, Pakistan, covering northern, eastern, and central regions were utilised. Independent variables included average temperature, specific humidity, soil moisture, and dew point. To address model-specific challenges, ridge regression was applied to mitigate multicollinearity in Logistic Regression, while SVM incorporated the Radial Basis Function (RBF) kernel and isolation forest to manage non-linearity and outliers. RF and GB were implemented without additional modifications. The Synthetic Minority Oversampling Technique (SMOTE) was used to handle class imbalance. Model performance was assessed using accuracy, precision, recall, F1-score, specificity, and the area under the receiver operating characteristic curve (ROC-AUC). SVM demonstrated the highest predictive capability with an ROC-AUC of 0.8166, followed by Logistic Regression (0.8024), while RF and GB recorded values of 0.742 and 0.7422, respectively. These findings highlight the superior performance of SVM in drought prediction and emphasise the importance of model selection and data preprocessing in enhancing drought monitoring for improved water resource management and climate adaptation strategies.
Program Keluarga Harapan (PKH) is a form of social protection provided by the government to overcome poverty in Indonesia. However, challenges remain in accurately predicting eligible households. Therefore, a data-based classification method is needed to identify PKH recipients based on their factors. This research was conducted in West Sumatra Province using variables from the Data Terpadu Kesejahteraan Sosial (DTKS) variable group contained in SUSENAS 2024. Based on data from Badan Pusat Statistik (BPS) of West Sumatera Province, there are 1.790 PKH recipient households and 9.810 non-recipient households, indicating a class imbalance. Considering the large amount of data and complex variables, PKH can be analyzed using the Extreme Gradient Boosting (XGBoost) algorithm because of its ability to handle large-scale data and produce high classification performance. To address data imbalance, Adaptive Synthetic (ADASYN) was applied before analysis. The application of XGBoost with the scale_pos_weight parameter shows low classification performance, with sensitivity value of 12.3% and balanced accuracy of 55.2%. To overcome this, unbalanced data was handled using the ADASYN method. The application of XGBoost after data balancing with ADASYN showed significant performance improvement, with sensitivity value 80.4% and balanced accuracy 88.1%. In classifying PKH recipient households, the variables that make an important contribution are the age of the head of household, floor area, diploma of the head of household, floor material and number of household Members. This research shows that the combination of XGBoost and ADASYN is effective in overcoming data imbalance and improving PKH recipient classification performance.
The early detection of dementia, a condition affecting both individuals and society, is essential for its effective management. However, reliance on advanced laboratory tests and specialized expertise limits accessibility, hindering timely diagnosis. To address this challenge, this study proposes a novel approach in which readily available biochemical and physiological features from electronic health records are employed to develop a machine learning-based binary classification model, improving accessibility and early detection. A dataset of 14,763 records from Phachanukroh Hospital, Chiang Rai, Thailand, was used for model construction. The use of a hybrid data enrichment framework involving feature augmentation and data balancing was proposed in order to increase the dimensionality of the data. Medical domain knowledge was used to generate inter-relation-based features (IRFs), which improve data diversity and promote explainability by making the features more informative. For data balancing, the K-Means Synthetic Minority Oversampling Technique (K-Means SMOTE) was applied to generate synthetic samples in under-represented regions of the feature space, addressing class imbalance. Extra Trees (ET) was used for model construction due to its noise resilience and ability to manage multicollinearity. The performance of the proposed method was compared with that of Support Vector Machine, K-Nearest Neighbors, Artificial Neural Networks, Random Forest, and Gradient Boosting. The results reveal that the ET model significantly outperformed other models on the combined dataset with four IRFs and K-Means SMOTE across key metrics, including accuracy (96.47%), precision (94.79%), recall (97.86%), F1 score (96.30%), and area under the receiver operating characteristic curve (99.51%).
This study introduces a machine learning framework to predict the suitability of ionic liquids with unknown physical properties as propellants for electrospray thrusters based on their molecular structure. We construct a training dataset by labeling ionic liquids as suitable (+1) or unsuitable (-1) for electrospray thrusters based on their density, viscosity, and surface tension. The ionic liquids are represented by their molecular descriptors calculated using the Mordred package. The dataset is extremely imbalanced due to the scarcity of suitable candidates. To mitigate the issue of imbalanced data, we applied a combination of oversampling the minority class and undersampling the majority class. We evaluate four machine learning algorithms-Logistic Regression, Support Vector Machine (SVM), Random Forest, and Extreme Gradient Boosting (XGBoost)-with SVM demonstrating superior predictive performance. The SVM predicts 193 candidate propellants from a dataset of ionic liquids with unknown physical properties. Further, we employ Shapley Additive Explanations (SHAP) to assess and rank the impact of individual molecular descriptors on model decisions.
Fine and new DEM of riparian zone is emergently required for hydrological and environmental modeling. With the effects of water fluctuation and submersion, acquiring DEM in the riparian zone is difficult. In this study, a self-adaptive ROC (Receiver Operating Characteristic) method was proposed to generate the DEM of riparian zone with the aid of water fluctuation (called as ROC DEM). The water extent acts as an altimeter to measure the altitude with corresponding water level. The elevation of each pixel was the optimal threshold to determine whether it was covered with water or not. With the fluctuation in water, the time-series labels of land or water and the corresponding water level of each pixel were recorded. Finally, the elevation of each pixel was obtained by acquiring the most optimal classification threshold with the ROC algorithm. The proposed method skillfully transformed the problem of acquiring the elevation of the riparian zone into a 2-class classification problem. The riparian zone in the Dongting Lake was considered as the study area to test the proposed method with the time-series Sentinel-1 SAR images and the corresponding water level. The proposed method is feasible to obtain the DEM and its results is more consistent with the actual topography comparing to other DEM products. The values of R2 between the ROC DEM and GLAS and field altitude reach to 0.7 and 0.9 respectively. This paper presents an alternate method for acquiring the topography in the riparian zone and tidal flat.
This systematic review comprehensively examines the application and impacts of Educational Data Mining (EDM) over the past decade. It explores the use of various data mining tools and techniques, statistics, and machine learning algorithms in education. The review discusses how EDM helps understand and improve the learning experience, educational strategies, and institutional efficiency. It highlights the iterative process of EDM, its applications, and the benefits it offers to different stakeholders, including students, teachers, and educational institutions. The paper also discusses the challenges related to data ethics, privacy, and security in EDM. Key sections include a methodology for conducting the systematic review, exploring different data mining techniques and learning styles, and using Artificial Intelligence in EDM. The review concludes with a discussion of findings, future research directions, and a summary of the study’s contributions and limitations.
Anomaly detection in medical imaging is pivotal for early diagnosis and treatment planning. However, the inherent class imbalance in medical datasets poses significant challenges, often leading to biased models that underperform on minority classes. This study investigates the integration of the Synthetic Minority Over-sampling Technique (SMOTE) with various machine learning and deep learning models to enhance anomaly detection in medical images. By applying SMOTE to balance datasets and evaluating its impact across multiple models, we demonstrate improved detection accuracy, sensitivity, and specificity. The findings underscore the efficacy of SMOTE in addressing class imbalance, thereby enhancing the reliability of anomaly detection systems in medical imaging.
Introduction: The structural disambiguation of English multiword terms (MWT) of three or more constituents (e.g., coastal sediment transport), often known as bracketing, involves the grouping of the dependent components so that the MWT is reduced to its basic form of modifier+head, as in coastal [sediment transport], which is a right-bracketed ternary compound. This work presents a study that explored whether the bracketing of a ternary compound, when used as an argument in a sentence, can be predicted from the semantic information encoded in that sentence. Methodology: A set of 1.694 sentences were analyzed semantically and annotated with the lexical domain of the verbs, the semantic role and category of the arguments, and the semantic relation between the arguments. These semantic variables were then analyzed statistically to determine whether they are able to predict the bracketing of a ternary compound. Results: A random forest model, with the lexical domain of the verb, and the semantic role and category of the MWT, was able to predict the bracketing of the ternary compounds used as arguments in a sample of 380 MWTs (100% F1‑score). A decision tree, with solely the semantic relation of the MWT to another argument in the same sentence, was also able to predict the bracketing of the ternary compounds in the sample (94,12% F1‑score). Discussion: Only a subset of three variables was necessary for bracketing prediction with an error free performance, whereas previous research employed a minimum of 12 variables. Conclusion: The semantic information in a sentence contributed substantially to compound parsing. This suggests a novel research direction in the integration of semantic variables into syntactic parsers and machine-translation applications.
Business failure prediction models are crucial in high-stakes domains like banking, insurance, and investing. In this paper, we propose an interpretable model that combines numerical and sentence-level textual features through a well-known attention mechanism. Our model demonstrates competitive performance across various metrics, and the attention weights help identify sentences intuitively linked to business failure, offering a form of interpretability. Furthermore, our findings highlight the strength of traditional financial ratios for business failure prediction while textual data—particularly when represented as keywords—is mainly useful to correctly classify corporate disclosures where the possibility of failure is explicitly mentioned.
Imbalanced classification problems frequently arise in critical domains such as fraud detection, medical diagnosis, cybersecurity, and anomaly detection, where the minority class often carries disproportionate importance despite its scarcity. Traditional machine learning algorithms tend to favour the majority class, leading to suboptimal performance and costly misclassifications in minority class detection. This study evaluates ensemble learning techniques—including Bagging, Boosting, Random Forest, EasyEnsemble, and BalancedRandomForest—for their effectiveness in managing class imbalance. Using several real-world benchmark datasets with varying imbalance ratios and feature complexities, the methods are rigorously assessed using metrics tailored to imbalanced scenarios, including F1-score, precision-recall area under the curve (PR-AUC), and geometric mean (G-mean). Results indicate that boosting-based methods, particularly Gradient Boosting Machines (GBM), consistently excel across most datasets, especially in terms of PR-AUC and G-mean. However, certain datasets with extreme imbalance or high feature dimensionality saw stronger performance from BalancedRandomForest. These findings underscore that the optimal ensemble method is highly dependent on specific dataset attributes and operational constraints. This analysis offers practical insights into aligning ensemble strategies with real-world requirements, guiding researchers and practitioners toward more robust and accurate models in imbalanced classification contexts.
Wireless Sensor Networks (WSNs) play a critical role in environmental monitoring and early forest fire detection. However, they are susceptible to sensor malfunctions and network intrusions, which can compromise data integrity and lead to false alarms or missed detections. This study presents a hybrid anomaly detection framework that integrates a Transformer-based Autoencoder, Isolation Forest, and XGBoost to effectively classify normal sensor behavior, malfunctions, and intrusions. The Transformer Autoencoder models spatiotemporal dependencies in sensor data, while adaptive thresholding dynamically adjusts sensitivity to anomalies. Isolation Forest provides unsupervised anomaly validation, and XGBoost further refines classification, enhancing detection precision. Experimental evaluation using real-world sensor data demonstrates that our model achieves 95% accuracy, with high recall for intrusion detection, minimizing false negatives. The proposed approach improves the reliability of WSN-based fire monitoring by reducing false alarms, adapting to dynamic environmental conditions, and distinguishing between hardware failures and security threats.
Class distribution disparities in datasets often result in imbalanced data issues, which can significantly impact model performance. This study investigates the effects of such imbalances on the performance of XGBoost and Support Vector Machines (SVM), specifically in the context of a five-class classification problem using the financial freedom index as the target variable. Initially, both models were applied to the imbalanced dataset, highlighting the performance degradation caused by the data imbalance. To mitigate this issue, the Synthetic Minority Oversampling Technique (SMOTE) was employed to generate a balanced dataset, after which the models were re-evaluated. Comparative analysis revealed that the XGBoost algorithm demonstrated superior performance relative to the SVM method once the data imbalance was addressed. Moreover, the improvement in classification accuracy for XGBoost was notably higher compared to SVM following the application of the SMOTE technique, underscoring the robustness of XGBoost in handling imbalanced data.
ResearchGate has not been able to resolve any references for this publication.