Research

Imbalanced Data in Machine Learning: A Comprehensive Review

Authors:
To read the file of this research, you can request a copy directly from the authors.

Abstract

Imbalanced data is a common challenge in machine learning where the distribution of classes in a dataset is skewed, with one class significantly outnumbering the others. This phenomenon can lead to biased models and reduced performance, posing a substantial problem in various real-world applications. This paper provides a comprehensive review of imbalanced data in machine learning, addressing the causes, consequences, and strategies for tackling this issue. We delve into various techniques, algorithms, and best practices that have been proposed to mitigate the challenges posed by imbalanced data, with a focus on their advantages, limitations, and practical considerations. This review aims to offer a comprehensive resource for researchers and practitioners working with imbalanced datasets, fostering a deeper understanding of the subject and guiding the selection of appropriate methodologies to address this pervasive problem.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

... It is critical to identify the underlying reasons of this widespread issue before delving into solutions for unbalanced data [13][14][15][16][17][18]. Comprehending these origins offers significant understanding into the characteristics of unbalanced datasets and guides our strategy for reducing their impacts. ...
... The main cause of unbalanced data is discrimination in the data gathering process [13]. It is possible for bias to be unintentionally introduced throughout the data gathering process while creating a dataset, be it for image categorization, medical research, or any other sector. ...
... Class inequality [13][14][15][16][17][18] is a inherent feature of some dataset that has to be solved in particular fields. Think about the situation of detecting spam emails. ...
Article
Full-text available
The business organisations ability to grow and flourish mostly relies on how successfully it understands and utilises the data it has collected; data has become more vital in today's society. Every company or organisation at the present time accumulates massive volumes of data across a range of areas, such as finance, trade, business, and healthcare. Medical data may be provided by clinics, doctors, healthcare providers, and insurance establishments. Upon locating the necessary medical datasets, the next phases would be to investigate and utilize appropriate modeling algorithms to mine substantial information for probable prediction. Biased data is significant challenge in machine learning where the distribution of data elements in a dataset is uneven, with one class considerably outnumbering the others. This occurrence leads to biased models and reduced performance that affects quality and reliability of machine learning algorithms. This paper presents detailed review on reasons for imbalanced data, its impact, algorithmic procedures to handle unevenly distributed data. We explore various techniques, algorithms to address problem, advantages, demerits and evaluation metrics to assess performance of procedures for handling imbalanced datasets.
... Techniques addressing data imbalance include synthesizing new training examples for the minority class through data augmentation or by oversampling the minority class instances through techniques like SMOTE (Kennedy et al., 2024). Other methods involve modifying the learning algorithms to assign higher misclassification costs to minority class examples, such that the model parameters are affected more by rare class examples, and ensemble methods trained on resampled versions of the data (Johnson & Khoshgoftaar, 2019;Shah & Ali, 2023). ...
Article
Full-text available
This paper assesses the potential for the large language models (LLMs) GPT-4 and GPT-3.5 to aid in deriving insight from education feedback surveys. Exploration of LLM use cases in education has focused on teaching and learning, with less exploration of capabilities in education feedback analysis. Survey analysis in education involves goals such as finding gaps in curricula or evaluating teachers, often requiring time-consuming manual processing of textual responses. LLMs have the potential to provide a flexible means of achieving these goals without specialized machine learning models or fine-tuning. We demonstrate a versatile approach to such goals by treating them as sequences of natural language processing (NLP) tasks including classification (multi-label, multi-class, and binary), extraction, thematic analysis, and sentiment analysis, each performed by LLM. We apply these workflows to a real-world dataset of 2500 end-of-course survey comments from biomedical science courses, and evaluate a zero-shot approach (i.e., requiring no examples or labeled training data) across all tasks, reflecting education settings, where labeled data is often scarce. By applying effective prompting practices, we achieve human-level performance on multiple tasks with GPT-4, enabling workflows necessary to achieve typical goals. We also show the potential of inspecting LLMs’ chain-of-thought (CoT) reasoning for providing insight that may foster confidence in practice. Moreover, this study features development of a versatile set of classification categories, suitable for various course types (online, hybrid, or in-person) and amenable to customization. Our results suggest that LLMs can be used to derive a range of insights from survey text.
ResearchGate has not been able to resolve any references for this publication.