Taghi Khoshgoftaar

Taghi Khoshgoftaar
Florida Atlantic University | FAU · Department of Computer and Electrical Engineering and Computer Science

About

568
Publications
221,881
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
44,154
Citations

Publications

Publications (568)
Article
Full-text available
Feature selection is an effective data reduction technique. SHapley Additive exPlanations (SHAP) can be used to provide a feature importance ranking for models built with labeled or unlabeled data. Thus, one may use the SHAP feature importance ranking in a feature selection technique by selecting the k highest ranking features. Furthermore, this SH...
Chapter
Credit card fraud is a pervasive issue that causes significant financial loss, thus underscoring the urgent need for effective detection techniques. In this book chapter on One-Class Classification (OCC) critical issues are thoroughly examined. The first deals with the training of OCC classifiers on majority versus minority class data. Our results...
Article
Full-text available
OCR2SEQ represents an innovative advancement in Optical Character Recognition (OCR) technology, leveraging a multi-modal generative augmentation strategy to overcome traditional limitations in OCR systems. This paper introduces OCR2SEQ’s unique approach, tailored to enhance data quality for sequence-to-sequence models, especially in scenarios chara...
Article
Full-text available
In the context of high-dimensional credit card fraud data, researchers and practitioners commonly utilize feature selection techniques to enhance the performance of fraud detection models. This study presents a comparison in model performance using the most important features selected by SHAP (SHapley Additive exPlanations) values and the model’s b...
Article
Full-text available
Acquiring labeled datasets often incurs substantial costs primarily due to the requirement of expert human intervention to produce accurate and reliable class labels. In the modern data landscape, an overwhelming proportion of newly generated data is unlabeled. This paradigm is especially evident in domains such as fraud detection and datasets for...
Article
Detecting fraudulent activities in credit card transactions can be challenging due to issues like high dimensionality and class imbalance that are often present in the datasets. To address these challenges, data reduction techniques such as data sampling and feature selection have become essential. In this study, we compare four approaches for data...
Conference Paper
Full-text available
In a typical classification problem, the choice of machine learning strategy has a significant impact on performance results. When the target dataset has a class imbalance, the selection of the right strategy becomes even more important with respect to performance outcome. This survey paper focuses on the choice between a binary classification appr...
Article
Full-text available
In the domain of Medicare insurance fraud detection, handling imbalanced Big Data and high dimensionality remains a significant challenge. This study assesses the combined efficacy of two data reduction techniques: Random Undersampling (RUS), and a novel ensemble supervised feature selection method. The techniques are applied to optimize Machine Le...
Article
Full-text available
The tasks of few-shot, one-shot, and zero-shot learning—or collectively “low-shot learning” (LSL)—at first glance are quite similar to the long-standing task of class imbalanced learning; specifically, they aim to learn classes for which there is little labeled data available. Motivated by this similarity, we conduct a survey to review the recent l...
Article
In this paper, we investigate the impact of Random Undersampling (RUS) on a supervised Machine Learning task involving highly imbalanced Big Data. We present the results of experiments in Medicare Fraud detection. To the best of our knowledge, these experiments are conducted with the largest insurance claims datasets ever used for Medicare Fraud de...
Article
Full-text available
Research into machine learning methods for fraud detection is of paramount importance, largely due to the substantial financial implications associated with fraudulent activities. Our investigation is centered around the Credit Card Fraud Dataset and the Medicare Part D dataset, both of which are highly imbalanced. The Credit Card Fraud Detection D...
Article
Full-text available
As a means of building explainable machine learning models for Big Data, we apply a novel ensemble supervised feature selection technique. The technique is applied to publicly available insurance claims data from the United States public health insurance program, Medicare. We approach Medicare insurance fraud detection as a supervised machine learn...
Article
Full-text available
I witnessed a death march in the USA in 2020. Schools, colleges, offices, and markets were shut down. I finished my Ph.D. during that crucial moment. I started this research last year—during my tenure as a postdoctoral researcher; I got this messy, not formatted, and unstructured dataset—without any direction on what I was supposed to do. Before va...
Article
Full-text available
The yearly increase in incidents of credit card fraud can be attributed to the rapid growth of e-commerce. To address this issue, effective fraud detection methods are essential. Our research focuses on the Credit Card Fraud Detection Dataset, which is a widely used dataset that contains real-world transaction data and is characterized by high clas...
Article
Full-text available
We present findings from experiments in Medicare fraud detection, that are the result of research on two new, publicly available datasets. In this research, we employ popular, open-source Machine Learning algorithms to identify fraudulent healthcare providers in Medicare insurance claims data. As far as we know, we are the first to publish a study...
Article
Full-text available
Fraud datasets often times lack consistent and accurate labels, and are characterized by having high class imbalance where the number of fraudulent examples are far fewer than those of normal ones. Machine learning designed for effectively detecting fraud is an important task since fraudulent behavior can have significant financial or health conseq...
Preprint
Full-text available
As a means of building explainable machine learning models for Big Data, we apply a novel ensemble supervised feature selection technique. The technique is applied to publicly available insurance claims data from the United States public health insurance program, Medicare. We approach Medicare insurance fraud detection as a supervised machine learn...
Article
Full-text available
Automated methods for detecting fraudulent healthcare providers have the potential to save billions of dollars in healthcare costs and improve the overall quality of patient care. This study presents a data-centric approach to improve healthcare fraud classification performance and reliability using Medicare claims data. Publicly available data fro...
Article
Full-text available
Output thresholding is well-suited for addressing class imbalance, since the technique does not increase dataset size, run the risk of discarding important instances, or modify an existing learner. Through the use of the Credit Card Fraud Detection Dataset, this study proposes a threshold optimization approach that factors in the constraint True Po...
Article
One consequence of the widespread use of Internet of Things (IoT) devices is an increase in the volume of attacks on IoT networks. In this study, we focus on the Bot-IoT dataset, with the aim of classifying its four types of attacks: Denial-of-Service (DoS), Distributed Denial-of-Service (DDoS), Reconnaissance, and Information Theft. Our contributi...
Article
Full-text available
Using the wrong metrics to gauge classification of highly imbalanced Big Data may hide important information in experimental results. However, we find that analysis of metrics for performance evaluation and what they can hide or reveal is rarely covered in related works. Therefore, we address that gap by analyzing multiple popular performance metri...
Article
Full-text available
With the massive resources and strategies accessible to attackers, countering Denial of Service (DoS) attacks is getting increasingly difficult. One of these techniques is application-layer DoS. Due to these challenges, network security has become increasingly more challenging to ensure. Hypertext Transfer Protocol (HTTP), Domain Name Service (DNS)...
Article
Full-text available
Training a machine learning algorithm on a class-imbalanced dataset can be a difficult task, a process that could prove even more challenging under conditions of high dimensionality. Feature extraction and data sampling are among the most popular preprocessing techniques. Feature extraction is used to derive a richer set of reduced dataset features...
Article
Full-text available
We propose a novel feature popularity framework, and introduce this new framework to the cybersecurity domain. Feature popularity has not yet been used in machine learning or data mining, and we implement it with three web attacks from the CSE-CIC-IDS2018 dataset: Brute Force, SQL Injection, and XSS web attacks. Feature popularity is based upon ens...
Article
Full-text available
The existence of class imbalance in a dataset can greatly bias the classifier towards majority classification. This discrepancy can pose a serious problem for deep learning models, which require copious and diverse amounts of data to learn patterns and output classifications. Traditionally, data-level and algorithm-level techniques have been instru...
Article
Full-text available
Hyperparameter tuning is the collection of techniques to discover optimal values for settings we supply to machine learning algorithms. Put another way, hyperparameters are not optimized by the algorithm. When researching Big Data, we face the dilemma of whether it will be useful to do hyperparameter tuning with the maximum possible amount of data....
Article
Full-text available
Machine learning applications for healthcare are reshaping the industry with new tools and services designed to improve the quality of patient care. A challenge common to many of these applications is encoding healthcare procedure codes, a high-cardinality categorical variable containing thousands of unique values. Traditional one-hot encoding tech...
Article
When analyzing cybersecurity datasets with machine learning, researchers commonly need to consider whether or not to include Destination Port as an input feature. We assess the impact of Destination Port as a predictive feature by building predictive models with three different input feature sets and four combinations of web attacks from the CSE-CI...
Article
Full-text available
Deep Learning has achieved remarkable success with Supervised Learning. Nearly all of these successes require very large manually annotated datasets. Data augmentation has enabled Supervised Learning with less labeled data, while avoiding the pitfalls of overfitting. However, Supervised Learning still fails to be Robust, making different prediction...
Article
Full-text available
Purchasing a home is one of the largest investments most people make. House price prediction allows individuals to be informed about their asset wealth. Transparent pricing on homes allows for a more efficient market and economy. We report the performance of machine learning models trained with structured tabular representations and unstructured te...
Article
Full-text available
Patient care in emergency rooms can utilize urgency labeling to facilitate resource allocation. With COVID-19 care, one of the most important indicators of care urgency is the severity of respiratory illness. We present an early analysis of 5,584 patient records, of whom 5,371 (96.2%) have returned a positive COVID-19 test, to understand how well w...
Article
Class label noise is a critical component of data quality that directly inhibits the predictive performance of machine learning algorithms. While many data-level and algorithm-level methods exist for treating label noise, the challenges associated with big data call for new and improved methods. This survey addresses these concerns by providing an...
Article
Full-text available
The recent years have seen a proliferation of Internet of Things (IoT) devices and an associated security risk from an increasing volume of malicious traffic worldwide. For this reason, datasets such as Bot-IoT were created to train machine learning classifiers to identify attack traffic in IoT networks. In this study, we build predictive models wi...
Article
Full-text available
In severely imbalanced datasets, using traditional binary or multi-class classification typically leads to bias towards the class(es) with the much larger number of instances. Under such conditions, modeling and detecting instances of the minority class is very difficult. One-class classification (OCC) is an approach to detect abnormal data points...
Article
Full-text available
Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfa...
Article
Full-text available
Employing Machine Learning algorithms to identify health insurance fraud is an application of Artificial Intelligence in Healthcare. Insurance fraud spuriously inflates the cost of Healthcare. Therefore, it could limit or even deny patients necessary care and treatment. We use Medicare claims data as input to various algorithms to gauge their perfo...
Article
Full-text available
Advances in data mining and machine learning continue to transform the healthcare industry and provide value to medical professionals and patients. In this study, we address the problem of encoding medical provider types and present four techniques for learning dense, semantic embeddings that capture provider specialty similarities. The first two m...
Preprint
Full-text available
Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfa...
Article
Full-text available
Class imbalance is an important consideration for cybersecurity and machine learning. We explore classification performance in detecting web attacks in the recent CSE-CIC-IDS2018 dataset. This study considers a total of eight random undersampling (RUS) ratios: no sampling, 999:1, 99:1, 95:5, 9:1, 3:1, 65:35, and 1:1. Additionally, seven different c...
Article
Full-text available
Class rarity is a frequent challenge in cybersecurity. Rarity occurs when the positive (attack) class only has a small number of instances for machine learning classifiers to train upon, thus making it difficult for the classifiers to discriminate and learn from the positive class. To investigate rarity, we examine three individual web attacks in b...
Article
Full-text available
Label noise is an important data quality issue that negatively impacts machine learning algorithms. For example, label noise has been shown to increase the number of instances required to train effective predictive models. It has also been shown to increase model complexity and decrease model interpretability. In addition, label noise can cause the...
Article
Full-text available
Machine learning algorithms efficiently trained on intrusion detection datasets can detect network traffic capable of jeopardizing an information system. In this study, we use the CSE-CIC-IDS2018 dataset to investigate ensemble feature selection on the performance of seven classifiers. CSE-CIC-IDS2018 is big data (about 16,000,000 instances), publi...
Article
Full-text available
This survey explores how Deep Learning has battled the COVID-19 pandemic and provides directions for future research on COVID-19. We cover Deep Learning applications in Natural Language Processing, Computer Vision, Life Sciences, and Epidemiology. We describe how each of these applications vary with the availability of big data and how learning tas...
Chapter
A variety of data-level, algorithm-level, and hybrid methods have been used to address the challenges associated with training predictive models with class-imbalanced data. While many of these techniques have been extended to deep neural network (DNN) models, there are relatively fewer studies that emphasize the significance of output thresholding....
Article
Full-text available
The era of big data has produced vast amounts of information that can be used to build machine learning models. In many cases, however, there is a point where adding more data only marginally increases model performance. This is especially important for scenarios of limited labeled data, as annotation can be expensive and time consuming. If the req...