Taghi Khoshgoftaar

Taghi Khoshgoftaar
Florida Atlantic University | FAU · Department of Computer and Electrical Engineering and Computer Science

About

520
Publications
162,966
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
24,997
Citations

Publications

Publications (520)
Article
Full-text available
The existence of class imbalance in a dataset can greatly bias the classifier towards majority classification. This discrepancy can pose a serious problem for deep learning models, which require copious and diverse amounts of data to learn patterns and output classifications. Traditionally, data-level and algorithm-level techniques have been instru...
Article
Full-text available
Hyperparameter tuning is the collection of techniques to discover optimal values for settings we supply to machine learning algorithms. Put another way, hyperparameters are not optimized by the algorithm. When researching Big Data, we face the dilemma of whether it will be useful to do hyperparameter tuning with the maximum possible amount of data....
Article
Full-text available
Machine learning applications for healthcare are reshaping the industry with new tools and services designed to improve the quality of patient care. A challenge common to many of these applications is encoding healthcare procedure codes, a high-cardinality categorical variable containing thousands of unique values. Traditional one-hot encoding tech...
Article
When analyzing cybersecurity datasets with machine learning, researchers commonly need to consider whether or not to include Destination Port as an input feature. We assess the impact of Destination Port as a predictive feature by building predictive models with three different input feature sets and four combinations of web attacks from the CSE-CI...
Article
Deep Learning has achieved remarkable success with Supervised Learning. Nearly all of these successes require very large manually annotated datasets. Data augmentation has enabled Supervised Learning with less labeled data, while avoiding the pitfalls of overfitting. However, Supervised Learning still fails to be Robust, making different prediction...
Article
Purchasing a home is one of the largest investments most people make. House price prediction allows individuals to be informed about their asset wealth. Transparent pricing on homes allows for a more efficient market and economy. We report the performance of machine learning models trained with structured tabular representations and unstructured te...
Article
Full-text available
Patient care in emergency rooms can utilize urgency labeling to facilitate resource allocation. With COVID-19 care, one of the most important indicators of care urgency is the severity of respiratory illness. We present an early analysis of 5,584 patient records, of whom 5,371 (96.2%) have returned a positive COVID-19 test, to understand how well w...
Article
Class label noise is a critical component of data quality that directly inhibits the predictive performance of machine learning algorithms. While many data-level and algorithm-level methods exist for treating label noise, the challenges associated with big data call for new and improved methods. This survey addresses these concerns by providing an...
Article
Full-text available
The recent years have seen a proliferation of Internet of Things (IoT) devices and an associated security risk from an increasing volume of malicious traffic worldwide. For this reason, datasets such as Bot-IoT were created to train machine learning classifiers to identify attack traffic in IoT networks. In this study, we build predictive models wi...
Article
Full-text available
In severely imbalanced datasets, using traditional binary or multi-class classification typically leads to bias towards the class(es) with the much larger number of instances. Under such conditions, modeling and detecting instances of the minority class is very difficult. One-class classification (OCC) is an approach to detect abnormal data points...
Article
Full-text available
Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfa...
Article
Full-text available
Employing Machine Learning algorithms to identify health insurance fraud is an application of Artificial Intelligence in Healthcare. Insurance fraud spuriously inflates the cost of Healthcare. Therefore, it could limit or even deny patients necessary care and treatment. We use Medicare claims data as input to various algorithms to gauge their perfo...
Article
Full-text available
Advances in data mining and machine learning continue to transform the healthcare industry and provide value to medical professionals and patients. In this study, we address the problem of encoding medical provider types and present four techniques for learning dense, semantic embeddings that capture provider specialty similarities. The first two m...
Preprint
Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfa...
Article
Full-text available
Class imbalance is an important consideration for cybersecurity and machine learning. We explore classification performance in detecting web attacks in the recent CSE-CIC-IDS2018 dataset. This study considers a total of eight random undersampling (RUS) ratios: no sampling, 999:1, 99:1, 95:5, 9:1, 3:1, 65:35, and 1:1. Additionally, seven different c...
Article
Full-text available
Class rarity is a frequent challenge in cybersecurity. Rarity occurs when the positive (attack) class only has a small number of instances for machine learning classifiers to train upon, thus making it difficult for the classifiers to discriminate and learn from the positive class. To investigate rarity, we examine three individual web attacks in b...
Article
Full-text available
Label noise is an important data quality issue that negatively impacts machine learning algorithms. For example, label noise has been shown to increase the number of instances required to train effective predictive models. It has also been shown to increase model complexity and decrease model interpretability. In addition, label noise can cause the...
Article
Full-text available
Machine learning algorithms efficiently trained on intrusion detection datasets can detect network traffic capable of jeopardizing an information system. In this study, we use the CSE-CIC-IDS2018 dataset to investigate ensemble feature selection on the performance of seven classifiers. CSE-CIC-IDS2018 is big data (about 16,000,000 instances), publi...
Article
Full-text available
This survey explores how Deep Learning has battled the COVID-19 pandemic and provides directions for future research on COVID-19. We cover Deep Learning applications in Natural Language Processing, Computer Vision, Life Sciences, and Epidemiology. We describe how each of these applications vary with the availability of big data and how learning tas...
Chapter
A variety of data-level, algorithm-level, and hybrid methods have been used to address the challenges associated with training predictive models with class-imbalanced data. While many of these techniques have been extended to deep neural network (DNN) models, there are relatively fewer studies that emphasize the significance of output thresholding....
Article
Full-text available
The era of big data has produced vast amounts of information that can be used to build machine learning models. In many cases, however, there is a point where adding more data only marginally increases model performance. This is especially important for scenarios of limited labeled data, as annotation can be expensive and time consuming. If the req...
Article
Full-text available
The exponential growth in computer networks and network applications worldwide has been matched by a surge in cyberattacks. For this reason, datasets such as CSE-CIC-IDS2018 were created to train predictive models on network-based intrusion detection. These datasets are not meant to serve as repositories for signature-based detection systems, but r...
Article
Full-text available
Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemb...
Article
Full-text available
Training predictive models with class-imbalanced data has proven to be a difficult task. This problem is well studied, but the era of big data is producing more extreme levels of imbalance that are increasingly difficult to model. We use three data sets of varying complexity to evaluate data sampling strategies for treating high class imbalance wit...
Article
Full-text available
The increasing reliance on electronic health record (EHR) in areas such as medical research should be addressed by using ample safeguards for patient privacy. These records often tend to be big data, and given that a significant portion is stored as free (unstructured) text, we decided to examine relevant work on automated free text de-identificati...
Article
Full-text available
Background: The widespread incidence and prevalence of Alzheimer's disease and mild cognitive impairment (MCI) has prompted an urgent call for research to validate early detection cognitive screening and assessment. Objective: Our primary research aim was to determine if selected MemTrax performance metrics and relevant demographics and health p...
Preprint
Full-text available
Gradient Boosted Decision Trees (GBDT's) are a powerful tool for classification and regression tasks in Big Data, Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT's in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemb...
Preprint
Full-text available
Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble...
Article
Full-text available
Abstract A majority of predictive models should be updated regularly, since the most recent data associated with the model may have a different distribution from that of the original training data. This difference may be critical enough to impact the effectiveness of the machine learning model. In our paper, we investigate the relationship between...
Article
Full-text available
Abstract In Machine Learning, if one class has a significantly larger number of instances (majority) than the other (minority), this condition is defined as class imbalance. With regard to datasets, class imbalance can bias the predictive capabilities of Machine Learning algorithms towards the majority (negative) class, and in situations where fals...
Article
Full-text available
Quality and affordable healthcare is an important aspect in people’s lives, particularly as they age. The rising elderly population in the United States (U.S.), with increasing number of chronic diseases, implies continuing healthcare later in life and the need for programs, such as U.S. Medicare, to help with associated medical expenses. Unfortuna...
Conference Paper
Class imbalance is a regularly occurring problem in machine learning that has been studied extensively over the last two decades. Various methods for addressing class imbalance have been introduced, including algorithm-level methods, data-level methods, and hybrid methods. While these methods are well studied using traditional machine learning algo...
Article
Full-text available
Severe class imbalance between majority and minority classes in Big Data can bias the predictive performance of Machine Learning algorithms toward the majority (negative) class. Where the minority (positive) class holds greater value than the majority (negative) class and the occurrence of false negatives incurs a greater penalty than false positiv...
Article
Full-text available
Abstract This study investigates the effectiveness of multiple maxout activation function variants on 18 datasets using Convolutional Neural Networks. A network with maxout activation has a higher number of trainable parameters compared to networks with traditional activation functions. However, it is not clear if the activation function itself or...
Article
Full-text available
Abstract High class imbalance between majority and minority classes in datasets can skew the performance of Machine Learning algorithms and bias predictions in favor of the majority (negative) class. This bias, for cases where the minority (positive) class is of greater interest and the occurrence of false negatives is costlier than false positives...
Article
Full-text available
Abstract The integrity of modern network communications is constantly being challenged by more sophisticated intrusion techniques. Attackers are consistently shifting to stealthier and more complex forms of attacks in an attempt to bypass known mitigation strategies. In recent years, attackers have begun to focus their attack efforts on the applica...
Article
Full-text available
Abstract In this paper, we comprehensively explain how we built a novel implementation of the Random Forest algorithm on the High Performance Computing Cluster (HPCC) Systems Platform from LexisNexis. The algorithm was previously unavailable on that platform. Random Forest’s learning process is based on the principle of recursive partitioning and a...