ResearchPDF Available

Enhanced Credit Card Fraud Detection Using Regularized GLMs: A Comparative Study of Down-Sampling Techniques

Authors:

Abstract

Credit card fraud poses a significant threat to financial institutions, resulting in substantial financial losses and eroding consumer trust. Effective detection of fraudulent transactions is crucial for mitigating these risks. This study investigates the performance of regularized Generalized Linear Models (GLMs) in detecting credit card fraud, focusing on the impact of various down-sampling techniques on model accuracy and efficiency. Given the highly imbalanced nature of credit card transaction data, traditional classification methods often struggle to identify fraudulent transactions due to the overwhelming majority of legitimate cases. To address this challenge, we explore several down-sampling strategies, including random down-sampling, Tomek links, and Edited Nearest Neighbors (ENN). Each technique aims to reduce the dataset's size while retaining essential characteristics, thereby enhancing the performance of the regularized GLMs. The effectiveness of these methods is evaluated based on metrics such as precision, recall, F1 score, and area under the ROC curve (AUC). We conduct a comparative analysis of the GLM performance with and without the application of down-sampling techniques, examining how these methods influence the model's ability to detect fraudulent transactions. The findings demonstrate that employing down-sampling techniques significantly improves the performance of regularized GLMs in fraud detection. The study concludes that a strategic combination of regularization methods and down-sampling techniques can enhance the identification of credit card fraud, thereby contributing to the development of more robust and efficient detection systems. This research offers valuable insights for financial institutions seeking to implement effective fraud detection mechanisms while ensuring minimal disruption to legitimate transactions.
Enhanced Credit Card Fraud Detection Using Regularized GLMs: A
Comparative Study of Down-Sampling Techniques
Authors: Leon Deon, Alexander Noah
Date: November, 2024
Abstract
Credit card fraud poses a significant threat to financial institutions, resulting in substantial financial
losses and eroding consumer trust. Effective detection of fraudulent transactions is crucial for
mitigating these risks. This study investigates the performance of regularized Generalized Linear
Models (GLMs) in detecting credit card fraud, focusing on the impact of various down-sampling
techniques on model accuracy and efficiency. Given the highly imbalanced nature of credit card
transaction data, traditional classification methods often struggle to identify fraudulent
transactions due to the overwhelming majority of legitimate cases. To address this challenge, we
explore several down-sampling strategies, including random down-sampling, Tomek links, and
Edited Nearest Neighbors (ENN). Each technique aims to reduce the dataset's size while retaining
essential characteristics, thereby enhancing the performance of the regularized GLMs. The
effectiveness of these methods is evaluated based on metrics such as precision, recall, F1 score,
and area under the ROC curve (AUC). We conduct a comparative analysis of the GLM
performance with and without the application of down-sampling techniques, examining how these
methods influence the model's ability to detect fraudulent transactions. The findings demonstrate
that employing down-sampling techniques significantly improves the performance of regularized
GLMs in fraud detection. The study concludes that a strategic combination of regularization
methods and down-sampling techniques can enhance the identification of credit card fraud, thereby
contributing to the development of more robust and efficient detection systems. This research
offers valuable insights for financial institutions seeking to implement effective fraud detection
mechanisms while ensuring minimal disruption to legitimate transactions.
Keywords: Credit card fraud, GLM, down-sampling techniques, fraud detection, regularization,
machine learning, imbalanced data, precision, recall, financial institutions.
Introduction
Credit card fraud is a pervasive issue that affects financial institutions and consumers alike, leading
to significant economic losses and undermining trust in electronic payment systems. With the
increasing reliance on credit card transactions in today’s digital economy, the need for effective
fraud detection mechanisms has become more critical than ever. As the volume of transactions
grows, so does the sophistication of fraudulent schemes, making traditional detection methods
inadequate. Therefore, leveraging advanced statistical and machine learning techniques is essential
to develop robust systems that can identify fraudulent activities promptly and accurately. The
nature of credit card transaction data is inherently imbalanced, with a small percentage of
transactions being fraudulent compared to the vast number of legitimate transactions. This
imbalance poses a challenge for conventional classification algorithms, which tend to be biased
towards the majority class. Consequently, fraudulent transactions may go undetected, resulting in
substantial financial losses. To combat this issue, researchers have explored various techniques to
improve the detection of fraud, including the use of Generalized Linear Models (GLMs), which
offer flexibility and interpretability while allowing for regularization to prevent overfitting.
Regularized GLMs incorporate penalty terms to the model's cost function, effectively managing
complexity and improving generalization on unseen data. By applying regularization techniques
such as Lasso and Ridge regression, these models can yield better performance in predicting fraud
compared to standard GLMs. However, to enhance their effectiveness further, it is essential to
address the data imbalance through down-sampling techniques. Down-sampling methods aim to
reduce the number of legitimate transactions in the dataset, creating a more balanced representation
of classes. This study investigates the impact of various down-sampling techniques, including
random down-sampling, Tomek links, and Edited Nearest Neighbors (ENN), on the performance
of regularized GLMs in credit card fraud detection. By examining how these techniques influence
the model's ability to accurately classify fraudulent transactions, this research aims to identify the
most effective strategies for enhancing fraud detection capabilities. The findings from this
comparative study will provide valuable insights for financial institutions seeking to implement
more efficient fraud detection mechanisms while minimizing the potential disruption to legitimate
transactions. Ultimately, this research contributes to the ongoing efforts to safeguard consumers
and institutions from the detrimental effects of credit card fraud, highlighting the importance of
innovative approaches in the ever-evolving landscape of financial crime.
Understanding Credit Card Fraud and Its Detection
Credit card fraud represents a significant challenge in the financial sector, characterized by
unauthorized transactions made with stolen credit card information. The impact of such fraudulent
activities extends beyond immediate financial losses, eroding consumer trust and potentially
leading to long-term reputational damage for financial institutions. Understanding the
complexities of credit card fraud is vital for developing effective detection systems.
Nature of Credit Card Fraud Fraudulent activities can take many forms, including card-not-
present (CNP) fraud, where transactions are conducted online without the physical card, and card-
present fraud, which occurs during in-person transactions. The advent of technology has made it
easier for criminals to execute sophisticated schemes, such as phishing, data breaches, and
skimming. As a result, the frequency and variety of credit card fraud cases have escalated,
necessitating a more proactive and advanced approach to fraud detection.
Regularized Generalized Linear Models (GLMs) In response to the growing threat of credit
card fraud, statistical methods such as Generalized Linear Models (GLMs) have gained traction
due to their flexibility and interpretability. GLMs allow for the modeling of relationships between
dependent and independent variables, making them suitable for binary classification problems,
such as identifying fraudulent transactions. By incorporating regularization techniques like Lasso
(L1) and Ridge (L2) regression, GLMs can effectively manage complexity, reduce overfitting, and
enhance predictive accuracy. Regularization helps in penalizing large coefficients in the model,
leading to simpler and more robust models that perform better on unseen data.
Down-Sampling Techniques Given the inherent imbalance in credit card transaction data, where
fraudulent transactions represent only a small fraction of total transactions, down-sampling
techniques play a crucial role in improving model performance. These methods aim to create a
more balanced dataset by reducing the number of legitimate transactions. Common down-
sampling techniques include random down-sampling, which involves randomly selecting a subset
of legitimate transactions, and more sophisticated methods like Tomek links and Edited Nearest
Neighbors (ENN), which focus on refining the data by removing ambiguous or redundant
instances. By applying these techniques, the training dataset becomes less skewed, enabling the
GLMs to learn more effectively from both classes.
Impact on Model Performance The integration of regularized GLMs with down-sampling
techniques is expected to enhance the overall performance of fraud detection systems. Improved
precision metricssuch as accuracy, recall, and F1 scoreare vital for evaluating model efficacy.
Precision measures the proportion of true positive predictions among all positive predictions,
directly impacting the ability to detect fraud accurately without raising false alarms.
Comparative Analysis of Down-Sampling Techniques
The effectiveness of credit card fraud detection models significantly hinges on the approach taken
to handle the imbalanced nature of transaction datasets. Given the rarity of fraudulent transactions
compared to legitimate ones, employing appropriate down-sampling techniques is crucial. This
section explores various down-sampling strategies, including random down-sampling, Tomek
links, and Edited Nearest Neighbors (ENN), highlighting their respective methodologies and
impacts on model performance.
Random Down-Sampling Random down-sampling is one of the simplest methods to address
class imbalance. This technique involves randomly selecting a subset of legitimate transactions to
match the number of fraudulent transactions in the dataset. While this approach can effectively
reduce the dataset size and balance the classes, it has some drawbacks. One major concern is the
potential loss of valuable information, as many legitimate transactions are discarded. This loss can
lead to underfitting, where the model fails to capture the underlying patterns due to insufficient
data representation. Nevertheless, random down-sampling provides a baseline for comparison with
more sophisticated techniques.
Tomek Links Tomek links offer a more refined method for handling class imbalance. This
technique identifies pairs of instances from different classes that are nearest neighbors. If one
instance belongs to the majority class (legitimate transactions) and the other to the minority class
(fraudulent transactions), the instance from the majority class is removed. This approach not only
helps in balancing the classes but also improves the decision boundary by eliminating noisy
examples that may lead to misclassification. By retaining valuable information while removing
redundant majority class instances, Tomek links can enhance the predictive performance of the
model.
Edited Nearest Neighbors (ENN) ENN is another advanced down-sampling technique that
combines aspects of the k-nearest neighbors algorithm with down-sampling. In this method, each
instance of the majority class is evaluated based on its nearest neighbors, and those that do not
match the majority class are removed. This process helps in refining the dataset by retaining
representative instances while discarding outliers. ENN is particularly effective in maintaining the
integrity of the dataset, reducing the risk of losing critical information while balancing class
distribution. By focusing on the local structure of the data, ENN can lead to improved model
accuracy and generalization.
Evaluating Down-Sampling Techniques To determine the most effective down-sampling
technique, it is essential to evaluate their impact on the performance of regularized GLMs in
detecting credit card fraud. Metrics such as precision, recall, F1 score, and area under the ROC
curve (AUC) provide insights into how well the model identifies fraudulent transactions while
minimizing false positives. A comparative analysis of these metrics across different down-
sampling methods will highlight the strengths and weaknesses of each approach, guiding
practitioners in selecting the most appropriate technique for their specific needs. Random down-
sampling, Tomek links, and Edited Nearest Neighbors each offer distinct advantages and
challenges. By systematically comparing these techniques, researchers and practitioners can
optimize their models, leading to more effective and reliable credit card fraud detection systems.
Understanding the nuances of these methods is crucial for developing robust strategies that not
only identify fraudulent transactions but also maintain the integrity and trustworthiness of the
financial system.
Performance Evaluation of Regularized GLMs in Fraud Detection
The effectiveness of fraud detection models relies heavily on the evaluation of their performance
across various metrics. Regularized Generalized Linear Models (GLMs) have gained prominence
in credit card fraud detection due to their adaptability and ability to handle imbalanced datasets
when combined with down-sampling techniques. This section discusses the performance
evaluation of these models, focusing on key metrics used to assess their effectiveness in identifying
fraudulent transactions.
Key Performance Metrics When evaluating the performance of fraud detection models, several
metrics provide insights into their accuracy and reliability. The most commonly used metrics
include precision, recall, F1 score, and area under the receiver operating characteristic curve
(AUC-ROC). Each of these metrics serves a unique purpose:
Precision measures the proportion of true positive predictions (correctly identified fraudulent
transactions) to the total predicted positives (both true positives and false positives). A high
precision indicates that the model is effective at minimizing false positives, which is crucial in
fraud detection to avoid unnecessary disruptions for legitimate customers.
Recall, also known as sensitivity, assesses the proportion of true positive predictions to the
total actual positives (true positives and false negatives). High recall is essential in fraud
detection, as it reflects the model's ability to identify as many fraudulent transactions as
possible, reducing the risk of financial losses for institutions.
F1 Score is the harmonic mean of precision and recall, providing a single metric that balances
both aspects. It is particularly useful in imbalanced datasets, where focusing solely on accuracy
can be misleading. A high F1 score indicates a good balance between precision and recall,
demonstrating the model's effectiveness in detecting fraud without compromising too much on
false positives.
AUC-ROC measures the model's ability to distinguish between the positive and negative
classes across different thresholds. AUC values range from 0 to 1, with a higher value
indicating better model performance. This metric is especially useful for evaluating how well
the model can differentiate fraudulent transactions from legitimate ones, providing a
comprehensive view of the model's capabilities.
Impact of Down-Sampling Techniques The integration of down-sampling techniques with
regularized GLMs significantly influences these performance metrics. For instance, random down-
sampling may yield quick results but risks losing valuable information, leading to a decrease in
recall and potentially affecting the F1 score. In contrast, more advanced methods like Tomek links
and ENN can improve precision and recall by refining the dataset and enhancing the model's ability
to learn from both classes.
Comparative Analysis of Results Conducting a comparative analysis of the performance metrics
across different down-sampling techniques will provide insights into their respective impacts on
the performance of regularized GLMs. By assessing the models under varying conditions and
sampling strategies, researchers can identify the most effective approaches for improving fraud
detection. This analysis not only informs the selection of down-sampling techniques but also
contributes to the development of more sophisticated and efficient fraud detection systems.
Conclusion
In the ever-evolving landscape of financial transactions, credit card fraud detection remains a
critical concern for institutions seeking to protect their customers and preserve their reputations.
This study underscores the significance of employing Regularized Generalized Linear Models
(GLMs) as an effective statistical approach to combatting this issue. By incorporating down-
sampling techniques, such as random down-sampling, Tomek links, and Edited Nearest Neighbors
(ENN), the study highlights the importance of addressing class imbalance in transaction datasets,
which is essential for developing robust fraud detection systems. The analysis reveals that each
down-sampling technique possesses distinct advantages and challenges. While random down-
sampling provides a straightforward approach to balance the dataset, it risks losing valuable
information that could enhance model performance. Conversely, more sophisticated methods like
Tomek links and ENN refine the dataset while maintaining essential information, leading to
improved precision and recall in detecting fraudulent transactions. Performance evaluation
metricsnamely precision, recall, F1 score, and AUC-ROCserve as critical indicators of model
effectiveness. The results indicate that GLMs, when combined with appropriate down-sampling
techniques, can significantly enhance the detection capabilities of fraud detection systems. High
precision and recall values reflect the models' ability to accurately identify fraudulent transactions
while minimizing false positives, an essential consideration in maintaining customer trust and
operational integrity. The findings of this study provide valuable insights for financial institutions
striving to improve their fraud detection mechanisms. By adopting a strategic approach that
leverages regularized GLMs and sophisticated down-sampling techniques, organizations can
create more resilient systems that not only safeguard against fraudulent activities but also enhance
overall customer experience. The comparative analysis of various down-sampling techniques
paves the way for future research aimed at refining fraud detection methodologies and fostering
innovation in the field. Ultimately, as fraudsters continually adapt their strategies, the development
of advanced statistical models and machine learning techniques becomes increasingly vital. This
study emphasizes the need for ongoing research and adaptation in fraud detection practices,
ensuring that financial institutions remain one step ahead in the battle against credit card fraud. By
investing in these innovative approaches, organizations can not only mitigate financial losses but
also cultivate a more secure and trustworthy environment for their customers.
References
1. Epp-Stobbe, Amarah, Ming-Chang Tsai, and Marc Klimstra. "Comparison of imputation
methods for missing rate of perceived exertion data in rugby." Machine Learning and
Knowledge Extraction 4, no. 4 (2022): 827-838.
2. Kessler, Ronald C., Irving Hwang, Claire A. Hoffmire, John F. McCarthy, Maria V.
Petukhova, Anthony J. Rosellini, Nancy A. Sampson et al. "Developing a practical suicide risk
prediction model for targeting high‐risk patients in the Veterans health
Administration." International journal of methods in psychiatric research 26, no. 3 (2017):
e1575.
3. Chen, Jie, Kees de Hoogh, John Gulliver, Barbara Hoffmann, Ole Hertel, Matthias Ketzel,
Mariska Bauwelinck et al. "A comparison of linear regression, regularization, and machine
learning algorithms to develop Europe-wide spatial models of fine particles and nitrogen
dioxide." Environment international 130 (2019): 104934.
4. Tiffin, Mr Andrew. Seeing in the dark: A machine-learning approach to nowcasting in
Lebanon. International Monetary Fund, 2016.
5. Grinberg, Nastasiya F., Oghenejokpeme I. Orhobor, and Ross D. King. "An evaluation of
machine-learning for predicting phenotype: studies in yeast, rice, and wheat." Machine
Learning 109, no. 2 (2020): 251-277.
6. Saggi, Mandeep Kaur, and Sushma Jain. "Reference evapotranspiration estimation and
modeling of the Punjab Northern India using deep learning." Computers and Electronics in
Agriculture 156 (2019): 387-398.
7. Sirsat, Manisha S., João Mendes-Moreira, Carlos Ferreira, and Mario Cunha. "Machine
Learning predictive model of grapevine yield based on agroclimatic patterns." Engineering in
Agriculture, Environment and Food 12, no. 4 (2019): 443-450.
8. Cyril Neba C, Gerard Shu F, Gillian Nsuh, Philip Amouda A, Adrian Neba F, Aderonke
Adebisi, P. Kibet, Webnda F. Time Series Analysis and Forecasting of COVID-19 Trends in
Coffee County, Tennessee, United States. International Journal of Innovative Science and
Research Technology (IJISRT). 2023;8(9): 2358- 2371. www.ijisrt.com. ISSN - 2456-2165.
Available:https://doi.org/10.5281/zenodo.10007394
9. Cyril Neba C, Gillian Nsuh, Gerard Shu F, Philip Amouda A, Adrian Neba F, Aderonke
Adebisi, Kibet P, Webnda F. Comparative analysis of stock price prediction models:
Generalized linear model (GLM), Ridge regression, lasso regression, elasticnet regression, and
random forest A case study on netflix. International Journal of Innovative Science and
Research Technology (IJISRT). 2023;8(10): 636-647. www.ijisrt.com. ISSN - 2456-2165.
Available:https://doi.org/10.5281/zenodo.10040460
10. Cyril Neba C, Gerard Shu F, Adrian Neba F, Aderonke Adebisi, P. Kibet, F. Webnda, Philip
Amouda A. “Enhancing Credit Card Fraud Detection with Regularized Generalized Linear
Models: A Comparative Analysis of Down-Sampling and Up-Sampling Techniques.”
International Journal of Innovative Science and Research Technology (IJISRT),
www.ijisrt.com. ISSN - 2456-2165, 2023;8(9):1841-1866.
Available:https://doi.org/10.5281/zenodo.8413849
11. Cyril Neba C, Gerard Shu F, Adrian Neba F, Aderonke Adebisi, Kibet P, Webnda F, Philip
Amouda A. (Volume. 8 Issue. 9, September -) Using Regression Models to Predict Death
Caused by Ambient Ozone Pollution (AOP) in the United States. International Journal of
Innovative Science and Research Technology (IJISRT), www.ijisrt.com. 2023;8(9): 1867-
1884.ISSN - 2456-2165. Available:https://doi.org/10.5281/zenodo.8414044
12. Cyril Neba, Shu F B Gerard, Gillian Nsuh, Philip Amouda, Adrian Neba, et al.. Advancing
Retail Predictions: Integrating Diverse Machine Learning Models for Accurate Walmart Sales
Forecasting. Asian Journal of Probability and Statistics, 2024, Volume 26, Issue 7, Page 1-23,
10.9734/ajpas/2024/v26i7626. hal-04608833
13. Eklund, Martin, Ulf Norinder, Scott Boyer, and Lars Carlsson. "Choosing feature selection and
learning algorithms in QSAR." Journal of Chemical Information and Modeling 54, no. 3
(2014): 837-843.
14. Neba Cyril, Chenwi, Advancing Retail Predictions: Integrating Diverse Machine Learning
Models for Accurate Walmart Sales Forecasting (March 04,
2024). https://doi.org/10.9734/ajpas/2024/v26i7626, Available at
SSRN: https://ssrn.com/abstract=4861836 or http://dx.doi.org/10.2139/ssrn.4861836
15. Onogi, Akio, Osamu Ideta, Yuto Inoshita, Kaworu Ebana, Takuma Yoshioka, Masanori
Yamasaki, and Hiroyoshi Iwata. "Exploring the areas of applicability of whole-genome
prediction methods for Asian rice (Oryza sativa L.)." Theoretical and applied genetics 128
(2015): 41-53.
16. Neba, Cyril, F. Gerard Shu, Gillian Nsuh, A. Philip Amouda, Adrian Neba, F. Webnda, Victory
Ikpe, Adeyinka Orelaja, and Nabintou Anissia Sylla. "A Comprehensive Study of Walmart
Sales Predictions Using Time Series Analysis." Asian Research Journal of Mathematics 20,
no. 7 (2024): 9-30.
17. Nsuh, Gillian, et al. "A Comprehensive Study of Walmart Sales Predictions Using Time Series
Analysis." (2024).
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This article presents a comprehensive study of sales predictions using time series analysis, focusing on a case study of Walmart sales data. The aim of this study is to evaluate the effectiveness of various time series forecasting techniques in predicting weekly sales data for Walmart stores. Leveraging a dataset from Kaggle comprising weekly sales data from various Walmart stores around the United States, this study explores the effectiveness of time series analysis in forecasting future sales trends. Various time series analysis Original Research Article Neba et al.; Asian Res. Integrated Moving Average (SARIMA), Prophet, Exponential Smoothing, and Gaussian Processes, are applied to model and forecast Walmart sales data. By comparing the performance of these models, the study seeks to identify the most accurate and reliable methods for forecasting retail sales, thereby providing valuable insights for improving sales predictions in the retail sector. The study includes an extensive exploratory data analysis (EDA) phase to preprocess the data, detect outliers, and visualize sales trends over time. Additionally, the article discusses the partitioning of data into training and testing sets for model evaluation. Performance metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) are utilized to compare the accuracy of different time series models. The results indicate that Gaussian Processes outperform other models in terms of accuracy, with an RMSE of 34,116.09 and an MAE of 25,495.72, significantly lower than the other models evaluated. For comparison, ARIMA and SARIMA models both yielded an RMSE of 555,502.2 and an MAE of 462,767.3, while the Prophet model showed an RMSE of 567,509.2 and an MAE of 474,990.8. Exponential Smoothing also performed well with an RMSE of 555,081.7 and an MAE of 464,110.5. These findings suggest the potential of Gaussian Processes for accurate sales forecasting. However, the study also highlights the strengths and weaknesses of each forecasting methodology, emphasizing the need for further research to refine existing techniques and explore novel modeling approaches. Overall, this study contributes to the understanding of time series analysis in retail sales forecasting and provides insights for improving future forecasting endeavors.
Article
Full-text available
The COVID-19 pandemic, caused by the novel coronavirus SARS-CoV-2, has had a profound impact globally, including in the United States and Coffee County, Tennessee. This research project delves into the multifaceted effects of the pandemic on public health, the economy, and society. We employ time series analysis and forecasting methods to gain insights into the trajectory of COVID-19 cases specifically within Coffee County, Tennessee. The United States has witnessed significant repercussions from the COVID-19 pandemic, including public health crises, economic disruptions, and healthcare system strains. Vulnerable populations have been disproportionately affected, leading to disparities in health outcomes. Mental health challenges have also emerged. Accurate forecasting of COVID-19 cases is crucial for informed decision-making. Disease forecasting relies on time series models to analyze historical data and predict future trends. We discuss various modeling approaches, including epidemiological models, data-driven methods, hybrid models, and statistical time series models. These models play a vital role in public health planning and resource allocation. We employ ARIMA, AR, MA, Holt's Exponential Smoothing, and GARCH models to analyze the time series data of COVID-19 cases in Coffee County. The selection of the best model is based on goodness-of-fit indicators, specifically the AIC and BIC. Lower AIC and BIC values are favored as they indicate better model fit. The dataset for this research project was sourced from the Tennessee Department of Health and spans from 12/03/2020 to 12/11/2022. It comprises records of all Tennessee counties, including variables such as date, total cases, new cases, total confirmed, new confirmed, total probable, and more. Our analysis focuses on Coffee County, emphasizing County, Date, and Total cases. Among the models considered, the GARCH model proves to be the most suitable for forecasting COVID-19 cases in Coffee County, Tennessee. This conclusion is drawn from the model's lowest AIC values compared to ARIMA and Holt's Exponential Smoothing. Additionally, the GARCH model's residuals exhibit a distribution closer to normalcy. Hence, for this specific time series data, the GARCH model outperforms ARIMA, AR, MA, and Holt's Exponential Smoothing in terms of predictive accuracy and goodness of fit.
Article
Full-text available
The primary objective was to develop a robust model for predicting the adjusted closing price of Netflix, leveraging historical stock price data sourced from Kaggle. Through in-depth Exploratory Data Analysis, we examined a dataset encompassing essential daily metrics for February 2018, including opening price, highest price, lowest price, closing price, adjusted closing price, and trading volume. Our research aims to provide valuable insights and predictive tools that can assist investors and market analysts in making informed decisions. The dataset presented a unique challenge, featuring a diverse mix of quantitative and categorical variables, making it an ideal candidate for a Generalized Linear Model (GLM). To address the characteristics of the data, we employed a GLM with a gamma(normal) family and a log link function, a suitable choice for modeling positive continuous data with right-skewed distributions. The study also expands beyond the GLM framework by incorporating Ridge Regression, Lasso Regression, Elasticnet Regression, and Random Forest models, enabling a comprehensive comparison of their predictive capabilities. Based on the RMSE values, including the Volume variable did not significantly improve the performance of the model in predicting Netflix stock prices. However, the difference between the RMSE values of the two models was small and may not be practically significant. Therefore, it was reasonable to keep the Volume variable in the model as it could potentially be a useful predictor in other scenarios. The analysis of the five models used for predicting the Netflix stock price based on the Root mean Squared Errors showed that the Lasso model performed the best. The Elastic Net model had the second-best performance, then the Ridge model, followed by the Random Forest Model and finally the GLM model. Overall, all five models demonstrated some level of accuracy in predicting the stock price, but the Lasso and Elastic Net models stood out with the best performance. These findings can be useful in guiding investment decisions and risk management strategies in the stock market.
Article
Full-text available
This study highlights the problem of credit card fraud and the use of regularized generalized linear models (GLMs) to detect fraud. GLMs are flexible statistical frameworks that model the relationship between a response variable and a set of predictor variables. Regularization Techniques such as ridge regression, lasso regression, and Elasticnet can help mitigate overfitting, resulting in a more parsimonious and interpretable model. The study used a credit card transaction dataset from September 2013, which included 492 fraud cases out of 284,807 transactions. The rising prevalence of credit card fraud has led to the development of sophisticated detection methods, with machine learning playing a pivotal role. In this study, we employed three machine learning models: Ridge Regression, Elasticnet Regression, and Lasso Regression, to detect credit card fraud using both Down-Sampling and Up-Sampling techniques. The results indicate that all three models exhibit accuracy in credit card fraud detection. Among them, Ridge Regression stands out as the most accurate model, achieving an impressive 98% accuracy with a 95% confidence interval between 97% and 99%. Following closely, Lasso Regression and Elasticnet Regression both demonstrate solid performance, with accuracy rates of 93.2% and 93.1%, respectively, and 95% confidence intervals ranging from 88% to 98%. When considering the Up-Sampling technique, Ridge Regression maintains its position as the most accurate model, achieving an accuracy rate of 98.2% with a 95% confidence interval spanning from 97% to 99.4%. Elasticnet Regression follows with an accuracy rate of 94.1% and a confidence interval between 0.8959 and 0.9854, while Lasso Regression exhibits a slightly lower accuracy of 93.1% with a confidence interval from 0.88 to 0.982. all three machine learning models—Ridge Regression, Elasticnet Regression, and Lasso Regression—demonstrate competence in credit card fraud detection. Ridge Regression consistently outperforms the others in both Down-Sampling and UpSampling scenarios, making it a valuable tool for financial institutions to safeguard against credit card fraud threats in the United States.
Article
Full-text available
Air pollution is a significant environmental challenge with far-reaching consequences for public health and the well-being of communities worldwide. This study focuses on air pollution in the United States, particularly from 1990 to 2017, to explore its causes, consequences, and predictive modeling. Air pollution data were obtained from an open-source platform and analyzed using regression models. The analysis aimed to establish the relationship between "Deaths by Ambient Ozone Pollution" (AOP) and various predictor variables, including "Deaths by Household Air Pollution from Solid Fuels" (HHAP_SF), "Deaths by Ambient Particulate Matter Pollution" (APMP), and "Deaths by Air Pollution" (AP). Our findings reveal that linear regression consistently outperforms other models in terms of accuracy, exhibiting a lower Mean Absolute Error (MAE) of 0.004609593 and Root Mean Squared Error (RMSE) of 0.005541933. In contrast, the Random Forest model demonstrates slightly lower accuracy with a MAE of 0.02133121 and RMSE of 0.03016053, while the Huber Regression model falls in between with a MAE of 0.02280993 and RMSE of 0.04360869. The results underscore the importance of addressing air pollution comprehensively in the United States, emphasizing the need for continued research, policy initiatives, and public awareness campaigns to mitigate its impact on public health and the environment. Keywords:- Air pollution, Ambient Ozone Pollution, United States, health impacts, predictive modeling, linear regression, Random Forest, Huber Regression
Article
Full-text available
Rate of perceived exertion (RPE) is used to calculate athlete load. Incomplete load data, due to missing athlete-reported RPE, can increase injury risk. The current standard for missing RPE imputation is daily team mean substitution. However, RPE reflects an individual's effort; group mean substitution may be suboptimal. This investigation assessed an ideal method for imputing RPE. A total of 987 datasets were collected from women's rugby sevens competitions. Daily team mean substitution, k-nearest neighbours, random forest, support vector machine, neural network, linear, stepwise, lasso, ridge, and elastic net regression models were assessed at different missing-ness levels. Statistical equivalence of true and imputed scores by model were evaluated. An ANOVA of accuracy by model and missingness was completed. While all models were equivalent to the true RPE, differences by model existed. Daily team mean substitution was the poorest performing model, and random forest, the best. Accuracy was low in all models, affirming RPE as mul-tifaceted and requiring quantification of potentially overlapping factors. While group mean substitution is discouraged, practitioners are recommended to scrutinize any imputation method relating to athlete load.
Article
Full-text available
In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial.
Article
Full-text available
Empirical spatial air pollution models have been applied extensively to assess exposure in epidemiological studies with increasingly sophisticated and complex statistical algorithms beyond ordinary linear regression. However, different algorithms have rarely been compared in terms of their predictive ability. This study compared 16 algorithms to predict annual average fine particle (PM2.5) and nitrogen dioxide (NO2) concentrations across Europe. The evaluated algorithms included linear stepwise regression, regularization techniques and machine learning methods. Air pollution models were developed based on the 2010 routine monitoring data from the AIRBASE dataset maintained by the European Environmental Agency (543 sites for PM2.5 and 2399 sites for NO2), using satellite observations, dispersion model estimates and land use variables as predictors. We compared the models by performing five-fold cross-validation (CV) and by external validation (EV) using annual average concentrations measured at 416 (PM2.5) and 1396 sites (NO2) from the ESCAPE study. We further assessed the correlations between predictions by each pair of algorithms at the ESCAPE sites. For PM2.5, the models performed similarly across algorithms with a mean CV R2 of 0.59 and a mean EV R2 of 0.53. Generalized boosted machine, random forest and bagging performed best (CV R2~0.63; EV R2 0.58–0.61), while backward stepwise linear regression, support vector regression and artificial neural network performed less well (CV R2 0.48–0.57; EV R2 0.39–0.46). Most of the PM2.5 model predictions at ESCAPE sites were highly correlated (R2 > 0.85, with the exception of predictions from the artificial neural network). For NO2, the models performed even more similarly across different algorithms, with CV R2s ranging from 0.57 to 0.62, and EV R2s ranging from 0.49 to 0.51. The predicted concentrations from all algorithms at ESCAPE sites were highly correlated (R2 > 0.9). For both pollutants, biases were low for all models except the artificial neural network. Dispersion model estimates and satellite observations were two of the most important predictors for PM2.5 models whilst dispersion model estimates and traffic variables were most important for NO2 models in all algorithms that allow assessment of the importance of variables. Different statistical algorithms performed similarly when modelling spatial variation in annual average air pollution concentrations using a large number of training sites.
Article
Full-text available
Grapevine yield prediction during phenostage and particularly, before harvest is highly significant as advanced forecasting could be a great value for superior grapevine management. The main contribution of the current study is to develop predictive model for each phenology that predicts yield during growing stages of grapevine and to identify highly relevant predictive variables. Current study uses climatic conditions, grapevine yield, phenological dates, fertilizer information, soil analysis and maturation index data to construct the relational dataset. After words, we use several approaches to pre-process the data to put it into tabular format. For instance, generalization of climatic variables using phenological dates. Random Forest, LASSO and Elasticnet in generalized linear models, and Spikeslab are feature selection embedded methods which are used to overcome dataset dimensionality issue. We used 10-fold cross validation to evaluate predictive model by partitioning the dataset into training set to train the model and test set to evaluate it by calculating Root Mean Squared Error (RMSE) and Relative Root Mean Squared Error (RRMSE). Results of the study show that rf_PF, rf_PC and rf_MH are optimal models for flowering (PF), colouring (PC) and harvest (MH) phenology respectively which estimate 1484.5, 1504.2 and 1459.4 (Kg/ha) low RMSE and 24.6%, 24.9% and 24.2% RRMSE, respectively as compared to other models. These models also identify some derived climatic variables as major variables for grapevine yield prediction. The reliability and early-indication ability of these forecast models justify their use by institutions and economists in decision making, adoption of technical improvements, and fraud detection.