Research ProposalPDF Available

Detecting Credit Card Fraud Transactions Using Regularized Forms of Generalized Linear Models (Lasso regression, Ridge Regression and Elastic Net Regression)

Authors:

Abstract and Figures

This study highlights the problem of credit card fraud and the use of regularized generalized linear models (GLMs) to detect fraud. GLMs are flexible statistical frameworks that model the relationship between a response variable and a set of predictor variables. Regularization techniques such as ridge regression, lasso regression, and elastic net can help mitigate overfitting, resulting in a more parsimonious and interpretable model. The study used a credit card transaction dataset from September 2013, which included 492 fraud cases out of 284,807 transactions. The input variables in the dataset were numerical and transformed using PCA. The results showed that all three models (ridge regression, lasso regression, and elastic net) were accurate in detecting credit card fraud, with ridge regression being the best with an accuracy of 98%, followed by lasso regression with 93.2% and elastic net with 93.1%. The study concludes that regularized GLMs can be useful tools for credit card fraud detection, especially when dealing with high-dimensional data.
Content may be subject to copyright.
Detecting Credit Card Fraud Transactions Using Regularized Forms of Generalized Linear
Models (Lasso regression, Ridge Regression and Elastic Net Regression)
C.Neba, A.Neba, A. Adebisi, P. Kibet., F.Webnda
1,2,3,4,5 Department of Computer Science, Austin Peay State University, Clarksville, Tennessee-USA
cneba@my.apsu.edu, aneba@my.apsu.edu, aadebisi@my.apsu.edu, pkibet@my.apsu.edu, fwegnda@my.apsu.edu
ABSTRACT
This study highlights the problem of credit card fraud and the use of regularized generalized linear models
(GLMs) to detect fraud. GLMs are flexible statistical frameworks that model the relationship between a
response variable and a set of predictor variables. Regularization techniques such as ridge regression, lasso
regression, and elastic net can help mitigate overfitting, resulting in a more parsimonious and interpretable
model. The study used a credit card transaction dataset from September 2013, which included 492 fraud
cases out of 284,807 transactions. The input variables in the dataset were numerical and transformed using
PCA. The results showed that all three models (ridge regression, lasso regression, and elastic net) were
accurate in detecting credit card fraud, with ridge regression being the best with an accuracy of 98%,
followed by lasso regression with 93.2% and elastic net with 93.1%. The study concludes that regularized
GLMs can be useful tools for credit card fraud detection, especially when dealing with high-dimensional
data.
Keywords: Machine Learning, Credit Card Transaction Fraud Detection, Regularized GLM, Ridge
Regression, Elastic Net Regression, Lasso Regression.
1. INTRODUCTION
A notable change in consumer financial services over the past few decades has been the growth of the
use of credit cards, both for payments and as sources of revolving credit. From modest origins in the 1950s
as a convenient way for the relatively well-to-do to settle restaurant and department store purchases
without carrying cash, credit cards have become a ubiquitous financial product held by households in all
economic strata (Credit Cards: Use and Consumer Attitudes, 1970-2000, 2000)".
Credit card fraud is a prevalent and challenging problem for financial institutions and consumers
worldwide. In 2021, the Federal Trade Commission (FTC) fielded nearly 390,000 reports of credit card
fraud, making it one of the most common kinds of fraud in the U.S. (Federal Trade Commission, 2020).
Fraudulent transactions can cause significant financial losses, harm the reputation of financial institutions,
and create inconvenience and stress for customers. Therefore, detecting and preventing credit card fraud
is of utmost importance. According to the Nilson Report (December 2021), global credit card and debit
card fraud resulted in losses of $28.58 billion during 2020, with card issuers and merchants incurring 88%
and 12% of those losses, respectively. Card issuer losses occurred mainly at the point of sale from
counterfeit cards while merchant losses occurred mainly on card-not-present (CNP) transactions. The
report also noted that during 2020, credit card and debit card gross fraud losses accounted for roughly
6.81₵ per $100 in total volume, up from 6.78₵ per $100 in 2019. In 2020, the US accounted for 35.83% of
the worldwide payment card fraud losses but generated only 22.40% of total volume. Finally, the Nilson
Report predicted that over the next 10 years, card industry losses to fraud will collectively amount to
$408.50 billion.
One approach to detect credit card fraud is to use Machine Learning techniques such as a generalized
linear model (GLM), which is a flexible statistical framework that allows modeling the relationship
between a response variable and a set of predictor variables. However, GLMs can suffer from overfitting,
which occurs when the model is too complex and fits the noise in the data instead of the underlying signal.
Regularization techniques can help mitigate overfitting by adding a penalty term to the model's objective
function that discourages large coefficients.
In this context, regularized forms of GLMs, such as ridge regression and lasso regression, can be useful
tools for detecting credit card fraud. These techniques allow the model to shrink the coefficients of less
important predictors, leading to a more parsimonious and interpretable model that is less prone to
overfitting. Furthermore, regularized GLMs can handle high-dimensional data with many predictors, a
common scenario in credit card fraud detection, where there are numerous features that may be relevant
to identifying fraudulent transactions.
For this study, we used the credit card transaction dataset from September 2013 which includes
transactions made by European cardholders (Pozzolo, Caelen, Johnson, and Bontempi; 2015). This dataset
covers a two-day period and includes 492 fraud cases out of a total of 284,807 transactions. The dataset
is considered unbalanced because the positive class (frauds) accounts for only 0.172% of all transactions.
The input variables in the dataset are numerical and have been transformed using PCA. The original
features and additional background information about the data cannot be disclosed due to confidentiality
concerns. The dataset includes 28 principal components obtained through PCA, and the 'Time' and
'Amount' features have not been transformed. 'Time' indicates the time in seconds between a given
transaction and the first transaction in the dataset, while 'Amount' indicates the transaction amount and
can be used for cost-sensitive learning. The response variable, 'Class', takes a value of 1 for fraud cases
and 0 for non-fraud cases.
2. A GENERAL REVIEW ON CREDIT CARD FRAUD DETECTION TECHNIQUES
According to Hanagandi, Dhar, and Buescher (1996), historical information on credit card transactions was
used to develop a fraud score model using a radial basis function network and a density-based clustering
approach. The authors applied this methodology to a fraud detection problem and reported satisfactory
preliminary results. The paper is considered an early example of using machine learning techniques for
fraud detection in credit card transactions, which has since become an important application area of
machine learning and data analytics.
(Dorronsoro et al., 1997) developed an online system for detecting credit card fraud using a neural
classifier, which was constructed using a nonlinear version of Fisher's discriminant analysis. The authors
reported that the system is currently fully operational and can handle more than 12 million credit card
operations per year, with satisfactory results obtained.
Bentley et al. (2000) proposed a genetic programming-based algorithm for classifying credit card
transactions into suspicious and non-suspicious categories using logic rules. The algorithm was tested on
a database of 4,000 transactions with 62 fields, and the most effective rule was selected based on its
predictability. This algorithm has shown promise in detecting credit card fraud, particularly in the context
of home insurance data. Nonetheless, given the constantly changing nature of fraud tactics, new and
advanced fraud detection methods are continuously being developed to keep pace with this evolving field.
3. METHODOLOGY
The methodology for building the fraud detection model involves data exploration and cleaning, data
preprocessing, model building, model evaluation, and interpretation. The data was checked for missing
values, outliers, and correlations between predictor variables. Standardization was used to avoid bias and
the data was split into training and testing sets. Logistic regression with L1 regularization (Lasso
Regularization), L2 regularization (Ridge Regularization) and a combination of L1 and L2 regularization also
known as Elastic Net Regularization was used and hyperparameters were tuned using cross-validation.
Performance metrics like accuracy, precision, recall, F1-score, ROC-AUC, etc. were used to evaluate the
model, and the coefficients were interpreted to understand the impact of predictor variables.
3.1. Exploratory Data Analysis
This dataset contains 31 variables, including the response variable (Class). The first 30 variables (V1 to V30)
represent numerical variables that have been transformed using PCA. The variables V1 to V28 represent
the principal components, while V29 and V30 are the residuals from the PCA transformation. The Amount
variable is a numerical variable representing the transaction amount, and the Class variable is a binary
variable representing whether the transaction is fraudulent or not (1 for fraud, 0 for not fraud).
Variable Description
V1 to V28 Numeric variables representing different aspects of the transaction such as amount,
time, location, etc.
Amount Numeric variable representing the amount of the transaction.
Class Binary variable indicating whether the transaction was fraudulent (1) or not (0).
3.1.1. Checking Missing Data
The following R-code was used to extract all rows from the "credit_card" data frame that have missing
values and looking at the output, we realize that the dataset has no missing values.
credit_card[!complete.cases(credit_card),]
[1] Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
[18] V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
<0 rows> (or 0-length row.names)
To confirm the absence of missing values in the dataset, the missing values in a data frame were visualized
using the "naniar" package in R and again, we realize that there are no missing values in the dataset.
library(naniar)
vis_miss(credit_card, 'warn_large_data' = FALSE)
gg_miss_var(credit_card)
Fig 1. : Visualization of Missing Values in the dataset
3.1.2. Frequency Distribution of “Class” Variable
The dataset is highly imbalanced, the positive class (frauds) accounts for just 0.1727486% of total
transactions (that is 492 out of a total of 284807 transactions) which is not suitable for this project hence
we will need to balance the data before proceeding with our Logistic regression.
Fig. 2: Frequency distribution of “Class” Variable
Looking at Fig 3.below, , we realize that all the fraudulent transactions were made for less than around
2500, which is far less than that of the true transactions thereby confirming the imbalance nature of the
dataset.
Fig. 3: Amount Vs Class Fraud chart
3.1.3. Checking Correlation Between the Variables
Looking at Fig4. below, we realized that there is extremely very little or no correlation among the variables
except for V2 which has some negative correlation with amount. We can nonetheless ignore it since it will
not affect the outcome of our analysis in any significant way.
Fig. 4: Correlation Between Variables
3.2. Data Processing
Data processing is an important step in machine learning because the quality of the data used to train a model
can significantly impact the accuracy and performance of the resulting model. The following data processing
steps were carried out:
3.2.1. Data Cleaning and Feature Selection:
The raw data did not contain any missing values, hence, there was no need for any data imputation. The Time
Variable was irrelevant or unimportant to the outcome we are trying to predict and since it did not have a
significant impact on the outcome, we took it out.
3.2.2. Data Normalization:
Since the range and distribution of the data could impact the performance of our machine learning algorithms,
we Normalized the data to ensure that the model is not biased towards any feature due to differences in scale
especially because the data was highly unbalanced. This process was carried out after splitting the dataset into
training and testing set.
4. DATA PARTITIONING
The dataset was split into training (190450 for Non-fraud Cases and 326 for Fraud Cases) and testing sets with
a ratio of 2:1. After splitting the dataset, we balanced the training data using the Down Sampling technique.
By downsampling the data, you are creating a new subset of the original training data where the positive class
(i.e., the minority class) is represented more frequently relative to the negative class (i.e., the majority class).
This can help address class imbalance issues that can arise in predictive modeling tasks, where one class is
significantly more prevalent than the other. After balancing the training data (326 for Non-fraud Cases and
326 for Fraud Cases, we could proceed to building the regularized GLMs.
5. MODEL FITTING/DATA MODELING
According to Altman and Marco (1994) and Flitman (1997), an increasing number of statistical models have
been applied to data mining tasks, including regression analysis, multiple discriminant analysis, logistic
regression, Probit method, and others (Hanagandi, Dhar, & Buescher, 1996). In the context of the credit card
dataset, Regulatizd GLMS (Ridge, Lasso and Elastic Net models) can be used to identify the features that are
most relevant for detecting fraudulent transactions while also reducing the effects of multicollinearity. These
models work by adding a penalty term to the ordinary least squares regression (OLS) objective function, which
shrinks the regression coefficients towards zero. The Ridge regression adds the L2 norm of the coefficients as
a penalty term, Lasso regression adds the L1 norm of the coefficients as a penalty term, and Elastic Net
regression adds a combination of L1 and L2 norm of the coefficients as a penalty term. By adding these penalty
terms, these models can reduce the coefficients of some features to zero, effectively eliminating them from
the model and thus addressing the issue of multicollinearity.
For Regression Models, we used the following R-snippet
CV <- cv.glmnet(x=X, y=down_train_data$Class, family="binomial", alpha = 0, nlambda = 200)
This was used to perform cross-validation on the model using the cv.glmnet() function from the glmnet package.
We then used the coef(CV, CV$lambda.min) R snippet to retrieve the coefficient estimates for the Ridge model
fit, with the optimal lambda value selected through cross-validation, allowing us to see which variables are most
strongly associated with the response variable in the final model. Lastly, we then used the R code snippet coef(CV,
CV$lambda.1se ) to retrieve the coefficient estimates for the model fit with the lambda value selected through
cross-validation that is one standard error away from the optimal lambda value, allowing us to see which
variables are most strongly associated with the response variable in a more parsimonious model.
5.1. RIDGE REGRESSION MODEL
The following Ridge model predictions were obtained.
head(pred)
s0
2 0.1820028
4 0.1732274
5 0.1394861
8 0.1483054
11 0.1540031
13 0.1547350
tail(pred)
s0
284789 0.1238469
284793 0.1290248
284797 0.2041513
284802 0.1805562
284803 0.1084127
284804 0.1419076
Using The R Code Snippets below, We obtained the above predictions for the Ridge model.
library(glmnet)
formula0 <- factor(Class)
X <- model.matrix (as.formula(formula0), data = down_train_data)[, -1]
#Cross Validation Method for Ridge
CV <- cv.glmnet(x=X, y=down_train_data$Class, family="binomial", alpha = 0, nlambda = 200)
coef(CV, CV$lambda.min)
coef(CV, CV$lambda.1se)
b.lambda <- CV$lambda.1se
b.lambda
fit.best <- glmnet(x=X, y=down_train_data$Class, family="binomial", alpha = 0,
lambda=b.lambda)
(fit.best$beta)
#Ridge Prediction
X.test <- model.matrix (as.formula(formula0), data = test)[, -1]
pred <- predict(fit.best, newx = X.test, type="response")
head(pred)
#Miscalculation Rate
pred1 <- ifelse(pred>0.5, 1, 0) #if it is bigger than 50%, we say yes
pred1
(miss.rate <- mean(yobs != pred1))
5.1.1. Ridge Regression Model Evaluation
Miscalculation Rate for Ridge Model
The mean of the miscalculation rate is at 0.005966118. A miscalculation rate of 0.005966118 indicates that
our model has 0. 5966118% incorrect predictions which means it has 99.4033882% correct predictions
hence the model is predicting accurately.
Confidence Interval for The Area Under the Curve(AUC)
The model has confidence 0.95 of predicting correctly with a confidence interval between 0.966 and 0.992.
at an accuracy of 97.9%
ROC Curve of the best fit Model for Ridge
Fig. 5: ROC Curve of the best fit Model for Ridge.
The higher AUC -ROC, the better the performance of the model at distinguishing between positive and the
negative classes so when the AUC is between o.5 and 1, that is, 0.5<AUC<1, then there is a high chance
that then model can distinguish between the positive class values from the negative class values. Since our
AUC value is 0.979, we have a 95% confidence Interval between 0.966 and 0.992 that our model can
differentiate between is FTP(False Positive Rate) and the TPR(True Positive rate)
Recall and Precision Score for Ridge Model
We obtained a precision of 0.994311 which means 99.4311% of our prediction is relevant. We obtained a
recall of the recall of 0.9997108 shows that our model has accuracy of 99.9997108 in correctly classifying
the total relevant results.
Recall And Precision Curves for Ridge Model
Fig. 6: Recall And Precision Curves for Ridge Model
These curves give the shape we would expect. At thresholds with low recall, the precision is
correspondingly high, and at very high recall, the precision begins to drop. Looking at the Precision-Recall
Curve, we notice the curve gets precision up to about 81.
5.2. ELASTICNET REGRESSION MODEL
The following Elastic Net Model predictions were obtained.
(tail(pred_elasticnet <- round(pred_elasticnet,
digits = 2)))
s0
2 0.14
4 0.09
5 0.12
8 0.12
11 0.09
13 0.10
(tail(pred_elasticnet <- round(pred_elasticnet,
digits = 2)))
s0
284789 0.11
284793 0.07
284797 0.17
284802 0.13
284803 0.01
284804 0.09
Using The R Code Snippets below, We obtained the above predictions for the Elasticnet model.
#Cross Validation Method for Elastic Net
CV2 <- cv.glmnet(x=X, y=down_train_data$Class, family="binomial", alpha = 0.5, nlambda = 200)
coef(CV2, CV2$lambda.min)
coef(CV2, CV2$lambda.1se)
b.lambda2 <- CV2$lambda.1se
b.lambda2
fit.best2 <- glmnet(x=X, y=down_train_data$Class, family="binomial", alpha = 0.5,
lambda=b.lambda2)
(fit.best2$beta)
#Elastic Net Prediction
X.test <- model.matrix (as.formula(formula0), data = test)[, -1]
pred_elasticnet <- predict(fit.best2, newx = X.test, type="response")
(head(pred_elasticnet <- round(pred_elasticnet, digits = 2)))
(tail(pred_elasticnet <- round(pred_elasticnet, digits = 2)))
#Miscalulation Rate for Elastic Net Prediction
pred1_elasticnet <- ifelse(pred_elasticnet>0.5, 1, 0)
(miss.rate <- mean(yobs != pred1_elasticnet))
5.2.1. Elasticnet Model Evaluation
Miscalculation Rate for Elastic Net Model
The mean of the miscalculation rate is at 0.01261286. A miscalculation rate of 0.01261286 indicates that our
model has 1.261286 % incorrect predictions which means it has 98.738714% correct predictions hence the
model is predicting accurately.
Confidence Interval for The Area Under the Curve(AUC)
The model has confidence 0.95 of predicting correctly with a confidence interval between 0.880 and 0.982 at
an accuracy of 93.%
ROC Curve of the best fit Model for Elastic Net
Fig.7: ROC Curve of the best fit Model for Elastic Net.
The higher AUC -ROC, the better the performance of the model at distinguishing between positive and the
negative classes so when the AUC is between o.5 and 1, that is, 0.5<AUC<1, then there is a high chance
that the model can distinguish between the positive class values from the negative class values. Since our
AUC value is 0.931, we have a 95% accuracy and a confidence Interval between 0.88 and 0.982 that our
model can differentiate between is FTP(False Positive Rate) and the TPR(True Positive rate)
Recall and Precision Score for Elastic Net Model
We obtained a precision of 0.9875886 which means 98.75886% of our prediction is relevant. We obtained
a recall of the recall of 0.9997735 which shows that our model has accuracy of 99.97735 in correctly
classifying the total relevant results.
Recall and Precision Curves for Elastic Net Model
Fig.8: Recall and Precision Curves for Elastic Net Model
These curves give the shape we would expect. At thresholds with low recall, the precision is correspondingly
high though at a constant rate, and at very high recall, the precision begins to drop. Looking at the Precision-
Recall Curve, we notice the curve gets precision up to about 78.
5.3. LASSO REGRESSION
The following Elastic Net Model predictions were obtained.
(head(pred_lasso <- round(pred_lasso, digits
= 2)))
s0
2 0.14
4 0.11
5 0.21
8 0.19
11 0.11
13 0.12
(tail(pred_lasso <- round(pred_lasso, digits =
2)))
s0
284789 0.15
284793 0.09
284797 0.20
284802 0.14
284803 0.01
284804 0.12
Using The R Code Snippets below, We obtained the above predictions for the Elasticnet model.
#Cross Validation Method for Elastic Net
CV1 <- cv.glmnet(x=X, y=down_train_data$Class, family="binomial", alpha = 1, nlambda = 200)
coef(CV1, CV1$lambda.min)
coef(CV1, CV1$lambda.1se)
b.lambda1 <- CV1$lambda.1se
b.lambda1
fit.best1 <- glmnet(x=X, y=down_train_data$Class, family="binomial", alpha = 1,
lambda=b.lambda1)
(fit.best1$beta)
#Elastic Net Prediction
X.test <- model.matrix (as.formula(formula0), data = test)[, -1]
pred_lasso <- predict(fit.best1, newx = X.test, type="response")
(head(pred_lasso <- round(pred_lasso, digits = 2)))
(tail(pred_lasso <- round(pred_lasso, digits = 2)))
#Miscalulation Rate for Elastic Net Prediction
pred1_lasso <- ifelse(pred_lasso>0.5, 1, 0) #if it is bigger than 50%, we say yes
(miss.rate <- mean(yobs != pred1_lasso))
5.3.1. Lasso Model Evaluation
Miscalculation Rate for Elastic Net Model
The mean of the miscalculation rate is at 0.009273537. A miscalculation rate of 0.009273537 indicates that
our model has 0.9273537% incorrect predictions which means it has 99.0726463% correct predictions
hence the model is predicting accurately.
Confidence Interval for The Area Under the Curve(AUC)
The model has confidence 0.95 of predicting correctly with a confidence interval between 0.880 and 0.982
at an accuracy of 93.1%
ROC Curve of the best fit Model for Lasso.
Fig.9: ROC Curve of the best fit Model for Lasso.
The higher AUC -ROC, the better the performance of the model at distinguishing between positive and the
negative classes so when the AUC is between 0.5 and 1, that is, 0.5<AUC<1, then there is a high chance
that the model can distinguish between the positive class values from the negative class values. Since our
AUC value is 0.931, we have a 95% confidence Interval between 0.88 and 0.982 that our model can
differentiate between is FTP(False Positive Rate) and the TPR(True Positive rate)
Recall and Precision Score for Lasso Model
We obtained a precision of 0.9909338 which means 99.09338 % of our prediction is relevant. We obtained
a recall of the recall of 0.9997743 which shows that our model has accuracy of 99.97743% in correctly
classifying the total relevant results.
Recall and Precision Curves for Lasso Model
Fig.10: Recall and Precision Curves for Lasso Model
These curves give the shape we would expect. A threshold with low recall, the precision is correspondingly
high though at a constant rate, and at very high recall, the precision begins to drop. Looking at the Precision-
Recall Curve, we notice the curve gets precision up to about 88.
CONCLUSION
From the outputs of the Ride Regression, Elastic net Regression and Lasso Regression, we realize that all
the three models are accurate in credit card fraud transaction detection. Amongst the three models, we
notice that Ridge Regression is the best with an accuracy of 98%, a confidence interval between 97% and
99%, with 95% confidence. Lasso Regression is the next best model with an accuracy of 93.2%, a confidence
interval between 88% and 98%, with 95% confidence, just slightly above Elastic net which comes third with
an accuracy of 93.1%, a confidence interval between 87% and 98%, with 95% confidence. Nonetheless we
will say Lasso and Elastic net both have an equal accuracy while Ridge is the best.
RECOMMENDATION
We recommend that the final models are deployed to make predictions on new data and their performance
being monitored regularly.
REFERENCES
Altman, E. I., Marco, G., & Varetto, F. (1994). Corporate distress diagnosis comparisons using linear
discriminant analysis and neural networks. Journal of Banking and Finance, 18(3), 505-529.
Bentley, P., Kim, J., Jung. G. & J Choi. 2000. Fuzzy Darwinian Detection of Credit Card Fraud, Proc. of
14thAnnual Fall Symposium of the Korean Information Processing Society.
Credit Cards: Use and Consumer Attitudes, 1970-2000, 86 Fed. Res. Bull. 623 (2000).
Dal Pozzolo, A., Caelen, O., Johnson, R. A., & Bontempi, G. (2015). Calibrating Probability with
Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and
Data Mining (CIDM) (pp. 1-7). IEEE.
Dorronsoro, J. R., Ginel, F., Sanchez, J. A., & Cruz, J. M. (1997). Building an online system for fraud detection
of credit card operations based on a neural classifier. Proceedings of the International Conference
on Artificial Neural Networks (ICANN'97), 683-688.
Federal Trade Commission. (2020). Consumer Sentinel Network Data Book 2020. Retrieved from
https://www.ftc.gov/system/files/documents/reports/consumer-sentinel-network-data-book-
2020/csn_data_book_2020.pdf
Flitman, A. M. (1997). Towards analysing student failures: neural networks compared with regression
analysis and multiple discriminant analysis. Computers & Operations Research, 24(4), 367-377.
Hanagandi, V., Dhar, S., & Buescher, K. (1996). Application of classification models on credit card fraud
detection. Proceedings of the International Conference on Neural Networks, 1996. doi:
10.1109/ICNN.1996.548981
Nilson Report. (December 2021). Card Fraud Worldwide. Issue 1209.
... In many fields, predictive modeling has become an essential tool for forecasting future trends and helping decision-makers in making data-driven choices [1]. For instance, in the financial sector, predictive techniques play a crucial role in detecting fraud [2,3]. Similarly, in the health sector, predictive models help in understanding and mitigating the impact of environmental factors like ambient ozone pollution on public health [4]. ...
Article
Full-text available
This article presents a comprehensive study of sales predictions using time series analysis, focusing on a case study of Walmart sales data. The aim of this study is to evaluate the effectiveness of various time series forecasting techniques in predicting weekly sales data for Walmart stores. Leveraging a dataset from Kaggle comprising weekly sales data from various Walmart stores around the United States, this study explores the effectiveness of time series analysis in forecasting future sales trends. Various time series analysis Original Research Article Neba et al.; Asian Res. Integrated Moving Average (SARIMA), Prophet, Exponential Smoothing, and Gaussian Processes, are applied to model and forecast Walmart sales data. By comparing the performance of these models, the study seeks to identify the most accurate and reliable methods for forecasting retail sales, thereby providing valuable insights for improving sales predictions in the retail sector. The study includes an extensive exploratory data analysis (EDA) phase to preprocess the data, detect outliers, and visualize sales trends over time. Additionally, the article discusses the partitioning of data into training and testing sets for model evaluation. Performance metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) are utilized to compare the accuracy of different time series models. The results indicate that Gaussian Processes outperform other models in terms of accuracy, with an RMSE of 34,116.09 and an MAE of 25,495.72, significantly lower than the other models evaluated. For comparison, ARIMA and SARIMA models both yielded an RMSE of 555,502.2 and an MAE of 462,767.3, while the Prophet model showed an RMSE of 567,509.2 and an MAE of 474,990.8. Exponential Smoothing also performed well with an RMSE of 555,081.7 and an MAE of 464,110.5. These findings suggest the potential of Gaussian Processes for accurate sales forecasting. However, the study also highlights the strengths and weaknesses of each forecasting methodology, emphasizing the need for further research to refine existing techniques and explore novel modeling approaches. Overall, this study contributes to the understanding of time series analysis in retail sales forecasting and provides insights for improving future forecasting endeavors.
Conference Paper
Full-text available
Undersampling is a popular technique for unbalanced datasets to reduce the skew in class distributions. However, it is well-known that undersampling one class modifies the priors of the training set and consequently biases the posterior probabilities of a classifier [9]. In this paper, we study analytically and experimentally how undersampling affects the posterior probability of a machine learning model. We formalize the problem of undersampling and explore the relationship between conditional probability in the presence and absence of undersampling. Although the bias due to undersampling does not affect the ranking order returned by the posterior probability, it significantly impacts the classification accuracy and probability calibration. We use Bayes Minimum Risk theory to find the correct classification threshold and show how to adjust it after undersampling. Experiments on several real-world unbalanced datasets validate our results.
Article
Full-text available
This study analyzes the comparison between traditional statistical methodologies for distress classification and prediction, i.e., linear discriminant (LDA) or logit analyses, with an artificial intelligence algorithm known as neural networks (NN). Analyzing well over 1,000 healthy, vulnerable and unsound industrial Italian firms from 1982–1992, this study was carried out at the Centrale dei Bilanci in Turin, Italy and is now being tested in actual diagnostic situations. The results are part of a larger effort involving separate models for industrial, retailing/trading and construction firms.The results indicate a balanced degree of accuracy and other beneficial characteristics between LDA and NN. We are particularly careful to point out the problems of the ‘black-box’ NN systems, including illogical weightings of the indicators and overfitting in the training stage both of which negatively impacts predictive accuracy. Both types of diagnoslic techniques displayed acceptable, over 90%, classificalion and holdoul sample accuracy and the study concludes that there certainly should be further studies and tests using the two lechniques and suggests a combined approach for predictive reinforcement.
Article
Using data from key first year courses, this article considers the development of subject-specific models to identify enrolled students at-risk of failure. The primary technique considered was neural networks, with it's results compared with logistic regression and multiple discriminant analysis. The three different modelling approaches were developed by three different analysts to achieve the benefits accruing from the independent M-Competition. We have found the quality of forecasts achieved to be significantly improved on earlier studies, presumably because of the subject specific nature of the models.
Fuzzy Darwinian Detection of Credit Card Fraud
  • P Bentley
  • J Kim
  • G Jung
  • Choi
Bentley, P., Kim, J., Jung. G. & J Choi. 2000. Fuzzy Darwinian Detection of Credit Card Fraud, Proc. of 14thAnnual Fall Symposium of the Korean Information Processing Society.
Building an online system for fraud detection of credit card operations based on a neural classifier
  • J R Dorronsoro
  • F Ginel
  • J A Sanchez
  • J M Cruz
Dorronsoro, J. R., Ginel, F., Sanchez, J. A., & Cruz, J. M. (1997). Building an online system for fraud detection of credit card operations based on a neural classifier. Proceedings of the International Conference on Artificial Neural Networks (ICANN'97), 683-688.
Consumer Sentinel Network Data Book 2020
Federal Trade Commission. (2020). Consumer Sentinel Network Data Book 2020. Retrieved from https://www.ftc.gov/system/files/documents/reports/consumer-sentinel-network-data-book-2020/csn_data_book_2020.pdf
Application of classification models on credit card fraud detection
  • V Hanagandi
  • S Dhar
  • K Buescher
Hanagandi, V., Dhar, S., & Buescher, K. (1996). Application of classification models on credit card fraud detection. Proceedings of the International Conference on Neural Networks, 1996. doi: 10.1109/ICNN.1996.548981
Card Fraud Worldwide
  • Nilson Report
Nilson Report. (December 2021). Card Fraud Worldwide. Issue 1209.