ArticlePDF Available

Enhancing Credit Card Fraud Detection with Regularized Generalized Linear Models: A Comparative Analysis of Down-Sampling and Up-Sampling Techniques

Authors:

Abstract and Figures

This study highlights the problem of credit card fraud and the use of regularized generalized linear models (GLMs) to detect fraud. GLMs are flexible statistical frameworks that model the relationship between a response variable and a set of predictor variables. Regularization Techniques such as ridge regression, lasso regression, and Elasticnet can help mitigate overfitting, resulting in a more parsimonious and interpretable model. The study used a credit card transaction dataset from September 2013, which included 492 fraud cases out of 284,807 transactions. The rising prevalence of credit card fraud has led to the development of sophisticated detection methods, with machine learning playing a pivotal role. In this study, we employed three machine learning models: Ridge Regression, Elasticnet Regression, and Lasso Regression, to detect credit card fraud using both Down-Sampling and Up-Sampling techniques. The results indicate that all three models exhibit accuracy in credit card fraud detection. Among them, Ridge Regression stands out as the most accurate model, achieving an impressive 98% accuracy with a 95% confidence interval between 97% and 99%. Following closely, Lasso Regression and Elasticnet Regression both demonstrate solid performance, with accuracy rates of 93.2% and 93.1%, respectively, and 95% confidence intervals ranging from 88% to 98%. When considering the Up-Sampling technique, Ridge Regression maintains its position as the most accurate model, achieving an accuracy rate of 98.2% with a 95% confidence interval spanning from 97% to 99.4%. Elasticnet Regression follows with an accuracy rate of 94.1% and a confidence interval between 0.8959 and 0.9854, while Lasso Regression exhibits a slightly lower accuracy of 93.1% with a confidence interval from 0.88 to 0.982. all three machine learning models—Ridge Regression, Elasticnet Regression, and Lasso Regression—demonstrate competence in credit card fraud detection. Ridge Regression consistently outperforms the others in both Down-Sampling and UpSampling scenarios, making it a valuable tool for financial institutions to safeguard against credit card fraud threats in the United States.
Content may be subject to copyright.
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1841
Enhancing Credit Card Fraud Detection
with Regularized Generalized Linear Models:
A Comparative Analysis of Down-Sampling
and Up-Sampling Techniques
1Cyril Neba C., 2Gerard Shu F. , 3Adrian Neba F., 4Aderonke Adebisi, 5P. Kibet., 6F.Webnda, 7Philip Amouda A.
1,3,4,5,6,7Department of Mathematics and Computer Science, Austin Peay State University, Clarksville, Tennessee, USA
2Montana State University, Gianforte School of Computing, Bozeman, Montana, USA
Abstract:- This study highlights the problem of credit
card fraud and the use of regularized generalized linear
models (GLMs) to detect fraud. GLMs are flexible
statistical frameworks that model the relationship
between a response variable and a set of predictor
variables. Regularization Techniques such as ridge
regression, lasso regression, and Elasticnet can help
mitigate overfitting, resulting in a more parsimonious and
interpretable model. The study used a credit card
transaction dataset from September 2013, which included
492 fraud cases out of 284,807 transactions.
The rising prevalence of credit card fraud has led to
the development of sophisticated detection methods, with
machine learning playing a pivotal role. In this study, we
employed three machine learning models: Ridge
Regression, Elasticnet Regression, and Lasso Regression,
to detect credit card fraud using both Down-Sampling
and Up-Sampling techniques. The results indicate that all
three models exhibit accuracy in credit card fraud
detection. Among them, Ridge Regression stands out as
the most accurate model, achieving an impressive 98%
accuracy with a 95% confidence interval between 97%
and 99%. Following closely, Lasso Regression and
Elasticnet Regression both demonstrate solid
performance, with accuracy rates of 93.2% and 93.1%,
respectively, and 95% confidence intervals ranging from
88% to 98%.
When considering the Up-Sampling technique,
Ridge Regression maintains its position as the most
accurate model, achieving an accuracy rate of 98.2% with
a 95% confidence interval spanning from 97% to 99.4%.
Elasticnet Regression follows with an accuracy rate of
94.1% and a confidence interval between 0.8959 and
0.9854, while Lasso Regression exhibits a slightly lower
accuracy of 93.1% with a confidence interval from 0.88 to
0.982. all three machine learning modelsRidge
Regression, Elasticnet Regression, and Lasso
Regressiondemonstrate competence in credit card
fraud detection. Ridge Regression consistently
outperforms the others in both Down-Sampling and Up-
Sampling scenarios, making it a valuable tool for financial
institutions to safeguard against credit card fraud threats
in the United States.
Keywords:- Machine Learning, Credit Card Transaction
Fraud Detection, Regularized GLM, Ridge Regression,
Elasticnet Regression, Lasso Regression, Down-Sampling
and Up-Sampling.
I. INTRODUCTION
A notable change in consumer financial services over
the past few decades has been the growth of the use of credit
cards, both for payments and as sources of revolving credit.
From modest origins in the 1950s as a convenient way for the
relatively well-to-do to settle restaurant and department store
purchases without carrying cash, credit cards have become a
ubiquitous financial product held by households in all
economic strata (Credit Cards: Use and Consumer Attitudes,
1970-2000, 2000)".
Credit card fraud is a prevalent and challenging problem
for financial institutions and consumers worldwide. In 2021,
the Federal Trade Commission (FTC) fielded nearly 390,000
reports of credit card fraud, making it one of the most
common kinds of fraud in the U.S. (Federal Trade
Commission, 2020) and consistently ranks among the top
consumer complaints, with thousands of cases reported each
year. In 2020 alone, the FTC received over 2.2 million fraud
reports, with identity theft and credit card fraud being among
the most frequently cited issues (Federal Trade Commission,
2021).
Fraudulent transactions can cause significant financial
losses, harm the reputation of financial institutions, and create
inconvenience and stress for customers. Therefore, detecting
and preventing credit card fraud is of utmost importance.
According to the Nilson Report (December 2021), global
credit card and debit card fraud resulted in losses of $28.58
billion during 2020, with card issuers and merchants
incurring 88% and 12% of those losses, respectively. Card
issuer losses occurred mainly at the point of sale from
counterfeit cards while merchant losses occurred mainly on
card-not-present (CNP) transactions. The report also noted
that during 2020, credit card and debit card gross fraud losses
accounted for roughly 6.81₵ per $100 in total volume, up
from 6.78₵ per $100 in 2019. In 2020, the US accounted for
35.83% of the worldwide payment card fraud losses but
generated only 22.40% of total volume. Finally, the Nilson
Report predicted that over the next 10 years, card industry
losses to fraud will collectively amount to $408.50 billion.
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1842
II. METHODS OF CREDIT CARD FRAUD
DETECTION
Detecting and preventing credit card fraud is an ongoing
challenge, but various methods have been developed to
mitigate its impact. One increasingly vital approach is the
utilization of machine learning and data analytics. These
techniques enable the analysis of vast datasets, identifying
suspicious patterns and anomalies that may indicate
fraudulent activity.
Anomaly Detection: Anomaly detection algorithms, such
as clustering and autoencoders, are used to flag transactions
that significantly deviate from the norm. These anomalies
may indicate fraudulent activities, and machine learning
models can assign risk scores to transactions based on their
deviation from established patterns (Ahmed et al., 2016).
Behavioral Analysis: Machine learning can analyze a
user's behavior over time, creating a profile of their typical
spending habits. Any sudden deviations from this profile
can trigger alerts for further investigation (Ahmed et al.,
2016).
Network Analysis: Credit card fraud often involves
coordinated efforts by criminal networks. Machine
learning models can analyze transaction networks to
identify links between seemingly unrelated accounts and
transactions, uncovering hidden patterns (Gandomi &
Haider, 2015).
Real-time Monitoring: Machine learning models can
operate in real-time, allowing for the immediate detection
of potentially fraudulent transactions. This rapid response
is crucial in preventing losses (Gandomi & Haider, 2015).
Natural Language Processing (NLP): NLP techniques
can be employed to analyze text data, such as customer
service interactions and transaction comments, to identify
suspicious language or phrases associated with fraud
(Dinakar & Nair, 2020).
Pattern Recognition: Machine learning algorithms can be
trained to recognize patterns associated with fraudulent
transactions. By analyzing historical data, these models can
identify deviations from typical spending behavior, such as
unusual purchase locations, transaction amounts, or
frequencies (Acar et al., 2017). One pattern recognition
approach to detect credit card fraud is to use Machine
Learning Techniques such as a generalized linear model
(GLM), which is a flexible statistical framework that
allows modeling the relationship between a response
variable and a set of predictor variables. However, GLMs
can suffer from overfitting, which occurs when the model
is too complex and fits the noise in the data instead of the
underlying signal. Regularization Techniques can help
mitigate overfitting by adding a penalty term to the model's
objective function that discourages large coefficients.
In this context, regularized forms of GLMs, such as
ridge regression and lasso regression, can be useful tools for
detecting credit card fraud. These Techniques allow the
model to shrink the coefficients of less important predictors,
leading to a more parsimonious and interpretable model that
is less prone to overfitting. Furthermore, regularized GLMs
can handle high-dimensional data with many predictors, a
common scenario in credit card fraud detection, where there
are numerous features that may be relevant to identifying
fraudulent transactions.
For this study, we used the credit card transaction
dataset from September 2013 which includes transactions
made by European cardholders (Pozzolo, Caelen, Johnson,
and Bontempi; 2015). This dataset covers a two-day period
and includes 492 fraud cases out of a total of 284,807
transactions. The dataset is considered unbalanced because
the positive class (frauds) accounts for only 0.172% of all
transactions. The input variables in the dataset are numerical
and have been transformed using PCA. The original features
and additional background information about the data cannot
be disclosed due to confidentiality concerns. The dataset
includes 28 principal components obtained through PCA, and
the 'Time' and 'Amount' features have not been transformed.
'Time' indicates the time in seconds between a given
transaction and the first transaction in the dataset, while
'Amount' indicates the transaction amount and can be used for
cost-sensitive learning. The response variable, 'Class', takes a
value of 1 for fraud cases and 0 for non-fraud cases.
III. A GENERAL REVIEW ON CREDIT CARD
FRAUD DETECTION TECHNIQUES
According to Hanagandi, Dhar, and Buescher (1996),
historical information on credit card transactions was used to
develop a fraud score model using a radial basis function
network and a density-based clustering approach. The authors
applied this methodology to a fraud detection problem and
reported satisfactory preliminary results. The paper is
considered an early example of using machine learning
Techniques for fraud detection in credit card transactions,
which has since become an important application area of
machine learning and data analytics.
(Dorronsoro et al., 1997) developed an online system
for detecting credit card fraud using a neural classifier, which
was constructed using a nonlinear version of Fisher's
discriminant analysis. The authors reported that the system is
currently fully operational and can handle more than 12
million credit card operations per year, with satisfactory
results obtained.
Bentley et al. (2000) proposed a genetic programming-
based algorithm for classifying credit card transactions into
suspicious and non-suspicious categories using logic rules.
The algorithm was tested on a database of 4,000 transactions
with 62 fields, and the most effective rule was selected based
on its predictability. This algorithm has shown promise in
detecting credit card fraud, particularly in the context of home
insurance data. Nonetheless, given the constantly changing
nature of fraud tactics, new and advanced fraud detection
methods are continuously being developed to keep pace with
this evolving field.
IV. METHODOLOGY
The methodology for building the fraud detection
model involves data exploration and cleaning, data
preprocessing, model building, model evaluation, and
interpretation. The data was checked for missing values,
outliers, and correlations between predictor variables.
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1843
Standardization was used to avoid bias and the data was split
into training and testing sets followed by modeling using
down-sample and up-sample Techniques. Logistic regression
with L1 regularization (Lasso Regularization), L2
regularization (Ridge Regularization) and a combination of
L1 and L2 regularization also known as Elasticnet
Regularization was used and hyperparameters were tuned
using cross-validation. Performance metrics like accuracy,
precision, recall, F1-score, ROC-AUC, etc. were used to
evaluate the model, and the coefficients were interpreted to
understand the impact of predictor variables.
A. Exploratory Data Analysis
This dataset contains 31 variables, including the response
variable (Class). The first 30 variables (V1 to V30) represent
numerical variables that have been transformed using PCA.
The variables V1 to V28 represent the principal components,
while V29 and V30 are the residuals from the PCA
transformation. The Amount variable is a numerical variable
representing the transaction amount, and the Class variable is
a binary variable representing whether the transaction is
fraudulent or not (1 for fraud, 0 for not fraud).
Table 1: Structure of Variables
Variable
Description
V1 to V28
Numeric variables representing different aspects of the transaction such as amount, time, location, etc.
Amount
Numeric variable representing the amount of the transaction.
Class
Binary variable indicating whether the transaction was fraudulent (1) or not (0).
Checking Missing Data
The following R snipet was used to extract all rows from
the "credit_card" data frame that have missing values and
looking at the output, we realize that the dataset has no
missing values.
credit_card[!complete.cases(credit_card),]
[1] Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
[18] V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
<0 rows> (or 0-length row.names)
To confirm the absence of missing values in the dataset,
the missing values in a data frame were visualized using the "naniar" package in R and again, we realize that there are no
missing values in the dataset as can be seen on the plot below.
Fig. 1: Visualization of Missing Values in the dataset
Frequency Distribution of “Class” Variable
The dataset is highly imbalanced, the positive class
(frauds) accounts for just 0.1727486% of total transactions
(that is 492 out of a total of 284807 transactions) which is not
suitable for this project hence we will need to balance the data
before proceeding with our Logistic regression.
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1844
Fig. 2: Frequency distribution of “Class” Variable
Looking at Fig 3.below, , we realize that all the
fraudulent transactions were made for less than around 2500, which is far less than that of the true transactions thereby
confirming the imbalance nature of the dataset.
Fig. 3: Amount Vs Class Fraud chart
Checking Correlation Between the Variables
Looking at Fig4. below, we realized that there is
extremely very little or no correlation among the variables
except for V2 which has some negative correlation with
amount. We can nonetheless ignore it since it will not affect
the outcome of our analysis in any significant way.
Fig. 4: Correlation Between Variables
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1845
Data Processing
Data processing is an important step in machine learning
because the quality of the data used to train a model can
significantly impact the accuracy and performance of the
resulting model. The following data processing steps were
carried out:
Data Cleaning and Feature Selection:
The raw data did not contain any missing values, hence,
there was no need for any data imputation. The Time Variable
was irrelevant or unimportant to the outcome we are trying
to predict and since it did not have a significant impact on the
outcome, we took it out.
Data Normalization:
Since the range and distribution of the data could impact
the performance of our machine learning algorithms, we
Normalized the data to ensure that the model is not biased
towards any feature due to differences in scale especially
because the data was highly unbalanced. This process was
carried out after splitting the dataset into training and testing
set.
B. Modeling With Down-Sampling Technique
Data Partitioning With Down-Sampling Technique
The dataset was split into training (190450 for Non-fraud
Cases and 326 for Fraud Cases) and testing sets with a ratio
of 2:1. After splitting the dataset, we balanced the training
data using the Down Sampling Technique. By downsampling
the data, you are creating a new subset of the original training
data where the positive class (i.e., the minority class) is
represented more frequently relative to the negative class
(i.e., the majority class). This can help address class
imbalance issues that can arise in predictive modeling tasks,
where one class is significantly more prevalent than the other.
After balancing the training data (326 for Non-fraud Cases
and 326 for Fraud Cases, we could proceed to building the
regularized GLMs.
Model Fitting/Data Modeling
According to Altman and Marco (1994) and Flitman
(1997), an increasing number of statistical models have been
applied to data mining tasks, including regression analysis,
multiple discriminant analysis, logistic regression, Probit
method, and others (Hanagandi, Dhar, & Buescher, 1996). In
the context of the credit card dataset, Regulatizd GLMS
(Ridge, Lasso and Elasticnet models) can be used to identify
the features that are most relevant for detecting fraudulent
transactions while also reducing the effects of
multicollinearity. These models work by adding a penalty
term to the ordinary least squares regression (OLS) objective
function, which shrinks the regression coefficients towards
zero. The Ridge regression adds the L2 norm of the
coefficients as a penalty term, Lasso regression adds the L1
norm of the coefficients as a penalty term, and Elasticnet
regression adds a combination of L1 and L2 norm of the
coefficients as a penalty term. By adding these penalty terms,
these models can reduce the coefficients of some features to
zero, effectively eliminating them from the model and thus
addressing the issue of multicollinearity.
We performed cross-validation on the model using the
cv.glmnet() function from the glmnet package. We then used
the coef(CV, CV$lambda.min) R snippet to retrieve the
coefficient estimates for the Ridge model fit, with the optimal
lambda value selected through cross-validation, allowing us
to see which variables are most strongly associated with the
response variable in the final model. Lastly, we then used the
R code snippet coef(CV, CV$lambda.1se ) to retrieve the
coefficient estimates for the model fit with the lambda value
selected through cross-validation that is one standard error
away from the optimal lambda value, allowing us to see
which variables are most strongly associated with the
response variable in a more parsimonious model.
Ridge Regression Model Down-Sampling Technique
The following Ridge model predictions were obtained.
Top 6
s0
2 0.1820028
4 0.1732274
5 0.1394861
8 0.1483054
11 0.1540031
13 0.1547350
Bottom 6
s0
284789 0.1238469
284793 0.1290248
284797 0.2041513
284802 0.1805562
284803 0.1084127
284804 0.1419076
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1846
Ridge Regression Model Evaluation Down-Sampling
Technique
Miscalculation Rate for Ridge Model for Down-
Sampling Technique
The mean of the miscalculation rate is at 0.005966118.
A miscalculation rate of 0.005966118 indicates that our
model has 0. 5966118% incorrect predictions which means it
has 99.4033882% correct predictions hence the model is
predicting accurately.
Confidence Interval for The Area Under the
Curve(AUC) for Ridge for Down-Sampling Technique
The model has confidence 0.95 of predicting correctly
with a confidence interval between 0.966 and 0.992. at an
accuracy of 97.9%.
ROC Curve of the best fit Model for Ridge for Down-
Sampling Technique
Fig. 5: ROC Curve of the best fit Model for Ridge (Down-Sampling Technique).
The higher AUC -ROC, the better the performance of
the model at distinguishing between positive and the negative
classes so when the AUC is between o.5 and 1, that is,
0.5<AUC<1, then there is a high chance that then model can
distinguish between the positive class values from the
negative class values. Since our AUC value is 0.979, we have
a 95% confidence Interval between 0.966 and 0.992 that our
model can differentiate between is FTP(False Positive Rate)
and the TPR(True Positive rate).
Recall and Precision Score for Ridge Model for Down-
Sampling Technique
We obtained a precision of 0.994311 which means
99.4311% of our prediction is relevant. We obtained a recall
of the recall of 0.9997108 shows that our model has accuracy
of 99.9997108 in correctly classifying the total relevant
results.
Recall And Precision Curves for Ridge Model for
Down-Sampling Technique
Fig. 6: Recall And Precision Curves for Ridge Model (Down-Sampling Technique)
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1847
These curves give the shape we would expect. At
thresholds with low recall, the precision is correspondingly
high, and at very high recall, the precision begins to drop.
Looking at the Precision-Recall Curve, we notice the curve
gets precision up to about 81.
Elasticnet Regression Model for Down-Sampling
Technique
The following Elasticnet Model predictions were
obtained.
Top 6
s0
2 0.14
4 0.09
5 0.12
8 0.12
11 0.09
13 0.10
Bottom 6 s0
284789 0.11
284793 0.07
284797 0.17
284802 0.13
284803 0.01
284804 0.09
Elasticnet Model Evaluation Down-Sampling Technique
Miscalculation Rate for Elasticnet Model for Down-
Sampling Technique
The mean of the miscalculation rate is at 0.01261286. A
miscalculation rate of 0.01261286 indicates that our model
has 1.261286 % incorrect predictions which means it has
98.738714% correct predictions hence the model is
predicting accurately.
Confidence Interval for The Area Under the
Curve(AUC) foElasticnet for Down-Sampling
Technique
The model has confidence 0.95 of predicting correctly
with a confidence interval between 0.880 and 0.982 at an
accuracy of 93.% .
ROC Curve of the best fit Model for Elasticnet for
Down-Sampling Technique
Fig. 7: ROC Curve of the best fit Model for Elasticnet (Down-Sampling Technique)
The higher AUC -ROC, the better the performance of
the model at distinguishing between positive and the negative
classes so when the AUC is between o.5 and 1, that is,
0.5<AUC<1, then there is a high chance that the model can
distinguish between the positive class values from the
negative class values. Since our AUC value is 0.931, we have
a 95% accuracy and a confidence Interval between 0.88 and
0.982 that our model can differentiate between is FTP(False
Positive Rate) and the TPR(True Positive rate)
Recall and Precision Score for Elasticnet Model for
Down-Sampling Technique
We obtained a precision of 0.9875886 which means
98.75886% of our prediction is relevant. We obtained a recall
of the recall of 0.9997735 which shows that our model has
accuracy of 99.97735% in correctly classifying the total
relevant results.
Recall and Precision Curves for Elasticnet Model for
Down-Sampling Technique
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1848
Fig. 8: Recall and Precision Curves for Elasticnet Model (Down-Sampling Technique)
These curves give the shape we would expect. At
thresholds with low recall, the precision is correspondingly
high though at a constant rate, and at very high recall, the
precision begins to drop. Looking at the Precision-Recall
Curve, we notice the curve gets precision up to about 78.
Lasso Regression for Down-Sampling Technique
The following Elasticnet Model predictions were
obtained.
Top 6
s0
2 0.14
4 0.11
5 0.21
8 0.19
11 0.11
13 0.12
Bottom 6
s0
284789 0.15
284793 0.09
284797 0.20
284802 0.14
284803 0.01
284804 0.12
Lasso Model Evaluation for Down-Sampling Technique
Miscalculation Rate for Elasticnet Model for Down-
Sampling Technique
The mean of the miscalculation rate is at 0.009273537.
A miscalculation rate of 0.009273537 indicates that our
model has 0.9273537% incorrect predictions which means it
has 99.0726463% correct predictions hence the model is
predicting accurately.
Confidence Interval for The Area Under the
Curve(AUC) for Lasso for Down-Sampling Technique
The model has confidence 0.95 of predicting correctly
with a confidence interval between 0.880 and 0.982 at an
accuracy of 93.1%.
ROC Curve of the best fit Model for Lasso for Down-
Sampling Technique
Fig. 9: ROC Curve of the best fit Model for Lasso (Down-Sampling Technique)
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1849
The higher AUC -ROC, the better the performance of
the model at distinguishing between positive and the negative
classes so when the AUC is between 0.5 and 1, that is,
0.5<AUC<1, then there is a high chance that the model can
distinguish between the positive class values from the
negative class values. Since our AUC value is 0.931, we have
a 95% confidence Interval between 0.88 and 0.982 that our
model can differentiate between is FTP(False Positive Rate)
and the TPR(True Positive rate)
Recall and Precision Score for Lasso Model for Down-
Sampling Technique
We obtained a precision of 0.9909338 which means
99.09338 % of our prediction is relevant. We obtained a recall
of the recall of 0.9997743 which shows that our model has
accuracy of 99.97743% in correctly classifying the total
relevant results.
Recall and Precision Curves for Lasso Model for Down-
Sampling Technique
Fig. 10: Recall and Precision Curves for Lasso Model (Down-Sampling Technique)
These curves give the shape we would expect. A
threshold with low recall, the precision is correspondingly
high though at a constant rate, and at very high recall, the
precision begins to drop. Looking at the Precision-Recall
Curve, we notice the curve gets precision up to about 88.
C. Modelling with Up-Sample Technique
Ridge Regression
The following Ridge Regression Model predictions were
obtained.
Top 6
s0
2 0.15150102
4 0.12316558
5 0.11707270
8 0.08736508
11 0.12797648
13 0.11691316
Ridge Regression Model Evaluation for Up-Sample
Technique
Miscalculation Rate for Ridge Model for Up-Sample
Technique
The mean of the miscalculation rate is at 0.006171316.
A miscalculation rate of 0. 006171316 indicates that our
model has 0. 6171316% incorrect predictions which means it
has 99.3828684% correct predictions hence the model is
predicting accurately.
Confidence Interval for The Area Under the
Curve(AUC) for Ridge for Up-Sample Technique
The model has confidence 0.95 of predicting correctly
with a confidence interval between 97.00 and 99.40. at an
accuracy of 98.2%.
ROC Curve of the best fit Model for Ridge for Up-
Sample Technique
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1850
Fig. 11: ROC Curve of the best fit Model for Ridge Model (Up-Sampling Technique)
The higher AUC -ROC, the better the performance of
the model at distinguishing between positive and the negative
classes so when the AUC is between 0.5 and 1, that is,
0.5<AUC<1, then there is a high chance that then model can
distinguish between the positive class values from the
negative class values. Since our AUC value is 0.982, we have
a 95% confidence Interval between 0.970 and 0.994 that our
model can differentiate between is FTP(False Positive Rate)
and the TPR(True Positive rate).
Recall and Precision Score for Ridge Model for Up-
Sample Technique
We obtained a precision of 0.999732 which means
99.9732% of our prediction is relevant. We obtained a recall
of the recall of 0.993512 shows that our model has accuracy
of 99.3512% in correctly classifying the total relevant
results.
Fig. 12: Recall And Precision Curves for Ridge Model (Up-Sampling Technique)
These curves give the shape we would expect. At
thresholds with low recall, the precision is correspondingly
high though at a constant rate, and at very high recall, the
precision begins to drop. Looking at the Precision-Recall
Curve, we notice the curve gets precision up to about 83.
Elasticnet Regression Model for Up-Sampling
Technique
The following Elasticnet Regression Model predictions
were obtained.
Top 6
s0
2 0.03
4 0.01
5 0.05
8 0.06
11 0.02
13 0.01
Bottom 6 s0
284789 0.02
284793 0.02
284797 0.09
284802 0.05
284803 0.00
284804 0.03
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1851
Miscalculation Rate for Elasticnet Model for Up-
Sampling Technique
The mean of the miscalculation rate is at 0.02283435. A
miscalculation rate of 0.02283435 indicates that our model
has 2.283435% incorrect predictions which means it has
97.716565% correct predictions hence the model is
predicting accurately.
Confidence Interval for The Area Under the
Curve(AUC) for Elasticnet Model for Up-Sampling
Technique
The model has confidence 0.95 of predicting correctly
with a confidence interval between 0.8958664 and 0.9853755
at an accuracy of 94.0621%.
ROC Curve of the best fit Model for Elasticnet for Up-
Sampling Technique
Fig.13: ROC Curve of the best fit Model for Elasticnet (Up-Sampling Technique)
The higher AUC -ROC, the better the performance of
the model at distinguishing between positive and the negative
classes so when the AUC is between 0.5 and 1, that is,
0.5<AUC<1, then there is a high chance that the model can
distinguish between the positive class values from the
negative class values. Since our AUC value is 0.941, we have
a 95% accuracy and a confidence Interval between 0.896 and
0.985 that our model can differentiate between is FTP(False
Positive Rate) and the TPR(True Positive rate)
Recall and Precision Score for Elasticnet Model for Up-
Sapling Technique
We obtained a precision of 0.9998257 which means
99.98257% of our prediction is relevant. We obtained a recall
of 0.9776274 which shows that our model has accuracy of
97.76274 in correctly classifying the total relevant results.
Recall and Precision Curves for Elasticnet Model for Up-
Sapling Technique
Fig.14: Recall and Precision Curves for Elasticnet Model (Up-Sampling Technique)
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1852
Just like the other curves, these curves give the shape
we would expect. At thresholds with low recall, the precision
is correspondingly high though at a constant rate, and at very
high recall, the precision begins to drop. Looking at the
Precision-Recall Curve, we notice the curve gets precision up
to about 83.
Lasso Regression for Up-Sampling Technique
The following Lasso Model predictions were obtained.
Top 6
s0
2 0.02
4 0.01
5 0.05
8 0.06
11 0.01
13 0.01
Bottom 6 s0
284789 0.02
284793 0.02
284797 0.09
284802 0.05
284803 0.00
284804 0.03
Lasso Model Evaluation for Up-Sampling Technique
Miscalculation Rate for Elasticnet Model for Up-
Sampling Technique
The mean of the miscalculation rate is at 0.009273537.
A miscalculation rate of 0.009273537 indicates that our
model has 0.9273537% incorrect predictions which means it
has 99.0726463% correct predictions hence the model is
predicting accurately.
Confidence Interval for The Area Under the
Curve(AUC) for Lasso for Up-Sampling Technique
The model has confidence 0.95 of predicting correctly
with a confidence interval between 0.880 and 0.982 at an
accuracy of 93.1%.
ROC Curve of the best fit Model for Lasso for Up-
Sampling Technique
Fig. 15: ROC Curve of the best fit Model for Lasso (Up-Sampling Technique)
The higher AUC -ROC, the better the performance of
the model at distinguishing between positive and the negative
classes so when the AUC is between 0.5 and 1, that is,
0.5<AUC<1, then there is a high chance that the model can
distinguish between the positive class values from the
negative class values. Since our AUC value is 0.931, we have
a 95% confidence Interval between 0.88 and 0.982 that our
model can differentiate between is FTP(False Positive Rate)
and the TPR(True Positive rate)
Recall and Precision Score for Lasso Model for Up-
Sampling Technique
We obtained a precision of 0.9909338 which means
99.09338 % of our prediction is relevant. We obtained a recall
of the recall of 0.9997743 which shows that our model has
accuracy of 99.97743% in correctly classifying the total
relevant results.
Recall and Precision Curves for Lasso Model for Up-
Sampling Technique
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1853
Fig. 16: Recall and Precision Curves for Lasso Model (Up-Sampling Technique)
These curves give the shape we would expect. A
threshold with low recall, the precision is correspondingly
high though at a constant rate, and at very high recall, the
precision begins to drop. Looking at the Precision-Recall
Curve, we notice the curve gets precision up to about 88.
V. CONCLUSION
From the outputs obtained from modeling while using
the Down-Sampling Technique of the Ride Regression,
Elasticnet Regression and Lasso Regression, we realize that
all the three models are accurate in credit card fraud
transaction detection. Amongst the three models, we notice
that Ridge Regression is the best with an accuracy of 98%, a
confidence interval between 97% and 99%, with 95%
confidence. Lasso Regression is the next best model with an
accuracy of 93.2%, a confidence interval between 88% and
98%, with 95% confidence, just slightly above Elasticnet
which comes third with an accuracy of 93.1%, a confidence
interval between 87% and 98%, with 95% confidence.
Nonetheless we will say Lasso and Elasticnet both have
equal accuracy while Ridge is the best.
When we look at the outputs obtained from modeling
while using the Up-Sampling Technique of the Ride
Regression, Elasticnet Regression and Lasso Regression, we
also realize that all the three models are accurate in credit card
fraud transaction detection though with very slight
differences where we notice that Ridge Regression is still the
best with an accuracy of 98.2%, a confidence interval
between 97.00 and 99.40 with a 95% confidence. Elasticnet
Regression is the next best with an accuracy of 94.0621%, a
confidence interval between 0.8958664 and 0.9853755 with
a 95% confidence unlike Lasso which was the next best in the
Down-Sampling Technique. Lasso Regression is the third
best with an accuracy of 93.1%, a confidence interval
between 0.880 and 0.982 with a 0.95 confidence in
predicting, unlike Elasticnet which was the third in the
Down-Sampling Technique.
RECOMMENDATION
We recommend that the final models are deployed to
make predictions on new data and their performance being
monitored regularly.
REFERENCES
[1.] Acar, A., Aksu, H., & Dogac, A. (2017). A survey of
credit card fraud detection techniques: Data and
technique-oriented perspective. Journal of Computer
and System Sciences, 83(1), 121-136.
[2.] Ahmed, M., Mahmood, A. N., & Hu, J. (2016). A
survey of network anomaly detection techniques.
Journal of Network and Computer Applications, 60,
19-31.
[3.] Altman, E. I., Marco, G., & Varetto, F. (1994).
Corporate distress diagnosis comparisons using linear
discriminant analysis and neural networks. Journal of
Banking and Finance, 18(3), 505-529.
[4.] Bentley, P., Kim, J., Jung. G., & Choi, J. (2000). Fuzzy
Darwinian Detection of Credit Card Fraud. Proc. of
14th Annual Fall Symposium of the Korean
Information Processing Society.
[5.] Credit Cards: Use and Consumer Attitudes, 1970-
2000, 86 Fed. Res. Bull. 623 (2000).
[6.] Dal Pozzolo, A., Caelen, O., Johnson, R. A., &
Bontempi, G. (2015). Calibrating Probability with
Undersampling for Unbalanced Classification. In
Symposium on Computational Intelligence and Data
Mining (CIDM) (pp. 1-7). IEEE.
[7.] Dinakar, K., & Nair, R. R. (2020). An efficient credit
card fraud detection model using machine learning and
natural language processing. Procedia Computer
Science, 171, 695-703.
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1854
[8.] Dorronsoro, J. R., Ginel, F., Sanchez, J. A., & Cruz, J.
M. (1997). Building an online system for fraud
detection of credit card operations based on a neural
classifier. Proceedings of the International Conference
on Artificial Neural Networks (ICANN'97), 683-688.
[9.] Federal Trade Commission. (2020). Consumer
Sentinel Network Data Book 2020. Retrieved from
https://www.ftc.gov/system/files/documents/reports/c
onsumer-sentinel-network-data-book-
2020/csn_data_book_2020.pdf
[10.] Federal Trade Commission. (2021). Consumer
Sentinel Network Data Book 2020. Retrieved from
https://www.ftc.gov/reports/consumer-sentinel-
network-data-book-2020
[11.] Flitman, A. M. (1997). Towards analysing student
failures: neural networks compared with regression
analysis and multiple discriminant analysis.
Computers & Operations Research, 24(4), 367-377.
[12.] Gandomi, A., & Haider, M. (2015). Beyond the hype:
Big data concepts, methods, and analytics.
International Journal of Information Management,
35(2), 137-144.
[13.] Hanagandi, V., Dhar, S., & Buescher, K. (1996).
Application of classification models on credit card
fraud detection. Proceedings of the International
Conference on Neural Networks, 1996. doi:
10.1109/ICNN.1996.548981
[14.] Nilson Report. (December 2021). Card Fraud
Worldwide. Issue 1209.
[15.] The Nilson Report. (2020). Card Fraud Losses Reach
$9.47 Billion. Retrieved from
https://www.nilsonreport.com/mention/194/card-
fraud-losses-reach-9.47-billion/
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1855
APPENDIX
R-CODES USED FOR THE PROJECT
setwd("C:/Users/nebcy/Documents/Apsu/Apsu/Data Set STAT5140")
credit_card1 = read.csv("creditcard.CSV")
credit_card<-credit_card1
dim(credit_card)
head(credit_card)
# Basic Data Exploration
install.packages(c('ROCR','ggplot2','corrplot','caTools','class',
'randomForest','pROC','imbalance'))
library(ROCR)
library(ggplot2)
library(corrplot)
library(caTools)
library(class)
library(randomForest)
library(pROC)
library(imbalance)
library(rpart)
#Basic Exploratory Data Analysis
#Viewing some initial observations of the dataset
#Summary of the dataset
summary(credit_card)
#Viewing Structure of the dataset
str(credit_card)
sapply(credit_card, FUN=class)
#Checking for missing values
credit_card[!complete.cases(credit_card),]
#Missing values by visualization
install.packages("naniar")
library(naniar)
vis_miss(credit_card, 'warn_large_data' = FALSE)
gg_miss_var(credit_card)
#Viewing the Frequency distribution of "Class" Variable
table(credit_card$Class)
prop.table(table(credit_card$Class))
ggplot(data = credit_card,aes(x=Class))+geom_bar(col="red")
#Plotting the Amount - Fraud chart
ggplot(data = credit_card,aes(y=Amount,x=Class))+geom_point(col="slateblue3") + facet_grid(~Class)
#Checking correlation between the variables
install.packages("corpcor")
library(corpcor)
head(cor2pcor(cov(credit_card)))
library(corrplot)
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1856
credit_card$Class <- as.numeric(credit_card$Class)
corr <- cor(credit_card[],method="pearson")
corrplot(corr,method = "circle", type = "lower")
#Taking out Time Column.
#Since the Time Variab,e is not important for this project, we take it out
credit_card<-credit_card1[,-1]
head(credit_card)
######################### Data partition ##################
set.seed(123) #maintains consistency of data and makes data set not to randomly change when running it multiple times
n <- nrow(credit_card)
split_data <- sample(x=1:2, size = n, replace=TRUE, prob=c(0.67, 0.33)) #x=1 =training data, x=2 = testing data
train <- credit_card[split_data == 1, ]
test <- credit_card[split_data == 2, ]
y.train <- train$Class
yobs <- test$Class
##################### Balancing data through upsampling
# Load the necessary libraries
install.packages("caret")
library(caret)
# Check the class distribution before upsampling
table(train$Class)
# Check the class of the 'Class' variable
class(train$Class)
# Convert 'Class' to a factor variable if it's not already
train$Class <- as.factor(train$Class)
# Verify that 'Class' is now a factor
class(train$Class)
set.seed(90) # for reproducibility
down_train_data <- downSample(x = train[, -30], y = train$Class)
# Check the class distribution after upsampling
table(down_train_data$Class)
# Upsample the minority class
set.seed(90) # for reproducibility
down_train_data <- downSample(x = train[, -30], y = train$Class)
# Check the class distribution after upsampling
table(down_train_data$Class)
#Building A logistic model
install.packages("glmnet")
library(glmnet) #Elastic net
formula0 <- factor(Class)~. #This teslls us num is a categorical variable and this will bring out just 1 and 0
X <- model.matrix (as.formula(formula0), data = down_train_data)[, -1] # trim off the first column and leaving only the predictors
#RIDGE REGRESSION
#Using cross validation to determine the optimal tuning parameter. To chose the best lambda
CV <- cv.glmnet(x=X, y=down_train_data$Class, family="binomial", alpha = 0, # CROSS VALIDATION METHOD
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1857
nlambda = 200)#we can take out part of the fucntion from standardize =T till maxit=100 so as to let R choose best lamda
for us.
plot(CV)
coef(CV, CV$lambda.min)
coef(CV, CV$lambda.1se)
b.lambda <- CV$lambda.1se #we can use 1se or min THIS GIVES US THE BEST LAMBDA
b.lambda
fit.best <- glmnet(x=X, y=down_train_data$Class, family="binomial", alpha = 0, #BEST FIT WILL NOT HAVE ANY
DATA DISAPPEARING BECAUSE WE USED RIDGE i.E. LAMBDA = 0
lambda=b.lambda)
(fit.best$beta)
# Prediction
X.test <- model.matrix (as.formula(formula0), data = test)[, -1]
pred <- predict(fit.best, newx = X.test, type="response") #type = "response" means y is continues
head(pred)
tail(pred)
dim(pred)
##### Finding missclassification rate #making a threshhold
pred1 <- ifelse(pred>0.5, 1, 0) #if it is bigger than 50%, we say yes
pred1
(miss.rate <- mean(yobs != pred1))
pred<-pred[, 1] #gives only the index (1, 0) and not the column
pred1<-pred1[,1]
#Plotting ROC curve of the fit.best model.
install.packages("cvAUC")
library(cvAUC)
AUC <- ci.cvAUC(predictions = pred, labels = yobs, folds=1:NROW(test), confidence = 0.95)
AUC
(auc.ci <- round(AUC$ci, digits = 3))
install.packages("verification")
library(verification)
mod.glm <- verify(obs = yobs, pred = pred)
roc.plot(mod.glm, plot.thres=NULL)
text(x=0.7, y=0.2, paste("Area under ROC = ", round(AUC$cvAUC, digits = 3), "with 95% CI (",
auc.ci[1], ",", auc.ci[2], ").", sep = " "), col="blue", cex =1.2)
#confusion matrix (we dont need it in this case because the data in imbalance)
#library(caret)
#library(e1071)
#confusionMatrix( as.factor(pred1),as.factor(yobs),positive = "1")
#RECALL AND PRECISION SCORE, WE CAN USE THIS
#precision(as.factor(yobs),as.factor(pred1))
#recall(as.factor(yobs), as.factor(pred1))
# Assuming yobs and pred1 are binary (0/1) vectors or factors
library(caret)
# Calculate precision
precision_score <- posPredValue(as.factor(pred1), as.factor(yobs))
# Calculate recall
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1858
recall_score <- sensitivity(as.factor(pred1), as.factor(yobs))
# Print the precision and recall scores
cat("Precision Score:", precision_score, "\n")
cat("Recall Score:", recall_score, "\n")
#RECALL AND PRECISION CURVES
install.packages("precrec")
library(precrec)
precrec_ridge <- evalmod(scores = pred, labels = yobs)
precrec_ridge
autoplot(precrec_ridge)
#################################
#ELASTICNET REGRESSION
#Using cross validation to determine the optimal tuning parameter. To chose the best lambda
CV2 <- cv.glmnet(x=X, y=down_train_data$Class, family="binomial", alpha = 0.5, # CROSS VALIDATION METHOD
nlambda = 200)#we can take out part of the fucntion from standardize =T till maxit=100 so as to let R choose best lamda
for us.
plot(CV2)
coef(CV2, CV2$lambda.min)
coef(CV2, CV2$lambda.1se)
b.lambda2 <- CV2$lambda.1se #we can use 1se or min THIS GIVES US THE BEST LAMBDA
b.lambda2
fit.best2 <- glmnet(x=X, y=down_train_data$Class, family="binomial", alpha = 0.5, #BEST FIT WILL SHOW SOME DATA
DISAPPEARING BECAUSE WE USED LASSO i.E. LAMBDA = 1
lambda=b.lambda2)
(fit.best2$beta)
# Prediction
X.test <- model.matrix (as.formula(formula0), data = test)[, -1]
pred_elasticnet <- predict(fit.best2, newx = X.test, type="response") #type = "response" means y is continues
(head(pred_elasticnet <- round(pred_elasticnet, digits = 2)))
(tail(pred_elasticnet <- round(pred_elasticnet, digits = 2)))
dim(pred_elasticnet)
##### Finding missclassification rate #making a threshhold
pred1_elasticnet <- ifelse(pred_elasticnet>0.5, 1, 0) #if it is bigger than 50%, we say yes
pred1_elasticnet
(miss.rate <- mean(yobs != pred1_elasticnet))
pred_elasticnet<-pred_elasticnet[, 1] #gives only the index (1, 0) and not the column
pred1_elasticnet<-pred1_elasticnet[,1]
#Plotting ROC curve of the fit.best model.
library(cvAUC)
AUC2 <- ci.cvAUC(predictions = pred1_elasticnet, labels = yobs, folds=1:NROW(test), confidence = 0.95)
AUC2
(auc.ci_elasticnet <- round(AUC2$ci, digits = 3))
library(verification)
mod.glm_elasticnet <- verify(obs = yobs, pred = pred_elasticnet)
roc.plot(mod.glm_elasticnet, plot.thres=NULL)
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1859
text(x=0.7, y=0.2, paste("Area under ROC = ", round(AUC2$cvAUC, digits = 3), "with 95% CI (",
auc.ci_elasticnet[1], ",", auc.ci_elasticnet[2], ").", sep = " "), col="blue", cex =1.2)
#confusion matrix (we dont need it in this case because the data in imbalance)
#library(caret)
#library(e1071)
#confusionMatrix( as.factor(pred1),as.factor(yobs),positive = "1")
#FOR RECALL AND PRECISION SCORE, WE CAN USE THIS ....Since our data is imbalance, we use RECALL AND
PRECISION SCORE to see if it can predict the minority class since it will definitely predict the majority class since it
has enough data on which to learn so with our output off 0.9917754, we can conclude that the model is prdicting well.
#precision(as.factor(yobs),as.factor(pred1_elasticnet))
#recall(as.factor(yobs), as.factor(pred1_elasticnet))
################
#RECALL AND PRECISION SCORE, WE CAN USE THIS
#precision(as.factor(yobs),as.factor(pred1_elasticnet))
#recall(as.factor(yobs), as.factor(pred1_elasticnet))
# Assuming yobs and pred1_elasticnet are binary (0/1) vectors or factors
library(caret)
# Calculate precision
precision_score <- posPredValue(as.factor(pred1_elasticnet), as.factor(yobs))
# Calculate recall
recall_score <- sensitivity(as.factor(pred1_elasticnet), as.factor(yobs))
# Print the precision and recall scores
cat("Precision Score:", precision_score, "\n")
cat("Recall Score:", recall_score, "\n")
#RECALL AND PRECISION CURVES
#install.packages("precrec")
library(precrec)
precrec_elasticnet <- evalmod(scores = pred_elasticnet, labels = yobs)
precrec_elasticnet
autoplot(precrec_elasticnet)
############################################
#LASSO REGRESSION
#Using cross validation to determine the optimal tuning parameter. To chose the best lambda
CV1 <- cv.glmnet(x=X, y=down_train_data$Class, family="binomial", alpha = 1, # CROSS VALIDATION METHOD
nlambda = 200)#we can take out part of the fucntion from standardize =T till maxit=100 so as to let R choose best lamda
for us.
plot(CV1)
coef(CV1, CV1$lambda.min)
coef(CV1, CV1$lambda.1se)
b.lambda1 <- CV1$lambda.1se #we can use 1se or min THIS GIVES US THE BEST LAMBDA
b.lambda1
fit.best1 <- glmnet(x=X, y=down_train_data$Class, family="binomial", alpha = 1, #BEST FIT WILL SHOW SOME DATA
DISAPPEARING BECAUSE WE USED LASSO i.E. LAMBDA = 1
lambda=b.lambda1)
(fit.best1$beta)
# Prediction
X.test <- model.matrix (as.formula(formula0), data = test)[, -1]
pred_lasso <- predict(fit.best1, newx = X.test, type="response") #type = "response" means y is continues
(head(pred_lasso <- round(pred_lasso, digits = 2)))
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1860
(tail(pred_lasso <- round(pred_lasso, digits = 2)))
dim(pred_lasso)
##### Finding missclassification rate #making a threshhold
pred1_lasso <- ifelse(pred_lasso>0.5, 1, 0) #if it is bigger than 50%, we say yes
(miss.rate <- mean(yobs != pred1_lasso))
pred_lasso<-pred_lasso[,1] #gives only the index (1, 0) and not the column
pred1_lasso<-pred1_lasso[,1]
#Plotting ROC curve of the fit.best model.
library(cvAUC)
AUC1 <- ci.cvAUC(predictions = pred1_lasso, labels = yobs, folds=1:NROW(test), confidence = 0.95)
AUC1
(auc.ci_lasso <- round(AUC1$ci, digits = 3))
library(verification)
mod.glm_lasso <- verify(obs = yobs, pred = pred_lasso)
roc.plot(mod.glm_lasso, plot.thres=NULL)
text(x=0.7, y=0.2, paste("Area under ROC = ", round(AUC1$cvAUC, digits = 3), "with 95% CI (",
auc.ci_lasso[1], ",", auc.ci_lasso[2], ").", sep = " "), col="purple", cex =1.2)
#confusion matrix (we dont need it in this case because the data in imbalance)
#library(caret)
#library(e1071)
#confusionMatrix( as.factor(pred1),as.factor(yobs),positive = "1")
#FOR RECALL AND PRECISION SCORE, WE CAN USE THIS
#precision(as.factor(yobs),as.factor(pred1_lasso))
#recall(as.factor(yobs), as.factor(pred1_lasso))
#RECALL AND PRECISION CURVES
#install.packages("precrec")
library(precrec)
precrec_lasso <- evalmod(scores = pred_lasso, labels = yobs)
precrec_lasso
autoplot(precrec_lasso)
##############
# Calculate precision
precision_score <- posPredValue(as.factor(pred1_lasso), as.factor(yobs))
# Calculate recall
recall_score <- sensitivity(as.factor(pred1_lasso), as.factor(yobs))
# Print the precision and recall scores
cat("Precision Score:", precision_score, "\n")
cat("Recall Score:", recall_score, "\n")
#RECALL AND PRECISION CURVES
#install.packages("precrec")
library(precrec)
precrec_lasso <- evalmod(scores = pred1_lasso, labels = yobs)
precrec_lasso
autoplot(precrec_lasso)
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1861
# Clear the console
cat("\014")
# Restart R
q(save = "no")
setwd("C:/Users/nebcy/Documents/Apsu/Apsu/Data Set STAT5140")
credit_card1 = read.csv("creditcard.CSV")
credit_card<-credit_card1
dim(credit_card)
head(credit_card)
# Basic Data Exploration
install.packages(c('ROCR','ggplot2','corrplot','caTools','class',
'randomForest','pROC','imbalance'))
library(ROCR)
library(ggplot2)
library(corrplot)
library(caTools)
library(class)
library(randomForest)
library(pROC)
library(imbalance)
library(rpart)
#Basic Exploratory Data Analysis
#Viewing some initial observations of the dataset
#Summary of the dataset
summary(credit_card)
#Viewing Structure of the dataset
str(credit_card)
sapply(credit_card, FUN=class)
#Checking for missing values
credit_card[!complete.cases(credit_card),]
#Missing values by visualization
install.packages("naniar")
library(naniar)
vis_miss(credit_card, 'warn_large_data' = FALSE)
gg_miss_var(credit_card)
#Viewing the Frequency distribution of "Class" Variable
table(credit_card$Class)
prop.table(table(credit_card$Class))
ggplot(data = credit_card,aes(x=Class))+geom_bar(col="red")
#Plotting the Amount - Fraud chart
ggplot(data = credit_card,aes(y=Amount,x=Class))+geom_point(col="slateblue3") + facet_grid(~Class)
#Checking correlation between the variables
install.packages("corpcor")
library(corpcor)
head(cor2pcor(cov(credit_card)))
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1862
library(corrplot)
credit_card$Class <- as.numeric(credit_card$Class)
corr <- cor(credit_card[],method="pearson")
corrplot(corr,method = "circle", type = "lower")
#Taking out Time Column.
#Since the Time Variab,e is not important for this project, we take it out
credit_card<-credit_card1[,-1]
head(credit_card)
######################### Data partition ##################
set.seed(123) #maintains consistency of data and makes data set not to randomly change when running it multiple times
n <- nrow(credit_card)
split_data <- sample(x=1:2, size = n, replace=TRUE, prob=c(0.67, 0.33)) #x=1 =training data, x=2 = testing data
train <- credit_card[split_data == 1, ]
test <- credit_card[split_data == 2, ]
y.train <- train$Class
yobs <- test$Class
##################### Balancing data through upsampling
# Load the necessary libraries
install.packages("caret")
library(caret)
# Check the class distribution before upsampling
table(train$Class)
# Check the class of the 'Class' variable
class(train$Class)
# Convert 'Class' to a factor variable if it's not already
train$Class <- as.factor(train$Class)
# Verify that 'Class' is now a factor
class(train$Class)
set.seed(90) # for reproducibility
up_train_data <- upSample(x = train[, -30], y = train$Class)
# Check the class distribution after upsampling
table(up_train_data$Class)
# Upsample the minority class
set.seed(90) # for reproducibility
up_train_data <- upSample(x = train[, -30], y = train$Class)
# Check the class distribution after upsampling
table(up_train_data$Class)
#Building A logistic model
install.packages("glmnet")
library(glmnet) #Elastic net
formula0 <- factor(Class)~. #This teslls us num is a categorical variable and this will bring out just 1 and 0
X <- model.matrix (as.formula(formula0), data = up_train_data)[, -1] # trim off the first column and leaving only the predictors
#RIDGE REGRESSION
#Using cross validation to determine the optimal tuning parameter. To chose the best lambda
CV <- cv.glmnet(x=X, y=up_train_data$Class, family="binomial", alpha = 0, # CROSS VALIDATION METHOD
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1863
nlambda = 200)#we can take out part of the fucntion from standardize =T till maxit=100 so as to let R choose best lamda
for us.
plot(CV)
coef(CV, CV$lambda.min)
coef(CV, CV$lambda.1se)
b.lambda <- CV$lambda.1se #we can use 1se or min THIS GIVES US THE BEST LAMBDA
b.lambda
fit.best <- glmnet(x=X, y=up_train_data$Class, family="binomial", alpha = 0, #BEST FIT WILL NOT HAVE ANY DATA
DISAPPEARING BECAUSE WE USED RIDGE i.E. LAMBDA = 0
lambda=b.lambda)
(fit.best$beta)
# Prediction
X.test <- model.matrix (as.formula(formula0), data = test)[, -1]
pred <- predict(fit.best, newx = X.test, type="response") #type = "response" means y is continues
head(pred)
tail(pred)
dim(pred)
##### Finding missclassification rate #making a threshhold
pred1 <- ifelse(pred>0.5, 1, 0) #if it is bigger than 50%, we say yes
pred1
(miss.rate <- mean(yobs != pred1))
pred<-pred[, 1] #gives only the index (1, 0) and not the column
pred1<-pred1[,1]
#Plotting ROC curve of the fit.best model.
install.packages("cvAUC")
library(cvAUC)
AUC <- ci.cvAUC(predictions = pred, labels = yobs, folds=1:NROW(test), confidence = 0.95)
AUC
(auc.ci <- round(AUC$ci, digits = 3))
install.packages("verification")
library(verification)
mod.glm <- verify(obs = yobs, pred = pred)
roc.plot(mod.glm, plot.thres=NULL)
text(x=0.7, y=0.2, paste("Area under ROC = ", round(AUC$cvAUC, digits = 3), "with 95% CI (",
auc.ci[1], ",", auc.ci[2], ").", sep = " "), col="blue", cex =1.2)
#confusion matrix (we dont need it in this case because the data in imbalance)
#library(caret)
#library(e1071)
#confusionMatrix( as.factor(pred1),as.factor(yobs),positive = "1")
#RECALL AND PRECISION SCORE, WE CAN USE THIS
#precision(as.factor(yobs),as.factor(pred1))
#recall(as.factor(yobs), as.factor(pred1))
# Assuming yobs and pred1 are binary (0/1) vectors or factors
library(caret)
# Calculate precision
precision_score <- posPredValue(as.factor(pred1), as.factor(yobs))
# Calculate recall
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1864
recall_score <- sensitivity(as.factor(pred1), as.factor(yobs))
# Print the precision and recall scores
cat("Precision Score:", precision_score, "\n")
cat("Recall Score:", recall_score, "\n")
#RECALL AND PRECISION CURVES
install.packages("precrec")
library(precrec)
precrec_ridge <- evalmod(scores = pred, labels = yobs)
precrec_ridge
autoplot(precrec_ridge)
#################################
#ELASTICNET REGRESSION
#Using cross validation to determine the optimal tuning parameter. To chose the best lambda
CV2 <- cv.glmnet(x=X, y=up_train_data$Class, family="binomial", alpha = 0.5, # CROSS VALIDATION METHOD
nlambda = 200)#we can take out part of the fucntion from standardize =T till maxit=100 so as to let R choose best lamda
for us.
plot(CV2)
coef(CV2, CV2$lambda.min)
coef(CV2, CV2$lambda.1se)
b.lambda2 <- CV2$lambda.1se #we can use 1se or min THIS GIVES US THE BEST LAMBDA
b.lambda2
fit.best2 <- glmnet(x=X, y=up_train_data$Class, family="binomial", alpha = 0.5, #BEST FIT WILL SHOW SOME DATA
DISAPPEARING BECAUSE WE USED LASSO i.E. LAMBDA = 1
lambda=b.lambda2)
(fit.best2$beta)
# Prediction
X.test <- model.matrix (as.formula(formula0), data = test)[, -1]
pred_elasticnet <- predict(fit.best2, newx = X.test, type="response") #type = "response" means y is continues
(head(pred_elasticnet <- round(pred_elasticnet, digits = 2)))
(tail(pred_elasticnet <- round(pred_elasticnet, digits = 2)))
dim(pred_elasticnet)
##### Finding missclassification rate #making a threshhold
pred1_elasticnet <- ifelse(pred_elasticnet>0.5, 1, 0) #if it is bigger than 50%, we say yes
pred1_elasticnet
(miss.rate <- mean(yobs != pred1_elasticnet))
pred_elasticnet<-pred_elasticnet[, 1] #gives only the index (1, 0) and not the column
pred1_elasticnet<-pred1_elasticnet[,1]
#Plotting ROC curve of the fit.best model.
library(cvAUC)
AUC2 <- ci.cvAUC(predictions = pred1_elasticnet, labels = yobs, folds=1:NROW(test), confidence = 0.95)
AUC2
(auc.ci_elasticnet <- round(AUC2$ci, digits = 3))
library(verification)
mod.glm_elasticnet <- verify(obs = yobs, pred = pred_elasticnet)
roc.plot(mod.glm_elasticnet, plot.thres=NULL)
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1865
text(x=0.7, y=0.2, paste("Area under ROC = ", round(AUC2$cvAUC, digits = 3), "with 95% CI (",
auc.ci_elasticnet[1], ",", auc.ci_elasticnet[2], ").", sep = " "), col="blue", cex =1.2)
#confusion matrix (we dont need it in this case because the data in imbalance)
#library(caret)
#library(e1071)
#confusionMatrix( as.factor(pred1),as.factor(yobs),positive = "1")
#FOR RECALL AND PRECISION SCORE, WE CAN USE THIS
#precision(as.factor(yobs),as.factor(pred1_elasticnet))
#recall(as.factor(yobs), as.factor(pred1_elasticnet))
################
#RECALL AND PRECISION SCORE, WE CAN USE THIS
#precision(as.factor(yobs),as.factor(pred1_elasticnet))
#recall(as.factor(yobs), as.factor(pred1_elasticnet))
# Assuming yobs and pred1_elasticnet are binary (0/1) vectors or factors
library(caret)
# Calculate precision
precision_score <- posPredValue(as.factor(pred1_elasticnet), as.factor(yobs))
# Calculate recall
recall_score <- sensitivity(as.factor(pred1_elasticnet), as.factor(yobs))
# Print the precision and recall scores
cat("Precision Score:", precision_score, "\n")
cat("Recall Score:", recall_score, "\n")
#RECALL AND PRECISION CURVES
#install.packages("precrec")
library(precrec)
precrec_elasticnet <- evalmod(scores = pred_elasticnet, labels = yobs)
precrec_elasticnet
autoplot(precrec_elasticnet)
############################################
#LASSO REGRESSION
#Using cross validation to determine the optimal tuning parameter. To chose the best lambda
CV1 <- cv.glmnet(x=X, y=up_train_data$Class, family="binomial", alpha = 1, # CROSS VALIDATION METHOD
nlambda = 200)#we can take out part of the fucntion from standardize =T till maxit=100 so as to let R choose best lamda
for us.
plot(CV1)
coef(CV1, CV1$lambda.min)
coef(CV1, CV1$lambda.1se)
b.lambda1 <- CV1$lambda.1se #we can use 1se or min THIS GIVES US THE BEST LAMBDA
b.lambda1
fit.best1 <- glmnet(x=X, y=up_train_data$Class, family="binomial", alpha = 1, #BEST FIT WILL SHOW SOME DATA
DISAPPEARING BECAUSE WE USED LASSO i.E. LAMBDA = 1
lambda=b.lambda1)
(fit.best1$beta)
# Prediction
X.test <- model.matrix (as.formula(formula0), data = test)[, -1]
pred_lasso <- predict(fit.best1, newx = X.test, type="response") #type = "response" means y is continues
(head(pred_lasso <- round(pred_lasso, digits = 2)))
(tail(pred_lasso <- round(pred_lasso, digits = 2)))
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IJISRT23SEP1533 www.ijisrt.com 1866
dim(pred_lasso)
##### Finding missclassification rate #making a threshhold
pred1_lasso <- ifelse(pred_lasso>0.5, 1, 0) #if it is bigger than 50%, we say yes
(miss.rate <- mean(yobs != pred1_lasso))
pred_lasso<-pred_lasso[,1] #gives only the index (1, 0) and not the column
pred1_lasso<-pred1_lasso[,1]
#Plotting ROC curve of the fit.best model.
library(cvAUC)
AUC1 <- ci.cvAUC(predictions = pred1_lasso, labels = yobs, folds=1:NROW(test), confidence = 0.95)
AUC1
(auc.ci_lasso <- round(AUC1$ci, digits = 3))
library(verification)
mod.glm_lasso <- verify(obs = yobs, pred = pred_lasso)
roc.plot(mod.glm_lasso, plot.thres=NULL)
text(x=0.7, y=0.2, paste("Area under ROC = ", round(AUC1$cvAUC, digits = 3), "with 95% CI (",
auc.ci_lasso[1], ",", auc.ci_lasso[2], ").", sep = " "), col="purple", cex =1.2)
#confusion matrix (we dont need it in this case because the data in imbalance)
#library(caret)
#library(e1071)
#confusionMatrix( as.factor(pred1),as.factor(yobs),positive = "1")
#FOR RECALL AND PRECISION SCORE, WE CAN USE THIS
#precision(as.factor(yobs),as.factor(pred1_lasso))
#recall(as.factor(yobs), as.factor(pred1_lasso))
#RECALL AND PRECISION CURVES
#install.packages("precrec")
library(precrec)
precrec_lasso <- evalmod(scores = pred_lasso, labels = yobs)
precrec_lasso
autoplot(precrec_lasso)
##############
# Calculate precision
precision_score <- posPredValue(as.factor(pred1_lasso), as.factor(yobs))
# Calculate recall
recall_score <- sensitivity(as.factor(pred1_lasso), as.factor(yobs))
# Print the precision and recall scores
cat("Precision Score:", precision_score, "\n")
cat("Recall Score:", recall_score, "\n")
#RECALL AND PRECISION CURVES
#install.packages("precrec")
library(precrec)
precrec_lasso <- evalmod(scores = pred1_lasso, labels = yobs)
precrec_lasso
autoplot(precrec_lasso)
... In many fields, predictive modeling has become an essential tool for forecasting future trends and helping decision-makers in making data-driven choices [1]. For instance, in the financial sector, predictive techniques play a crucial role in detecting fraud [2,3]. Similarly, in the health sector, predictive models help in understanding and mitigating the impact of environmental factors like ambient ozone pollution on public health [4]. ...
Article
Full-text available
This article presents a comprehensive study of sales predictions using time series analysis, focusing on a case study of Walmart sales data. The aim of this study is to evaluate the effectiveness of various time series forecasting techniques in predicting weekly sales data for Walmart stores. Leveraging a dataset from Kaggle comprising weekly sales data from various Walmart stores around the United States, this study explores the effectiveness of time series analysis in forecasting future sales trends. Various time series analysis Original Research Article Neba et al.; Asian Res. Integrated Moving Average (SARIMA), Prophet, Exponential Smoothing, and Gaussian Processes, are applied to model and forecast Walmart sales data. By comparing the performance of these models, the study seeks to identify the most accurate and reliable methods for forecasting retail sales, thereby providing valuable insights for improving sales predictions in the retail sector. The study includes an extensive exploratory data analysis (EDA) phase to preprocess the data, detect outliers, and visualize sales trends over time. Additionally, the article discusses the partitioning of data into training and testing sets for model evaluation. Performance metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) are utilized to compare the accuracy of different time series models. The results indicate that Gaussian Processes outperform other models in terms of accuracy, with an RMSE of 34,116.09 and an MAE of 25,495.72, significantly lower than the other models evaluated. For comparison, ARIMA and SARIMA models both yielded an RMSE of 555,502.2 and an MAE of 462,767.3, while the Prophet model showed an RMSE of 567,509.2 and an MAE of 474,990.8. Exponential Smoothing also performed well with an RMSE of 555,081.7 and an MAE of 464,110.5. These findings suggest the potential of Gaussian Processes for accurate sales forecasting. However, the study also highlights the strengths and weaknesses of each forecasting methodology, emphasizing the need for further research to refine existing techniques and explore novel modeling approaches. Overall, this study contributes to the understanding of time series analysis in retail sales forecasting and provides insights for improving future forecasting endeavors.
Research
Full-text available
This study investigates the relationship between ambient ozone pollution and mortality rates in the United States, utilizing regression models to quantify the impact of ozone exposure on public health. Ozone, a harmful air pollutant, is linked to various respiratory and cardiovascular diseases, making it a significant concern for public health policymakers. The analysis employs a range of regression techniques, including multiple linear regression and generalized additive models (GAMs), to examine data collected from various health and environmental sources over multiple years. Data were sourced from the Environmental Protection Agency (EPA) and the National Center for Health Statistics, encompassing ozone concentration levels and corresponding mortality rates across different demographics. The study incorporates potential confounders such as socioeconomic status, age, and pre-existing health conditions to ensure robust model predictions. The results reveal a statistically significant association between elevated ambient ozone levels and increased mortality rates, particularly among vulnerable populations, such as the elderly and individuals with pre-existing respiratory conditions. By employing regression models, this research highlights the critical need for effective air quality management strategies and policies aimed at reducing ozone pollution. The findings provide valuable insights for healthcare professionals, policymakers, and researchers, contributing to a deeper understanding of the health implications of air quality. Furthermore, this study emphasizes the importance of continued monitoring and regulation of air pollutants to safeguard public health and reduce the burden of ozone-related mortality. Overall, the application of regression models in this analysis underscores their utility in exploring complex environmental health issues, facilitating informed decision-making to improve air quality and public health outcomes in the United States.
Research
Full-text available
This study presents a comprehensive analysis of retail sales prediction for Walmart, utilizing a combination of machine learning and time series models. With the growing complexity of consumer behavior and market dynamics, accurate sales forecasting is essential for effective inventory management, promotional strategies, and overall operational efficiency. The research integrates various methodologies, including linear regression, decision trees, and advanced time series forecasting techniques such as ARIMA and Prophet, to identify patterns and trends in historical sales data. The dataset encompasses multiple features, including historical sales figures, promotional activities, and external factors such as holidays and weather conditions. By employing feature engineering techniques, relevant variables are extracted to enhance model performance. A rigorous evaluation framework is applied, using metrics such as Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) to assess the accuracy of each model. The results indicate that hybrid approaches, combining the strengths of machine learning and time series analysis, yield superior forecasting accuracy compared to traditional models. The findings demonstrate that incorporating external factors significantly improves the predictive capabilities, allowing for more informed decision-making. This study provides valuable insights into the applicability of advanced analytical techniques in retail sales forecasting, highlighting the importance of a data-driven approach in optimizing inventory levels and enhancing customer satisfaction.
Research
Full-text available
Ambient ozone pollution is a significant environmental health risk, associated with various adverse health outcomes, including respiratory diseases and increased mortality rates. This study aims to analyze the influence of ambient ozone levels on public health in the United States, specifically focusing on mortality trends. Using a comprehensive dataset that includes ozone concentration measurements, demographic information, and health records from multiple states, we employed regression analysis to quantify the relationship between ozone exposure and mortality rates. Our findings indicate a statistically significant correlation between elevated ambient ozone levels and increased mortality, particularly among vulnerable populations such as the elderly and individuals with pre-existing health conditions. The analysis controlled for confounding variables such as socioeconomic status, smoking prevalence, and access to healthcare, ensuring robust results. We utilized various regression models, including linear regression and generalized additive models, to capture the non-linear relationships and potential threshold effects of ozone exposure on health outcomes. The results highlight critical trends in mortality associated with ozone pollution, suggesting that even moderate increases in ozone levels can lead to significant health impacts. Furthermore, the analysis reveals regional variations in mortality trends, emphasizing the need for targeted public health interventions in areas with higher ozone concentrations. By providing empirical evidence on the health risks associated with ozone exposure, our study aims to inform policymakers and public health officials, ultimately contributing to improved health outcomes and environmental protection efforts in the United States.
Research
Full-text available
This study conducts a comprehensive time series analysis and forecasting of COVID-19 trends in Coffee County, Tennessee, aiming to understand the pandemic's progression and its implications for public health policy and resource allocation. Utilizing daily reported cases and deaths from official health sources, we apply various time series forecasting techniques, including ARIMA (AutoRegressive Integrated Moving Average), Seasonal Decomposition of Time Series (STL), and Exponential Smoothing State Space Models (ETS), to model the dynamics of COVID-19 infections in the region. We begin by exploring the historical data to identify trends, seasonality, and potential outliers, employing visualizations and statistical tests to assess data characteristics. Subsequently, we implement the ARIMA model, optimizing parameters through auto-correlation and partial auto-correlation functions, alongside evaluating the model's residuals to ensure adequacy. Additionally, the STL decomposition method is used to extract seasonal and trend components, facilitating a clearer understanding of underlying patterns. To enhance forecasting accuracy, we also leverage ETS models, which adaptively smooth the data, capturing changes in trends and seasonal effects effectively. Our results highlight significant fluctuations in case numbers, influenced by various socioeconomic factors and public health interventions throughout the pandemic. The forecasting outcomes provide valuable insights into potential future trends, aiding local health authorities in decision-making processes regarding resource allocation and public health measures. This study underscores the importance of continuous monitoring and adaptive strategies in response to evolving COVID-19 dynamics, contributing to the broader discourse on pandemic preparedness and response at the community level.
Research
Full-text available
Credit card fraud poses a significant threat to financial institutions, resulting in substantial financial losses and eroding consumer trust. Effective detection of fraudulent transactions is crucial for mitigating these risks. This study investigates the performance of regularized Generalized Linear Models (GLMs) in detecting credit card fraud, focusing on the impact of various down-sampling techniques on model accuracy and efficiency. Given the highly imbalanced nature of credit card transaction data, traditional classification methods often struggle to identify fraudulent transactions due to the overwhelming majority of legitimate cases. To address this challenge, we explore several down-sampling strategies, including random down-sampling, Tomek links, and Edited Nearest Neighbors (ENN). Each technique aims to reduce the dataset's size while retaining essential characteristics, thereby enhancing the performance of the regularized GLMs. The effectiveness of these methods is evaluated based on metrics such as precision, recall, F1 score, and area under the ROC curve (AUC). We conduct a comparative analysis of the GLM performance with and without the application of down-sampling techniques, examining how these methods influence the model's ability to detect fraudulent transactions. The findings demonstrate that employing down-sampling techniques significantly improves the performance of regularized GLMs in fraud detection. The study concludes that a strategic combination of regularization methods and down-sampling techniques can enhance the identification of credit card fraud, thereby contributing to the development of more robust and efficient detection systems. This research offers valuable insights for financial institutions seeking to implement effective fraud detection mechanisms while ensuring minimal disruption to legitimate transactions.
Research
Full-text available
This case study investigates the forecasting of Netflix stock prices using various regression and machine learning models, aimed at enhancing predictive accuracy in a dynamic financial environment. As one of the leading streaming services globally, Netflix's stock performance is influenced by numerous factors, including subscriber growth, content investments, and market competition. To analyze these influences, we employ a range of models, including Generalized Linear Models (GLM), Ridge Regression, Lasso Regression, Elastic Net, and advanced machine learning techniques such as Random Forest and Support Vector Regression (SVR). The study begins by preprocessing historical stock price data, extracting relevant features that may impact price movements, including macroeconomic indicators and company-specific metrics. We then implement the selected models and compare their predictive performance using various metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared values. Preliminary results indicate that machine learning models, particularly Random Forest and SVR, outperform traditional regression techniques in terms of predictive accuracy, highlighting their ability to capture complex, non-linear relationships in the data. Furthermore, the study examines the importance of feature selection and engineering, demonstrating how tailored predictors can significantly enhance model performance. This research provides valuable insights into the efficacy of different forecasting methods for stock price prediction in the rapidly evolving entertainment sector. By leveraging advanced analytical techniques, investors and analysts can make more informed decisions regarding Netflix stock, ultimately contributing to more effective investment strategies.
Research
Full-text available
This study presents a comparative analysis of various stock price prediction models, specifically focusing on Generalized Linear Models (GLM), Ridge Regression, Lasso Regression, Elastic Net, and Random Forest. As financial markets become increasingly complex, accurate forecasting of stock prices is critical for investors and financial analysts. Each model offers unique advantages and drawbacks, which can significantly impact prediction accuracy. The analysis begins with an overview of the fundamental principles underlying each model. GLM serves as a flexible tool for modeling the relationship between stock prices and predictor variables, while Ridge and Lasso Regression introduce regularization techniques to mitigate overfitting, enhancing model robustness. Elastic Net combines the strengths of both Ridge and Lasso, making it particularly useful for scenarios with highly correlated features. Random Forest, on the other hand, leverages ensemble learning, constructing multiple decision trees to improve prediction accuracy and handle non-linear relationships effectively. The models are evaluated using historical stock price data, with performance metrics including Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared values. A series of experiments are conducted to determine the models' predictive power across different time horizons and market conditions. The findings indicate that while Random Forest consistently outperforms traditional regression models in terms of prediction accuracy, the simpler models (GLM, Ridge, Lasso, and Elastic Net) offer interpretability that is valuable for understanding the underlying market dynamics. Ultimately, this study provides insights into selecting appropriate stock price prediction models based on specific analytical needs, paving the way for future research in financial forecasting methodologies.
Article
This study examines how children with developmental language disorder (DLD) discriminate voiced and voiceless consonants and their processing speed. It also explores the contribution of factors like age, nonverbal intelligence, vocabulary, morphosyntactic skills, and sentence repetition in explaining speech perception abilities. Fourteen Cypriot Greek children with DLD and 14 peers with typical development (TD) aged 7; 10–10; 4 were recruited. Children were divided into four groups based on age and condition: young-DLD, young-TD, old-DLD, and old-TD. All children participated in an AX task, which measured their ability to discriminate sounds and their processing speed. They also completed a nonverbal intelligence test and a DVIQ test, which provided measures of various language abilities. The results demonstrated that the young-DLD group exhibited lower performance in discriminating consonants compared to the young-TD group, while such differences were not observed between the old-DLD and old-TD groups. Furthermore, while no significant differences in processing time were found between the DLD and TD groups, both young DLD and TD groups displayed longer processing times compared to their older counterparts. Age was the best-contributing factor to speech perception abilities in children with DLD in contrast to morphosyntax and vocabulary for children with TD. These findings highlight the role of voicing discrimination as a diagnostic marker of DLD as opposed to reaction time. Moreover, they underscore the crucial role of age in detecting DLD. The language developmental trajectories of children with TD appear distinct from those with DLD, as evidenced by variations in contributing factors between the two groups. These disparities can be attributed to the diverse nature of the DLD population, the therapies they receive, the compensatory strategies they employ, and the potential impact of other contributing factors.
Conference Paper
Full-text available
Undersampling is a popular technique for unbalanced datasets to reduce the skew in class distributions. However, it is well-known that undersampling one class modifies the priors of the training set and consequently biases the posterior probabilities of a classifier [9]. In this paper, we study analytically and experimentally how undersampling affects the posterior probability of a machine learning model. We formalize the problem of undersampling and explore the relationship between conditional probability in the presence and absence of undersampling. Although the bias due to undersampling does not affect the ranking order returned by the posterior probability, it significantly impacts the classification accuracy and probability calibration. We use Bayes Minimum Risk theory to find the correct classification threshold and show how to adjust it after undersampling. Experiments on several real-world unbalanced datasets validate our results.
Article
Full-text available
Size is the first, and at times, the only dimension that leaps out at the mention of big data. This paper attempts to offer a broader definition of big data that captures its other unique and defining characteristics. The rapid evolution and adoption of big data by industry has leapfrogged the discourse to popular outlets, forcing the academic press to catch up. Academic journals in numerous disciplines, which will benefit from a relevant discussion of big data, have yet to cover the topic. This paper presents a consolidated description of big data by integrating definitions from practitioners and academics. The paper's primary focus is on the analytic methods used for big data. A particular distinguishing feature of this paper is its focus on analytics related to unstructured data, which constitute 95% of big data. This paper highlights the need to develop appropriate and efficient analytical methods to leverage massive volumes of heterogeneous data in unstructured text, audio, and video formats. This paper also reinforces the need to devise new tools for predictive analytics for structured big data. The statistical methods in practice were devised to infer from sample data. The heterogeneity, noise, and the massive size of structured big data calls for developing computationally efficient algorithms that may avoid big data pitfalls, such as spurious correlation.
Article
Full-text available
This study analyzes the comparison between traditional statistical methodologies for distress classification and prediction, i.e., linear discriminant (LDA) or logit analyses, with an artificial intelligence algorithm known as neural networks (NN). Analyzing well over 1,000 healthy, vulnerable and unsound industrial Italian firms from 1982–1992, this study was carried out at the Centrale dei Bilanci in Turin, Italy and is now being tested in actual diagnostic situations. The results are part of a larger effort involving separate models for industrial, retailing/trading and construction firms.The results indicate a balanced degree of accuracy and other beneficial characteristics between LDA and NN. We are particularly careful to point out the problems of the ‘black-box’ NN systems, including illogical weightings of the indicators and overfitting in the training stage both of which negatively impacts predictive accuracy. Both types of diagnoslic techniques displayed acceptable, over 90%, classificalion and holdoul sample accuracy and the study concludes that there certainly should be further studies and tests using the two lechniques and suggests a combined approach for predictive reinforcement.
Article
Using data from key first year courses, this article considers the development of subject-specific models to identify enrolled students at-risk of failure. The primary technique considered was neural networks, with it's results compared with logistic regression and multiple discriminant analysis. The three different modelling approaches were developed by three different analysts to achieve the benefits accruing from the independent M-Competition. We have found the quality of forecasts achieved to be significantly improved on earlier studies, presumably because of the subject specific nature of the models.
A survey of credit card fraud detection techniques: Data and technique-oriented perspective
  • A Acar
  • H Aksu
  • A Dogac
Acar, A., Aksu, H., & Dogac, A. (2017). A survey of credit card fraud detection techniques: Data and technique-oriented perspective. Journal of Computer and System Sciences, 83(1), 121-136.
Fuzzy Darwinian Detection of Credit Card Fraud
  • P Bentley
  • J Kim
  • G Jung
  • J Choi
Bentley, P., Kim, J., Jung. G., & Choi, J. (2000). Fuzzy Darwinian Detection of Credit Card Fraud. Proc. of 14th Annual Fall Symposium of the Korean Information Processing Society.
An efficient credit card fraud detection model using machine learning and natural language processing
  • K Dinakar
  • R R Nair
Dinakar, K., & Nair, R. R. (2020). An efficient credit card fraud detection model using machine learning and natural language processing. Procedia Computer Science, 171, 695-703.
Building an online system for fraud detection of credit card operations based on a neural classifier
  • J R Dorronsoro
  • F Ginel
  • J A Sanchez
  • J M Cruz
Dorronsoro, J. R., Ginel, F., Sanchez, J. A., & Cruz, J. M. (1997). Building an online system for fraud detection of credit card operations based on a neural classifier. Proceedings of the International Conference on Artificial Neural Networks (ICANN'97), 683-688.
Consumer Sentinel Network Data Book 2020
Federal Trade Commission. (2020). Consumer Sentinel Network Data Book 2020. Retrieved from https://www.ftc.gov/system/files/documents/reports/c onsumer-sentinel-network-data-book-2020/csn_data_book_2020.pdf