ArticlePDF Available

Abstract and Figures

The volume of electronic transactions has raised significantly in last years, mainly due to the popularization of electronic commerce (e-commerce), such as online retailers (e.g., Amazon.com, eBay, Ali Express.com). We also observe a significant increase in the number of fraud cases, resulting in billions of dollars losses each year worldwide. Therefore it is important and necessary to developed and apply techniques that can assist in fraud detection and prevention, which motivates our research. This work aims to apply and evaluate computational intelligence techniques (e.g., Data mining and machine learning) to identify fraud in electronic transactions, more specifically in credit card operations performed by Web payment gateways. In order to evaluate the techniques, we apply and evaluate them in an actual dataset of the most popular Brazilian electronic payment service. Our results show good performance in fraud detection, presenting gains up to 43 percent of an economic metric, when compared to the actual scenario of the company.
Content may be subject to copyright.
Fraud Analysis and Prevention in e-Commerce Transactions
Evandro Caldeira
Federal Center of Technological
Education of Minas Gerais (CEFET-MG)
Computing Department
Belo Horizonte, MG, Brazil
Email: evandrocaldeira@gmail.com
Gabriel Brand˜
ao
Federal Center of Technological
Education of Minas Gerais (CEFET-MG)
Computing Department
Belo Horizonte, MG, Brazil
Email: gabrielbrandao@decom.cefetmg.br
Adriano C. M. Pereira
Federal University of
Minas Gerais (UFMG)
Dept. of Computer Science
Belo Horizonte, MG, Brazil
Email: adrianoc@dcc.ufmg.br
Abstract—The volume of electronic transactions has raised
significantly in last years, mainly due to the popularization
of electronic commerce (e-commerce), such as online retailers
(e.g., Amazon.com, eBay, AliExpress.com). We also observe a
significant increase in the number of fraud cases, resulting in
billions of dollars losses each year worldwide. Therefore it is
important and necessary to developed and apply techniques
that can assist in fraud detection and prevention, which
motivates our research. This work aims to apply and evaluate
computational intelligence techniques (e.g., data mining and
machine learning) to identify fraud in electronic transactions,
more specifically in credit card operations performed by Web
payment gateways. In order to evaluate the techniques, we
apply and evaluate them in an actual dataset of the most
popular Brazilian electronic payment service. Our results show
good performance in fraud detection, presenting gains up to 43
percent of an economic metric, when compared to the actual
scenario of the company.
Keywords-Fraud Prevention; e-Commerce; e-Business; e-
Payment; Machine Learning;
I. INTRODUCTION
Recently we have observed a significant increase in the
volume of electronic transactions, mainly due to the popular-
ization of World Wide Web and electronic commerce, such
as online retailers (e.g., www.ebay.com, www.walmart.com,
www.amazon.com). We also testify a huge increase in the
number of online frauds, resulting in billions of dollars
losses each year worldwide. Therefore it is important and
necessary to developed and apply techniques that can assist
in fraud detection, which motivates our research.
Bhatla et al [1] said that the rate at which Internet credit
card fraud occurs is 12 to 15 times higher than face-to-
face transactions. The 12th annual online fraud report by
CyberSource [2] shows that, for most of the current decade,
merchant online fraud losses continued to increase, reaching
a peak of $4 billion in 2008. According to Siddhartha
Bhattacharyya et al. [3] with the growth in credit card
transactions, as a share of the payment system, there has
also been an increase in credit card fraud, and 70% of U.S.
consumers are noted to be significantly concerned about
identity fraud.
Moreover, many fraud detection problems occur in huge
amounts of data. For instance, the credit card company
Barclaycard has about 350 million transactions per year just
in the UK. The Royal Bank of Scotland, which has the
largest credit card market in Europe, has more than one
billion transactions per year [4]. The processing of these
datasets, looking for fraudulent operations, requires fast and
efficient algorithms.
In this context, data mining techniques have been relevant
in solving this challenge since it can deal with a large
amount of data. In this work we apply and evaluate compu-
tational intelligence techniques to identify fraud in electronic
transactions, more specifically in credit card operations. In
order to evaluate the techniques, we define a concept of
economic efficiency and apply them in an actual dataset of
the most popular Brazilian electronic payment service. Our
results can be used to create systems to assist fraud analysts
in their jobs. The performance in fraud detection, compared
with the actual scenario, is up to 43% improvement in the
financial gain, that is, using the economic efficiency metric
that will be later explained.
The remainder of this paper is organized as follows.
Section II describes some related work. Section III presents
a brief description about the computational intelligence
techniques that we adopt in this work: Bayesian networks,
logistic regression, neural networks and random forest. Sec-
tion IV describes our case study, using a representative
sample of actual data, where we present a dataset overview,
the experimental methodology and results. Finally, section V
presents the conclusions and future work.
II. RELATED WORK
Due to the importance of the fraud detection problem, we
may distinguish several works that discuss this subject[3],
[5], [6], [7]. Thomas et al. (2004) [8] propose a very simple
decision tree that is used to identify general fraud classes.
They also propose a first step towards a fraud taxonomy.
Vasiu and Vasiu (2004) [9] propose a taxonomy for computer
fraud and, to build it, employ a five-phase methodology.
According to the authors, the taxonomy presented was
prepared from a fraud preventing perspective and may be
used in various ways. For them, this methodology can be
useful as a tool for awareness and education, and can also
2014 9th Latin American Web Congress
978-1-4799-6953-1/14 $31.00 © 2014 IEEE
DOI 10.1109/LAWeb.2014.23
42
2014 9th Latin American Web Congress
978-1-4799-6953-1/14 $31.00 © 2014 IEEE
DOI 10.1109/LAWeb.2014.23
42
2014 9th Latin American Web Congress
978-1-4799-6953-1/14 $31.00 © 2014 IEEE
DOI 10.1109/LAWeb.2014.23
42
help those responsible for combating frauds associated with
IT to design and implement policies to reduce risks. Chau et
al. (2006) [10] propose a methodology called 2-Level Fraud
Spotting (2LFS) to model the techniques that fraudsters often
use to carry out fraudulent activities and to detect offenders
preventively. This methodology is used to characterize the
auction users on-line as honest, dishonest, and accomplices.
Methodologies that characterize fraud are essential for the
first phase of the process, since they are the starting point to
create a model of the problem and define the best technique
for its solution.
There are several researches that develop methods to
detect fraud [11], [5], [12] and we can realize that these
methodologies can differ significantly due to the peculiarities
of each fraud type. However, what can be noticed is that
the data mining techniques have been widely used in fraud
detection regardless of the methodology adopted. This is
because these techniques allow the useful information ex-
traction in databases with large volumes of data. Phua et al.
[13] conducted an exploratory study of numerous articles
related to fraud detection using data mining and explained
these methods and techniques.These algorithms are based on
some approaches such as supervised strategy with labeled
data, unsupervised strategy with unlabeled data and hybrid
approach.
In supervised strategy with labeled data, algorithms exam-
ine every transaction, previously labeled, to mathematically
determine the profile of a fraudulent transaction and esti-
mate your risk. Neural Networks, Support Vector Machines
(SVM), Decision Trees and Bayesian Networks are some
of the techniques used by this strategy. Maes et al. [14]
used the STAGE algorithm for Bayesian networks and “back
propagation” algorithm for neural networks to detect fraud
in credit card transactions. The results show that Bayesian
networks are more accurate and faster training, but are
slower when applied to new instances.
In unsupervised strategy with unlabeled data, the methods
do not require prior knowledge of fraudulent and not fraud-
ulent transactions. On the other hand, changes in behavior
are detected or unusual transactions are identified. Examples
of these techniques are Clustering and Anomaly Detection.
Netmap [15] describes how the clustering algorithm is used
to form well-connected data groups and how it led to the
capture of the real insurance fraudsters. Bolton and Hand
[16] proposed an approach of fraud detection for credit
card using anomalies detected in transactions. Abnormal
behaviors are identified in spending and how often they
occur is used to determine which cases may be fraud.
In the hybrid approach (supervised and unsupervised)
there are researches using data labeled with supervised and
unsupervised algorithms to detect fraud in insurance and
telecommunications. Unsupervised approaches have been
used to segment data into groups to be used in supervised
approaches. Williams and Huang [17] apply a three step
process: k-means for detecting groups, C4.5 for decision
making, and statistical summaries and visualization tools to
evaluate the rule. It is important to note that the choice of
which approach to be used depends on the methodology and
the available database.
SVM and random forests are sophisticated data mining
techniques, which have been noted in recent years to show
superior performance across different applications [18], [19]
SVMs are statistical learning techniques, with strong theo-
retical foundation and successful application in a range of
problems [20]. They are closely related to neural networks,
and through use of kernel functions, can be considered an
alternate way to obtain neural network classifiers. Rather
than minimizing empirical error on training data, SVMs seek
to minimize an upper bound on the generalization error.
As compared with techniques like neural networks which
are prone to local minima, overfitting and noise, SVMs
can obtain global solutions with good generalization error.
Appropriate parameter selection is, however, important to
obtain good results with SVM. In our application, which
has a very unbalanced data, SVM does not provide good
results.
There is a very complete work [21] that performs a review
of the literature on the application of data mining techniques
for the detection of financial fraud. Although financial fraud
detection (FFD) is an emerging topic of great importance, a
comprehensive literature review of the subject has yet to be
carried out. This paper thus represents the first systematic,
identifiable and comprehensive academic literature review of
the data mining techniques that have been applied to FFD.
49 journal articles on the subject published between 1997
and 2008 were analyzed and classified into four categories
of financial fraud (bank fraud, insurance fraud, securities
and commodities fraud, and other related financial fraud)
and six classes of data mining techniques (classification,
regression, clustering, prediction, outlier detection, and visu-
alization). The findings of this review clearly show that data
mining techniques have been applied most extensively to
the detection of insurance fraud, although corporate fraud
and credit card fraud have also attracted a great deal of
attention in recent years. The main data mining techniques
used for FFD are logistic models, neural networks, the
Bayesian belief network, and decision trees, all of which
provide primary solutions to the problems inherent in the
detection and classification of fraudulent data. This paper
also addresses the gaps between FFD and the needs of
the industry to encourage additional research on neglected
topics, and concludes with several suggestions for further
FFD research.
These related works have helped us, indicating promising
strategies for detecting and preventing fraud. As the datasets
are different, mainly due to the very unbalanced data of our
scenario, it is not possible to directly compare the results, but
they provide an idea of the efficiency of these approaches.
434343
Moreover, as in our case the main goal is to rank the
transactions to block the ones that have high probability of
being fraud (chargeback), we are going to define a more
precise quality indicator to measure the economic gain of
each computational model.
III. FUNDAMENTALS
This section describes the techniques we apply and evalu-
ate in this work: Bayesian networks (Section III-A), logistic
regression (Section III-B), neural networks (Section III-C),
and random forest (Section III-D).
A. Bayesian Networks
Bayesian Networks (BN) are directed acyclic graphs that
represent dependencies between the variables of a proba-
bilistic model, where each node in the graph represents a
random variable and the arcs represents the relationships
between these variables [22], as showed by Figure 1, where
the event A affects directly the event D that if affected
directly by event B, and so on. And eis an independent
event.
Figure 1. Bayesian Network - Description.
The mathematical definition for BN is derived of Bayes
theorem, which shows that conditional probability of a event
Aigiven a event B, can be calculated by Equation 1, where
P(Ai|B)is the probability of A when B occurs.
P(Ai|B)=P(B|Ai)P(Ai)
P(B)(1)
In fraud detection problem the BN is unknown, therefore
to build the BN graph it is need to learn it from the data.
From the BN graph, we can calculate the set of dependent
variables to happen a fraud (conditional probability), using
Equation 1. Before calculating the conditional probability,
we can find the probability of fraud applying Equation 2
[23].
P(xi, ..., xn)=
n
i=0
P(xi|Parents(Xi)),(2)
where Parents(Xi)are determined by a graph as showed
by Figure 1.
B. Logistic Regression
Logistic Regression (LR) is a statistical technique that
produces, from set of explanatory variables, a model that
can predict values taken by a categorical dependent variable.
Thus, a regression model is used to calculate the probability
of an event, through the link function described by the
following Equation:
π(x)= e(β0+β1x1+β2x2+...βixi)
1+e(β0+β1x1+β2x2+...βixi),(3)
where π(x)is the probability of success when the value
of the predictive variable is x. β0is a constant used for
adjustment and βiare the coefficients of the predictive
variables [24].
In order understand LR, it is important to explain the
concept of Generalized Linear Models (GLM). This consists
of three components [25]:
A random component, which contains the probability distribution of
the dependent variable (Y).
A systematic component, which corresponds to a linear function
between the independent variables.
Alink function, that is responsible for describing the mathematical
relationship between the systematic component and random compo-
nent.
The binary LR model is a special case of the GLM
model with the logit function. This function is used to get
the estimation of coefficients [26]. Then, we apply these
coefficients in Equation 3 that result in our fraud probability.
C. Neural Networks
A Neural Network (NN) is an interconnected assembly
of simple processing elements, units or nodes, whose func-
tionality is loosely based on the animal neuron [27]. The
processing ability of the network is stored in the inter-unit
connection strengths, or weights, obtained by a process of
adaptation to, or learning from, a set of training patterns.
Generically, the processing in a neuron consists of a linear
combination of entries (xj), which can be described by
Equation 4:
net =w1x1+w2x2+... +wDxD
=
D
j=1
wjxj=wTx, (4)
where wjis a weight associated with the input (xj).
This weight shows the intensity wherewith a particular input
influences the output value. The calculated value (net) is
applied in an activation function that can be Linear, Step,
Ramp, Sigmoid, Hyperbolic Tangent or Gaussian. [28] The
NN model used was MultiLayer Perceptron (MLP), which
has the ability to classify non-linearly separable regions [29],
appropriate for our fraud detection approach.
444444
The training was done using the Levenberg-Marquardt
algorithm [30], because it is fast and can achieve good
results. We perform a set of experiments to determine the
best NN configuration, that is, a network with two layers: the
first (hidden layer) containing ten neurons and the second
(output layer) containing one neuron.
D. Random Forest
The Random Forest (RF) algorithm was proposed by Brei-
man [31] based on the use of trees to product classification.
Breiman’s definition to algorithm is: “A RF is a classi-
fier consisting of a collection of tree-structured classifiers
h(x, θk),k =1, ... where the θkare independent identically
distributed random vectors and each tree casts a unit vote
for the most popular class at input x”.
The classifier quality or performance can be measured
by a high value of probability P(h(X)=Y). The vector
Xrepresents the variables of the problem and Yis the
response. Given a observed dataset
((x1,1, ...x1,n),(x2,1, ...x2,n ), ..., (xk,1, ...xk,n)) = D
and let Bbe the number of trees and mthe number of
features. The Algorithm 1 describes the RF.
Algorithm 1 Random Forest Algorithm
for N=0, .., B do
DiBootstrap sample from D
TiContruct tree using Di
for node =1, .., No.N odes do
nodeichoose random subset mof all features.
end for
end for
Xtake the majority vote for all trees
IV. CASE STUDY
This section presents our case study where we apply
the computational intelligence techniques to detect fraud in
electronic transactions, more specifically in credit card in
terms of chargeback operations.
A. DATASET OVERVIEW
PagSeguro1is a Web service for online payment, owned
by the largest Latin America Internet and Web Content
Provider, named Universo Online Inc.(UOL)2, which ensures
the safety of those who buy and sell on the web.
In PagSeguro each transaction is composed of tens of
attributes of the more different types and one of these
attributes refers to the status of the transaction, which can
result in a valid transaction or chargeback. The purpose of
this work is to analyze a set of transactions that occurred
1http://pagseguro.uol.com.br
2http://www.uol.com.br
in PagSeguro, using the attributes that characterize these
transactions to apply computational intelligence techniques,
such as Bayesian Networks, Logistic Regression, Random
Forest and Neural Networks, to detect fraud (chargeback).
Table I shows a short summary of the PagSeguro dataset.
It embeds a significant sample of valid and chargeback
transactions, which has thousands of transactions. Due to a
confidentiality agreement, the quantitative information about
this dataset cannot be presented.
Valid Chargeback
Average Value (US$) 36.33 81.59
Standard deviation (US$) 80.51 122.74
Median (US$) 15.00 40.00
Coefficient Of Variation 2.22 1.50
Table I
PAGSEGURO DATASET -SUMMARY.
Figure 2 shows the relative quantity of chargeback trans-
actions for each month. Despite this percentage would be
considered low, it is very significant, since a chargeback
transaction results in a loss of the total transaction value.
Moreover, a valid transaction results in a gain of only a
small percentage of the transaction value for the payment
service company.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
2010-10 2010-11 2010-12 2011-01 2011-02 2011-03
Relative Amount of Chargeback (%)
Month of the Year
Figure 2. Relative Amount of Chargeback.
Figure 3 shows the cumulative distribution function (CDF)
of transaction value. Valid transactions with values lower
than US$25 correspond to 66%, and 32% for chargebacks
ones. Thus, we can see that in general valid transactions
present lower values than chargeback ones.
From the dataset we selected 21 attributes to be used as
candidate for the techniques. The most important attributes
that we use are described, as follows:
Value: a numeric attribute that represents the value of
transaction.
Score: a literal attribute that helps to identify success-
ful, unsuccessful and incomplete transactions.
Hour: a numeric attribute that refers to the transaction
create time.
454545
0
20
40
60
80
100
0 150 300 450 600 750 900 1050 1200
Cumulative ammount (%)
Transaction Value (U$)
Valid transactions
Chargeback transactions
Figure 3. Cumulative Distribution Function (CDF) of Transaction Value.
Buyer Type: a literal attribute that helps to identify a
type of user buyer type.
Buyer Registration Time: a numeric attribute that
display the buyer’s registration time.
Day: a numeric attribute that display the day in which
the transaction occurs.
Buyer’s Points: a numeric attribute that helps to iden-
tify users who had successfull transactions in the past.
Registered flg registered : A Flag attribute. In PagSeguro,
users unregistered also can buy, this flag helps to
identify who are the registered buyers.
Seller Registration Time: a numeric attribute that
display the seller’s registration time.
Stores Main Category: a numeric attribute that is
related to the main category of the store. The category
refers to the main products type sold by the store.
Credit Card Operator: a numeric attribute that iden-
tify credit card operator.
Credit Card Owner Age: a numeric attribute that
represents, in years, how old is the credit card owner.
Quantity of installments num installment qty : A numeric at-
tribute that it says the quantity of installments used in
the purchase.
Status at the Serasa3idt serasa status : A literal attribute that
it shows the status of the buyer at the Serasa .
Had response from Serasa flg has answer sr : A flag attribute
that it shows if the consults at the Serasa returns an
answer about the buyer.
CPF4flg cpf : A flag attribute. This attributte tells that CPF
of the buyer is the same of the CPF of credit card owner.
3The Serasa is a private company that owns one of largest data base in
the world and devotes its activity to the provision of services of general
interest. The institution is recognized by the code of consumer protection
as a entity of public nature.
4A document that identify the individual taxpayer in face of the Federal
Revenue Secretariat of Brazil (FRSB). The CPF holds the registration
information provided by the individual taxpayer that the other data systems
of the FRSB.
DDD3: a flag attribute that compares if the DDD of the
registered user in the PagSegurois consistent with the
DDD of the credit card owner.
Federation Unit: a nominal attribute that refers to the
Federation Unit provided by the user.
B. Methodology
We used the same methodology for all techniques, starting
with a characterization of our dataset, which allowed us to
remove items with lower significance and categorize some
numeric variables. We made a selection of the most rele-
vant attributes to fraud detection, using “Forward Stepwise
Regression”, which is based in the verisimilitude concept
[32]. We also use InfoGain, which shows the relative gain
of each variable, and this was made in Weka4. Weka is a free
software, under GPL License and it has many data mining
and classification algorithms in its toolbox.
Following this process we define the training and test sets
to evaluate the algorithms. We use the first 3 weeks of the
month for training and remaining for test. This reproduces a
real scenario situation and guarantees the model generality.
We also use the technique of “K-fold-Cross-Validation” to
validate the quality of our experiments. In order to perform
this, we define the sub-samples number (K)to5.
To evaluate the fraud detection techniques we use different
environments, each of them has its own parameters. The fine
tuning of these parameters was made using an exhaustive
search testing different values for each technique. Next, we
describe some details about the techniques and experiments,
such as the parameters used for each technique.
We use the software R5to build the LR model. To
binary LR we use GLM package with the parame-
ters: “FORMULA” where we set the response variable
to chargeback and independent variables the others.
“FAMILY” is defined as binomial and “LINK” as logit.
For the BN we use Weka. We use the parameter Q
to algorithm Hill Climbing to search for the network
topology, as subcategory of it we have the parameters:
“-P” as the maximum number of father nodes set to 1
because this the number of our response variable and
“-S” to define the score that is used to mount the Bayes
probability table.
For the NN we use MATLAB Neural Network Toolbox.
The network is a MLP with one hidden layer consisting
of ten neurons and with the output layer of one neuron.
The activation function of the hidden layer is tangent
sigmoidal and linear for the output layer. For the train-
ing stage, we use the Levenberg-Marquardt algorithm.
For the RF we use Weka. This implementation only
permits the manipulation of the “Max Depth” of the
3Long Distance Call
4http://www.cs.waikato.ac.nz/ml/weka/
5http://www.r-project.org/
464646
trees, the “Number of Features” to be used in the
random selection and the “Number of Trees”. We set
“Max Depth” to unlimited as well as the “Number of
Features” and “Number of Trees” to 10.
After the execution of the techniques we construct a
ranking by the degree of reliability to fraud assigned to
each transaction. On the top of the ranking would be the
transactions with the higher probability to be fraud. After
this, we apply Equation 7 in many ranking ranges to obtain
the best result.
Beyond the precision we measured the recall that is the
ability to find all the existent frauds. We also defined a
fitness function called Economic Efficiency (EE) that can be
seen on Equation 5. The Gain (G) represents the financial
value of true positive transactions, rate (r) is a percentage
that the company gains in a successful transaction and the
Lost (L) is the financial value of false negative transactions.
Applying this formula in the ranking we find the position
that maximizes the profit for a given algorithm.
EETechnique =
n
j=1
Gj×rLj×(1 r)(5)
Equation 6 is a simplification of the Company profit.
The rvalue has the same meaning as described before in
Equation 5, NF represents the financial value of Non-Fraud
transactions and Fis the financial value of fraud transactions.
EEReal =
n
j=1
NFj×rFj×(1 r)(6)
Equation 7 gives a relative gain where 100% represents
the maximum gain and 0% is the actual scenario without
the use of any technique. The EEMax is the maximum gain
that the company could have when no fraud occurs. We will
use this equation in section IV-C to compare all techniques.
EE =EETechnique EEReal
EEMax EEReal
(7)
We are not using precision rate to measure how efficient
is a technique due to the unbalanced dataset. A random algo-
rithm model would get a very low precision for chargeback,
less than 0.5%. This is the reason why we use the EE that is
the most relevant factor in our scenario. Using this concept
we also avoid the misunderstand of a high precision when
classifying all transactions as valid and none as chargeback.
C. Results
Table II summarizes the results for techniques previously
described in Section III. The best result between all tech-
niques was the NN in March, with 43.66% of EE. Except
BN, October is the worst month for all techniques. It is
important to emphasize the most important measurement is
the Economic Efficiency (EE), which is represented by Rank.
We inform about precision and recall, which are traditional
classification metrics, however in our problem we want to
rank transitions according to a fraud score ranking, thus it
is not a typical classification problem.
BN LR NN RF
Oct.
Prec. 7.05 4.10 7.00 10.17
Rec. 18.93 27.52 9.00 11.47
Rank. 0.79 1.98 0.36 0.33
EE 25.28 12.03 11.69 8.13
Nov.
Prec. 14.70 8.33 5.00 19.02
Rec. 32.38 36.67 39.00 27.01
Rank. 0.73 1.47 2.57 0.47
EE 29.70 28.73 33.64 22.42
Dec.
Prec. 7.40 3.53 5.00 14.17
Rec. 21.08 30.20 23.00 14.55
Rank. 1.16 3.49 1.75 0.42
EE 16.61 10.64 20.04 18.02
Jan.
Prec. 8.78 9.70 6.00 13.11
Rec. 25.56 21.19 21.00 10.60
Rank. 1.30 0.98 1.30 0.32
EE 16.57 15.54 11.98 9.90
Feb.
Prec. 7.78 6.06 9.00 7.55
Rec. 42.96 44.62 19.00 18.36
Rank. 3.10 4.13 1.03 1.12
EE 27.40 25.75 24.03 12.01
Mar.
Prec. 9.93 5.38 6.00 4.24
Rec. 43.01 49.94 34.00 32.32
Rank. 2.22 4.76 3.18 3.91
EE 35.53 35.61 43.66 13.48
Table II
COMPARATIVE RESULTS FOR ALL TECHNIQUES ON THE WHOLE
DATASET.ABBREVIATIONS:“PREC.” IS PRECISION,“REC.” IS RECALL,
“RANK.” IS RANKING
BN has its lowest gain in October with 14.33% of EE,
7.05% of precision at position 0.79% of the ranking. The
best result was achieved in March with 35.53% of EE, 9.93%
of precision and ranking coverage with 2.22%. The higher
precision value was obtained in November with 32.38% of
recall. The higher recall rate is in March with 43.01%.
LR has its lowest gain in October with 12.03% of EE with
4.10% of precision and its best EE is 35.61% in March.
NN presents its worst results in October and January with
11.69% and 11.98%, respectively. Its best result is in March
with 43.66% of EE and 6% of precision.
RF has the worst EE in October with 8.13%. Its best is
22.42% of EE in November with precision of 19.02% at
0.47% of the ranking.
Figure 4 shows the EE until 8% of the ranking in
March. NN presents the best performance until 5.80% of
the ranking, and after that it drops and stays bellow BN
curve. The RF stays bellow the others until to the end.
These results shows that all the four algorithms can bring
474747








        


!"
#&'*"
+&"&&
',"'*"
/:"&
Figure 4. March - EE versus Ranking Position
gains to the company, even the less effective technique
reaches at least 8% of Economic Efficiency gain. This
methodology of fraud detection can be used by e-commerce
companies to reduce the risk in credit card operations. If
we compare the techniques to choose the one that would
be the best to avoid chargeback, we identify that Bayesian
Networks (BN) is the best one, since Neural Networks (NN)
presented lower values in some months of the actual dataset.
Therefore BN has been chosen as the best technique for this
scenario, presenting significant gains for all months of data.
V. C ONCLUSION
In this work we build different fraud detection models
to predict fraud in online transactions, more specifically
credit card operations. We apply and evaluate four different
computation intelligence techniques, after choosing them
from an initial set of evaluated experiments that adopt sev-
eral distinct techniques. In order to evaluate the techniques,
we apply them in an actual dataset, containing thousands
of transactions per day, from the most popular Brazilian
electronic payment service, called PagSeguro.
We confirm that imbalanced classes, fraud and non-fraud,
was a factor that directs impacts on the prediction gains. The
achieved results present significant gains when compared to
actual scenario of the company, which adopts some fraud
detection procedures. In order to compare the techniques,
we adopt an Economic Efficiency (EE) function, which
describes the financial improvement relative to the actual
scenario from the corporation. In the best case, we have
achieved a gain of 43.66%.
We realize that the worst results were obtained in the
months with a fewer amount of fraud transactions than
other ones. Neural Network and Bayesian Networks have
performed the best results. The Logistic Regression ap-
proach reached its better result in March with 35.61% of
EE, slightly better than Bayesian Networks and worst than
Neural Networks, with 43.66% of EE. The worst technique
was Random Forest with gains in the range of 8.13% to
22.42%.
One of the challenges of this research is the nature of data,
since they are much unbalanced with the minor class with
less than 1%. As a future work we intend to use techniques
to deal with this data imbalance, preserving the generality of
the model. One possible solution would be consider weights
to assign to classes, where the minor class receives the larger
weight [33]. Moreover, Xie et al. [34] have proposed an
improvement to Random Forest technique, combining the
balanced random forest and the weighted random forest.
Thus, an idea is to optimize the computational techniques
and another proposal to improve the gains is to use hybrid
models that can be composed by ensemble of techniques.
ACKNOWLEDGMENT
This research was supported by the Brazilian National
Institute of Science and Technology for the Web (CNPq
grant numbers 573871/2008-6 and 477709/2012-5), CAPES,
CNPq, Finep, and Fapemig.
REFERENCES
[1] V. P. Tej Paul Bhatla and A. Dua, Understanding Credit Card
Frauds, 2003.
484848
[2] C. Mindware Research Group, 2011 Online Fraud Report,
12th ed., 2011. [Online]. Available: http://www.cybersource.
com
[3] S. Bhattacharyya, S. Jha, K. Tharakunnel, Westland, and
J. Christopher, “Data mining for credit card fraud: A com-
parative study,” Decis. Support Syst., vol. 50, pp. 602–613,
February 2011.
[4] R. J. Bolton and D. J. H, “Statistical fraud detection: A
review,” p. 2002, 2002.
[5] R. Maranzato, A. Pereira, M. Neubert, and A. P. do Lago,
“Fraud detection in reputation systems in e-markets using
logistic regression and stepwise optimization,” SIGAPP Appl.
Comput. Rev., vol. 11, pp. 14–26, June 2010. [Online].
Available: http://doi.acm.org/10.1145/1869687.1869689
[6] P. Ravisankar, V. Ravi, G. R. Rao, and I. Bose, “Detection
of financial statement fraud and feature selection using data
mining techniques,” Decision Support Systems, vol. 50, no. 2,
pp. 491–500, 2011.
[7] E. W. T. Ngai, Y. Hu, Y. H. Wong, Y. Chen, and X. Sun,
“The application of data mining techniques in financial fraud
detection: A classification framework and an academic review
of literature,” Decision Support Systems, vol. 50, no. 3, pp.
559–569, 2011.
[8] B. Thomas, J. Clergue, A. Schaad, and M. Dacier, “A com-
parison of conventional and online fraud,” in CRIS’04, 2nd
International Conference on Critical Infrastructures, October
25-27, 2004 - Grenoble, France, 10 2004.
[9] L. Vasiu and I. Vasiu, “Dissecting computer fraud: From
definitional issues to a taxonomy,” in Proceedings of
the Proceedings of the 37th Annual Hawaii International
Conference on System Sciences (HICSS’04) - Track 7 -
Volume 7, ser. HICSS ’04. Washington, DC, USA: IEEE
Computer Society, 2004, pp. 70170.3–. [Online]. Available:
http://portal.acm.org/citation.cfm?id=962755.963148
[10] D. H. Chau, S. P, and C. Faloutsos, “Detecting fraudulent
personalities in networks of online auctioneers,” in In Proc.
ECML/PKDD, 2006, pp. 103–114.
[11] T. Fawcett and F. Provost, “Adaptive fraud detection. data
mining and knowledge discovery,” 1997.
[12] E. L. Barse, H. Kvarnstr¨
om, and E. Jonsson, “Synthesizing
test data for fraud detection systems,” in Proceedings of the
19th Annual Computer Security Applications Conference,
ser. ACSAC ’03. Washington, DC, USA: IEEE Computer
Society, 2003, pp. 384–. [Online]. Available: http://portal.
acm.org/citation.cfm?id=956415.956464
[13] C. Phua, V. Lee, K. Smith-Miles, and R. Gayler, “A com-
prehensive survey of data mining-based fraud detection re-
search,” 2005.
[14] S. Maes, K. Tuyls, B. Vanschoenwinkel, and B. Manderick,
“Credit card fraud detection using bayesian and neural net-
works,” in In: Maciunas RJ, editor. Interactive image-guided
neurosurgery. American Association Neurological Surgeons,
1993, pp. 261–270.
[15] Netmap, “Fraud and crime example brochure,” 2004.
[16] R. J. Bolton and D. J. Hand, “Unsupervised Profiling
Methods for Fraud Detection,” Statistical Science, vol. 17,
no. 3, pp. 235–255, 2002. [Online]. Available: http:
//citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.5743
[17] G. J. Williams and Z. Huang, “Mining the knowledge mine:
The hot spots methodology for mining large real world
databases,” 1997.
[18] B. Larivi`
ere and D. Van den Poel, “Predicting customer
retention and profitability by using random forests and
regression forests techniques,” Expert Syst. Appl., vol. 29,
no. 2, pp. 472–484, Aug. 2005. [Online]. Available:
http://dx.doi.org/10.1016/j.eswa.2005.04.043
[19] A. R. Statnikov, L. Wang, and C. F. Aliferis,
“A comprehensive comparison of random forests and
support vector machines for microarray-based cancer
classification.” BMC Bioinformatics, vol. 9, 2008. [Online].
Available: http://dblp.uni-trier.de/db/journals/bmcbi/bmcbi9.
html#StatnikovWA08
[20] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support
vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2,
no. 3, pp. 27:1–27:27, May 2011. [Online]. Available:
http://doi.acm.org/10.1145/1961189.1961199
[21] E. W. T. Ngai, Y. Hu, Y. H. Wong, Y. Chen, and
X. Sun, “The application of data mining techniques in
financial fraud detection: A classification framework and
an academic review of literature,Decis. Support Syst.,
vol. 50, no. 3, pp. 559–569, Feb. 2011. [Online]. Available:
http://dx.doi.org/10.1016/j.dss.2010.08.006
[22] S. Maes, karl Tuyls, B. Vanschoenwinkel, and B. Mander-
ick, “Credit card fraud detection using bayesian and neural
networks,” Vrije Universiteir Brussel, 2001.
[23] G. Cooper and E. Herskovits, “A bayesian method for the
induction of probabilistic networks from data,” Machine
Learning, vol. 9, no. 4, pp. 309–347, 1992.
[24] D. W. Hosmer, Applied Logistic Regression, 2nd ed. New
York: Wiley, 2000.
[25] A. J. Dobson, An Introduction to Generalized Linear Models.
London:Chapman and Hall, 1990.
[26] W. N. Venables, D. M. Smith, and the R Development
Core Team, “An introduction to r,” http://www.cran.r-project.
org, [Online; Accessed: July 20, 2014].
[27] K. Gurney and K. Gurney, An introduction to neural networks.
CRC Press, 1997.
[28] A. P. Engelbrecht, Computational Intelligence: An Introduc-
tion, 2nd ed. Wiley, 2007.
[29] A. Konar, Computational Intelligence: Principles, Techniques
and Applications. Springer-Verlag New York, 2005.
[30] M. I. A. Lourakis, “A brief description of the levenberg-
marquardt algorithm,” vol. 3, p. 2, 2005.
[31] L. Breiman, “Random forests,” Machine learning, vol. 45,
no. 1, pp. 5–32, 2001.
[32] T. R. D. C. T. Version, “R: A language and environment
for statistical computing,” http://www.r-project.org, [Online;
Accessed: July 20, 2014].
[33] C. Chen, A. Liaw, and L. Breiman, “Using random forest to
learn imbalanced data,” Discovery, no. 1999, pp. 1–12, 2004.
[34] Y. Xie, X. Li, E. Ngai, and W. Ying, “Customer churn
prediction using improved balanced random forests,Expert
Systems with Applications, vol. 36, no. 3, pp. 5445–5449,
2009.
494949
... Traditionally, digital fraud solutions have been tackled from various angles, including the use of machine learning rules and algorithms [1,2] that can effectively prevent some of the most common cases. Fraud detection is a complex issue that encompasses several challenges. ...
... Users' different actions within a digital platform contribute to this context. Within these, intrinsic characteristics of the transaction can be found, such as amounts, types of card, dates [1], and even patterns related to user behavior such as the frequency and recency of transactions [6], as well as the temporality about the dynamics of the problem [7]. All these characteristics offer a broader view whose purpose will always focus on distinguishing the types of operations. ...
Article
Full-text available
A temporal graph network (TGN) algorithm is introduced to identify fraudulent activities within a digital platform. The central premise is that digital transactions can be modeled via a graph network where various entities interact. The data used to build an event-based temporal graph (ETG) were sourced from an online payment platform and include details such as users, cards, devices, bank accounts, and features related to all these entities. Based on these data, seven distinct graphs were created; the first three represent individual interaction events (card registration, device registration, and bank account registration), while the remaining four are combinations of these graphs (card-device, card-bank account, device-bank account, and card-device-bank account registration). This approach was adopted to determine if the graph's structure influenced the detection of fraudulent transactions. The results demonstrate that integrating more interaction events into the graph enhances the metrics, meaning graphs containing more interaction events yield superior fraud detection results than those based on individual events. In addition, the data used in this work correspond to Latin American payment transactions, which is relevant in the context of fraud detection since this region has the highest fraud rate in the world, yet few studies have focused on this issue.
... Besides, discrepancies can also have a negative impact on consumers. The number of cases of violations of consumer rights as showed by the number of complaints received by consumer protection institutions shows that the implementation of consumer protection has not been able to answer the needs of consumer protection (Caldeira et al., 2014;Raval, 2021). ...
... The effort of government and fintech associations in providing education to the public is the right decision. The government also requires all platforms to be registered in Financial Services Authority to monitor and supervise these platforms (Caldeira et al., 2014;Petraşcu & Tieanu, 2014;Wopperer, 2002). It is expected that the assistance and supervision provided by Financial Services Authority and the Association to the P2P Lending platform can provide signs to follow these rules and policies. ...
Article
Full-text available
Fintech P2P Lending has caused many complaints from its consumers. This study discusses how consumer protection regulations can reduce fraud from the P2P Lending platform. This study uses a normative research method using a legal approach and a case approach. The population used in this study were P2P customers. This study analyzes the problem with a legislative concerning consumer protection and financial services authority regulation about information technology borrowing and borrowing services. Data collection used primary legal materials in the form of laws and secondary legal materials obtained from library studies in the form of literature. The researcher applies a qualitative data analysis method. The results of this study are one of the key factors in protecting consumer rights derived from various regulations related to the supervision and supervision system carried out by the government. The government should do more education and outreach to the public regarding the fintech peer to peer lending platform and the formation of the Investment Alert Task Force. The study and formulation of government regulations must be balanced with the speed of technological change and digital transformation by involving practitioners so as not to harm each other. Synergy between stakeholders (collaboration), law, surveillance, and data protection can reduce the risk of illegal P2P platform practices. Coordination and synergy between policy makers and stakeholders will greatly facilitate handling in regulating the P2P Lending platform so as not to cause overlapping regulations that cause new problems.
... By contrast, it could automatically trigger a request for biometric verification or an OTP in case the login is from some strange location or device. In this way, it adapts to emerging threats [29]. ...
Article
Full-text available
Ecommerce platforms need to assure transaction security in the face of rising challenges and fraud attempts that are perilous for business and consumers alike. Predictive AI has become a quintessential tool for fraud detection in real time in these systems. Traditional rule-based fraud detection methods have tended to be brittle, allowing little room for adaptation as the nature of fraud changes; ML models scale dynamically for detection of threats. These models analyze huge datasets, find anomalies, and flag possible fraud activities, thus enabling systems to make autonomous decisions during the process of payment. Real-time analysis by AI reduces latency in fraud detection; hence, security is increased with minimal disturbance to real transactions. Besides choosing suitable models, predictive AI implementation involves feature engineering to optimize data and deployment in production environments. This paper addresses the integration of both supervised and unsupervised learning techniques for fraud detection in eCommerce payment systems, with a contributing role of AI in relation to data privacy, improvement of customer authentication, and continuous learning with respect to emerging cyber threats. Thus, this research has sought to explore how eCommerce payments in cybersecurity are being remade by predictive AI, which is comprehended through the operational mechanisms and possible implication of such AI models.
... Traditional abnormal transaction detection methods often rely on manual analysis, which is inefficient and prone to the problem of omission and false alarms. The introduction of artificial intelligence technology has brought new solutions for abnormal transaction detection [9][10][11]. ...
Article
Full-text available
The rapid development of e-commerce is accompanied by the problem of a large number of abnormal transactions, and the development of artificial intelligence models has brought the possibility of large-scale real-time monitoring of transaction anomalies. The study integrates the transaction anomaly monitoring method based on the Light GBM model and multi-instance learning model, and trains the constructed combined model based on Taobao data. The average accuracy, precision, recall, and F1 values of the combined model are 0.977, 0.971, 0.973, and 0.972, respectively. The monitoring effect is higher than that of a single model, which indicates that the combined model is more effective in identifying anomalies in e-commerce transactions. The relevant technicians of A e-commerce platform who have used the model said that the implementation effect of the combination model constructed in this paper is more satisfactory.
... The rise of big data, data mining, and machine learning has made it possible to analyze seller behaviors and historical data to uncover suspicious actions that may indicate fraud. [10] The majority of a product's traffic comes from those who see it in search results and then go on to browse and buy it. The fact is that consumers are more inclined to peruse or buy items that be highly regarded, have many buyers, and/or rank highly in search results [11].Hence, dishonest sellers (adversaries) intentionally inflate the price of their target products by illegally inflating the number of visits, purchases, and reviews shown in search results. ...
Chapter
Fraud poses a significant threat across various sectors, with the e-commerce industry being particularly vulnerable based on quantum network. Using quantum networks for detecting fraud in e-commerce transactions has the potential to completely change online security. Quantum networks rely on the principles of quantum mechanics to provide the highest level of security when transmitting data. Companies facilitating online payments gather extensive data on user transactions, leveraging machine learning techniques to differentiate between legitimate and fraudulent activities. To enhance expertise in fraud detection, machine learning methods are employed to identify online payment fraud within e-commerce transactions. The dataset, structured at the transaction level, is analysed to uncover patterns distinguishing fraudulent behaviour from normal transactions. Feature engineering, such as incorporating user-level statistics like mean and standard deviation, aids in pattern recognition—a common practice in models like LGBMs (light gradient boosting machines). Detecting fraud presents a challenge due to the imbalance between fraudulent and non-fraudulent data. The performance of the model is evaluated using metrics such as accuracy and F1 score. The current system employs Bayesian optimization techniques to refine LGBM and XGBoost models. The proposed model aims to identify consumer fraud by analysing purchasing patterns and historical data using machine learning methodologies, specifically adopting a classification approach. Tree-based methods, including tree-based bagging and boosting techniques such as LGBM, XGBoost, CatBoost, and deep learning, are utilized. The synthetic minority over-sampling technique (SMOTE) is used to balance the imbalanced data. The primary aim is to create a reliable fraud detection system that is suited to the e-commerce environment.
Article
Payment processing systems have advanced significantly in the airline business. Because e-payments are easy, they have captured the attention of many companies in the aviation industry and are quickly becoming the dominant means of payment. However, as technology advances, fraud grows at a comparable rate. Over the years, there has been a surge in payment fraud incidents in the airline sector, reducing the platform's trustworthiness. Despite attempts to eliminate epayment fraud, decision-makers lack the technical expertise required to use the finest fraud detection and prevention assessments. This research recognizes the lack of an established decision model as a hurdle and seeks to fix the problem. In response, this research aims to develop a decision model for the airline industry to evaluate the e-payment fraud detection and prevention capabilities of airlines. The literature examines the scope of airline payment fraud to formulate the optimal framework to handle the problem. Guided by the results, the study proceeds to develop an HDM model from experts’ validation, quantification, and desirability inputs. The results of the factors’ validation and quantification show that the Economic and Financial, and the Security perspectives have the most impact on decision-making. Airline companies can use the developed framework to examine whether they are ready to adopt online fraud prevention technologies to increase their success rate. To measure payment organizations' readiness for digital payment fraud protection technologies, a scoring methodology was developed in this research and applied to two case studies.
Chapter
Due to the importance of social networking in our lives today, online impersonations and fake accounts have become very common. Machine-learning-based techniques have been proposed to combat fake users in several studies. Detecting fake users is a crucial problem. Various machine learning algorithms, such as Random Forests, Support Vector Machines, and Neural Networks, are applied to detect these accounts. To determine which model produces the most accurate results and which algorithm yields the best results, the accuracy and performance of all three models were tested and compared. RF model was observed to be the most accurate at predicting fake accounts in the study.
Article
Full-text available
The Levenberg-Marquardt (LM) algorithm is an iterative technique that locates the minimum of a function that is expressed as the sum of squares of nonlinear functions. It has become a standard technique for nonlinear least-squares problems and can be thought of as a combination of steepest descent and the Gauss-Newton method. This document briefly describes the mathematics behind levmar, a free LM C/C++ implementation that can be found at http://www.ics.forth.gr/˜lourakis/levmar.
Article
Full-text available
Reputation is the opinion of the public toward a person, a group of people, or an organization. Reputation systems are particularly important in e-markets, where they help buyers to decide whether to purchase a product or not. Since a higher reputation means more profit, some users try to deceive such systems to increase their reputation. E-markets should protect their reputation systems from attacks in order to maintain a sound environment. This work addresses the task of finding attempts to deceive reputation systems in e-markets. Our goal is to generate a list of users (sellers) ranked by the probability of fraud. Firstly we describe characteristics related to transactions that may indicate frauds evidence and they are expanded to the sellers. We describe results of a simple approach that ranks sellers by counting characteristics of fraud. Then we incorporate characteristics that cannot be used by the counting approach, and we apply logistic regression to both, improved and not improved. We use real data from a large Brazilian e-market to train and evaluate our methods and the improved set with logistic regression performs better, specially when we apply stepwise optimization. We validate our results with specialists of fraud detection in this market place. In the end, we increase by 112% the number of identified fraudsters against the reputation system. In terms of ranking, we reach 93% of average precision after specialists' review in the list that uses Logistic Regression and Stepwise optimization. We also detect 55% of fraudsters with a precision of 100%.
Article
This book focuses on various techniques of computational intelligence, both single ones and those which form hybrid methods. Those techniques are today commonly applied issues of artificial intelligence, e.g. to process speech and natural language, build expert systems and robots. The first part of the book presents methods of knowledge representation using different techniques, namely the rough sets, type-1 fuzzy sets and type-2 fuzzy sets. Next various neural network architectures are presented and their learning algorithms are derived. Moreover, the family of evolutionary algorithms is discussed, in particular the classical genetic algorithm, evolutionary strategies and genetic programming, including connections between these techniques and neural networks and fuzzy systems. In the last part of the book, various methods of data partitioning and algorithms of automatic data clustering are given and new neuro-fuzzy architectures are studied and compared. This well-organized modern approach to methods and techniques of intelligent calculations includes examples and exercises in each chapter and a preface by Jacek Zurada, president of IEEE Computational Intelligence Society (2004-05).
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
In this paper we propose two ways to deal with the imbalanced data classification problem using random forest. One is based on cost sensitive learning, and the other is based on a sampling technique. Performance metrics such as precision and recall, false positive rate and false negative rate, F-measure and weighted accuracy are computed. Both methods are shown to improve the prediction accuracy of the minority class, and have favorable performance compared to the existing algorithms.