Conference PaperPDF Available

A Frequency-domain-based Pattern Mining for Credit Card Fraud Detection

Authors:

Abstract and Figures

Nowadays, the prevention of credit card fraud represents a crucial task, since almost all the operators in the E-commerce environment accept payments made through credit cards, aware of that some of them could be fraudulent. The development of approaches able to face effectively this problem represents a hard challenge due to several problems. The most important among them are the heterogeneity and the imbalanced class distribution of data, problems that lead toward a reduction of the effectiveness of the most used techniques, making it difficult to define effective models able to evaluate the new transactions. This paper proposes a new strategy able to face the aforementioned problems based on a model defined by using the Discrete Fourier Transform conversion in order to exploit frequency patterns, instead of the canonical ones, in the evaluation process. Such approach presents some advantages, since it allows us to face the imbalanced class distribution and the cold-start issues by involving only the past legitimate transactions, reducing the data heterogeneity problem thanks to the frequency-domain-based data representation, which results less influenced by the data variation. A practical implementation of the proposed approach is given by presenting an algorithm able to classify a new transaction as reliable or unreliable on the basis of the aforementioned strategy.
Content may be subject to copyright.
A Frequency-domain-based Pattern Mining
for Credit Card Fraud Detection
Roberto Saia and Salvatore Carta
Dipartimento di Matematica e Informatica, Universit`
a di Cagliari, Italy
{roberto.saia,salvatore}@unica.it
Keywords: Business Intelligence, Fraud Detection, Pattern Mining, Fourier, Metrics.
Abstract: Nowadays, the prevention of credit card fraud represents a crucial task, since almost all the operators in the
E-commerce environment accept payments made through credit cards, aware of that some of them could be
fraudulent. The development of approaches able to face effectively this problem represents a hard challenge
due to several problems. The most important among them are the heterogeneity and the imbalanced class
distribution of data, problems that lead toward a reduction of the effectiveness of the most used techniques,
making it difficult to define effective models able to evaluate the new transactions. This paper proposes a new
strategy able to face the aforementioned problems based on a model defined by using the Discrete Fourier
Transform conversion in order to exploit frequency patterns, instead of the canonical ones, in the evaluation
process. Such approach presents some advantages, since it allows us to face the imbalanced class distribution
and the cold-start issues by involving only the past legitimate transactions, reducing the data heterogeneity
problem thanks to the frequency-domain-based data representation, which results less influenced by the data
variation. A practical implementation of the proposed approach is given by presenting an algorithm able to
classify a new transaction as reliable or unreliable on the basis of the aforementioned strategy.
1 INTRODUCTION
Studies conducted by American Association of
Fraud Examiners1show that the financial frauds rep-
resent the 10-15% of the entire fraud cases, by involv-
ing the 75-80% of the entire financial value, with an
estimated average loss per fraud case of 2 million of
dollars, in the USA alone. Fraud represents one of
the major issues related to the use of credit cards, an
important aspect considering the exponential growth
of the E-commerce transactions. For these reasons,
the research of effective approaches able to detect the
frauds has become a crucial task, because it allows the
involved operators to eliminate, or at least reduce, the
economic losses.
Since the fraudulent transactions are typically less
than the legitimate ones, the data distribution is highly
unbalanced and this reduces the effectiveness of many
machine learning strategies Japkowicz and Stephen
(2002). Such problem is worsened by the scarcity
of information that characterizes a typical financial
transaction, a scenario that leads toward an overlap-
ping of the classes of expense of a user Holte et al.
1http://www.acfe.com
(1989).
There are many state-of-the-art techniques de-
signed to perform the fraud detection task, for in-
stance, those that exploit the Data Mining Lek
et al. (2001), the Artificial Intelligence Hoffman and
Tessendorf (2005), the Fuzzy Logic Lenard and Alam
(2005), the Machine Learning Whiting et al. (2012),
and the Genetic Programming Assis et al. (2010) tech-
niques.
Almost all the aforementioned techniques mainly
rely on the detection of outliers in the transactions un-
der analysis, a basic approach that could lead toward
many wrong classifications (i.e., reliable transactions
classified as unreliable). Most of these wrong classifi-
cations happen due to the absence of extensive criteria
during the evaluation process, since many techniques
are not able to manage some non-numeric transac-
tion features during the evaluation process, e.g., one
of the most performing approaches, such as Random
Forests, is not able to manage types of data that in-
volve a large number of categories.
The idea behind this paper is a new representation
of the data obtained by using the Fourier transforma-
tion Duhamel and Vetterli (1990) in order to move a
time series (the sequence of discrete-time data given
by the feature values of a transaction ) in the fre-
quency domain, allowing us to analyze the data from
a new point of view.
It should be observed that the proposed evalua-
tion process involves only the past legitimate transac-
tions, presenting some advantages: first, it operates in
a proactive way, by facing the imbalanced class distri-
bution and the cold-start (i.e., scarcity or total absence
of fraudulent transaction cases) problems; second, it
reduces the problems related to the data heterogene-
ity, since the data representation in the frequency do-
main is more stable than the canonical one, in terms
of capability of recognizing a peculiar pattern, regard-
less of the value assumed by the transaction features.
The contributions of this paper are as follows:
(i) definition of the time series to use in the Fourier
process, on the basis of the past legitimate trans-
actions;
(i) formalization of the comparison process between
the time series of an unevaluated transaction and
those of the past legitimate transactions, in terms
of difference between their frequency magnitude;
(i) formulation of an algorithm, based on the previ-
ous comparison process, able to classify a new
transaction as reliable or unreliable.
The remainder of the paper is organized as fol-
lows: Section 2 introduces the background and related
work; Section 3 provides a formal notation, makes
some assumptions, and defines the faced problem;
Section 4 describes the steps necessary to define the
proposed approach; Section 5 gives some concluding
remarks.
2 BACKGROUND AND RELATED
WORK
Many studies consider the frauds as the biggest
problem in the E-commerce environment. The chal-
lenge faced by the fraud detection techniques is the
classification of a financial transaction as reliable or
unreliable, on the basis of the analysis of its features
(e.g., description, date, total amount, etc.).
The study presented in Assis et al. (2010) indi-
cates how in the fraud detection field there is a lack of
public real-world datasets, configuring a relevant is-
sue for those who deal with the research and develop-
ment of new and more effective fraud detection tech-
niques. This scenario mainly depends on the restric-
tive policies adopted by the financial operators, which
for competitive or legal reasons do not provide in-
formation about their business activities. These poli-
cies are also adopted because the financial data are
composed by real information about their customers,
which even anonymized may reveal potential vulner-
abilities related to the E-commerce infrastructure.
Supervised and Unsupervised Approaches. In
Phua et al. (2010) it is underlined how the unsuper-
vised fraud detection strategies are still a very big
challenge in the field of E-commerce. In spite of
the fact that every supervised strategy in fraud detec-
tion needs a reliable training set, the work proposed
in Bolton and Hand (2002) takes in consideration the
possibility to adopt an unsupervised approach during
the fraud detection process, when no dataset of refer-
ence containing an adequate number of transactions
(legitimate and non-legitimate) is available.
Data Heterogeneity. Pattern recognition can be
considered an important branch of the machine learn-
ing field. Its main task is the detection of patterns
and regularities in a data stream, in order to define
an evaluation model to exploit in a large number of
real-world applications Garibotto et al. (2013). One
of the most critical problems related to the pattern
recognition tasks is the data heterogeneity. Litera-
ture describes the data heterogeneity issue as the in-
compatibility among similar features resulting in the
same data being represented differently in different
datasets Chatterjee and Segev (1991).
Data Unbalance. One of the most important
problems that makes the definition of effective models
for the fraud detection difficult is the imbalanced class
distribution of data Japkowicz and Stephen (2002);
He and Garcia (2009). This issue is given by the fact
that the data used in order to train the models are char-
acterized by a small number of default cases and a
big number of non-default ones, a distribution of data
that limits the performance of the classification tech-
niques Japkowicz and Stephen (2002); Brown and
Mues (2012).
Cold Start. The cold start problem Donmez et al.
(2007) arises when there is not enough information to
train a reliable model about a domain. In the context
of the fraud detection, such scenario appears when the
data used to train the model are not representative of
all classes of data Attenberg and Provost (2010) (i.e.,
default and non-default cases).
Detection Models. The static approach Pozzolo
et al. (2014) represents a canonical way to operate in
order to detect fraudulent events in a stream of trans-
actions. This approach divides the data stream into
blocks of the same size, and the user model is trained
by using a certain number of initial and contiguous
blocks of the sequence, which are used to infer the
future blocks. The updating approach Wang et al.
(2003), instead, when a new block appears, trains the
user model by using a certain number of latest and
contiguous blocks of the sequence, then the model can
Frequency
Time
Magnitude
f1
f2
f...
fX
Figure 1: T ime and Frequency Domains
be used to infer the future blocks, or it can be aggre-
gated into a big model composed by several models.
In another strategy, based on the so-called forgetting
approach Gao et al. (2007), a user model is defined
at each new block, by using a small number of non
fraudulent transactions, extracted from the last two
blocks, but by keeping all previous fraudulent ones.
Also in this case, the model can be used to infer the
future blocks, or it can be aggregated into a big model
composed by several models.
The main disadvantages related to these ap-
proaches of user modeling are: the incapacity to track
the changes in the users behavior, in the case of the
static approach; the ineffectiveness to operate in the
context of small classes, in the case of the updating
approach; the computational complexity in the case
of the forgetting approach. However, regardless of
the used approach, the problem of the heterogeneity
and unbalance of the data still remains unaltered.
Discrete Fourier Transform. The basic idea be-
hind the approach proposed in this paper is to move
the process of evaluation of the new transactions (time
series) from their canonical time domain to the fre-
quency one, in order to obtain a representative pattern
composed by their frequency components, as shown
in Figure 1. This operation is performed by recurring
to the Discrete Fourier Transform (DF T ), whose for-
malization is reported in Equation 1, where iis the
imaginary unit.
Fn
def
=
N1
k=0
fk·e2πink/N,nZ(1)
As result we obtain a set of sinusoidal functions,
each corresponding to a particular frequency compo-
nent. It is possible to return to the original time do-
main by using the inverse Fourier transform shown in
Equation 2.
fk=1
N
N1
n=0
Fn·e2πikn/N,nZ(2)
3 PRELIMINARIES
Formal notation, assumptions, and problem defi-
nition are stated in the following:
3.1 Formal Notation
Given a set of classified transactions T=
{t1,t2,...,tN}, and a set of features V={v1,v2,
... ,vM}that compose each tT, we denote as
T+Tthe subset of legitimate transactions, and
as TTthe subset of fraudulent ones. We also
denote as ˆ
T={ˆ
t1,ˆ
t2,...,ˆ
tU}a set of unclassified
transactions. It should be observed that a trans-
action only can belong to one class cC, where
C={reliable,unreliable}. Finally, we denote as
F={f1,f2,..., fX}the frequency components of
each transaction obtained through the DFT process.
3.2 Assumptions
A periodic wave is characterized by a frequency f
and a wavelength λ(i.e., the distance in the medium
between the beginning and end of a cycle λ=w
f0
,
where wstands for the wave velocity), which are de-
fined by the repeating pattern, the non-periodic waves
that we take into account during the Discrete Fourier
Transform process do not have a frequency and a
wavelength. Their fundamental period Tis the pe-
riod where the wave values were taken and sr denotes
their number over this time (i.e., the acquisition fre-
quency).
Assuming that the time interval between the ac-
quisitions is equal, on the basis of the previous defi-
nitions applied in the context of this paper, the con-
sidered non-periodic wave is given by the sequence
of values v1,v2,...,vMwith vV, which composes
each transaction tT+(i.e., the past legitimate trans-
actions) and ˆ
tˆ
T(i.e., the unevaluated transactions),
and that representing the time series taken into ac-
count. Their fundamental period Tstarts with v1and
it ends with vM, thus we have that sr =|V|; the sample
interval si is instead given by the fundamental period
Tdivided by the number of acquisition, i.e., si =T
|V|.
We compute the Discrete Fourier Transform of
each time series t T+and ˆ
tˆ
T, by converting their
representation from the time domain to the frequency
one. The obtained frequency-domain representation
provides information about the signal’s magnitude
and phase at each frequency. For this reason, the out-
put (denoted as x) of the DFT computation is a series
of complex numbers composed by a real part xrand
an imaginary part xi, thus x= (xr+ixi). We can ob-
tain the xmagnitude by using |x|=q(x2
r+x2
i)and
the xphase by using ϕ(x) = arctanxi
xr, although in
the context of this paper we take into account only the
frequency magnitude.
3.3 Problem Definition
On the basis of a process of comparison (denoted as
Θ) performed between the frequency patterns of the
time series related to the set T+and to the set ˆ
T,
the goal of the proposed approach is to classify each
transaction ˆ
tˆ
Tas reliable or unreliable.
Given a function evaluation(ˆ
t,Θ)created to eval-
uate the correctness of the ˆ
tclassification, which re-
turns a boolean value β(0=misclassification,1=cor-
rect classification), we can formalize our objective
as the maximization of the results sum, as shown in
Equation 3.
max
0β≤| ˆ
T|
β=
|ˆ
T|
u=1
evaluation(ˆ
tu,Θ)(3)
4 PROPOSED APPROACH
The implementation of our approach is carried out
through the following steps:
Data Definition: definition of the time series in
terms of sequence of transaction features;
Data Evaluation: comparison of the frequency
patterns of two transactions, made by processing
the related time series;
Data Classification: presentation of the algo-
rithm able to classify a new transaction as reliable
or unreliable, on the basis of the previous compar-
ison process.
In the following, we provide a detailed description of
each of these steps.
4.1 Data Definition
In the first step of our approach we define the time se-
ries to use in the Discrete Fourier Transform process.
Formally, a time series represents a series of data
points stored by following the time order and usually
it is a sequence captured at successive equally spaced
points in time, thus it can be considered a sequence of
discrete-time data.
In the context of the proposed approach, the time
series taken into account are defined by using the set
of features Vthat compose each transaction in the T+
and ˆ
Tsets, as shown in Equation 4, by following the
criterion reported in Equation 5.
T+=
v1,1v1,2... v1,M
v2,1v2,2... v2,M
.
.
..
.
.....
.
.
vN,1vN,2... vN,M
ˆ
T=
v1,1v1,2... v1,M
v2,1v2,2... v2,M
.
.
..
.
.....
.
.
vU,1vU,2.. . vU,M
(4)
(v1,1,v1,2,...,v1,M),(v2,1,v2,2,...,v2,M),·· · ,(vN,1,vN,2,...,vN,M)
(v1,1,v1,2,...,v1,M),(v2,1,v2,2,...,v2,M),·· · ,(vU,1,vU,2,...,vU,M)(5)
The time series related to an item ˆ
tˆ
Twill be
compared to the time series related to all the items
tT+, by following the criteria explained in the next
steps.
4.2 Data Evaluation
The frequency domain representation allows us to
perform a transaction analysis in terms of the magni-
tude assumed by each frequency component that char-
acterizes the transaction, allowing us to detect some
patterns in the features that are not discoverable oth-
erwise. As preliminary work, we compared the two
different representation of a transaction (i.e., these ob-
tained in the time and frequency domains), observing
some interesting properties for the context taken into
account in this paper, which are described in the fol-
lowing:
The phase invariance property shown in Figure 2
proves that also in case of translation2between
transactions, a specific pattern still exists in the
frequency domain. In other words, by working
in the frequency domain we can detect a specific
pattern, also when it shifts along the features that
compose a transaction.
The amplitude correlation property shown in Fig-
ure 3 evidences that a direct correlation exists be-
tween the feature values in the time domain and
the magnitudes assumed by the frequency com-
ponents in the frequency domain. It grants that
our approach is able to differentiate the transac-
tions on the basis of the values assumed by the
transaction features.
Practically, the process of analysis is performed
by moving the time series of the transactions to com-
pare from their time domain to the frequency one, by
recurring to the DFT introduced in Section 2.
The process of comparison between a transaction
ˆ
tˆ
Tto evaluate and a past legitimate transaction
tT+is performed by measuring the difference
between the magnitude |f|of each component fF
2A translation in time domain corresponds to a change
in phase in the frequency domain.
1234567 8 9 10
1.0
2.0
3.0
4.0
Time (Series)
Value
1234567 8 9 10
1.0
2.0
3.0
4.0
Time (Series)
Value
0.2 0.4
2.0
4.0
6.0
8.0
Frequency (Hz)
Magnitud e
0.2 0.4
2.0
4.0
6.0
8.0
Frequency (Hz)
Magnitud e
Figure 2: Phase Invariance Propert y
1234567 8 9 10
1.0
2.0
3.0
4.0
Time (Series)
Value
1234567 8 9 10
1.0
2.0
3.0
4.0
Time (Series)
Value
0.2 0.4
2.0
4.0
6.0
8.0
Frequency (Hz)
Magnitud e
0.2 0.4
2.0
4.0
6.0
8.0
Frequency (Hz)
Magnitud e
Figure 3: Am plitude Correlation Pro perty
in the frequency components of the involved transac-
tions.
It is shown in the Equation 6, where f1
xand f2
x
denote, respectively, the same frequency component
of an item tT+and an item ˆ
tˆ
T.
=|f1
x|−|f2
x|,with |f1
x| | f2
x|(6)
It should be noted that, as described in Section 4.3,
for each transaction ˆ
tˆ
Tto evaluate, the aforemen-
tioned process is repeated by comparing it to each
transaction tT+. This allows us to evaluate the vari-
ation in the context of all the legitimate past cases.
4.3 Data Classification
The proposed approach is based on the Algorithm 1.
It takes as input the set T+of past legitimate trans-
actions and a transaction ˆ
tto evaluate. It returns a
boolean value that indicates the classification of the
transaction ˆ
t(true=reliable or false=unreliable).
From step 1 to step 21 we process the unevaluated
transaction ˆ
t, by starting with the extraction of its time
series (step 2), which is processed at step 3 in order
to get the frequency components. From the step 4 to
step 14 we instead process each non-default transac-
tion tT+, by performing the extraction of the time
series (step 5) and by obtaining its frequency compo-
nents (step 6). The steps from 7to 13 verify if the
difference between the magnitude of each frequency
Algorithm 1 Transact ion cl assi f ication
Input: T+=Legitimate past transactions, ˆ
t=Unevaluated transaction
Output: β=Classification of the transaction ˆ
t
1: procedure TRANSACTIONCLASSIFICATION(T+,ˆ
t)
2: ts1getT imeseries(ˆ
t)
3: F1getDF T (ts1)
4: for each tin T+do
5: ts2getT imeseries(t)
6: F2getDF T (ts2)
7: for each fin Fdo
8: if (|F2(f)|−|F1(f)| getVariationRange(T+,f)then
9: reliable rel iable +1
10: else
11: unreliable unrel iable +1
12: end if
13: end for
14: end for
15: if reliable >unrel iable then
16: βtrue
17: else
18: βf alse
19: end if
20: return β
21: end procedure
components fFof the non-default transactions and
the correspondent component of the current transac-
tion, is within the interval given by the minimum and
maximum variation measured in the set T+, by com-
paring all magnitudes of the current frequency com-
ponent f. On the basis of the result of this operation
we increase the reliable value (when the difference is
within the interval) or the unreliable one (otherwise)
(steps 9and 11). The reliable and unreliable values
will determine the classification of the transaction un-
der evaluation (steps from 15 to 19), and the result is
returned by the algorithm at the step 20.
5 CONCLUSIONS
Fraud detection techniques cover a crucial role in
many financial contexts, since they are able to reduce
the losses due to fraud, suffered directly by the traders
or indirectly by the credit card issuers.
This paper introduces a novel fraud detection ap-
proach aimed to classify the new transactions as re-
liable or unreliable by evaluating their characteris-
tics (pattern) in the frequency domain instead of the
canonical one. It is performed through the Fourier
transformation, defining our model by only using the
past legitimate user transactions.
Such approach allows us to avoid the data unbal-
ance problem that affects the canonical classification
approaches, because it only uses a class of data dur-
ing the process of definition of the model, allowing
us to operate in a proactive way, by also reducing the
cold-start problem.
Even the problems related to the data heterogene-
ity are reduced thanks to the adoption of a more sta-
ble model (based on the frequency components) able
to recognize peculiar patterns in the transaction fea-
tures, regardless of the value assumed by them.
Future work would be oriented to the implementa-
tion of the proposed approach in a real-world context,
by comparing its performance to those of the most
widely used state-of-the-art approaches.
ACKNOWLEDGEMENTS
This research is partially funded by Regione Sardegna
under project “Next generation Open Mobile Apps
Development” (NOMAD), “Pacchetti Integrati di
Agevolazione” (PIA) - Industria Artigianato e Servizi
- Annualit`
a 2013.
REFERENCES
Assis, C., Pereira, A. M., de Arruda Pereira, M., and Car-
rano, E. G. (2010). Using genetic programming to detect
fraud in electronic transactions. In Prazeres, C. V. S.,
Sampaio, P. N. M., Santanch`
e, A., Santos, C. A. S.,
and Goularte, R., editors, A Comprehensive Survey of
Data Mining-based Fraud Detection Research, volume
abs/1009.6119, pages 337–340.
Attenberg, J. and Provost, F. J. (2010). Inactive learn-
ing?: difficulties employing active learning in practice.
SIGKDD Explorations, 12(2):36–41.
Bolton, R. J. and Hand, D. J. (2002). Statistical fraud de-
tection: A review. Statistical Science, pages 235–249.
Brown, I. and Mues, C. (2012). An experimental com-
parison of classification algorithms for imbalanced credit
scoring data sets. Expert Syst. Appl., 39(3):3446–3453.
Chatterjee, A. and Segev, A. (1991). Data manipulation
in heterogeneous databases. ACM SIGMOD Record,
20(4):64–68.
Donmez, P., Carbonell, J. G., and Bennett, P. N. (2007).
Dual strategy active learning. In ECML, volume 4701
of Lecture Notes in Computer Science, pages 116–127.
Springer.
Duhamel, P. and Vetterli, M. (1990). Fast fourier trans-
forms: a tutorial review and a state of the art. Signal
processing, 19(4):259–299.
Gao, J., Fan, W., Han, J., and Yu, P. S. (2007). A gen-
eral framework for mining concept-drifting data streams
with skewed distributions. In Proceedings of the Seventh
SIAM International Conference on Data Mining, April
26-28, 2007, Minneapolis, Minnesota, USA, pages 3–14.
SIAM.
Garibotto, G., Murrieri, P., Capra, A., Muro, S. D., Petillo,
U., Flammini, F., Esposito, M., Pragliola, C., Leo, G. D.,
Lengu, R., Mazzino, N., Paolillo, A., D’Urso, M., Ver-
tucci, R., Narducci, F., Ricciardi, S., Casanova, A., Fenu,
G., Mizio, M. D., Savastano, M., Capua, M. D., and Fer-
one, A. (2013). White paper on industrial applications
of computer vision and pattern recognition. In ICIAP
(2), volume 8157 of Lecture Notes in Computer Science,
pages 721–730. Springer.
He, H. and Garcia, E. A. (2009). Learning from imbalanced
data. IEEE Trans. Knowl. Data Eng., 21(9):1263–1284.
Hoffman, A. J. and Tessendorf, R. E. (2005). Artificial
intelligence based fraud agent to identify supply chain
irregularities. In Hamza, M. H., editor, IASTED Inter-
national Conference on Artificial Intelligence and Appli-
cations, part of the 23rd Multi-Conference on Applied
Informatics, Innsbruck, Austria, February 14-16, 2005,
pages 743–750. IASTED/ACTA Press.
Holte, R. C., Acker, L., and Porter, B. W. (1989). Concept
learning and the problem of small disjuncts. In Sridha-
ran, N. S., editor, Proceedings of the 11th International
Joint Conference on Artificial Intelligence. Detroit, MI,
USA, August 1989, pages 813–818. Morgan Kaufmann.
Japkowicz, N. and Stephen, S. (2002). The class imbal-
ance problem: A systematic study. Intell. Data Anal.,
6(5):429–449.
Lek, M., Anandarajah, B., Cerpa, N., and Jamieson, R.
(2001). Data mining prototype for detecting e-commerce
fraud. In Smithson, S., Gricar, J., Podlogar, M., and
Avgerinou, S., editors, Proceedings of the 9th Euro-
pean Conference on Information Systems, Global Co-
operation in the New Millennium, ECIS 2001, Bled,
Slovenia, June 27-29, 2001, pages 160–165.
Lenard, M. J. and Alam, P. (2005). Application of fuzzy
logic fraud detection. In Khosrow-Pour, M., editor, En-
cyclopedia of Information Science and Technology (5
Volumes), pages 135–139. Idea Group.
Phua, C., Lee, V. C. S., Smith-Miles, K., and Gayler, R. W.
(2010). A comprehensive survey of data mining-based
fraud detection research. CoRR, abs/1009.6119.
Pozzolo, A. D., Caelen, O., Borgne, Y. L., Waterschoot, S.,
and Bontempi, G. (2014). Learned lessons in credit card
fraud detection from a practitioner perspective. Expert
Syst. Appl., 41(10):4915–4928.
Wang, H., Fan, W., Yu, P. S., and Han, J. (2003). Min-
ing concept-drifting data streams using ensemble classi-
fiers. In Getoor, L., Senator, T. E., Domingos, P. M.,
and Faloutsos, C., editors, Proceedings of the Ninth ACM
SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, Washington, DC, USA, August
24 - 27, 2003, pages 226–235. ACM.
Whiting, D. G., Hansen, J. V., McDonald, J. B., Albrecht,
C. C., and Albrecht, W. S. (2012). Machine learning
methods for detecting patterns of management fraud.
Computational Intelligence, 28(4):505–527.
... The discussed classifications of the IDSs based on their operational mode are summarized in Fig. 1. Similar to other research domains (e.g., Fraud Detection [34]), where the main objective is the identification of numerically rare events, the performance evaluation metrics used in this domain must take into account the high degree of data imbalance that usually characterizes the data, as to get reliable evaluations not biased by the majority class of samples. ...
... The discussed classifications of the IDSs based on their operational mode are summarized in Fig. 1. Similar to other research domains (e.g., Fraud Detection [34]), where the main objective is the identification of numerically rare events, the performance evaluation metrics used in this domain must take into account the high degree of data imbalance that usually characterizes the data, as to get reliable evaluations not biased by the majority class of samples. ...
Conference Paper
Full-text available
The ever-increasing use of services based on computer networks, even in crucial areas unthinkable until a few years ago, has made the security of these networks a crucial element for anyone, also in consideration of the increasingly sophisticated techniques and strategies available to attackers. In this context, Intrusion Detection Systems (IDSs) play a very important role since they are responsible for analyzing and classifying each network activity as legitimate or illegitimate, allowing us to take the necessary countermeasures at the appropriate time. However, these systems are not infallible and this is due to several reasons, the most important of which are the constant evolution of the attacks (e.g., zero-day attacks) and the problem that many attacks have behavior similar to those of the legitimate activities, and therefore they are very hard to identify. This work relies on the hypothesis that the subdivision of the training data used for the definition of the IDS classification model into a certain number of partitions, in terms of events and features, can improve the characterization of the network events, improving the system performance. All the non-overlapping data partitions train independent classification models and the event is classified according to a majority-voting rule. A series of experiments conducted on a benchmark real-world dataset support the initial hypothesis, showing a performance improvement, compared to a canonical training approach.
... Yuan et al. (2017) combine deep learning techniques and graph techniques for the detection of fraudulent transactions. Recently, the use of mathematical methods like Fourier transform(Saia & Carta, 2017), computational intelligence(West & Bhattacharya, 2016), and wavelet transforms(Saia, 2017) are also used by researchers to analyze credit card transactions.Kim et al. (2019) proposed a hybrid approach for credit card fraud transaction detection. The proposed approach is called the "Champion-challenge" framework and it is based on a hybrid ensemble and deep learning model. ...
Article
Full-text available
In this digital era, the trend of online transactions for E-commerce sites and banking services is increasing. By using different online transaction methods users can make payments directly from their bank accounts. But along with the increase of online transactions, there is an increase in fraudulent transactions. These fraudulent transitions have identical features and characteristics of online transactions, so there is a need for the development of frameworks or technologies to detect fraudulent transactions. In this context, this paper represents a survey of the latest frameworks and techniques proposed by the researchers for the identification of fraudulent transactions and securing online transactions.
... The performance evaluation of a large dataset shows that the proposed method is efficient, scalable, accurate, and easy to use. A new way to look for fraud is shown that uses the Discrete Fourier Transform (DFT) [35]. This method may be able to fix an imbalance in class distribution and lessen the amount of data heterogeneity. ...
Article
Full-text available
In recent years, the highly boosting development in e-commerce technologies made it possible for people to select the most desirable items from shops and stores worldwide while being at home. Credit card frauds transactions are common nowadays because of online payments. Online transactions are the root cause of fraudulent credit card activity, bringing enormous financial losses. Financial institutions must install an automatic deterrent mechanism to check these fraudulent actions. The fraudulent transactions do not follow a specific pattern and continuously change their shape and behavior. This paper aims to use ensemble learning with supervised Machine Learning (ML) models to predict the occurrence of fraud transactions. The experimental study has been evaluated on the open-source Kaggle credit card fraud detection dataset. The performance of the proposed model is measured in terms of accuracy score, confusion matrix, and classification report. The results were state-of-the-art using the voting ensemble learning technique shows that it can be get the best results using PCA with 100.0% accuracy, 97.3% precision, 73.5% recall, and 83.7% f1-score against other ML classifiers.
... The scheme does not allow new transactions without the branch authentication. Saia and Carta (2017) developed a detection strategy with the discrete Fourier transform conversion so as to deal with the unbalanced dataset. It reduced the data heterogeneity problem by using the past legitimate transactions, and also, tackled the cold start issues. ...
... The scheme does not allow new transactions without the branch authentication. Saia and Carta (2017) developed a detection strategy with the discrete Fourier transform conversion so as to deal with the unbalanced dataset. It reduced the data heterogeneity problem by using the past legitimate transactions, and also, tackled the cold start issues. ...
Article
Full-text available
The e-commerce industry’s rapid growth, accelerated by the COVID-19 pandemic, has led to an alarming increase in digital fraud and associated losses. To establish a healthy e-commerce ecosystem, robust cyber security and anti-fraud measures are crucial. However, research on fraud detection systems has struggled to keep pace due to limited real-world datasets. Advances in artificial intelligence, Machine Learning (ML), and cloud computing have revitalized research and applications in this domain. While ML and data mining techniques are popular in fraud detection, specific reviews focusing on their application in e-commerce platforms like eBay and Facebook are lacking depth. Existing reviews provide broad overviews but fail to grasp the intricacies of ML algorithms in the e-commerce context. To bridge this gap, our study conducts a systematic literature review using the Preferred Reporting Items for Systematic reviews and Meta-Analysis (PRISMA) methodology. We aim to explore the effectiveness of these techniques in fraud detection within digital marketplaces and the broader e-commerce landscape. Understanding the current state of the literature and emerging trends is crucial given the rising fraud incidents and associated costs. Through our investigation, we identify research opportunities and provide insights to industry stakeholders on key ML and data mining techniques for combating e-commerce fraud. Our paper examines the research on these techniques as published in the past decade. Employing the PRISMA approach, we conducted a content analysis of 101 publications, identifying research gaps, recent techniques, and highlighting the increasing utilization of artificial neural networks in fraud detection within the industry.
Article
Full-text available
Billions of dollars of loss are caused every year due to fraudulent credit card transactions. The design of efficient fraud detection algorithms is key for reducing these losses, and more algorithms rely on advanced machine learning techniques to assist fraud investigators. The design of fraud detection algorithms is however particularly challenging due to non-stationary distribution of the data, highly imbalanced classes distributions and continuous streams of transactions. At the same time public data are scarcely available for confidentiality issues, leaving unanswered many questions about which is the best strategy to deal with them. In this paper we provide some answers from the practitioner’s perspective by focusing on three crucial issues: unbalancedness, non-stationarity and assessment. The analysis is made possible by a real credit card dataset provided by our industrial partner.
Conference Paper
Full-text available
The volume of online transactions has raised a lot in last years, mainly due to the popularization of e-commerce, such as Web retailers. We also observe a significant increase in the number of fraud cases, resulting in billions of dollars losses each year worldwide. Therefore it is important and necessary to developed and apply techniques that can assist in fraud detection, which motivates our research. This work proposes the use of Genetic Programming (GP), an Evolutionary Computation approach, to model and detect fraud (charge back) in electronic transactions, more specifically in credit card operations. In order to evaluate the technique, we perform a case study using an actual dataset of the most popular Brazilian electronic payment service, called UOL PagSeguro. Our results show good performance in fraud detection, presenting gains up to 17.72% percent compared to the baseline, which is the actual scenario of the corporation.
Article
Many important information systems applications require access to data stored in multiple heterogeneous databases. This paper examines a problem in interdatabase data manipulation within a heterogeneous environment, where conventional techniques are no longer useful. To solve the problem, a broader definition for join operator is proposed. Also, a method to probabilistically estimate the accuracy of the join is discussed.
Article
Discovery of financial fraud has profound social consequences. Loss of stockholder value, bankruptcy, and loss of confidence in the professional audit firms have resulted from failure to detect financial fraud. Previous studies that have attempted to discover fraud patterns from publicly available information have achieved only moderate levels of success. This study explores the capabilities of recently developed statistical learning and data mining methods in an attempt to advance fraud discovery performance to levels that have potential for proactive discovery or mitigation of financial fraud. The partially adaptive methods we test have achieved success in a number of complex problem domains and are easily interpretable. Ensemble methods, which combine predictions from multiple models via boosting, bagging, or related approaches, have emerged as among the most powerful data mining and machine learning methods. Our study includes random forests, stochastic gradient boosting, and rule ensembles. The results for ensemble models show marked improvement over past efforts, with accuracy approaching levels of practical potential. In particular, rule ensembles do so while maintaining a degree of interpretability absent in the other ensemble methods. © 2012 Wiley Periodicals, Inc.
Article
With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.
Article
The publication of the Cooley-Tukey fast Fourier transform (FFT) algorithm in 1965 has opened a new area in digital signal processing by reducing the order of complexity of some crucial computational tasks like Fourier transform and convultion from N2 to N log2, where N is the problem size. The development of the major algorithms (Cooley-Tukey and split-radix FFT, prime factor algorithm and Winograd fast Fourier transform) is reviewed. Then, an attempt is made to indicate the state of the art on the subject, showin the standing of researh, open problems and implementations.ZusammenfassungDie Publikation von Cooley-Tukey's schnellem Fourier Transformations Algorithmus in 1965 brachte eine neue Area in der digitalen signaverarbeitung weil die Ordnung der Komplexität von gewissen zentralen Berechnungen, wie die Fourier Transformations und die digitale Faltung, von N2 zu Nlog2N reduziert wurden (wo N die Problemgrösse darstellt). Die Entwickflung der wichtigsten Algorithmen (Cooley-Tukey und Split-Radix FFT), Prime Factor Algorithmus und Winograd's schneller Fourier Transformation) ist nachvollzogen. Dann wird, den Stand des Feldes zu beschreiben, um zu zeigen wo die Forschung steht, was für Probleme noch offenstehen, wie zum Beispel in Implementierungen.RésuméLa publication de l'algorithme de Cooley-Tukey pour la transformation de Fourier rapide a ouvert une nouvelle ère dans traitement numérique des signaux, en résiduisant l'ordre de comlexité de problèmes cruciaux, comme la transformation de Fourier ou la convulution de N2 à Nlog2N (où N est la taille du problème). Le dévelopment des algorithmes principaux (Cooley-Tukey, split-radix FFT, algorithmes des facteurs premiers, et transformée rapidem de Winograd) est déscrit. Ensuite, l'état de l'art est donné, et on parle problémes ouverts et des implantations.
Conference Paper
The advent of electronic commerce (e-commerce) has marked a significant change in the way business now approach the implementation of their sales and marketing strategies. Electronic commerce growth has been accompanied by an increase in fraudulent practices. Researchers have proposed rules-based auditing systems for electronic commerce transactions, but they highly depend on the auditor’s knowledge of ecommerce fraud (Wong et al., 2000). While fraud patterns may occur, the management, control, and application of these patterns is difficult due to the increasing number of online transactions currently handled by e-commerce systems. In this paper, we present research in progress of the prototype of an extension to e-commerce auditing systems that use data mining techniques to generate rules from fraud patterns. Subsequently, the system applies these rules to e-commerce databases with the aim of isolating those transactions that have a high chance of being fraudulent.