Conference PaperPDF Available

A Discrete Wavelet Transform Approach to Fraud Detection

Authors:

Abstract

The exponential growth in the number of operations carried out in the e-commerce environment is directly related to the growth in the number of operations performed through credit cards. This happens because practically all commercial operators allow their customers to make their payments by using them. Such scenario leads toward an high level of risk related to the potential fraudulent activities that the fraudsters can perform by exploiting this powerful instrument of payment illegitimately. A large number of state-of-the-art approaches are designed in order to face the fraud detection problem, adopting different solutions, although there are some common issues that reduce their effectiveness. Some of the most important of them are the imbalanced distribution and the heterogeneity of data, i.e., there is a large number of legitimate cases and a small number of fraudulent ones, and the transactions are characterized by large variations in the feature values. This paper presents a novel fraud detection approach based on the Discrete Wavelet Transform, which is here exploited in order to define an evaluation model able to address the aforementioned issues. Such goal is reached by using only legitimate transactions during the model definition process, thanks to the more stable representation of data offered by the new domain, less influenced by the data variation. The experimental results show that our approach achieves performance comparable to that of one of the best approaches at the state of the art, such as random forests. The relevant aspect related to this result is that our model is trained without using previous fraudulent cases, adopting a proactive strategy, with the positive side-effect to solve the cold-start problem.
A Discrete Wavelet Transform Approach
to Fraud Detection
Roberto Saia
Department of Mathematics and Computer Science
University of Cagliari, Via Ospedale 72 - 09124 Cagliari, Italy
roberto.saia@unica.it
Abstract. The exponential growth in the number of operations car-
ried out in the e-commerce environment is directly related to the growth
in the number of operations performed through credit cards. This hap-
pens because practically all commercial operators allow their customers
to make their payments by using them. Such scenario leads toward an
high level of risk related to the potential fraudulent activities that the
fraudsters can perform by exploiting this powerful instrument of pay-
ment illegitimately. A large number of state-of-the-art approaches have
been designed to address this problem, but they must face some common
issues, the most important of them are the imbalanced distribution and
the heterogeneity of data. This paper presents a novel fraud detection
approach based on the Discrete Wavelet Transform, which is exploited in
order to define an evaluation model able to address the aforementioned
problems. Such objective is achieved by using only legitimate transac-
tions in the model definition process, an operation made possible by the
more stable data representation offered by the new domain. The per-
formed experiments show that our approach performance is comparable
to that of one of the best state-of-the-art approaches such as random
forests, demonstrating how such proactive strategy is also able to face
the cold-start problem.
Keywords: Business intelligence ·Fraud detection ·Pattern mining ·
Wavelet
1 Introduction
A study performed by American Association of Fraud Examiners1shows that
the credit card frauds (i.e., purchases without authorization or counterfeits of
credit cards) are the 10–15% of all the fraud cases, for a financial value close to
75–80%. Only in the USA, such frauds lead toward an estimated average loss per
fraud case of 2million of dollars, and for this reason in recent years there was an
increase in the researchers'efforts, aimed to define effective techniques for the
fraud detection. Literature presents several state-of-the-art techniques for this
task, but all of them have to face some common problems, e.g., the imbalanced
1http://www.acfe.com
distribution of data and the heterogeneity of the information that compose a
transaction. Such scenario is worsened by the scarcity of information that usually
characterizes a transaction, a problem that leads toward an overlapping of the
classes of expense.
The core idea of the proposed approach is the adoption of a new evalua-
tion model based on the data obtained by processing the transactions through a
Discrete Wavelet Transformation (DW T ) [1]. Considering that such process in-
volves only the previous legitimate transactions, it operates proactively by facing
the cold-start issue (i.e., scarcity or absence of fraudulent examples during the
model definition), reducing also the problems related to the data heterogeneity,
since the new model is less influenced by the data variations.
The scientific contributions given by this paper are as follows:
(i) definition of the time series to use as input in the DW T process, in terms
of sequence of values assumed by the features of a credit card transaction;
(ii) formalization of the process aimed to compare the DW T output of a new
transaction with those of the previous legitimate ones;
(iii) classification of the new transactions as legitimate or fraudulent through
an algorithm based on the previous comparison process.
The paper is organized into several sections: Section 2 introduces the background
and related work of the fraud detection scenario; Section 3 reports the formal
notation adopted in this paper and defines the faced problem; Section 4 gives all
details about our approach; Section 5 describes the experimental environment,
the used datasets and metrics, the adopted strategy and competitor approach,
ending with the presentation of the experimental results; the last Section 6 pro-
vides some concluding remarks and future work.
2 Background and Related Work
Fraud Detection Techniques: the strategy adopted by the fraud detection
systems can be of two types: supervised or unsupervised [2]. By following the
supervised strategy it uses the previous fraudulent and non-fraudulent transac-
tions in order to define its evaluation model. This is a strategy that needs a set of
examples related to both classes of transactions, and its effectiveness is usually
restricted to the recognition of patterns present in the training set. By following
the unsupervised strategy, the system analyzes the new transactions with the
aim to detect anomalous values in their features, where as anomaly we mean a
value outside the range of values assumed by the feature in the set of previous
legitimate cases.
The static approach [3] represents the most common way to operate in order
to detect fraudulent transactions related to a credit card activity. By following
such approach, the data stream is divided into blocks of equal size and the model
is trained by using only a limited number of initial and contiguous blocks. Dif-
ferently from the static approach, the updating approach [4] updates its model
at each new block, performing this activity by using a certain number of latest
and contiguous blocks. A forgetting approach [5] can be also followed, and in this
case the model is updated when a new block appears, performing this operation
by using all the previous fraudulent transactions, but only the legitimate trans-
actions present in the last two blocks. The models defined on the basis of these
approaches can be used individually or they can be aggregated in order to define
a bigger model of evaluation. Some of the problems related to the aforementioned
approaches are the inability to model the users behavior (static approach), the
inability to manage small classes of data (updating approach), and the compu-
tational complexity (forgetting approach), plus the common issues described in
the following.
Open Problems: a series of problems, reported below, make the work of re-
searchers operating in this field harder.
(i) Lack of public real-world datasets: this happens for several reasons, the first
of them being the restrictive policies adopted by commercial operators, aimed
to not reveal information about their business, for privacy, competition, or legal
issues [6].
(ii) Non-adaptability: caused by the inability of the evaluation models to classify
the new transactions correctly, when these have patterns different to those used
during the model training [7].
(iii) Data heterogeneity: this problem is related to the incompatibility between
similar features resulting in the same data being represented differently in dif-
ferent datasets [8].
(iv) Unbalanced distribution of data: it is certainly the most important issue [9],
which happens because the information available to train the evaluation models
is usually composed by a large number of legitimate cases and a small number of
fraudulent ones, resulting in a data configuration that reduces the effectiveness
of the classification approaches.
(v) Cold-start: another problem is related to those scenarios where the data
used for the evaluation model training does not contain enough information on
the domain taken into account, leading toward the definition of unreliable mod-
els [10]. Basically, this happens when the data available for the model training
does not contain representative examples of all classes of information.
Proposed Approach: the core idea of this work is to move the evaluation
process from the canonical domain to a new domain by exploiting the Discrete
Wavelet Transformation (DW T ) [11]. In more detail, we use the DW T process
in a time series data mining context, where a time series usually refers to a
sequence of values acquired by measuring the variation in the time of a specific
data type (i.e., temperature, amplitude, etc.).
The DW T process transforms a time series by exploiting a set of functions
named wavelets [12], and in literature it is usually performed in order to reduce
the data size or the data noise (e.g., in the image compression and filtering
tasks). The time-scale multiresolution offered by the D W T allows us to observe
the original time series from different points of view, each of them containing
interesting information on the original data. The capability in the new domain to
observe the data by using multiple scales (multiple resolution levels) allows our
approach to define a more stable and representative model of the transactions,
with regard to the canonical state-of-the-art approaches.
In our approach we define time series as the sequence of values assumed by
the features of a credit card transaction, frequency represents the number of
occurrences of a value in a time series over a unit of time, and as scale we refer
to the time interval that characterize a time series.
Formally, a Continuous Wavelet Transform (C W T ) is defined as shown in
Equation 1, where ψ(t) represents a continuous function in both the time and
frequency domain (called mother wavelet) and the denoting the complex con-
jugate .
Xw(a, b) = 1
|a|1/2R
−∞ x(t)ψtb
adt (1)
Given the impossibility to analyze the data by using all wavelets coefficients, it
is usually acceptable to consider a discrete subset of the upper half-plane to be
able to reconstruct the data from the corresponding wavelets coefficients The
considered discrete subset of the half-plane are all the points (am, namb), where
m, n Z, and this allows us to define the so-called child wavelets as shown in
Equation 2.
ψm,n(t) = 1
amψtnb
am(2)
The use of small scales (i.e., that corresponds to large frequencies, since the scale
is given by the formula 1
frequency ) compress the data, giving us an overview
of the involved information, while large scales (i.e., low frequencies) expand
the data, offering a detailed analysis of the information. On the basis of the
characteristics of the wavelets transformation, although it is possible to use many
basis functions as mother wavelet (e.g., Daubechies,Meyer,Symlets,Coiflets,
etc), for the scope of our approach we decided to use one of the simplest and
oldest wavelets formalization, the Haar wavelet [13]. It is shown in Equation 3
and it allows us to measure the contrast directly from the responses of low and
high frequency sub-bands.
ψ(t) =
1,0t > 1
2
1,1
2t < 1
0, otherwise
(3)
Competitor Approach: considering that the most effective fraud detection
approaches in literature need both the fraudulent and legitimate examples to
train their model, we have chosen not to compare our approach to many of
them, limiting the comparison to only one of the most used and effective ones,
being Random Forests [14]. Our intention is to demonstrate the capability of
the proposed approach to define an effective evaluation model by using a single
class of transactions, overcoming some well-known issues.
Random Forests represents one of the most effective state-of-the-art ap-
proaches, since in most of the cases reported in literature it outperforms the
other ones in this particular field [15, 16]. It works by following an ensemble
learning method for classification and regression based on the construction of a
number of randomized decision trees during the training phase and the classifi-
cation is inferred by averaging the obtained results.
3 Notation and Problem Definition
Given a set of classified transactions T={t1, t2,...,tN}, and a set of fea-
tures V={v1, v2,...,vM}that compose each tT, we denote as T+=
{t1, t2,...,tK}the subset of legitimate transactions (then T+T), and as
T={t1, t2, . . . , tJ}the subset of fraudulent ones (then TT). We also
denote as ˆ
T={ˆ
t1,ˆ
t2,...,ˆ
tU}a set of unevaluated transactions. It should be
observed that a transaction only can belong to one class cC, where C=
{legitimate, f raudulent}. Finally, we denote as F={f1, f2,...,fX}the output
of the DW T process.
Denoting as Ξthe process of comparison between the DW T output of the
time series in the set T+(i.e., the sequence of feature values in the previous
legitimate transactions) and the D W T output of the time series related to the
unevaluated transactions in the set ˆ
T(processed one at a time), the objective of
our approach is the classification of each transaction ˆ
tˆ
Tas legitimate or fraud-
ulent. Defining a function Evaluation(ˆ
t, Ξ) that performs this operation based
on our approach, returning a boolean value β(0=misclassification,1=correct
classification) for each classification, we can formalize our objective function
(Equation 4) in terms of maximization of the results sum.
max
0β≤| ˆ
T|
β=|ˆ
T|
P
u=1
Eval uation(ˆ
tu, Ξ)(4)
4 Proposed Approach
Step 1 of 3 - Data Definition: a time series is a series of events acquired
during a certain period of time, where each of these events is characterized by
a value. The set composed by all the acquisitions refers to a single variable,
since it contains data of the same type. In our approach we consider as time
series (ts) the sequence of values assumed by the features vVin the sets T+
(previous legitimate transactions) and ˆ
T(unevaluated transactions), as shown
in Equation 5.
T+=
v1,1v1,2. . . v1,M
v2,1v2,2. . . v2,M
.
.
..
.
.....
.
.
vK,1vK,2. . . vK,M
ˆ
T=
v1,1v1,2. . . v1,M
v2,1v2,2. . . v2,M
.
.
..
.
.....
.
.
vU,1vU,2. . . vU,M
ts(T+) = (v1,1, v1,2,...,v1,M ),(v2,1, v2,2,...,v2,M ),··· ,(vK,1, vK,2,...,vK,M )
ts(ˆ
T) = (v1,1, v1,2,...,v1,M ),(v2,1, v2,2,...,v2,M ),··· ,(vU,1, vU,2, . . . , vU,M)
(5)
Step 2 of 3 - Data Processing: the time series previously defined are here
used as input in the DW T process. Without going deeply into the formal prop-
erties of the wavelet transform, we want to exploit the following two:
(i) Dimensionality reduction: the DW T process can reduce the time series data,
since its orthonormal transformation reduces their dimensionality, providing a
compact representation that preserves the original information in its coefficients.
By exploiting this property a fraud detection system can reduce the computa-
tional complexity of the involved processes;
(ii) Multiresolution analysis: the DW T process allows us to define separate time
series on the basis of the original one, distributing the information in them in
terms of wavelet coefficients. The orthonormal transformation carried out by
DW T preserves the original information, allowing us to return to the original
data representation. A fraud detection system can exploit this property in order
to detect rapid changes in the data under analysis, observing the data series
under two different points of view, one approximated and one detailed. The first
provides an overview on the data, while the second provides useful information
for the data changing evaluation.
Our approach exploits both the aforementioned properties, transforming the
time series through the Haar wavelet process. The approximation coefficients
at level N
2was preferred to a more precise one in order to define a more stable
evaluation model, less influenced by the data heterogeneity.
Step 3 of 3 - Data Classification: a new transaction ˆ
tˆ
Tis evaluated by
comparing the output of the D W T process applied on each time series extracted
by the set T+(previous legitimate transactions) to the output of the same process
applied on the time series of the transaction ˆ
tto evaluate.
The comparison is performed in terms of cosine similarity between the output
vectors (i.e. values in the set F), as shown in Equation 6, where is the similar-
ity, αis a threshold experimentally defined, and cis the resulting classification.
We repeat this process for each transaction tT+, evaluating the classification
of the transaction ˆ
ton the basis of the average of all the comparisons.
=Cosim(F(t), F (ˆ
t)), with c =(α, legitimate
∆ < α, fraudulent (6)
The Algorithm 1 takes the past legitimate transactions in T+as input, the
transaction ˆ
tto evaluate, and the threshold α, returning a boolean value that
indicates the ˆ
tclassification (i.e., true=legitimate or false=fraudulent) as output.
Algorithm 1 T r ansaction evaluation
Require: T+=Legitimate previous transactions, ˆ
t=Unevaluated transaction, α=Threshold
Ensure: β=Classification of the transaction ˆ
t
1: procedure transactionEvaluation(T+,ˆ
t)
2: ts1getT imeseries(ˆ
t)
3: sp1getDW T (ts1)
4: for each tin T+do
5: ts2getT imeseries(t)
6: sp2getDW T (ts2)
7: cos cos +getCosineSimilarity(sp1, sp2)
8: end for
9: avg cos
|T+|
10: if avg > α then βtrue else βf alse
11: return β
12: end procedure
5 Experiments
5.1 Environment
The proposed approach was developed in Java, by using the JWave2library for
the Discrete Wavelet Transformation. The competitor approach (i.e., Random
Forests) and the metrics used for its evaluation have been implemented in R3, by
using randomForest,DMwR, and ROCR packages. For reproducibility reasons,
the Rfunction set.seed() has been used, and the Random Forests parameters
were tuned by finding those that maximize the performance. Statistical differ-
ences between the results were calculated by the independent-samples two-tailed
Student's t-tests (p < 0.05).
5.2 DataSet
The public real-world dataset used for the evaluation of the proposed approach is
related to a series of credit card transactions made by European cardholders4in
two days of September 2013, for a total of 492 frauds out of 284,807 transactions.
It is an highly unbalanced dataset [17], since the fraudulent cases are only the
0.0017% of all the transactions.
For confidentiality reasons all dataset fields have been made public in
anonymized form, except the time, the amount, and the classification ones.
5.3 Metrics
Cosine Similarity: it measures the similarity (Cosim) between two non-zero
vectors v1and v2in terms of cosine angle between them, as shown in the Equa-
tion (7). It allows us to evaluate the similarity between vectors of values returned
by the DW T processes.
Cosim(v1,v2) = cos(v1,v2) = v1·v2
kv1k·kv2k(7)
F-score: it represents the weighted average of the Precision and Recall metrics,
a largely used metric in the statistical analysis of binary classification that re-
turns a value in a range [0,1], where 0 is the worst value and 1 the best one.
More formally, given two sets T(P)and T(R), where T(P)denotes the set of per-
formed classifications of transactions, and T(R)the set that contains the actual
classifications of them, it is defined as shown in Equation 8.
F-score(T(P), T (R)) = 2 ·P recision·Recall
P recision+Recal
with
P recision(T(P), T (R)) = |T(R)T(P)|
|T(P)|, Recall(T(P), T (R)) = |T(R)T(P)|
|T(R)|
(8)
AUC: the Area Under the Receiver Operating Characteristic curve (AU C ) is
a performance measure used to evaluate the predictive power of a classification
2https://github.com/cscheiblich/JWave/
3https://www.r-project.org/
4https://www.kaggle.com/dalpozz/creditcardfraud
model. Its result is in a range [0,1], where 1 indicates the best performance. More
formally, given the subsets of previous legitimate transactions T+and previous
fraudulent ones T, its formalization is reported in the Equation 9, where Θ
indicates all possible comparisons between the transactions of the two subsets
T+and T. The result is obtained by averaging over these comparisons.
Θ(t+, t) =
1, if t+> t
0.5, if t+=t
0, if t+< t
AUC =1
|T+|·|T|
|T+|
P
1
|T|
P
1
Θ(t+, t)(9)
5.4 Strategy
Cross-validation: in order to improve the reliability of the obtained results
and reduce the impact of data dependency, the experiments followed a k-fold
cross-validation criterion, with k=10, where each dataset is divided in ksubsets,
and each ksubset is used as test set, while the other k-1 subsets are used as
training set, and the final result is given by the average of all kresults.
Threshold Tuning: according to the Algorithm 1 we need to define the op-
timal value of the αparameter, since the classification process depends on it
(Equation 6). It is the average value of cosine similarity calculated between all
the pairs of legitimate transactions in the set T+(α= 0.91 in our case).
5.5 Competitor
The state-of-the-art approach chosen as our competitor is Random Forests. It
was implemented in Rlanguage by using the randomForest and the DMwR pack-
ages. The DMwR package was used to face the class imbalance problem through
the Synthetic Minority Over-sampling Technique (SMOTE) [18], a popular sam-
pling method that creates new synthetic data by randomly interpolating pairs
of nearest neighbors.
5.6 Results
Analyzing the experimental results, we can do the following considerations:
(i) the first set of experiments, which results are shown in Figure 1.a, was
focused on the evaluation of our approach (denoted as W T ) in terms of F-
score. We can observe how it gets performance close to that of its competi-
tor Random Forests, despite the adoption of a proactive strategy (i.e., not
using previous fraudulent transactions during the model training), demon-
strating its ability to define an effective model by exploiting only a class of
transaction (i.e., the legitimate one);
(ii) the second set of experiments, which results are shown in Figure 1.b, was
instead aimed to evaluate the performance of our approach in terms of
AUC. This metric measures the predictive power of a classification model
0.20 0.40 0.60 0.80 1.00
RF
WT
0.95
0.92
(a) F-score
Approaches
0.20 0.40 0.60 0.80 1.00
RF
WT
0.98
0.78
(b) AUC
Approaches
Fig. 1. F-score and AUC perf ormance
and the results indicate that our approach, also in this case, offers perfor-
mance levels close to those of its competitor RF , while not using previous
fraudulent cases to define its model.
(iii) summarizing all the results, the first consideration that arises is related to
the capability of our approach to face the data imbalance and the cold-start
problems, adopting a proactive strategy that only needs a transaction class
for the model definition. The last but not least important consideration is
that such proactivity allows a fraud detection system to operate without the
need to have previous examples of fraudulent cases, with all the advantages
that derive from it.
6 Conclusions and Future Work
Nowadays, credit cards represent an irreplaceable instrument of payment and
such scenario obviously leads towards an increasing of the related fraud cases,
making it necessary to design effective techniques for the fraud detection.
Instead of aiming to outperform the existing state-of-the-art approaches, with
this paper we want to demonstrate that through a new data representation is
possible to design a fraud detection system that operates without the need of
previous fraudulent examples. The goal was to prove that our evaluation model,
defined by using a single class of transactions, is able to offer a level of perfor-
mance similar to one of the best state-of-the-art approaches based on a model
defined by using all classes of transactions (i.e., Random Forests), overcoming
some important issues such as the data imbalance and the cold-start ones.
We can consider the obtained results to be very interesting, given that our
competitor, in addition to use both classes of transactions to train its model,
adopts a data balance mechanism (i.e., SMOTE).
For the aforementioned considerations, a future work will be focused on the
definition of an hybrid fraud detection approach able to combine the advantages
of the non-proactive state-of-the-art approaches with those of our proactive al-
ternative.
Acknowledgments. This research is partially funded by Regione Sardegna un-
der project Next generation Open Mobile Apps Development (NOMAD ), Pac-
chetti Integrati di Agevolazione (PIA)Industria Artigianato e Servizi (2013).
References
1. Chaovalit, P., Gangopadhyay, A., Karabatis, G., Chen, Z.: Discrete wavelet
transform-based time series analysis and mining. ACM Comput. Surv. 43(2) (2011)
6:1–6:37
2. Bolton, R.J., Hand, D.J.: Statistical fraud detection: A review. Statistical Science
(2002) 235–249
3. Pozzolo, A.D., Caelen, O., Borgne, Y.L., Waterschoot, S., Bontempi, G.: Learned
lessons in credit card fraud detection from a practitioner perspective. Expert Syst.
Appl. 41(10) (2014) 4915–4928
4. Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept-drifting data streams using
ensemble classifiers. In Getoor, L., Senator, T.E., Domingos, P.M., Faloutsos,
C., eds.: Proceedings of the Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, Washington, DC, USA, August 24 - 27,
2003, ACM (2003) 226–235
5. Gao, J., Fan, W., Han, J., Yu, P.S.: A general framework for mining concept-
drifting data streams with skewed distributions. In: Proceedings of the Seventh
SIAM International Conference on Data Mining, April 26-28, 2007, Minneapolis,
Minnesota, USA, SIAM (2007) 3–14
6. Phua, C., Lee, V., Smith, K., Gayler, R.: A comprehensive survey of data mining-
based fraud detection research. (2010)
7. Sorournejad, S., Zojaji, Z., Atani, R.E., Monadjemi, A.H.: A survey of credit
card fraud detection techniques: Data and technique oriented perspective. CoRR
abs/1611.06439 (2016)
8. Chatterjee, A., Segev, A.: Data manipulation in heterogeneous databases. ACM
SIGMOD Record 20(4) (1991) 64–68
9. Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study.
Intell. Data Anal. 6(5) (2002) 429–449
10. Donmez, P., Carbonell, J.G., Bennett, P.N.: Dual strategy active learning. In:
ECML. Volume 4701 of Lecture Notes in Computer Science., Springer (2007) 116–
127
11. Chernick, M.R.: Wavelet methods for time series analysis. Technometrics 43(4)
(2001) 491
12. Percival, D.B., Walden, A.T.: Wavelet methods for time series analysis. Volume 4.
Cambridge university press (2006)
13. Mallat, S.: A theory for multiresolution signal decomposition: The wavelet repre-
sentation. IEEE Trans. Pattern Anal. Mach. Intell. 11(7) (1989) 674–693
14. Breiman, L.: Random forests. Machine Learning 45(1) (2001) 5–32
15. Lessmann, S., Baesens, B., Seow, H., Thomas, L.C.: Benchmarking state-of-the-
art classification algorithms for credit scoring: An update of research. European
Journal of Operational Research 247(1) (2015) 124–136
16. Brown, I., Mues, C.: An experimental comparison of classification algorithms for
imbalanced credit scoring data sets. Expert Syst. Appl. 39(3) (2012) 3446–3453
17. Dal Pozzolo, A., Caelen, O., Johnson, R.A., Bontempi, G.: Calibrating probability
with undersampling for unbalanced classification. In: Computational Intelligence,
2015 IEEE Symposium Series on, IEEE (2015) 159–166
18. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic
minority over-sampling technique. Journal of artificial intelligence research 16
(2002) 321–357
... Yuan et al. (2017) combine deep learning techniques and graph techniques for the detection of fraudulent transactions. Recently, the use of mathematical methods like Fourier transform(Saia & Carta, 2017), computational intelligence(West & Bhattacharya, 2016), and wavelet transforms(Saia, 2017) are also used by researchers to analyze credit card transactions.Kim et al. (2019) proposed a hybrid approach for credit card fraud transaction detection. The proposed approach is called the "Champion-challenge" framework and it is based on a hybrid ensemble and deep learning model. ...
Article
Full-text available
In this digital era, the trend of online transactions for E-commerce sites and banking services is increasing. By using different online transaction methods users can make payments directly from their bank accounts. But along with the increase of online transactions, there is an increase in fraudulent transactions. These fraudulent transitions have identical features and characteristics of online transactions, so there is a need for the development of frameworks or technologies to detect fraudulent transactions. In this context, this paper represents a survey of the latest frameworks and techniques proposed by the researchers for the identification of fraudulent transactions and securing online transactions.
... For instance, some works aimed to improve the precision and increase the speed of data processing through a hybrid automatic learning system [8] or through incremental learning [9]. Another challenge for fraud detection is the lack of data from which detection systems learn, and [10] proposed a fraud-detection system that does not require previous fraudulent examples. However, even when the data are available, large and small datasets should be addressed differently [11]. ...
Article
Full-text available
Fraud entails deception in order to obtain illegal gains; thus, it is mainly evidenced within financial institutions and is a matter of general interest. The problem is particularly complex, since perpetrators of fraud could belong to any position, from top managers to payroll employees. Fraud detection has traditionally been performed by auditors, who mainly employ manual techniques. These could take too long to process fraud-related evidence. Data mining, machine learning, and, as of recently, deep learning strategies are being used to automate this type of processing. Many related techniques have been developed to analyze, detect, and prevent fraud-related behavior, with the fraud triangle associated with the classic auditing model being one of the most important of these. This work aims to review current work related to fraud detection that uses the fraud triangle in addition to machine learning and deep learning techniques. We used the Kitchenham methodology to analyze the research works related to fraud detection from the last decade. This review provides evidence that fraud is an area of active investigation. Several works related to fraud detection using machine learning techniques were identified without the evidence that they incorporated the fraud triangle as a method for more efficient analysis.
... A preliminary consideration regarding the evaluation metrics used in this domain is related to the fact that, similar to some other domains Carta et al., 2020a;Saia and Carta, 2017a;Saia, 2017;Saia and Carta, 2016;Saia et al., 2018a;Saia and Carta, 2017b), the involved data are usually characterized by a high degree of imbalance (in the intrusion detection context the minority class is the intrusion one), requiring assessment metrics that are not biased by this characteristic. ...
Conference Paper
Full-text available
Anyone working in the field of network intrusion detection has been able to observe how it involves an ever-increasing number of techniques and strategies aimed to overcome the issues that affect the state-of-the-art solutions. Data unbalance and heterogeneity are only some representative examples of them, and each misclassification operates in this context could have enormous repercussions in different crucial areas such as, for instance, financial, privacy, and public reputation. This happens because the current scenario is characterized by a huge number of public and private network-based services. The idea behind the proposed work is decomposing the canonical classification process into several sub-processes, where the final classification depends on all the sub-processes results, plus the canonical one. The proposed Training Data Decomposition (TDD) strategy is applied on the training datasets, where it applies a decomposition into regions, according to a defined number of events and features. The ratio that leads this process is related to the observation that the same network event could be evaluated in a different manner, when it is evaluated in different time periods and/or when it involves different features. According to this observation, the proposed approach adopts different classification models, each of them trained in a different data region characterized by different time periods and features, classifying the event both on the basis of all model results, and on the basis of the canonical strategy that involves all data.
... The fist one lies in the dataset nature of the problem itself. Similar to other problems, namely fraud or intrusion detection [6][7][8], the source of data typically contains different distributions of classes [9,10], which, in the specific case of credit scoring, is more favorable to the reliable instances rather than to the unreliable ones [11]. Such a behavior can seriously affect the performance of classification algorithms, once they can be often biased to classify the most frequent class [11,12]. ...
Chapter
Full-text available
The increasing amount of credit offered by financial institutions has required intelligent and efficient methodologies of credit scoring. Therefore, the use of different machine learning solutions to that task has been growing during the past recent years. Such procedures have been used in order to identify customers who are reliable or unreliable, with the intention to counterbalance financial losses due to loans offered to wrong customer profiles. Notwithstanding, such an application of machine learning suffers with several limitations when put into practice, such as unbalanced datasets and, specially, the absence of sufficient information from the features that can be useful to discriminate reliable and unreliable loans. To overcome such drawbacks, we propose in this work a Two-Step Feature Space Transforming approach, which operates by evolving feature information in a twofold operation: (i) data enhancement; and (ii) data discretization. In the first step, additional meta-features are used in order to improve data discrimination. In the second step, the goal is to reduce the diversity of features. Experiments results performed in real-world datasets with different levels of unbalancing show that such a step can improve, in a consistent way, the performance of the best machine learning algorithm for such a task. With such results we aim to open new perspectives for novel efficient credit scoring systems.
... For instance, the literature reports intrusion detection approaches based on machine learning criteria such as gradient boosting [14], adaptive boosting [15], and random forests [16]. Other proposals involve artificial neural networks [17], probabilistic criteria [18], or data transformation/representation [19][20][21], similarly to what is done, in terms of scenario and data balance, in closely related domains [22][23][24][25][26][27][28][29][30]. It should be noted that, besides sharing the same objectives, they also tackle analogous problems, such as the difficulty of classifying intrusion events that are very similar to normal ones in terms of characteristics or the difficulty to detect novel form of attacks (e.g., zero-days attacks [31]). ...
Article
Full-text available
The dramatic increase in devices and services that has characterized modern societies in recent decades, boosted by the exponential growth of ever faster network connections and the predominant use of wireless connection technologies, has materialized a very crucial challenge in terms of security. The anomaly-based Intrusion Detection Systems, which for a long time have represented one of the most efficient solutions in order to detect intrusion attempts on a network, then have to face this new and more complicated scenario. Well-known problems, such as the difficulty of distinguishing legitimate activities from illegitimate ones due to their similar characteristics and their high degree of heterogeneity, today have become even more complex, considering the increase in the network activity. After providing an extensive overview of the scenario under consideration, this work proposes a Local Feature Engineering (LFE) strategy aimed to face such problems through the adoption of a data preprocessing strategy that reduces the number of possible network event patterns, increasing at the same time their characterization. Unlike the canonical feature engineering approaches, which take into account the entire dataset, it operates locally in the feature space of each single event. The experiments conducted on real-world data have shown that this strategy, which is based on the introduction of new features and the discretization of their values, improves the performance of the canonical state-of-the-art solutions.
... On the other side, many studies are instead oriented to a deeper analysis of the social media posts, such as in [34,35], where the authors propose approaches for toxic comment classification. Indeed, similarly to other contexts [36][37][38][39][40][41][42], a considerable importance is given to the transformation of the original data domain, such as, for instance, in [43], where an approach aimed to predict the popularity of online videos by exploiting the Fourier transform has been presented. Another example is represented by the work in [44], where the authors propose a solution based on the wavelet transform to detect human, legitimate bot, and malicious bot in online social networks. ...
Article
Full-text available
Predicting the popularity of posts on social networks has taken on significant importance in recent years, and several social media management tools now offer solutions to improve and optimize the quality of published content and to enhance the attractiveness of companies and organizations. Scientific research has recently moved in this direction, with the aim of exploiting advanced techniques such as machine learning, deep learning, natural language processing, etc., to support such tools. In light of the above, in this work we aim to address the challenge of predicting the popularity of a future post on Instagram, by defining the problem as a classification task and by proposing an original approach based on Gradient Boosting and feature engineering, which led us to promising experimental results. The proposed approach exploits big data technologies for scalability and efficiency, and it is general enough to be applied to other social media as well.
Article
The time spent processing the test set is a crucial factor in research projects involving audio signal transforms. The present work analyzes the performance of a set of programming languages commonly used in this type of scientific research. The results showed the C and C++ languages as faster in a test suite with 8 different programming languages implementing the discrete Wavelet transform.
Article
The high volume of money involved in e-commerce transactions draws the attention of fraudsters, which makes fraud prevention and detection techniques of high importance. Current surveys and reviews on fraud systems focuses mainly on financial-specific domains or general areas, leaving e-commerce aside. In this context, this article presents a systematic literature review on fraud detection and prevention for e-commerce systems. Our methodology involved searching for articles published in the last six years into four different literature databases. The search of articles employs a search string composed of the following keywords: purchase, buy, transactions, fraud prevention, fraud detection, e-commerce, web commerce, online store, real-time, and stream. We apply six filtering criteria to remove irrelevant articles. The methodology resulted in 64 articles, which we carefully analyzed to answer five research questions. Our contribution appears in the updated perception of fraud types, computational methods for fraud detection and prevention, as well as the employed domains. To the best of our knowledge, this is the first survey on combining prevention and detection of e-commerce frauds, linking also architectural insights, artificial intelligence methods, and open challenges and gaps in the research area. The study main findings demonstrate that from 64 articles, only five focus on the fraud prevention problem, and credit card fraud is the most explored fraud type. In addition, current literature lacks studies that propose strategies for detecting fraudsters and automated bots in real-time.
Article
Full-text available
Credit card plays a very important rule in today's economy. It becomes an unavoidable part of household, business and global activities. Although using credit cards provides enormous benefits when used carefully and responsibly,significant credit and financial damages may be caused by fraudulent activities. Many techniques have been proposed to confront the growth in credit card fraud. However, all of these techniques have the same goal of avoiding the credit card fraud; each one has its own drawbacks, advantages and characteristics. In this paper, after investigating difficulties of credit card fraud detection, we seek to review the state of the art in credit card fraud detection techniques, data sets and evaluation criteria.The advantages and disadvantages of fraud detection methods are enumerated and compared.Furthermore, a classification of mentioned techniques into two main fraud detection approaches, namely, misuses (supervised) and anomaly detection (unsupervised) is presented. Again, a classification of techniques is proposed based on capability to process the numerical and categorical data sets. Different data sets used in literature are then described and grouped into real and synthesized data and the effective and common attributes are extracted for further usage.Moreover, evaluation employed criterions in literature are collected and discussed.Consequently, open issues for credit card fraud detection are explained as guidelines for new researchers.
Article
Full-text available
Many years have passed since Baesens et al. published their benchmarking study of classification algorithms in credit scoring [Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., & Vanthienen, J. (2003). Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the Operational Research Society, 54(6), 627-635.]. The interest in prediction methods for scorecard development is unbroken. However, there have been several advancements including novel learning methods, performance measures and techniques to reliably compare different classifiers, which the credit scoring literature does not reflect. To close these research gaps, we update the study of Baesens et al. and compare several novel classification algorithms to the state-of-the-art in credit scoring. In addition, we examine the extent to which the assessment of alternative scorecards differs across established and novel indicators of predictive accuracy. Finally, we explore whether more accurate classifiers are managerial meaningful. Our study provides valuable insight for professionals and academics in credit scoring. It helps practitioners to stay abreast of technical advancements in predictive modeling. From an academic point of view, the study provides an independent assessment of recent scoring methods and offers a new baseline to which future approaches can be compared.
Article
Full-text available
Billions of dollars of loss are caused every year due to fraudulent credit card transactions. The design of efficient fraud detection algorithms is key for reducing these losses, and more algorithms rely on advanced machine learning techniques to assist fraud investigators. The design of fraud detection algorithms is however particularly challenging due to non-stationary distribution of the data, highly imbalanced classes distributions and continuous streams of transactions. At the same time public data are scarcely available for confidentiality issues, leaving unanswered many questions about which is the best strategy to deal with them. In this paper we provide some answers from the practitioner’s perspective by focusing on three crucial issues: unbalancedness, non-stationarity and assessment. The analysis is made possible by a real credit card dataset provided by our industrial partner.
Conference Paper
Full-text available
In recent years, there have been some interesting stud- ies on predictive modeling in data streams. However, most such studies assume relatively balanced and sta- ble data streams but cannot handle well rather skewed (e.g., few positives but lots of negatives) and stochastic distributions, which are typical in many data stream ap- plications. In this paper, we propose a new approach to mine data streams by estimating reliable posterior prob- abilities using an ensemble of models to match the dis- tribution over under-samples of negatives and repeated samples of positives. We formally show some interesting and important properties of the proposed framework, e.g., reliability of estimated probabilities on skewed pos- itive class, accuracy of estimated probabilities, efficiency and scalability. Experiments are performed on several synthetic as well as real-world datasets with skewed dis- tributions, and they demonstrate that our framework has substantial advantages over existing approaches in estimation reliability and predication accuracy.
Article
Many important information systems applications require access to data stored in multiple heterogeneous databases. This paper examines a problem in interdatabase data manipulation within a heterogeneous environment, where conventional techniques are no longer useful. To solve the problem, a broader definition for join operator is proposed. Also, a method to probabilistically estimate the accuracy of the join is discussed.
Article
1. Introduction to wavelets 2. Review of Fourier theory and filters 3. Orthonormal transforms of time series 4. The discrete wavelet transform 5. The maximal overlap discrete wavelet transform 6. The discrete wavelet packet transform 7. Random variables and stochastic processes 8. The wavelet variance 9. Analysis and synthesis of long memory processes 10. Wavelet-based signal estimation 11. Wavelet analysis of finite energy signals Appendix. Answers to embedded exercises References Author index Subject index.