ArticlePDF Available

Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection

Authors:

Abstract and Figures

Supervised learning techniques are widely employed in credit card fraud detection, as they make use of the assumption that fraudulent patterns can be learned from an analysis of past transactions. The task becomes challenging, however, when it has to take account of changes in customer behavior and fraudsters’ ability to invent novel fraud patterns. In this context, unsupervised learning techniques can help the fraud detection systems to find anomalies. In this paper we present a hybrid technique that combines supervised and unsupervised techniques to improve the fraud detection accuracy. Unsupervised outlier scores, computed at different levels of granularity, are compared and tested on a real, annotated, credit card fraud detection dataset. Experimental results show that the combination is efficient and does indeed improve the accuracy of the detection.
Content may be subject to copyright.
Combining Unsupervised and Supervised Learning in
Credit Card Fraud Detection
Fabrizio Carcilloa, Yann-A¨el Le Borgnea, Olivier Caelenb, Yacine Kessacib,
Fed´eric Obl´eb, Gianluca Bontempia
aMachine Learning Group, Computer Science Department, Faculty of Sciences ULB,
Universit´e Libre de Bruxelles, Brussels, Belgium.
(email: {fcarcill, yleborgn, gbonte}@ulb.ac.be)
bR&D Worldline, Worldline, France.
(email: {olivier.caelen,yacine.kessaci,frederic.oble}@worldline.com).
Abstract
Supervised learning techniques are widely employed in credit card fraud de-
tection, as they make use of the assumption that fraudulent patterns can be
learned from an analysis of past transactions. The task becomes challeng-
ing, however, when it has to take account of changes in customer behavior
and fraudsters’ ability to invent novel fraud patterns. In this context, un-
supervised learning techniques can help the fraud detection systems to find
anomalies. In this paper we present a hybrid technique that combines su-
pervised and unsupervised techniques to improve the fraud detection accu-
racy. Unsupervised outlier scores, computed at different levels of granularity,
are compared and tested on a real, annotated, credit card fraud detection
dataset. Experimental results show that the combination is efficient and does
indeed improve the accuracy of the detection.
Keywords: Fraud Detection, Ensemble Learning, Outlier Detection,
Semi-supervised Learning, Contextual Outlier Detection
1. Introduction
Credit card fraud detection aims to decide whether or not a transaction
is fraudulent on the basis of historical data. The decision is notoriously
difficult because of changes in customer spending behaviors, for example
during holiday periods, and in fraudsters’ own techniques, particularly those
Preprint submitted to Special Issue on Business Analytics Emerging Trends and ChallengesApril 13, 2019
that they use to adapt to fraud detection techniques. It is now known that
machine learning techniques offer an effective approach to tackling problems
like these [12].
A typical Fraud Detection System (FDS) includes multiple layers of con-
trol, each of which can either be automated or supervised by humans [6, 11].
Part of the automated layer embraces machine learning algorithms that build
predictive models based on annotated transactions. In the last decade, in-
tensive machine learning research for credit card fraud detection has led
to the development of supervised, unsupervised, and semi-supervised tech-
niques [27, 28]. In our previous works on credit card fraud detection, we
investigated supervised [6, 11], unsupervised [8], and semi-supervised tech-
niques [8, 7].
Supervised techniques rely on the set of past transactions for which the
label (also referred to as outcome or class) of the transaction is known. In
credit card fraud detection problems, the label is either genuine (the trans-
action was made by the cardholder) or fraudulent (the transaction was made
by a fraudster). The label is usually known a posteriori, either because a
customer complained or as a result of an investigation by the credit card
company. Supervised techniques make use of labeled past transactions to
learn a fraud prediction model, which returns, for any new transaction, the
probability of it being a fraud. However, not all labels are available immedi-
ately [6, 13].
Unsupervised outlier detection techniques do not require knowledge of
the label of transactions, and aim at characterizing the data distribution of
transactions. They rely on the assumption that outliers of the transaction
distribution are frauds; they can therefore be used to detect unseen types
of frauds because they do not rely on transactions labeled fraudulent in the
past. It is worth noting that their use also extends to clustering and com-
pression algorithms [10]. Clustering allows the identification of separate data
distributions for which different predictive models should be used, while com-
pression reduces the dimensionality of the learning problem; both algorithms
tend to improve the performances of supervised techniques.
The two approaches are complementary: supervised techniques learn from
past fraudulent behaviors, while unsupervised techniques target the detection
of new types of fraud. These two complementary approaches are combined
in the semi-supervised techniques [8, 37] often used when there are many
unlabeled data points and few labeled ones. They aim to perform better than
a supervised model that uses only the dataset of the few available labeled
2
data points, or an unsupervised model that does not profit from the few
labels.
This paper concerns the integration of unsupervised techniques with su-
pervised credit card fraud detection classifiers.
In particular we present a number of criteria to compute outlier scores at
different levels of granularity (from high granular card specific to low granular
global outlier scores) and we assess their added value in terms of accuracy
once integrated as features in a supervised learning strategy. As discussed in
Section 2 the combination of unsupervised and supervised learning is not new
in literature. In particular, our global approach, as we refer to it, is inspired
by the best-of-both-worlds principle proposed by Michenkova et al. in [24].
What is original in this paper is the adoption of this principle in a credit
card fraud detection setting and specifically the design and assessment of
several outlier scores adapted to the specific nature of our problem. Section 3
introduces the standard unsupervised outlier scores used in the experimental
section (Section 3.1), three original approaches to consider different levels of
granularity when computing the outlier scores (Section 3.2), and the metrics
used to compare the different approaches (Section 3.3). The experimental
comparison is performed in Section 4, while the discussion and conclusion
are presented in Section 5 and Section 6.
2. The state-of-the-art
The use of ensemble learning is popular in the supervised learning com-
munity for such techniques as boosting [15], bagging [4]; it is also common in
unsupervised outlier detection, where ensemble strategies improve the esti-
mation of the outlier scores [25, 38]. Sequential [26] and parallel [32] ensemble
strategies have been also proposed to combine supervised and unsupervised
outlier-detection algorithms.
The integration of supervised and unsupervised techniques has already
been discussed in the literature of fraud detection. In [32], Veeramachaneni
et al. introduce the AI2system that concatenates results from the anomaly
detection approach with those from the supervised learning approach. The
[32] analysis begins with the concurrent use concurrently of a supervised
model (Random Forest) and an ensemble of unsupervised models. The results
from these models are then merged by selecting the top n
2results from the
supervised model and the top n
2results from the unsupervised ensemble.
Note that this method requires a strategy to combine the scores deriving
3
from different outlier detection methods and to manage the observations
common to both subsets ( n
2unsupervised and n
2supervised outputs). To
tackle this issue, the authors propose to project the different scores in the
same space, for example by normalizing the scores in the [0,1] interval.
In [35], Yamanishi and Takeuchi developed a two-stage online outlier
detection algorithm based on unsupervised learning. In the first step, the
algorithm trains a Gaussian mixture model to score an unsupervised dataset
and imputes it by giving positive labels to highly scored data. In the second
step, the labeled dataset is used to learn a supervised outlier detector.
The best-of-both-worlds principle is a sequential approach proposed by
Michenkova et al. in [24]. This team applied multiple unsupervised outlier
detection algorithms to transform an initial dataset using a collection of
outlier scores. This sequential approach includes unsupervised learning in
the first stage and supervised learning in the second. The outlier score vector
so, obtained by the unsupervised model over the original dataset DS, is used
to augment DS:DS0= (DS, so). Next, the team compared the results
in terms of AUC-ROC using a logistic regression model in three different
settings: original dataset alone,outlier scores alone, and original dataset +
outlier scores. Using two datasets, they showed that the classifier improves
its accuracy when it uses outlier scores in addition to standard features. The
goal of adding multiple outlier scores to standard features is to highlight
the different aspects of feature space outlierness. The key advantage of this
approach is that through it, we do not need to normalize or combine scores
generated by heterogeneous methods. Furthermore, the supervised method
is expected to automatically extract information from these scores.
Recently, a class of outlier detection algorithms emerged called contextual
outlier detection [10, 22, 29]. This class of algorithms aims to find outliers
given a context. A context is a subset of the original dataset, and it is usually
identified by one or more contextual attributes. Besides contextual attributes,
there are behavioral attributes which are used to identify the outlier score
for each instance. Two instances with exactly the same behavioral attributes
but that are defined in two different contexts may be identified as outlier and
inlier instances, respectively.
3. Our Approach
Given the nature of the fraud detection problem, particularly the one-
to-many relationship between cards and transactions [8, 7], we propose an
4
extension of the best-of-both-worlds principle introduced by Michenkova et al.
in [24]. This extension consists of the definition of a number of outlier scores
(Section 3.1) with consideration to different levels of granularity (Section 3.2)
and their integration into the supervised approach.
3.1. Outlier scores
An outlier score vector can be generated using different unsupervised
techniques. In this section, we introduce the outlier scores used in our ex-
periments: Z-score,PC-1,PCA-RE-1,IF, and GM-1.
Given a dataset Xof ffeatures and Nobservations the multivariate
Z-score of a vector x∈ <fis
f
X
i=1 xiˆµi
ˆσi2
where ˆµiand ˆσiare the sample mean and standard deviation of the ith
feature, respectively.
Principal Component Analysis (PCA) is another well known technique for
outlier detection [19, 34] which works by transforming the original dataset X
(after normalization) into a set T=XW of flinearly uncorrelated variables
called principal components. The matrix Wis squared and its ith column
Wiis the ith eigenvector of XTX. We consider two scores of a vector x∈ <f
based on PCA: the first is the value of the first component
PC-1 =WT
1x
and the second
PCA-RE-1 =kxW1WT
1xk
is the reconstruction error obtained by using the first principal component.
Variations of these two scores are denoted by changing the suffix of the
score name. So, PC-2 will denote the second principal component, and PCA-
RE-2 will be the reconstruction error in the case of values reconstructed using
the first two principal components together.
The score IF is based on Isolation Forest [23], and uses the length of the
path between the root and the leaves of a random forest [5] as an indicator of
outlierness. Finally, GM-m is given by the density in xof a Gaussian Mixture
(GM) model fit to the dataset, where the suffix mdenotes the number of
mixtures.
5
3.2. Global, local, and cluster granularity
The approach described by Michenkova et al. in [24] for augmenting
the dataset DS takes outlier scores computed on the whole DS set into
consideration. As cardholder behaviors are very diverse, computing outlier
scores in a global fashion may be a sub-optimal solution. Based on contextual
outlier detection, this section proposes three main approaches to defining
contexts and computing outlier scores at different levels of granularity:
1. Global granularity: All transactions are considered to be samples of a
unique global distribution for which outlier scores can be computed. A
transaction is considered anomalous if it lies outside the overall multi-
variate pattern of the entire set of transactions. This approach is the
closest to the one discussed in [24], since no specificity relating to the
credit-card problem (e.g., the fact that transactions belong to different
customers) is taken into account.
2. Local granularity: The computation of the outlier scores is completed
in a card-based manner, and a transaction is considered anomalous
only if it abnormally differs from past transactions carried out by the
same card.
3. Cluster granularity: This is a compromise between the two previous
approaches. This method’s rationale is that both the global and the
card-based approaches have intrinsic limitations. It is not realistic to
think that all genuine cardholders behave in the same manner. In
statistical terms, this means that a global approach leads to a biased
estimator of the distribution. At the same time, however, few histor-
ical examples are available at the card level with negative impact on
the quality (i.e. large variance) of the related estimation. The cluster
approach aims to select the optimal aggregation level at which a rea-
sonable trade-off between the bias and the variance of the two extreme
approaches can be reached. Clustering is completed at the card level,
and it is based on a set of features which describe customer behavior,
such as the amount of money spent over the last 24 hours.
Figure 1 is an illustrative example demonstrating the three approaches.
In this example, we consider the amount of money involved in a transaction
to calculate the outlier score. Our goal is to detect the most suspicious
transaction or the most extreme transaction by considering only the amount
of money spent in each transaction.
6
Figure 1: This illustrative example presents the behaviors of five cardholders (CHi,i=
1,...,5) over a fixed period of time. The dotted line represents average expenditure, and
the bars represent the amount of money spent in each transaction.
From a global perspective, to detect the most suspicious transaction in a
transaction history, we have to consider the average amount of money that is
spent by the cardholders throughout their transactions (65.4 in the example).
The highest value recorded for the cardholder CH2 (185) is the most
divergent value with respect to the average amount this cardholder spends.
Accordingly, the cardholder CH2 receives an alert regarding this transaction.
In the local approach case, a cardholder’s most suspicious transaction is
determined according to the difference between individual transaction values
and the average amount of money spent by a cardholder in their transaction
history. In this example, cardholder CH3 is alerted, since of all the card-
holders, the greatest difference is detected between one of their transaction
values (110) and their average transaction amount (26.4).
7
Let us now cluster the cardholders in two groups based on their average
spending amounts. The first cluster will contain cardholders with high av-
erage expenditures (CH1 and CH2), and the second will contain cardholders
with low average expenditures (CH3, CH4, and CH5). The average trans-
action amounts are 99.8 in the high expenditure group and 39.1 in the low
expenditure group. In this case, the cardholder CH2 is again alerted, since
his transaction in the amount of 185 diverges the most from his cluster’s
average amount (99.8).
While running the global and local approaches is fairly straightforward,
the cluster approach requires that some elements, such as the clustering
algorithm and the features used in the cluster metric, be set. We choose to use
the k-means algorithm [18], as it is simple to interpret, it runs quickly on large
datasets, and it offers us the ability to decide a priori an arbitrary number of
clusters to be identified (allowing us to easily control the aggregation level).
The matter of choosing the features on which clustering will be performed
(i.e. the contextual attributes) is not trivial; this decision can have a major
impact on the final accuracy. We considered two different sets of features:
the first describes the cardholder’s behavior and the second summarizes the
cardholder’s personal data. In the first case, we examined cardholders’ aver-
age transaction expenditures and their total numbers of transactions over the
last 24 hours. In the second case we consider the age, the nationality1, and
the gender of the cardholder. Since considering cardholder behavior leads
to better accuracy, we only present clusters created according to cardholder
behavior features in the experimental part. The principal hyper-parameter
of the k-means algorithm is the knumber of clusters to be created. This
hyper-parameter is also important for our case study, as it defines the outlier
score’s level of granularity. In our experiments, we allow the hyper-parameter
to vary from a minimum low granularity of 10 clusters to a maximum high
granularity of 5000 clusters.
Algorithm 1 defines the process of outlier score construction at different
levels of granularity. The function Score (row 1) receives a training set, a
test set, and a list of outlier models to be computed as input. The output
is the training and test sets augmented with related outlier scores. When
1Note that nationality is not encoded as a categorical variable but as a continuous
variable given by the a priori risk of fraud associated to the nationality, estimated by the
conditional frequency of the training set.
8
Algorithm 1 Outlier scores at different levels of granularity
Require: gr granularity: global, local, or cluster
Require: k number of clusters, if gr == cluster
Require: cardU sage statistics on the card usage, if gr == cluster
Require: ot outlier models
Require: Dtr training set
Require: Dte test set
1: function Score(subDtr, subDte , ot)
2: subOutDtr subDtr
3: subOutDte subDte
4: for tin ot do
5: outlierM odel fit tto subDtr
6: trainingS core get score of subDtr using outlierM odel
7: testScore get score of subDte using outlierM odel
8: subOutDtr append trainingScore to subOutDtr
9: subOutDte append testScroe to subOutDte
10: end for
11: return subOutDtr ,subOutDte
12: end function
13: if (gr == ”global”) then global granularity
14: (DOuttr , DOutte)Score(Dtr , Dte , ot)
15: end if
16: if (gr == ”local”) then local granularity
17: DOuttr , DOutte empty datasets
18: for card in Dtr do
19: (subDtr, subDte )Score(Dtr [cardID == card], Dte [cardID == card], ot)
20: DOuttr ← {DOuttr , subDtr }
21: DOutte ← {DOutte, subDte}
22: end for
23: end if
24: if (gr == ”cluster”) then cluster granularity
25: DOuttr , DOutte empty datasets
26: clusterLabel k-means(cardU sage, k)
27: Dtr ← {Dtr, clusterLabel}
28: Dte ← {Dte, clusterLabel}
29: for ifrom 1to kdo
30: (subDtr, subDte )Score(Dtr [cluster == i], Dte [cluster == i], ot)
31: DOuttr ← {DOuttr , subDtr }
32: DOutte ← {DOutte, subDte}
33: end for
34: end if
35: DOuttr , DOutte augmented training and test set
9
global granularity is considered (row 13),the entire training and test sets are
passed directly to the function Score; in the other two cases, only a portion
(i.e. the one corresponding to the specific card or cluster) is passed to the
function Score. In the local approach, this split is completed at the card
level (row 18). In the cluster approach, the split is completed at the level of
the cluster computed in row 26.
3.3. Metrics
Several metrics have been proposed in the literature to measure the qual-
ity of the detection. These include i) the Area Under the Receiver Operat-
ing Characteristic Curve (AUC-ROC) [3, 9, 13, 30], ii) the Area Under the
Precision-Recall Curve (AUC-PR) [9, 21], iii) the F-measure [1, 3, 16, 31], iv)
specificity [3, 30], v) the Recall [3, 30, 31], and vi) the Precision [3, 13, 31, 33].
In this work, we will focus on analyzing the metrics that are considered the
most relevant in matters of fraud detection2: TopnPrecision (i.e. the Preci-
sion associated to the nhighest risk transactions returned by the algorithm)
and AUC-PR. As the ultimate goal of the fraud detection process is to pro-
vide the maximum possible number of true positives within the alerts issued
for fraud investigators, the most pertinent metric is TopnPrecision. This
variant of the Precision metric refers to the ratio of the number of true posi-
tive alerts to the number of total alerts. We set n= 100 in accordance with
our industrial partner, as it is possible for a group of investigators to analyze
this number of suspicious credit cards in a day. The dependency of Topn
Precision on the nvalue has been studied in [3, 33], who showed that by
decreasing n, Precision increases at the cost of a reduced Recall. The metrics
that refer to the entire test set (and not simply to alerts) are the AUC-ROC
and the AUC-PR. The Area Under the Curve (AUC) is a value in [0,1] that
summarizes the relation between two metrics: the Recall and the False Pos-
itive Rate in the case of AUC-ROC and Precision and Recall in the case of
AUC-PR. The AUC-ROC is equivalent to the probability that a randomly
chosen fraudulent transaction has a score higher than that of a randomly
chosen genuine transaction [17]. Though the two metrics may look similar,
it has been shown that AUC-PR is more effective in cases of high class im-
balance [14]. Furthermore, it is known that optimization of the AUC-ROC
does not guarantee optimization of the AUC-PR (and vice versa) [14].
2This choice was done in agreement with our industrial partner Worldline.
10
4. Experiments
To compute a consistent outlier score for the local approach, we decided
to consider only those cards with histories of at least 10 transactions in the
training set. This threshold was set to preserve statistical accuracy in this
approach, since the computation of a local outlier score (even the simplest)
using fewer transactions would inevitably be affected by large variance, which
would undermine the overall accuracy of the approach. This limitation is not
present in the cluster and global approaches. To ensure a fair comparison
between the three approaches, we also used the same set of cards in the global
and cluster approaches. This is beneficial for the cluster approach, since
it requires a minimum number of transactions to track customer behavior.
Furthermore, we excluded the cards in the test set that were not present in
the training set, since it would not be possible in this case to pre-compute
outlier scores in the training set.
In accordance with the literature, we adopted a random forest model as a
baseline, given its superiority in credit card fraud detection [1, 3, 30, 36]. We
used a particular implementation of the random forest, Balanced Random
Forest (BRF) [6], in which each tree is shaped on a balanced subset of the
original sample (and undersampling is used to balance the two classes). A
single model is trained over the whole training set and tested on 54 days’
worth of data (Static approach [13]). The testing interval begins one week
after the end of the training set period in order to emulate verification la-
tency [6].
The dataset used for the experiments includes information from 334 days
of transactions recorded from February 2 to December 31, 2016. This set was
provided to us by our industrial partner, Worldline, a leading company in
transactional services, and includes 76 million transactions. The percentage
of fraudulent transactions in this dataset is 0.36%. We use transactions
until September 30 for the training data, while those that occurred between
October 8 and December 31 are used as test set data. The week of October 1
to October 7, 2016 represents the verification latency period, and data from
this period is not used for training or testing purposes.
4.1. Baseline
In this first group of experiments, we trained classifiers using datasets
taken from 2, 4, 6, and 8 months prior to the cut-off date (October 1, 2016)
while using the original set of features alone (in a basic set) or with aggregated
11
Figure 2: Average daily accuracy in a 54-day test set and of different training lengths:
(a) average daily Top100 Card Precision, (b) average daily AUC-PR for fraudulent card
detection, (c) the Top5000 Precision for fraudulent transaction detection, and (d) the
AUC-PR for fraudulent transaction detection.
(a) (b)
(c) (d)
12
features (in an extended set). While the basic set includes the raw features
obtained by our industrial partner, Worldline, the extended set includes the
features acquired through feature engineering [2]. Examples of engineered
features include the total sums of money spent by the cardholder and the
number of transactions executed by a cardholder over the last 24 hours, as
we mentioned previously.
We considered two metrics: the TopnPrecision and the AUC-PR. In
Figure 2a, the Top100 Precision is plotted, and the Top5000 Precision is
reported in 2c. There is a major conceptual difference between these two
metrics: the Top100 Precision is computed every day and then averaged
over the number of days in the test period, while the Top5000 Precision
is computed according to the whole test set without considering the daily
label. The same difference exists in the case of the AUC-PR of Figure 2b
and Figure 2d.
A second difference is that the detection approach in Figure 2a and Fig-
ure 2b aims to detect fraudulent cards, while that of Figure 2c and Figure 2d
aims to detect fraudulent transactions. This second metric is typically con-
sidered less relevant, as once a fraudulent transaction is detected, the related
card is typically blocked, and no further transactions from that card can
be considered.According to Figure 2, it appears that, independently of the
considered metric, the larger the training set, the better the accuracy [20].
Furthermore, an extended set of features leads to a greater accuracy than
that of a basic set.
4.2. Global approach
In this section, we use the best performing configuration of the baseline
model (i.e., the configuration trained over an eight-month period using the
extended feature set; see Figure 2). This configuration is compared to a
model trained on the same set but with the additional features derived from
the outlier scores.
In Figure 3, we can see that, in comparison to the baseline model, the
use of global outlier scores does not significantly improve fraud detection. In
the worst case (i.e., ”All Outliers,” where we use only the outlier scores and
do not consider the baseline feature space), we observe a strong deteriora-
tion in accuracy. The outlier scores do not provide additional information,
even if they are used in combination with baseline features (baseline + ”All
Outliers”). In this case, additional features increase the risk of overfitting.
13
The incapacity of global outliers to contribute useful information to the pre-
dictive model may be related to the fact that such scores are highly general
and therefore unable to capture the specificity of a given fraud behavior.
It is worth recalling that the aforementioned ”All Outliers” and ”Baseline
All Outliers” correspond respectively to the ”proposed” and ”proposed+”
modalities presented by Michenkova et al. in [24].
4.3. Local approach
Figure 4 reports the results of the experiments based on local outlier
scores. Regardless of the metric in question, the use of local outlier scores is
detrimental to fraud detection. Among all the outliers, the IF local outlier
score is the best performing; the PCA-RE outlier score performs as well as
the baseline model in terms of Precision Top5000 scores computed using the
entire test set. Our interpretation is that while global outlier scores refer to
sets that are too large(which introduces large bias), local outlier scores are
likely computed on a set of transactions that is too restricted to be useful
(introducing large variance). As for the global outlier score approach, the
use of solely outlier scores (without considering the baseline feature space)
produces a very low accuracy.
4.4. Cluster approach
In the experiments based on cluster outlier scores, we allowed the num-
ber of clusters in k-means to vary from 10 to 5000. In the first analysis, all
unsupervised scores presented in Section 3.1 are used to augment the origi-
nal dataset (Figure 5). Unfortunately, it appears that this approach is not
beneficial to fraud detection and that its accuracy is even lower than that of
the global case. If we consider the Top100 Precision metric, we can see that
the case with 10 clusters performs better than other cases, even if it does not
perform better than the baseline.
The second analysis concerns a study on the relevance of outlier scores
with respect to the original features. Several methods exist to assess a fea-
ture’s relevance. One of the fastest ways is to rely on the relevance returned
by random forests. In Figure 6, we show the features used by this classifier,
ordered by importance (outlier scores are shown in red). We note that the
first two features have a significant impact on the model: they refer to the
risk of the shop receiving the payment (Shop Risk ) and the risk of the shop
that received the previous payment (Last Shop Risk).Many of the outlier
14
Figure 3: Accuracy obtained on a test set of data from 54 days while using global outlier
scores as additional features: (a) average daily Top100 Card Precision, (b) average daily
AUC-PR for card detection, (c) Top5000 Transaction Precision for the entire test set, and
(d) AUC-PR for the entire test set.
(a) (b)
(c) (d)
15
Figure 4: Accuracy obtained on a test set of data from 54 days while using local outlier
scores as additional features: (a) average daily Top100 Card Precision, (b) average daily
AUC-PR for card detection, (c) Top5000 Transaction Precision for the entire test set, and
(d) AUC-PR for the entire test set.
(a) (b)
(c) (d)
16
Figure 5: Accuracy obtained on data from a test set of 54 days while using cluster outlier
scores as additional features: (a) average daily Top100 Card Precision, (b) average daily
AUC-PR for card detection, (c) Top5000 Transaction Precision on the entire test set, and
(d) AUC-PR on the entire test set.
(a) (b)
(c) (d)
17
Figure 6: Features ranked by importance, obtained from the random forest classifier of
Figure 5 (a 10-cluster case). The red bars refer to the scores returned by an outlier score
technique.
scores are ranked just after these two important features, indicating that
outlier scores could potentially play a key role in risk prediction.
For this reason, the third analysis focuses only on augmenting the orig-
inal dataset with a single outlier. We consider the highest ranked outlier
score (PC-1 in Figure 7) and the second highest one (GM-2 in Figure 8).
The interest of GM-2 is also witnessed by the fact that this score appears
among those with the highest Top100 Precision in the global perspective (see
Figure 3a).
While no improvement is seen in Top100 Card Precision (Figures 7a
and 8a), a significant improvement is visible in terms of Card AUC-PR and
Transaction AUC-PR. In the case of GM-2, the Top5000 Transaction Preci-
sion is also higher.
5. Discussion
Several inconsistencies in the behaviors of the Top100 Precision and the
AUC-PR metrics used to assess the cluster approach in comparison to the
baseline model warrant deeper analysis. For this reason, we report all Precision-
Recall curves in the baseline and the 10-cluster local approaches based on a
18
Figure 7: Accuracy obtained on a test set of data from 54 days while using cluster-based
PC-1 outlier scores as additional features: (a) average daily Top100 Card Precision,
(b) average daily AUC-PR for card detection, (c) Top5000 Transaction Precision on the
entire test set, and (d) AUC-PR on the entire test set.
(a) (b)
(c) (d)
19
Figure 8: Accuracy obtained on a test set of data from 54 days while using cluster-based
GM-2 outlier scores as additional features: (a) average daily Top100 Card Precision,
(b) average daily AUC-PR for card detection, (c) Top5000 Transaction Precision on the
entire test set, and (d) AUC-PR on the entire test set.
(a) (b)
(c) (d)
20
combination of baseline features and GM-2 outlier scores (shown in blue and
red, respectively, in Figure 9a).
First, we note that the red curve is higher than the Baseline curve for Re-
call values ranging from 0.1 to 0.3. This is in line with the AUC-PR accuracy
observed in Figures 8b and 8d, in which the cluster approach outperforms
the baseline approach.
Figure 9b shows a closeup of Figure 9a, focusing on the Recall interval
between 0 and 0.01. We observe here that the baseline PR curve often rises
above the 10-cluster PR curve. This is consistent with the Top100 Card
Precision (related to a low Recall configuration) illustrated in Figure 8a,
where the baseline approach outperforms the ”baseline + GM-2 outlier score”
approach.
The cluster approach is shown to be the most promising method; however,
it also has several technical limitations. First of all, it requires the choice
of hyper-parameters. Selecting the appropriate contextual attributes is not
a trivial matter (see Section 3.2), and the adoption of k-means requires the
setting of a number of clusters k, which has an impact on the outlier score’s
level of granularity.
A further disadvantage of this method relates to the minimum number
of transactions necessary for a card to be analyzed. As was mentioned in
Section 4, in the local approach, we were restricted to considering cards in
the training set with histories of more than 10 transactions. Note that the
use of such a filter could make some fraudulent patterns non-observable.
6. Conclusion
This article proposes the implementation of a hybrid approach that makes
use of unsupervised outlier scores to extend the feature set of a fraud detec-
tion classifier. The novelty of the contribution, beyond its applications in
real and sizeable datasets of credit card transactions, is the implementation
and assessment of different levels of granularity for the definition of an outlier
score. The granularity in question ranges from the card level to the global
level, considering intermediate levels of card aggregation through clustering.
The results are not convincing in terms of the global and local approaches.
Our interpretation is that both approaches do not have the right level of
granularity necessary to take advantage of unsupervised information. A
more promising outcome is obtained through the cluster approach (notably
in terms of AUC-PR), though it appears that augmenting data sets with too
21
(a) (b)
Figure 9: Precision-Recall curve obtained from a single day of test set data using only
the baseline features and a combination of cluster-based GM-2 outlier scores and
baseline features (10 clusters): (a) Precision-Recall curve for card detection and (b) zoomed
Precision-Recall curve for card detection.
many scores could be detrimental due to overfitting and variance issues. The
obtained results open the way for several potential research directions:
The fact that the best-of-both-worlds method provides improvements
in terms of AUC-PR but not in terms of TopnPrecision indicates that
the added value of unsupervised measures may depend on the adopted
accuracy criterion.
Additional work (in terms of different clustering algorithms and differ-
ent sets of features) with the clustering metric could shed further light
on the relevance of this approach.
Though many outlier scores seem to provide information about fraud
risk (see the relevance plot in Figure 6), using many of them at the
same time is detrimental to the approach’s final accuracy.
The impact of granularity on the approach’s accuracy indicates the im-
portance of analyzing datasets in a stratified manner, not only in an
unsupervised manner but also in a supervised manner (e.g., by intro-
ducing some notion of locality).
22
Acknowledgement
The authors FC, YLB, and GB acknowledge the funding of the BruFence
and DefeatFraud projects; both are supported by INNOVIRIS (Brussels In-
stitute for the Encouragement of Scientific Research and Innovation).
References
[1] Bahnsen, A. C., Aouada, D., and Ottersten, B. (2015). Example-
dependent cost-sensitive decision trees. Expert Systems with Applications,
42(19):6609–6619.
[2] Bahnsen, A. C., Aouada, D., Stojanovic, A., and Ottersten, B. (2016).
Feature engineering strategies for credit card fraud detection. Expert Sys-
tems with Applications, 51:134–142.
[3] Bhattacharyya, S., Jha, S., Tharakunnel, K., and Westland, J. C. (2011).
Data mining for credit card fraud: A comparative study. Decision Support
Systems, 50(3):602–613.
[4] B¨uhlmann, P., Yu, B., et al. (2002). Analyzing bagging. The Annals of
Statistics, 30(4):927–961.
[5] Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
Springer
[6] Carcillo, F., Dal Pozzolo, A., Le Borgne, Y.-A., Caelen, O., Mazzer, Y.,
and Bontempi, G. (2018a). Scarff: a scalable framework for streaming
credit card fraud detection with spark. Information fusion, 41:182–194.
[7] Carcillo, F., Le Borgne, Y.-A., Caelen, O., and Bontempi, G. (2017). An
assessment of streaming active learning strategies for real-life credit card
fraud detection. In Data Science and Advanced Analytics (DSAA), 2017
IEEE International Conference on, pages 631–639. IEEE.
[8] Carcillo, F., Le Borgne, Y.-A., Caelen, O., and Bontempi, G. (2018b).
Streaming active learning strategies for real-life credit card fraud detection:
assessment and visualization. International Journal of Data Science and
Analytics, pages 1–16.
23
[9] Carneiro, N., Figueira, G., and Costa, M. (2017). A data mining based
system for credit-card fraud detection in e-tail. Decision Support Systems,
95:91–101.
[10] Chandola, V., Banerjee, A., and Kumar, V. (2009). Anomaly detection:
A survey. ACM computing surveys (CSUR), 41(3):15.
[11] Dal Pozzolo, A., Boracchi, G., Caelen, O., Alippi, C., and Bontempi,
G. (2017). Credit card fraud detection: a realistic modeling and a novel
learning strategy. IEEE transactions on neural networks and learning sys-
tems.
[12] Dal Pozzolo, A., Caelen, O., and Bontempi, G. (2015). When is under-
sampling effective in unbalanced classification tasks? In Joint European
Conference on Machine Learning and Knowledge Discovery in Databases,
pages 200–215. Springer.
[13] Dal Pozzolo, A., Caelen, O., Le Borgne, Y.-A., Waterschoot, S., and
Bontempi, G. (2014). Learned lessons in credit card fraud detection from
a practitioner perspective. Expert systems with applications, 41(10):4915–
4928.
[14] Davis, J. and Goadrich, M. (2006). The relationship between precision-
recall and roc curves. In Proceedings of the 23rd international conference
on Machine learning, pages 233–240. ACM.
[15] Freund, Y., Schapire, R., and Abe, N. (1999). A short introduction
to boosting. Journal-Japanese Society For Artificial Intelligence, 14(771-
780):1612.
[16] Fu, K., Cheng, D., Tu, Y., and Zhang, L. (2016). Credit card fraud
detection using convolutional neural networks. In International Conference
on Neural Information Processing, pages 483–490. Springer.
[17] Hanley, J. A. and McNeil, B. J. (1982). The meaning and use of the area
under a receiver operating characteristic (roc) curve. Radiology, 143(1):29–
36.
[18] Hartigan, J. A. and Wong, M. A. (1979). Algorithm as 136: A k-means
clustering algorithm. Journal of the Royal Statistical Society. Series C
(Applied Statistics), 28(1):100–108.
24
[19] Hotelling, H. (1933). Analysis of a complex of statistical variables into
principal components. Journal of educational psychology), 24(6):417. War-
wick & York.
[20] Junqu´e de Fortuny, E., Martens, D., and Provost, F. (2013). Predictive
modeling with big data: is bigger really better? Big Data, 1(4):215–226.
[21] Jurgovsky, J., Granitzer, M., Ziegler, K., Calabretto, S., Portier, P.-E.,
He-Guelton, L., and Caelen, O. (2018). Sequence classification for credit-
card fraud detection. Expert Systems with Applications.
[22] Liang, J. and Parthasarathy, S. (2016). Robust contextual outlier de-
tection: Where context meets sparsity. In Proceedings of the 25th ACM
International on Conference on Information and Knowledge Management,
pages 2167–2172. ACM.
[23] Liu, F.T. and Ting, K.M. and Zhou, Z. (2008). Isolation forest In Eighth
IEEE International Conference on Data Mining, pages 413–422. IEEE
[24] Micenkov´a, B., McWilliams, B., and Assent, I. (2014). Learning outlier
ensembles: The best of both worldssupervised and unsupervised. In Pro-
ceedings of the ACM SIGKDD 2014 Workshop on Outlier Detection and
Description under Data Diversity (ODD2). New York, NY, USA, pages
51–54.
[25] Nguyen, H. V., Ang, H. H., and Gopalkrishnan, V. (2010). Mining
outliers with ensemble of heterogeneous detectors on random subspaces. In
International Conference on Database Systems for Advanced Applications,
pages 368–383. Springer.
[26] Rayana, S., Zhong, W., and Akoglu, L. (2016). Sequential ensemble
learning for outlier detection: A bias-variance perspective. In Data Mining
(ICDM), 2016 IEEE 16th International Conference on, pages 1167–1172.
IEEE.
[27] Sethi, N. and Gera, A. (2014). A revived survey of various credit card
fraud detection techniques. International Journal of Computer Science
and Mobile Computing, 3(4):780–791.
25
[28] Shimpi, P. R. and Kadroli, V. (2015). Survey on credit card fraud
detection techniques. International Journal Of Engineering And Computer
Science, 4(11):15010–15015.
[29] Song, X., Wu, M., Jermaine, C., and Ranka, S. (2007). Conditional
anomaly detection. IEEE Transactions on Knowledge and Data Engineer-
ing, 19(5):631–645.
[30] Van Vlasselaer, V., Bravo, C., Caelen, O., Eliassi-Rad, T., Akoglu, L.,
Snoeck, M., and Baesens, B. (2015a). Apate: A novel approach for auto-
mated credit card transaction fraud detection using network-based exten-
sions. Decision Support Systems, 75:38–48.
[31] Van Vlasselaer, V., Eliassi-Rad, T., Akoglu, L., Snoeck, M., and Bae-
sens, B. (2015b). Afraid: fraud detection via active inference in time-
evolving social networks. In Advances in Social Networks Analysis and
Mining (ASONAM), 2015 IEEE/ACM International Conference on, pages
659–666. IEEE.
[32] Veeramachaneni, K., Arnaldo, I., Korrapati, V., Bassias, C., and Li,
K. (2016). Aiˆ 2: Training a big data machine to defend. In Big Data
Security on Cloud (BigDataSecurity), IEEE International Conference on
High Performance and Smart Computing (HPSC), and IEEE International
Conference on Intelligent Data and Security (IDS), pages 49–54. IEEE.
[33] Wei, W., Li, J., Cao, L., Ou, Y., and Chen, J. (2013). Effective detection
of sophisticated online banking fraud on extremely imbalanced data. World
Wide Web, 16(4):449–475.
[34] Wold, S. and Esbensen, K. and Geladi, P. (1987). Principal component
analysis. In Chemometrics and intelligent laboratory systems, 2(1-3):37–
52. Elsevier.
[35] Yamanishi, K. and Takeuchi, J.-i. (2001). Discovering outlier filtering
rules from unlabeled data: combining a supervised learner with an un-
supervised learner. In Proceedings of the seventh ACM SIGKDD interna-
tional conference on Knowledge discovery and data mining, pages 389–394.
ACM.
26
[36] Zareapoor, M. and Shamsolmoali, P. (2015). Application of credit card
fraud detection: Based on bagging ensemble classifier. Procedia Computer
Science, 48:679–685.
[37] Zhu, X. (2005). Semi-supervised learning literature survey. Technical
report, Computer Sciences TR 1530, University of Wisconsin Madison.
[38] Zimek, A., Campello, R. J., and Sander, J. (2014). Ensembles for un-
supervised outlier detection: challenges and research questions a position
paper. Acm Sigkdd Explorations Newsletter, 15(1):11–22.
27
... The authors conducted experiments using German and Australian credit card databases and compared the experimental results of SVM-S and BP networks. Work [25] proposes an extended approach based on the principle of optimal combination, defines multiple anomaly scores, and considers a synthesis approach at different levels of granularity. It describes how to combine unsupervised and supervised learning to improve the accuracy of credit card fraud detection. ...
Article
Full-text available
With the rise of digital payment methods and the growth in financial transactions, the issue of credit card fraud has become increasingly severe. Traditional fraud detection methods are currently facing challenges such as poor model performance, difficulty in obtaining accurate results, and limitations in distributed deployment. These challenges stem from constantly evolving fraud strategies, higher volumes of transactions, and the complexity of the financial environment. This study proposes a credit card fraud detection algorithm based on Structured Data Transformer (SDT) and federated learning, which leverages the advanced capabilities of the Transformer model in deep learning. First, we organize credit card data into sequences and introduce a special, learnable token at the beginning of each sequence for classification purposes. Thanks to the attention mechanism of the Transformer, the model can automatically highlight important features in the data, significantly improving the accuracy of fraud detection. Second, addressing the complex financial environment and concerns about financial data privacy, we introduce a federated learning architecture to deploy the SDT model across different banks in a distributed manner. Momentum updates are used for model parameter updates during training, which enhance model performance and ensure data privacy between banks. Lastly, we conducted experimental validation on two financial datasets of different scales. The results on Dataset 1 and Dataset 2 show that our proposed SDT model surpasses traditional detection methods in terms of AUC-PR values (0.882, 0.816) and AUC-ROC values (0.982, 0.994). By integrating federated learning and deploying and testing the two datasets in a distributed environment, the AUC-PR values (0.884, 0.892) and AUC-ROC values (0.963, 0.998) can be further improved.
... Similarly, Ghosh Dastidar et al. use over 60 million transactions spanning across 6 months. The data is sourced from primarily European cardholders and span over different time-periods, such as 2015 in [14], 2016 in [38] and 2017 in [2]. We also observe several private datasets from Chinese financial providers including those used by Xie et al. [39], which comprises of over 5 million transactions, the B2C dataset used by [32] of more than 30 million transactions, and the one used by Zhang et al. [29]. ...
Article
Full-text available
As a consequence of the ubiquity of online payments, there has been an accompanying surge in fraudulent activity in recent times, leading to billions of dollars of financial losses. As payment providers aim to tackle this with various preventive mechanisms, fraudsters also continuously evolve their methods to remain indistinguishable from genuine actors. This necessitates sophisticated fraud detection tools to supplement these security mechanisms. As the volume of transactions taking place per day is in the millions, relying solely on human investigation is expensive and ultimately unfeasible, leading to an emergence of research into data driven or machine learning methods for fraud detection. Over the last decade, this research has evolved into tackling the various particularities of the domain. These include the skewed nature of the data, the evolving user and fraud behaviour and towards learning representations of the context in which a transaction takes place. This work aims to provide the community an in-depth overview of these different directions in which recent research works on online fraud detection has focused on. We develop a taxonomy of the domain based on these directions and organize our analysis according to them. For each area we focus on significant methodological advancements and highlight limitations or gaps in the current state-of-the-art solutions. Through our analysis, it emerges that one of the the primary limiting factors that many researchers face is the lack availability of high-quality credit card data. Therefore, we provide a first step in addressing this in the form of a data generation framework using generative adversarial networks (GANs). We hope that this survey serves as a foundation for researchers who want to address the multi-faceted problem of credit card fraud detection.
... October 2024 -Florianópolis, SC, Brazil Técnicas de detecção de fraudes de cartão de crédito são classificadas em aprendizado supervisionado e não supervisionado. No aprendizado não supervisionado, técnicas de agrupamento são usadas para identificar comportamentos anômalos sem a necessidade de rotulação prévia [Carcillo et al., 2021]. No aprendizado supervisionado, técnicas como Extremely Randomized Trees (ET), Random Forest (RF),Árvores de Decisão (AD), CatBoost (CB), Gradient Boosting (GB), AdaBoost (AB), Regressão Logística (RL), K-Nearest Neighbors (KNN) e Naive Bayes (NB) são aplicadas a conjuntos de dados rotulados para prever transações fraudulentas [Bhattacharyya et al., 2011]. ...
Conference Paper
Devido ao aumento do comércio eletrônico e do uso de cartões de crédito, as fraudes com cartões de crédito tornaram-se um grande desafio para as entidades envolvidas. Apesar dos prejuízos, essas fraudes ainda representam uma pequena parte das transações, criando um problema de desbalanceamento de dados nas áreas de detecção de fraudes do sistema financeiro. Este trabalho avalia várias combinações de técnicas de seleção de atributos, balanceamento de classes e algoritmos de classificação. Para balancear as classes, foram usadas técnicas de subamostragem, superamostragem e ajustes de limiares nos classificadores. As combinações foram testadas em dois conjuntos de dados desbalanceados, avaliados pela métrica escore F1. Os resultados mostram um ganho de desempenho quando são implementadas técnicas de balanceamento de dados e otimização de limiares de classificação.
... At present, many scholars have conducted in-depth research on credit card fraud detection based on machine learning methods, including: algorithm based on decision tree and boolean logic [13], cost sensitive decision tree algorithm [14], risk induced cost sensitive minimal Bayesian algorithm [15], parallel fuzzy neural network [16], a framework for detecting potential fraudulent transactions in credit card transactions mining based on CNN [2], support vector machine model with Spark [17]. Some scholars have introduced semi supervised learning [18] and ensemble learning [19] into the problem of credit card fraud identification. ...
Article
Full-text available
Credit card fraud identification is an important issue in risk prevention and control for banks and financial institutions. In order to establish an efficient credit card fraud identification model, this article studied the relevant factors that affect fraud identification. A credit card fraud identification model based on neural networks was constructed, and in-depth discussions and research were conducted. First, the layers of neural networks were deepened to improve the prediction accuracy of the model; second, this paper increase the hidden layer width of the neural network to improve the prediction accuracy of the model. This article proposes a new fusion neural network model by combining deep neural networks and wide neural networks, and applies the model to credit card fraud identification. The characteristic of this model is that the accuracy of prediction and F1 score are relatively high. Finally, use the random gradient descent method to train the model. On the test set, the proposed method has an accuracy of 96.44% and an F1 value of 96.17%, demonstrating good fraud recognition performance. After comparison, the method proposed in this paper is superior to machine learning models, ensemble learning models, and deep learning models.
Conference Paper
Full-text available
Fraud detection plays a crucial role in various industries, especially in the financial sector, where preventing fraudulent activities is essential to minimize losses and maintain consumer trust. This paper addresses key challenges in fraud detection, including data imbalance and uncertainty, which often hinder the effectiveness of detection models. To overcome these challenges, we explore traditional machine learning methods and introduce two novel approaches to enhance detection capabilities. Firstly, we propose a hybrid pipeline that integrates both supervised and unsupervised learning techniques, enabling more accurate identification of fraudulent activities. Through this hybrid model, we show improvements in performance metrics over traditional models, effectively addressing the limitations posed by data imbalance. Secondly, we develop a novel deep learning model that incorporates uncertainty into its framework. This model is specifically designed to handle the inherent uncertainties present in real-world fraud detection scenarios, allowing for more robust and reliable detection outcomes. Our empirical evaluations, using publicly available datasets, show that this new deep learning approach outperforms similar models that do not consider uncertainty. By integrating uncertainty management into the model's structure, we achieve greater accuracy and reliablity in fraud detection. These findings highlight the importance of addressing data imbalance and uncertainty in fraud detection and demonstrate the potential of hybrid and deep learning models to enhance the performance of fraud detection systems in e-commerce and other financial applications.
Article
Full-text available
This review offers a detailed strategy to address the growing threat of credit card fraud in today's digital landscape. By utilizing Big Data analytics alongside machine learning methods, the system aims to transform fraud detection processes. It tackles the challenges arising from the increasing volume and complexity of credit card transactions, enabling the real-time detection and prevention of fraudulent actions. The system employs sophisticated machine learning algorithms to identify patterns and anomalies linked to fraudulent activities, allowing for proactive responses to emerging fraud tactics. Additionally, the system is optimized to handle and analyze large datasets efficiently, ensuring timely and precise detection of fraud. It also incorporates strong security protocols to protect sensitive customer data while adhering to privacy regulations. This review ultimately seeks to enhance the safety and reliability of electronic payments, protecting financial institutions and consumers from the harmful effects of credit card fraud.
Article
Full-text available
This review offers a detailed strategy to address the growing threat of credit card fraud in today's digital landscape. By utilizing Big Data analytics alongside machine learning methods, the system aims to transform fraud detection processes. It tackles the challenges arising from the increasing volume and complexity of credit card transactions, enabling the real-time detection and prevention of fraudulent actions. The system employs sophisticated machine learning algorithms to identify patterns and anomalies linked to fraudulent activities, allowing for proactive responses to emerging fraud tactics. Additionally, the system is optimized to handle and analyze large datasets efficiently, ensuring timely and precise detection of fraud. It also incorporates strong security protocols to protect sensitive customer data while adhering to privacy regulations. This review ultimately seeks to enhance the safety and reliability of electronic payments, protecting financial institutions and consumers from the harmful effects of credit card fraud.
Article
Full-text available
Credit card fraud detection is a very challenging problem because of the specific nature of transaction data and the labeling process. The transaction data are peculiar because they are obtained in a streaming fashion, and they are strongly imbalanced and prone to non-stationarity. The labeling is the outcome of an active learning process, as every day human investigators contact only a small number of cardholders (associated with the riskiest transactions) and obtain the class (fraud or genuine) of the related transactions. An adequate selection of the set of cardholders is therefore crucial for an efficient fraud detection process. In this paper, we present a number of active learning strategies and we investigate their fraud detection accuracies. We compare different criteria (supervised, semi-supervised and unsupervised) to query unlabeled transactions. Finally, we highlight the existence of an exploitation/exploration trade-off for active learning in the context of fraud detection, which has so far been overlooked in the literature.
Article
Full-text available
Detecting frauds in credit card transactions is perhaps one of the best testbeds for computational intelligence algorithms. In fact, this problem involves a number of relevant challenges, namely: concept drift (customers' habits evolve and fraudsters change their strategies over time), class imbalance (genuine transactions far outnumber frauds), and verification latency (only a small set of transactions are timely checked by investigators). However, the vast majority of learning algorithms that have been proposed for fraud detection rely on assumptions that hardly hold in a real-world fraud-detection system (FDS). This lack of realism concerns two main aspects: 1) the way and timing with which supervised information is provided and 2) the measures used to assess fraud-detection performance. This paper has three major contributions. First, we propose, with the help of our industrial partner, a formalization of the fraud-detection problem that realistically describes the operating conditions of FDSs that everyday analyze massive streams of credit card transactions. We also illustrate the most appropriate performance measures to be used for fraud-detection purposes. Second, we design and assess a novel learning strategy that effectively addresses class imbalance, concept drift, and verification latency. Third, in our experiments, we demonstrate the impact of class unbalance and concept drift in a real-world data stream containing more than 75 million transactions, authorized over a time window of three years.
Article
Full-text available
The expansion of the electronic commerce, together with an increasing confidence of customers in electronic payments, makes of fraud detection a critical factor. Detecting frauds in (nearly) real time setting demands the design and the implementation of scalable learning techniques able to ingest and analyse massive amounts of streaming data. Recent advances in analytics and the availability of open source solutions for Big Data storage and processing open new perspectives to the fraud detection field. In this paper we present a SCAlable Real-time Fraud Finder (SCARFF) which integrates Big Data tools (Kafka, Spark and Cassandra) with a machine learning approach which deals with imbalance, nonstationarity and feedback latency. Experimental results on a massive dataset of real credit card transactions show that this framework is scalable, efficient and accurate over a big stream of transactions.
Article
Full-text available
Credit-card fraud leads to billions of dollars in losses for online merchants. With the development of machine learning algorithms, researchers have been finding increasingly sophisticated ways to detect fraud, but practical implementations are rarely reported. We describe the development and deployment of a fraud detection system in a large e-tail merchant. The paper explores the combination of manual and automatic classification, gives insights into the complete development process and compares different machine learning methods. The paper can thus help researchers and practitioners to design and implement data mining based systems for fraud detection or similar problems. This project has contributed not only with an automatic system, but also with insights to the fraud analysts for improving their manual revision process, which resulted in an overall superior performance.
Article
Full-text available
Ensemble methods for classification and clustering have been effectively used for decades, while ensemble learning for outlier detection has only been studied recently. In this work, we design a new ensemble approach for outlier detection in multi-dimensional point data, which provides improved accuracy by reducing error through both bias and variance. Although classification and outlier detection appear as different problems, their theoretical underpinnings are quite similar in terms of the bias-variance trade-off [1], where outlier detection is considered as a binary classification task with unobserved labels but a similar bias-variance decomposition of error. In this paper, we propose a sequential ensemble approach called CARE that employs a two-phase aggregation of the intermediate results in each iteration to reach the final outcome. Unlike existing outlier ensembles which solely incorporate a parallel framework by aggregating the outcomes of independent base detectors to reduce variance, our ensemble incorporates both the parallel and sequential building blocks to reduce bias as well as variance by (ii) successively eliminating outliers from the original dataset to build a better data model on which outlierness is estimated (sequentially), and (iiii) combining the results from individual base detectors and across iterations (parallelly). Through extensive experiments on sixteen real-world datasets mainly from the UCI machine learning repository [2], we show that CARE performs significantly better than or at least similar to the individual baselines. We also compare CARE with the state-of-the-art outlier ensembles where it also provides significant improvement when it is the winner and remains close otherwise.
Article
Due to the growing volume of electronic payments, the monetary strain of credit-card fraud is turning into a substantial challenge for financial institutions and service providers, thus forcing them to continuously improve their fraud detection systems. However, modern data-driven and learning-based methods, despite their popularity in other domains, only slowly find their way into business applications. In this paper, we phrase the fraud detection problem as a sequence classification task and employ Long Short-Term Memory (LSTM) networks to incorporate transaction sequences. We also integrate state-of-the-art feature aggregation strategies and report our results by means of traditional retrieval metrics. A comparison to a baseline random forest (RF) classifier showed that the LSTM improves detection accuracy on offline transactions where the card-holder is physically present at a merchant. Both the sequential and non-sequential learning approaches benefit strongly from manual feature aggregation strategies. A subsequent analysis of true positives revealed that both approaches tend to detect different frauds, which suggests a combination of the two. We conclude our study with a discussion on both practical and scientific challenges that remain unsolved.
Conference Paper
Credit card is becoming more and more popular in financial transactions, at the same time frauds are also increasing. Conventional methods use rule-based expert systems to detect fraud behaviors, neglecting diverse situations, extreme imbalance of positive and negative samples. In this paper, we propose a CNN-based fraud detection framework, to capture the intrinsic patterns of fraud behaviors learned from labeled data. Abundant transaction data is represented by a feature matrix, on which a convolutional neural network is applied to identify a set of latent patterns for each sample. Experiments on real-world massive transactions of a major commercial bank demonstrate its superior performance compared with some state-of-the-art methods.