Conference PaperPDF Available


  • Norwegian University of Science and Technology at Gjøvik, Norway

Abstract and Figures

The lack of legitimate datasets on mobile money transactions to perform research on in the domain of fraud detection is a big problem today in the scientific community. Part of the problem is the intrinsic private nature of financial transactions, that leads to no public available data sets. This will leave the researchers with the burden of first harnessing the dataset before performing the actual research on it. This paper propose an approach to such a problem that we named the PaySim simulator. PaySim is a financial simulator that simulates mobile money transactions based on an original dataset. In this paper, we present a solution to ultimately yield the possibility to simulate mobile money transactions in such a way that they become similar to the original dataset. With technology frameworks such as Agent-Based simulation techniques, and the application of mathematical statistics, we show in this paper that the simulated data can be as prudent as the original dataset for research.
Content may be subject to copyright.
Edgar Alonso Lopez-Rojas(a),Ahmad Elmir(b) , and Stefan Axelsson(c)
(a) ,(b) Blekinge Institute of Technology ,(c)The Norwegian University of Science and Technology
(a) (b) (c)
The lack of legitimate datasets on mobile money transac-
tions to perform research on in the domain of fraud detec-
tion is a big problem today in the scientific community.
Part of the problem is the intrinsic private nature of fi-
nancial transactions, that leads to no public available data
This will leave the researchers with the burden of first har-
nessing the dataset before performing the actual research
on it. This paper propose an approach to such a problem
that we named the PaySim simulator.
PaySim is a financial simulator that simulates mobile
money transactions based on an original dataset. In this
paper, we present a solution to ultimately yield the pos-
sibility to simulate mobile money transactions in such a
way that they become similar to the original dataset. With
technology frameworks such as Agent-Based simulation
techniques, and the application of mathematical statistics,
we show in this paper that the simulated data can be as
prudent as the original dataset for research.
Keywords: Multi-Agent Based Simulation, Financial
data, Fraud Detection,Retail Fraud, Synthetic Data.
Obtaining access to data sets of mobile transactions for
research is a very hard task due to the intrinsic private na-
ture of such transactions. Scientists and researchers must
today spend time and effort in obtaining permits and ac-
cess to relevant data sets before they can research on such
data set. This is time consuming and distracts researchers
from from focusing on the main problem which is per-
forming experiments on the data and finding novel ways
to solve problems such as the problem that inspired this
paper which is the fraud detection on financial data.
The work presented in this paper provides a tool and
a method to generate synthetic data with the help of a
simulator that we named PaySim. PaySim generates syn-
thetic datasets similar to real datasets from mobile money
transactions. This will be done by the means of computer
simulation, in particular, agent based simulation. Agent
based simulation is of great benefit in this context, this
partially because the models created represent with ac-
curacy the human behaviour during transactions and are
flexible enough to easily be adapted to new constraints.
PaySim simulates mobile money transactions based on
a sample of real transactions extracted from the logs of a
mobile money service implemented in an African coun-
try. With the help of statistic analysis and social net-
work analysis PaySim is able to generate congruous re-
sults with the original data set.
The scope of this paper covers the design and construc-
tion of the simulator as well as the evaluation of the qual-
ity of the data generated. The injection of malicious fraud
behaviour and the application of different fraud detection
methods are outside the scope of this paper and are the
topics for further work with the PaySim simulator.
Outline This paper is structured as follows: Section 2.
presents the background and previous work in simulating
financial data. Section 3. states the problem and dur-
ing sections 4. and 5. we present the implementation
of PaySim and the results of the simulations.Finally sec-
tion 6. present the conclusions and future work.
In many parts of Africa the adoption of mobile money
as a means of sending & receiving funds have improved
the life of merchants and customers alike. In Tanzania
for instance, which according to the world bank is one
of the fastest growing economies in the world, the adop-
tion of mobile money as a solution for creating payments
has induced a positive effect on the overall economy.
During December 2013 alone, 100 million transactions
were made in total netting a volume of $1.8 billion dol-
lars (Seetharam and Johnson, 2015).
The domain of Mobile Money Transfer has grown sub-
stantially in the last few years and have attracted greater
attention from users, specifically in areas in which bank-
ing solutions may not be as procurable as in developed
countries. Many solutions have been employed in many
places for this purpose. There are existing mobile money
services in more than 10 African countries which cover-
age of 14% of all mobile subscribers (Rieke et al., 2013).
The ever growing usage of mobile money has increased
the chances and likelihood of criminals to perform fraud-
ulent activities in an attempt to circumvent the security
measures of mobile money transfers services for personal
financial gain. There is therefore a great amount of pres-
sure on researching the potential security pitfalls that can
be exploited with the ultimate goal to develop counter-
solutions for the attacks.
Due to the large amount of transactions and the ever
changing characteristics on fraud, the current measures
against fraud lack effectiveness. Many current system
still base their detection mechanism on simple thresholds
assigned arbitrarily. Therefore there is a need to push
forward and investigate the effect of fraud and stop the
wrongdoers from fraudulent profit.
With PaySim, we aim to address this problem by pro-
viding a simulation tool and a method to generate syn-
thetic datasets of mobile transactions. The benefits of us-
ing a simulator to address fraud detection was first pre-
sented by (Lopez-Rojas and Axelsson, 2012b). This re-
search states the problem of obtaining access to financial
datasets and propose using synthetic datasets based on
simulations. The method proposed is based on the con-
cept of MABS (Multi Agent Based Simulation). MABS
has the benefits that allows the agents to incorporate simi-
lar financial behaviour to the one present in domains such
as bank transactions and mobile payments.
The first implementation of a simulator for financial
transaction was introduced by (Lopez-Rojas and Axels-
son, 2012a) with a mobile money transactions simulator.
This simulator was implemented due to the difficulties to
implement a proper fraud detection control on a mobile
money system that was under development. This paper
was the first to present an alternative to the lack of real
data problem. The synthetic dataset generated by the sim-
ulator was used to test the performance of different ma-
chine learning algorithms in finding patterns of money
The work by (Gaber et al., 2013) introduced another
similar technique to generate synthetic logs for fraud de-
tection. The main difference here was that this time there
was available real data to calibrate the results and com-
pare the quality of the result of the simulator. The purpose
of this study was to generate testing data that researchers
can use to evaluate different approaches. This works dif-
ferers significantly from our work because we present a
different method for analysing the data place special at-
tention on evaluating the quality of the resultant synthetic
data set.
There has been some work done in the domain of fi-
nancial transactions for retail stores. The most prominent
of which is the work done by (Lopez-Rojas et al., 2013).
The work done in that paper is very similar to the work
done in this paper. A large collection of data was gath-
ered from Sweden’s biggest shoe-retailer, and techniques
involved complex machine-learning algorithms in an at-
tempt to find fraudulent behaviour in clients. The paper
showed among other things results from Social Network
which described the relationship between the clients and
the sellers for each store. A definition of what was per-
ceived as ”fraudulent” was made and based on that the
machine-learning algorithms were trained to detect that
type of behaviour.
Public databases of financial transactions are almost
non existent. However the work of (Lopez-Rojas and Ax-
elsson, 2014) during the implementation of a simulator
called BankSim presents a MABS of financial payments.
BankSim is implemented in a similar way as the RetSim
simulator and our simulator using in addition to statistical
analysis a social network analysis. BankSim is based on
the aggregated financial information of payments during
6 months of the two main cities of Spain that was provided
by a bank in Spain with the purpose of developing ap-
plications of different kinds that benefit from this sort of
data. Our work differs from this work because the source
of the data and the characteristics of bank payments and
mobile transactions are different as presented later in the
following sections.
The key common aspect on previous work is the use
of the paradigm of ”Multi Agent Based Simulation”
approach which incorporates into the behaviour of the
agents the main customer logic to reach similar results
as the real world. It is important to recognize that a sim-
ulation is not an actual ”replication” of the original data
set. Rather, a simulation will with the aid of statistical
methods generate a very similar data set of the original
data set. The degree in variance will largely be depen-
dent on how the data on the original data set is structured,
hence, different simulations based on different seeds will
generate different output data sets but consistent with the
real world.
The problem formulation for this research paper tackles
the issue of whether the generation of synthetic financial
data is sufficient to supersede real financial data whilst
simultaneously yield commensurable results if the syn-
thetic data is used as the source data set for any research.
This is of primary concern for any researcher that wish to
perform scientific tests but does not or have limited ac-
cess to a real financial data set.
The main focus and goal for the simulation is to yield
another completely self-sufficient data set with the goal
of having similar statistical properties as the original data
set. To yield such results, the simulator must go through
several steps to be able to complete.
In order to simulate the mobile money service, we
need to properly simulate the different kind of transac-
tions that the system supports. We decided to cover 5 of
the most important transaction types: CASH-IN, CASH-
CASH-IN is the process of increasing the balance of
account by paying in cash to a merchant.
CASH-OUT is the opposite process of CASH-IN, it
means to withdraw cash from a merchant which decreases
the balance of the account.
DEBIT is similar process than CASH-OUT and in-
volves sending the money from the mobile money service
to a bank account.
PAYMENT is the process of paying for goods or ser-
vices to merchants which decreases the balance of the ac-
count and increases the balance of the receiver.
TRANSFER is the process of sending money to an-
other user of the service through the mobile money plat-
There are other types of transactions that we decided
to exclude from the simulation due to the low percentage
of data found in the sample.
PaySim uses the MABS toolkit called MASON version
19 which is implemented in Java (Luke, 2005). We se-
lected MASON because it is: multi-platform, supports
parallelisation, and fast execution speed in comparison
with other agent frameworks. This is especially impor-
tant for multiple running and computationally expensive
simulations such as PaySim (Railsback et al., 2006).
4.1. Overview,Design and Details (ODD)
The design of PaySim was based on the ODD model in-
troduced by (Grimm et al., 2006). ODD contains 3 main
parts: Overview,Design Concepts and Details.
4.1.1. ODD Overview
The purpose of this simulator is to simulate payments
done in the realms of mobile transactions. The simula-
tor should ultimately perform simulations in such a way
that synthetic data in regards to mobile transactions can be
generated. The simulator should generate synthetic data
that is very similar to a batch of real transactional Data
provided by Ericsson. The goal is to have a generator
that can generate data on the fly that can later be used by
the scientific community in an attempt to research more
about fraud detection.
The model has one primary type of Entity which is
Client. Each client has a profile that describes the al-
lowed behaviour for the client such as the limit on trans-
actions daily/yearly, the transaction limit and the maxi-
mum balance for the client. Furthermore the number of
transactions, withdrawals, transfers and deposits is stored
for each client. The client can further be classified by
age to be young, adult or senior. Each client has a base
currency in which the transactions are based upon. The
client can perform transactions in the form of deposits,
withdrawals and transfers. For every transaction that is
made, it is stored and saved within the client.
4.1.2. Process Overview and Scheduling
The client has several processes that alter their inter-
nal states. For each step that is made by the simulator,
based on a random variable that is contingent on calcu-
lated probabilities, a type of transaction that is to be per-
formed by the client is chosen. A deposit transaction will
increase the balance of the client, a withdrawal will de-
crease the balance of the client and a transfer transaction
will withdraw money from the original client and then de-
posit them to the destination client in question.
The concepts that are behind the model are based on
statistical analysis of a large batch of real data. From this
batch of data, probabilities of each action were calculated
and incorporated into the model to generate synthetic in-
formation as close as possible to the real data. The client
agent has some adaptive behaviours that will alter their
way of acting; for instance if the client has reached its
daily limit it cannot withdraw money any more for that
day. This adaptive behaviour is a direct result of the
transfer process mentioned above. There is interaction
between agents since there is a probability that at a partic-
ular step of the simulation, an agent might transfer money
to another agent and thus alter its and the other agents
4.2. Inputs
There are multiple inputs required in order for the simu-
lator to function smoothly. As initial input, the number of
clients neighbours for each agent is assigned. The profile
for each agent is then further attached based on a proba-
bility. Their location on the spatial space along with their
neighbours is also initialized.
Parameter File This is the file that contains all of
the needed parameters that the simulator needs to
initiate. Among these parameters we find the seed
and perhaps the most relevant of which is the paths
for where the input files and the output files are
placed on the current machine.
Aggregated Transaction File This file contains the
distribution of the transactions from the original data
set. More precisely, it contains how many transac-
tions were made at any given day/hour combination
(step). what is the average price for that, what type
of transaction it was etc. This is of paramount impor-
tance for the simulator since statistical data is gener-
ated from the information gathered from this file.
Repetitions File This file contains the frequency of
transactions that the original clients had per type of
transaction. This means that some of the agents are
schedule more than others based on a social network
analysis of the indegree and outdegree of the cus-
Since the simulator is using MASON as the framework
for performing the simulation, it is of paramount impor-
tance to define how each step is to be regarded. For this
simulation we defined that each day/hour combination
represents one step. At each step, a Client that represents
the agent for the simulator is generated. The client will
be placed in an environment in which it is to make deci-
sions based on the information it perceives. The Client is
created with the statistical distribution of the possibilities
to perform each transaction type for a specific day/hour
combination. The client then randomly perform (based
on the distribution initiated) different transaction types in
relation to the other clients on the simulator. Also, for
each client generated, there is a probability Pfor the client
to make future transactions at later steps. This probability
is gathered from the database of the original data set.
4.3. Initiation Stage
In this stage, the PaySim simulator must load the neces-
sary data needed from the original dataset:
Load The Parameters The first and most important
step is to load the values for each parameter in the
parameter file. These will among other things con-
tain the file paths for the source data inputs that the
simulator needs to load.
Load Aggregate File This is the original Aggregate
File that will be used as a base point for the simu-
lator to generate statistically similar results in terms
of ”What to simulate, at what day, at which amount”
etc. One such extraction could be for instance: At
Day 1 and hour 15, simulate 8703 transactions of the
type PAYMENT with the average transaction size of
180000 and the standard deviation of 15000.
Load Initial Balance Container Apart from the sta-
tistical distribution for each transaction type input to
the client, there is another important input, namely,
the ”Initial Balance” of the client. Upon the gen-
eration of each client in the simulation, there must
be an initial ”Balance” attached to that client. This
balance is generated with the ”Balance Container”
file as base point. The Balance Container consists
of the different probabilities that will generate dif-
ferent initial balance ranges.
Load Maximum Repetitions File As mentioned
previously, each client has a probability P of making
future transactions in future steps in the simulation.
What this file does is to make sure that each client
does not make more repetitions that is allowed. Like
the Balance Container, the probabilities in this file
is also yielded from the database of the original data
4.4. Execution Stage
Upon completion of the Initiation Stage and all of the pa-
rameters are successfully loaded, the simulator can now
proceed to the execution stage. It is at this stage that the
simulator will perform the actual simulation, and yield the
simulated transaction results:
4.4.1. Generating The Clients
The agents are the founding blocks of the ”Agent Based
Simulator”. The agent in this context, resembles the
”Client”. Upon each step of the simulation, the PaySim
simulator will convert each step to a ”Day/hour” combi-
nation. This will then be used as an input to extract the
statistical distributions from the original data set. Based
on the Aggregated Transaction File, PaySim harness the
probability Pof performing each each transaction in the
simulator and save it into the model of the client. With
this information, the client now has gained more knowl-
edge and will know the following important things:
Number Of Transactions This is the total number
of transactions that this generated client will do.
Make Future Steps This is the information of
whether the client is to participate in future steps.
Which means scheduling the tasks of performing
more transactions during further steps.
Statistical Distribution This is the different proba-
bilities that the client will have loaded into it which
entails the probability Pof performing each action.
Initial Balance This will be the initial balance that
the client will have once generated.
4.4.2. Performing the transactions
After each client is generated, the client will make the
decision of what type of transaction it will ultimately
make, again this is completely derived from the distribu-
tion loaded. The client is in an environment which allows
it to freely interact with other clients in the simulation.
There are some types of transaction types that is based
on that, like ”TRANSFER” for instance. The ”TRANS-
FER” type is exchange of money from one client to an-
other; hence, the client will have to interact with other
clients to simulate the actual exchange of funds.
4.5. Finalization Stage
After each of the clients have completed their role in the
simulation and performed all of the transactions allotted
the results must be saved. There are 4 outputs generated
for each simulation made. All of which serve a specific
purpose which will allow for the exact repetition of the
simulation with the exact initial properties.
Logfile Each transaction that is made will contain
a record with the meta-data for that transaction.
Data such as what client performed which action, to
which other client, the sum of the transaction, and
the delta in balance for all clients involved. Each
such record will be saved in a logfile unique for the
specific simulation.
MySQL Database Apart from the logfile, the record
for each transaction will also be saved into a MySQL
database. The purpose of which is to allow for eas-
ier queries when the analysis of the resulsts is to be
Aggregate Dump An aggregatedump that is similar
to the original aggregatedump from the original data
set will also be generated. It is these two files that
will be used to generate the plots and graphs resem-
bling the results of the transactions.
Parameter File History This file will contain the
exact properties needed for the simulation to be able
to reproduce the exact same results again. This is
important because each simulator must be able to be
reproduced again, and without the original ”seed”
used, it will not be possible.
We ran PaySim several times using random seeds for 744
steps, representing one month of real time data. Each run
took around 30 minutes in a i7 intel processor. We se-
lected a datasets that contained the lowest difference in
Figure 1: Visualization of transaction type CASH-IN
0 100 200 300 400 500 600 700
0 20000
0 100 200 300 400 500 600 700
0 100 200 300 400 500 600 700
0e+00 4e+05
0 100 200 300 400 500 600 700
0e+00 5e+05
Figure 2: Visualization of transaction type CASH-OUT
0 100 200 300 400 500 600 700
0 40000
0 100 200 300 400 500 600 700
0e+00 4e+09
0 100 200 300 400 500 600 700
0e+00 5e+05
0 100 200 300 400 500 600 700
0e+00 8e+05
Figure 3: Visualization of transaction type TRANSFER
0 100 200 300 400 500 600 700
0 6000
0 100 200 300 400 500 600 700
0e+00 4e+09
0 100 200 300 400 500 600 700
0e+00 5e+05
0 100 200 300 400 500 600 700
0.0e+00 2.5e+07
Figure 4: Visualization of transaction type PAYMENT
0 100 200 300 400 500 600 700
0 30000
0 100 200 300 400 500 600 700
0e+00 1e+08
0 100 200 300 400 500 600 700
1000 4000
0 100 200 300 400 500 600 700
0 60000
Figure 5: Visualization of transaction type DEBIT
0 100 200 300 400 500 600 700
0 1000
0 100 200 300 400 500 600 700
0e+00 8e+06
0 100 200 300 400 500 600 700
0 20000
0 100 200 300 400 500 600 700
0 200000
values according to the original data set provided. The
selected synthetic dataset was arbitrary named PS41840.
PS41840 contains around 23 million records divided into
the 5 types of categories presented before. Table 1 shows
the types of transactions, count and average amount gen-
erated with the simulator. The amount values are given
in a currency that we can not disclose.
Table 1: Simulated PS41840
TYPE Count avgAmount
CASH-IN 4 496 947 153 019
CASH-OUT 9 014 407 155 989
TRANSFER 2 030 969 630 810
PAYMENT 8 955 794 10 793
DEBIT 139 935 5 016
The evaluation of the quality of the database was first
calculated using the sum of square error (SSE) method on
the quantities of the different datasets. The one with the
lowest error was PS41840.
In order to verify that the simulation was working prop-
erly we plotted the distributions to visually identify sig-
nificant differences between the original and the synthetic
dataset. Figures 1, 2, 3, 4 and 5 show the visualization
per type of transaction. Each figure contains the output
for each step regarding the count of transactions, the total
sum of transaction, the average and the standard devia-
tion. The red continuous line represent the original data
distribution and the blue dashed line represent the syn-
thetic dataset PS41840.
Something we noted is that the first 14 days of the sim-
ulation the activity in the system is higher compared to the
remaining days. This is perhaps a phenomenon present
due to the introduction of income during the first days of
the month.
PaySim is a simulation of mobile money transactions with
the objective to generate a synthetic transactional data set
that can be used for research into fraud detection. The
data sets generated with PaySim can aid academia, finan-
cial organisations and governmental agencies to test their
fraud detection methods or to compare the performance of
different methods under similar conditions using a com-
mon public available and standard synthetic data set for
the test.
We argue that PaySim is ready to be use as a tool to
generate synthetic transactions that resemble the origi-
nal and private data set supplied. By using PaySim we
protect the privacy of the customers of the service at the
same time that interesting results are possible to share
with other researchers without the constrains and legal
boundaries of the original data.
The results presented in the section 5. help to visually
appreciate that the generated dataset captures the process
and the frequencies of the different transaction types of
the mobile money service.
Future work on the simulator is to add to the model
fraudulent agents and run different scenarios to test the
efficacy and accuracy of diverse fraud detection meth-
ods. We also want to make a synthetic data set available
to other researchers and be able to compare and share di-
verse results.
This work is part of the research project ”Scalable
resource-efficient systems for big data analytics” funded
by the Knowledge Foundation (grant: 20140032) in Swe-
Chrystel Gaber, Baptiste Hemery, Mohammed Achem-
lal, Marc Pasquet, and Pascal Urien. Synthetic logs
generator for fraud detection in mobile transfer ser-
vices. In 2013 International Conference on Collab-
oration Technologies and Systems (CTS), pages 174–
179. IEEE, may 2013. ISBN 978-1-4673-6404-1. doi:
Volker Grimm, Uta Berger, Finn Bastiansen, Sigrunn
Eliassen, Vincent Ginot, Jarl Giske, John Goss-
Custard, Tamara Grand, Simone K. Heinz, Geir
Huse, Andreas Huth, Jane U. Jepsen, Christian
Jø rgensen, Wolf M. Mooij, Birgit Müller, Guy Pe’er,
Cyril Piou, Steven F. Railsback, Andrew M. Rob-
bins, Martha M. Robbins, Eva Rossmanith, Nadja
Rüger, Espen Strand, Sami Souissi, Richard a.
Stillman, Rune Vabø, Ute Visser, and Donald L.
DeAngelis. A standard protocol for describing
individual-based and agent-based models. Ecolog-
ical Modelling, 198(1-2):115–126, September 2006.
ISSN 03043800. doi: 10.1016/j.ecolmodel.2006.04.
023. URL
Edgar Lopez-Rojas and Stefan Axelsson. Multi agent
based simulation (mabs) of financial transactions for
anti money laundering (aml). In Audun Josang and
Bengt Carlsson, editors, Nordic Conference on Secure
IT Systems, pages 25–32, Karlskrona, 2012a.
Edgar Alonso Lopez-Rojas and Stefan Axelsson. Money
Laundering Detection using Synthetic Data. In Julien
Karlsson, Lars ; Bidot, editor, The 27th workshop of
(SAIS), pages 33–40, Örebro, 2012b. Linköping Uni-
versity Electronic Press.
Edgar Alonso Lopez-Rojas and Stefan Axelsson. So-
cial Simulation of Commercial and Financial Be-
haviour for Fraud Detection Research. In Advances in
Computational Social Science and Social Simulation,
Barcelona, 2014. ISBN 9789172952782.
Edgar Alonso Lopez-Rojas, Stefan Axelsson, and Dan
Gorton. RetSim: A Shoe Store Agent-Based Simula-
tion for Fraud Detection. In The 25th European Mod-
eling and Simulation Symposium, number c, page 10,
Athens, Greece, 2013.
S. Luke. MASON: A Multiagent Simulation Environ-
ment. Simulation, 81(7):517–527, July 2005. ISSN
0037-5497. doi: 10.1177/0037549705058073. URL
S. F. Railsback, S. L. Lytinen, and S. K. Jackson. Agent-
based Simulation Platforms: Review and Develop-
ment Recommendations. Simulation, 82(9):609–623,
September 2006. ISSN 0037-5497. doi: 10.1177/
0037549706073695. URL http://sim.sagepub.
Roland Rieke, Maria Zhdanova, Jurgen Repp, Romain
Giot, and Chrystel Gaber. Fraud Detection in Mo-
bile Payments Utilizing Process Behavior Analysis. In
2013 International Conference on Availability, Relia-
bility and Security, pages 662–669. IEEE, sep 2013.
ISBN 978-0-7695-5008-4. doi: 10.1109/ARES.2013.
Balachandran Seetharam and Drew Johnson. Mobile
Money’s Impact on Tanzanian Agriculture. 2015.
MSc. Edgar A. Lopez-Rojas
Edgar Lopez is a PhD student in Computer Science at
Blekinge Institue of Technology in Sweden and his re-
search areas are Multi-Agent Based Simulation, Ma-
chine Learning techniques with applied Visualization for
fraud detection and Anti Money Laundering (AML) in
the domains of retail stores, payment systems and fi-
nancial transactions. He obtained a Bachelors degree in
Computer Science from EAFIT University in Colombia
(2004). After that he worked for 5 more years at EAFIT
University as a System Analysis and Developer and par-
tially as a lecturer. He obtained a Masters degree in Com-
puter Science from Linköping University in Sweden in
2011 and a licentiate degree in computer science (a de-
gree halfway between a Master’s degree and a PhD) in
MSc. Ahmad Elmir
Ahmad obtained a master in computer science with spe-
ciality in security from the Blekinge Institute of Technol-
ogy. His master’s thesis was about the design and con-
struction of PaySim under the supervision of the main
author of this paper. Previously he have studied natural
sciences in the gymnasium for three years. His speciality
was at computer science and programming. He has a keen
interest for scientific inquiry in the domain of security as
it is in his opinion an ever developing field. He have also
worked for 9 months with software development in a cor-
Dr. Stefan Axelsson
Stefan Axelsson is a senior lecturer at NTNU - Norwegian
University of Science and Technology in Norway. He re-
ceived his M.Sc in computer science and engineering in
1993, and his Ph.D. in computer science in 2005, both
from Chalmers University of Technology, in Gothenburg,
Sweden. His research interests revolve around computer
security, especially the detection of anomalous behaviour
in computer networks, financial transactions and ship/-
cargo movements to name a few. He is also interested in
how to combine the application of machine learning and
information visualization to better aid the operator in un-
derstanding how the system classifies a certain behaviour
as anomalous. Stefan has ten years of industry experi-
ence, most of it working with systems security issues at
... Detailed information for each feature can be found in [39] and there is a short description in Table 2. Historical overdraft PaySim Dataset. PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country [40]. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world. ...
... For performance reasons, in this work we have selected a subset transaction with 25,867 transactions selected randomly maintaining a distribution with 80% non-fraud transactions and 20% fraud transactions. Detailed information for each feature can be found in [40] and there is a short description in Table 3. ...
Full-text available
Artificial intelligence (AI) has recently intensified in the global economy due to the great competence that it has demonstrated for analysis and modeling in many disciplines. This situation is accelerating the shift towards a more automated society, where these new techniques can be consolidated as a valid tool to face the difficult challenge of credit fraud detection (CFD). However, tight regulations do not make it easy for financial entities to comply with them while using modern techniques. From a methodological perspective, autoencoders have demonstrated their effectiveness in discovering nonlinear features across several problem domains. However, autoencoders are opaque and often seen as black boxes. In this work, we propose an interpretable and agnostic methodology for CFD. This type of approach allows a double advantage: on the one hand, it can be applied together with any machine learning (ML) technique, and on the other hand, it offers the necessary traceability between inputs and outputs, hence escaping from the black-box model. We first applied the state-of-the-art feature selection technique defined in the companion paper. Second, we proposed a novel technique, based on autoencoders, capable of evaluating the relationship among input and output of a sophisticated ML model for each and every one of the samples that are submitted to the analysis, through a single transaction-level explanation (STE) approach. This technique allows each instance to be analyzed individually by applying small fluctuations of the input space and evaluating how it is triggered in the output, thereby shedding light on the underlying dynamics of the model. Based on this, an individualized transaction ranking (ITR) can be formulated, leveraging on the contributions of each feature through STE. These rankings represent a close estimate of the most important features playing a role in the decision process. The results obtained in this work were consistent with previous published papers, and showed that certain features, such as living beyond means, lack or absence of transaction trail, and car loans, have strong influence on the model outcome. Additionally, this proposal using the latent space outperformed, in terms of accuracy, our previous results, which already improved prior published papers, by 5.5% and 1.5% for the datasets under study, from a baseline of 76% and 93%. The contribution of this paper is twofold, as far as a new outperforming CFD classification model is presented, and at the same time, we developed a novel methodology, applicable across classification techniques, that allows to breach black-box models, erasingthe dependencies and, eventually, undesirable biases. We conclude that it is possible to develop an effective, individualized, unbiased, and traceable ML technique, not only to comply with regulations, but also to be able to cope with transaction-level inquiries from clients and authorities.
... This work is based on [14] experiments and considers the datasets provided by [17]. While [14] proposed an artificial neural networks approach deploying adversarial autoencoder networks, it was not possible to evaluate the impact on the predictive power of supervised learning techniques, applied in a fraud detection dataset, by adopting latent vectors coordinates as features. ...
... The data set used in this project is provided by [17]. This is a synthetic data set, based on a mobile money simulator that has similar statistical properties to a real fraud data set. ...
... That's why a synthetic publicly available and freely downloadable dataset has been used [19]. As it is explained in [20], where all the details about the generation of the dataset as well as the conducted and reported experiments, have been clearly provided, it ensures that the synthetic dataset reflect real world data. An added benefit, is that such dataset can be used from the research community to support its own experiments, to evaluate the obtained results and to compare them with other research contributions. ...
... Since data generation is outside the scope of this work, for those who want to have more information on the used database, they can find all the details in [20]. ...
... The "Paysim" dataset from Kaggle [10], a well known online community has been used for analysis and detection of frauds. This is a synthetic dataset of Online payment transactions consisting of nearly 6.4 million rows of data. ...
... There is a lack of publicly available money transaction dataset to work in the domain of fraud detection. In this paper, we have used the dataset created by PaySim simulator which is a synthetic dataset reflecting the real mobile money transactions dataset using multiagent-based simulation [27]. This dataset has over 6 million (6,362,620) transactions with ten features and one target class. ...
In this era of big data, classifying imbalanced real-life data in supervised learning is a challenging research issue. Standard data sampling methods: under-sampling and over-sampling have several limitations for dealing with big data. Mostly, under-sampling approach removes data points from majority class instances and over-sampling approach engenders artificial minority class instances to make the data balanced. However, we may lose informative information instances using under-sampling approach, and under other conditions over-sampling approach causes overfitting problem. In this research work, we have presented a new cluster-based under-sampling approach by amalgamating ensemble learning (e.g. RandomForest classifier) for classification of imbalanced data that we implemented in Julia. We have collected actual illegal money transaction telecom fraud data, which is highly imbalanced with only 8,213 minority class instances among 63,62,620 instances. The proposed method bifurcates the data into majority class and minority class instances. Then, clusters the majority class instances into several clusters and considers a set of instances from each cluster to create several sub-balanced datasets. Finally, a number of classifiers are generated using these balances datasets and apply majority voting technique for classifying unknown new instances. We have tested the proposed method on a separate test dataset that achieved 95.82% accuracy.
... Often synthetic financial datasets are created to address this problem, which attempts to be analogous to real data. A synthetic dataset created Paysim mobile money simulator [14] was ...
Imbalanced datasets have been a unique challenge for machine learning, requiring specialized approaches to correctly classify the minority class. Financial fraud detection involves using highly imbalanced datasets with a class imbalance of up to .01% frauds to 99.99% regular transactions. It is essential to identify all frauds in financial fraud detection, even if some classifications' precision is low. I developed a random forest assembly that separates fraudulent transactions into tiers of precision. With this approach, 96% of fraudulent transactions are identified, showing an 8% increase in recall when compared to standard approaches. 59% of fraud classifications' precision increases by 10% up to 98% by optimizing several random forests on different fitness functions. These models are then combined to act as a sieve with increasing tolerance for low precision classifications. The effectiveness of random forest for financial fraud detection is also improved through feature extraction techniques. Random forest is weak at detecting patterns between interdepended features. This problem is address through unsupervised feature extraction. I will demonstrate a new random forest architecture PCA-embedded random forest, which increased random forest performance.
Distinguishing malicious anomalous activities from unusual but benign activities is a fundamental challenge for cyber defenders. Prior studies have shown that statistical user behavior analysis yields accurate detections by learning behavior profiles from observed user activity. These unsupervised models are able to generalize to unseen types of attacks by detecting deviations from normal behavior, without knowledge of specific attack signatures. However, approaches proposed to date based on probabilistic matrix factorization are limited by the information conveyed in a two-dimensional space. Non-negative tensor factorization, on the other hand, is a powerful unsupervised machine learning method that naturally models multi-dimensional data, capturing complex and multi-faceted details of behavior profiles. Our new unsupervised statistical anomaly detection methodology matches or surpasses state-of-the-art supervised learning baselines across several challenging and diverse cyber application areas, including detection of compromised user credentials, botnets, spam e-mails, and fraudulent credit card transactions.
Digital frauds get a dramatic increase over the years and lead to considerable losses. Detecting fraudulent attempts is valuable to many industries and especially to the banking and financial sectors. To help in anticipating and accurately identifying whether a transaction is fraudulent, machine learning-based models are the key solution for banking and financial institutions. In this paper, an artificial intelligence-based model was built using deep learning and was trained using stochastic gradient descent and feedforward neural networks. The dropout regularization has been utilized to enhance the generalization capabilities of the digital transaction classification model. Different activation functions were used and explored such as the max-out, the hyperbolic tangent, the rectifier linear unit, and the exponential rectifier linear unit. The impact of the learning rate on the model performance was analyzed. For the evaluation of the model, we did use of different metrics such as the accuracy, the precision, and the recall. The obtained results are promising, and the developed model can be used effectively to defend the banking sector against digital frauds.
Conference Paper
Full-text available
We present a social simulation model that covers three main financial services: Banks, Retail Stores, and Payments systems. Our aim is to address the problem of a lack of public data sets for fraud detection research in each of these domains, and provide a variety of fraud scenarios such as money laundering, sales fraud (based on refunds and discounts), and credit card fraud. Currently, there is a general lack of public research concerning fraud detection in the financial domains in general and these three in particular. One reason for this is the secrecy and sensitivity of the customers data that is needed to perform research. We present PaySim, RetSim, and BankSim as three case studies of social simulations for financial transactions using agent-based modelling. These simulators enable us to generate synthetic transaction data of normal behaviour of customers, and also known fraudulent behaviour. This synthetic data can be used to further advance fraud detection research, without leaking sensitive information about the underlying data. Using statistics and social network analysis (SNA) on real data we can calibrate the relations between staff and customers, and generate realistic synthetic data sets. The generated data represents real world scenarios that are found in the original data with the added benefit that this data can be shared with other researchers for testing similar detection methods without concerns for privacy and other restrictions present when using the original data.
Conference Paper
Full-text available
RetSim is an agent-based simulator of a shoe store based on the transactional data of one of the largest retail shoe sellers in Sweden. The aim of RetSim is the generation of synthetic data that can be used for fraud detection research. Statistical and a Social Network Analysis (SNA) of relations between staff and customers was used to develop and calibrate the model. Our ultimate goal is for RetSim to be usable to model relevant scenarios to generate realistic data sets that can be used by academia, and others, to develop and reason about fraud detection methods without leaking any sensitive information about the underlying data. Synthetic data has the added benefit of being easier to acquire, faster and at less cost, for experimentation even for those that have access to their own data. We argue that RetSim generates data that usefully approximates the relevant aspects of the real data.
Full-text available
Generally, fraud risk implies any intentional deception made for financial gain. In this paper, we consider this risk in the field of services which support transactions with electronic money. Specifically, we apply a tool for predictive security analysis at runtime which observes process behavior with respect to transactions within a money transfer service and tries to match it with expected behavior given by a process model. We analyze deviations from the given behavior specification for anomalies that indicate a possible misuse of the service related to money laundering activities. We evaluate the applicability of the proposed approach and %the computational and detection provide measurements on computational and recognition performance of the tool -- Predictive Security Analyser (PSA) -- produced using real operational and simulated logs. The goal of the experiments is to detect misuse patterns reflecting a given money laundering scheme in synthetic process behavior based on properties captured from real world transaction events.
Conference Paper
Full-text available
Mobile payments become more and more popular and thus are very attractive targets for fraudsters. As the latter always find new ways to commit crimes and avoid detection, research in the field of fraud is always evolving. However, transactional data and feedback from existing services are lacking. This article addresses this issue by proposing a synthetic data generator. Our idea is to model the behavior of various actors to generate testing data that researchers can use to evaluate approaches for identifying fraudulent transactions. This paper presents our approach and prototype. The logs generator was evaluated by comparing the generated synthetic logs with real ones.
Conference Paper
Full-text available
Mobile money is a service for performing financial transactions using a mobile phone. By law it has to have protection against money laundering and other types of fraud. Research into fraud detection methods is not as advanced as in other similar fields. However, getting access to real world data is difficult, due to the sensitive nature of financial transactions, and this makes research into detection methods difficult. Thus, we propose an approach based on a Multi-Agent Based Simulation (MABS) for the generation of synthetic transaction data. We present the generation of syn-thetic data logs of transactions and the use of such a data set for the study of different detection scenarios using machine learning.
Conference Paper
Full-text available
Criminals use money laundering to make the proceeds from their illegal activities look legitimate in the eyes of the rest of society. Current countermeasures taken by financial organizations are based on legal requirements and very basic statistical analysis. Machine Learning offers a number of ways to detect anomalous transactions. These methods can be based on supervised and unsupervised learning algorithms that improve the performance of detection of such criminal activity. In this study we present an analysis of the difficulties and considerations of applying machine learning techniques to this problem. We discuss the pros and cons of using synthetic data and problems and advantages inherent in the generation of such a data set. We do this using a case study and suggest an approach based on Multi-Agent Based Simulations (MABS).
Full-text available
Simulation models that describe autonomous individual organisms (individual based models, IBM) or agents (agent-based models, ABM) have become a widely used tool, not only in ecology, but also in many other disciplines dealing with complex systems made up of autonomous entities. However, there is no standard protocol for describing such simulation models, which can make them difficult to understand and to duplicate. This paper presents a proposed standard protocol, ODD, for describing IBMs and ABMs, developed and tested by 28 modellers who cover a wide range of fields within ecology. This protocol consists of three blocks (Overview, Design concepts, and Details), which are subdivided into seven elements: Purpose, State variables and scales, Process overview and scheduling, Design concepts, Initialization, Input, and Submodels. We explain which aspects of a model should be described in each element, and we present an example to illustrate the protocol in use. In addition, 19 examples are available in an Online Appendix. We consider ODD as a first step for establishing a more detailed common format of the description of IBMs and ABMs. Once initiated, the protocol will hopefully evolve as it becomes used by a sufficiently large proportion of modellers.
MASON is a fast, easily extensible, discrete-event multi-agent simulation toolkit in Java, designed to serve as the basis for a wide range of multi-agent simulation tasks ranging from swarm robotics to machine learning to social complexity environments. MASON carefully delineates between model and visualization, allowing models to be dynamically detached from or attached to visualizers, and to change platforms mid-run. This paper describes the MASON system, its motivation, and its basic architectural design. It then compares MASON to related multi-agent libraries in the public domain, and discusses six applications of the system built over the past year which suggest its breadth of utility.
Software has enabled the use of mobile money by farmers and significantly benefited Tanzanian agriculture.
Five software platforms for scientific agent-based models (ABMs) were reviewed by implementing example models in each. NetLogo is the highest-level platform, providing a simple yet powerful programming language, built-in graphical interfaces, and comprehensive documentation. It is designed primarily for ABMs of mobile individuals with local interactions in a grid space, but not necessarily clumsy for others. NetLogo is highly recommended, even for prototyping complex models. MASON, Repast, and Swarm are "framework and library" platforms, providing a conceptual framework for organizing and designing ABMs and corresponding software libraries. MASON is least mature and designed with execution speed a high priority. The Objective-C version of Swarm is the most mature library platform and is stable and well organized. Objective-C seems more natural than Java for ABMs but weak error-handling and the lack of developer tools are drawbacks. Java Swarm allows Swarm's Objective-C libraries to be called from Java; it does not seem to combine the advantages of the two languages well. Repast provides Swarm-like functions in a Java library and is a good choice for many, but parts of its organization and design could be improved. A rough comparison of execution speed found MASON and Repast usually fastest (MASON 1-35% faster than Repast), Swarm (including Objective-C) fastest for simple models but slowest for complex ones, and NetLogo intermediate. Recommendations include completing the documentation (for all platforms except NetLogo), strengthening conceptual frameworks, providing better tools for statistical output and automating simulation experiments, simplifying common tasks, and researching technologies for understanding how simulation results arise. 10.1177/0037549706073695