Conference PaperPDF Available

A Discretized Enriched Technique to Enhance Machine Learning Performance in Credit Scoring

Authors:

Abstract and Figures

The automated credit scoring tools play a crucial role in many financial environments, since they are able to perform a real-time evaluation of a user (e.g., a loan applicant) on the basis of several solvency criteria, without the aid of human operators. Such an automation allows who work and offer services in the financial area to take quick decisions with regard to different services, first and foremost those concerning the consumer credit, whose requests have exponentially increased over the last years. In order to face some well-known problems related to the state-of-the-art credit scoring approaches, this paper formalizes a novel data model that we called Discretized Enriched Data (DED), which operates by transforming the original feature space in order to improve the performance of the credit scoring machine learning algorithms. The idea behind the proposed DED model revolves around two processes, the first one aimed to reduce the number of feature patterns through a data discretization process, and the second one aimed to enrich the discretized data by adding several meta-features. The data discretization faces the problem of heterogeneity, which characterizes such a domain, whereas the data enrichment works on the related loss of information by adding meta-features that improve the data characterization. Our model has been evaluated in the context of real-world datasets with different sizes and levels of data unbalance, which are considered a benchmark in credit scoring literature. The obtained results indicate that it is able to improve the performance of one of the most performing machine learning algorithm largely used in this field, opening up new perspectives for the definition of more effective credit scoring solutions.
Content may be subject to copyright.
A Discretized Enriched Technique to Enhance Machine Learning
Performance in Credit Scoring
Roberto Saia, Salvatore Carta, Diego Reforgiato Recupero, Gianni Fenu and Marco Saia
Department of Mathematics and Computer Science,
University of Cagliari, Via Ospedale 72 - 09124 Cagliari, Italy
{roberto.saia, salvatore, diego.reforgiato, fenu}@unica.it, m.saia@studenti.unica.it
Keywords: Business Intelligence, Decision Support System, Credit Scoring, Machine Learning, Algorithms.
Abstract: The automated credit scoring tools play a crucial role in many financial environments, since they are able to
perform a real-time evaluation of a user (e.g., a loan applicant) on the basis of several solvency criteria, without
the aid of human operators. Such an automation allows who work and offer services in the financial area to
take quick decisions with regard to different services, first and foremost those concerning the consumer credit,
whose requests have exponentially increased over the last years. In order to face some well-known problems
related to the state-of-the-art credit scoring approaches, this paper formalizes a novel data model that we
called Discretized Enriched Data (DED), which operates by transforming the original feature space in order
to improve the performance of the credit scoring machine learning algorithms. The idea behind the proposed
DED model revolves around two processes, the first one aimed to reduce the number of feature patterns
through a data discretization process, and the second one aimed to enrich the discretized data by adding
several meta-features. The data discretization faces the problem of heterogeneity, which characterizes such a
domain, whereas the data enrichment works on the related loss of information by adding meta-features that
improve the data characterization. Our model has been evaluated in the context of real-world datasets with
different sizes and levels of data unbalance, which are considered a benchmark in credit scoring literature.
The obtained results indicate that it is able to improve the performance of one of the most performing machine
learning algorithm largely used in this field, opening up new perspectives for the definition of more effective
credit scoring solutions.
1 INTRODUCTION
In the past decades, the credit scoring techniques have
assumed a great importance in many financial sec-
tors (Siddiqi, 2017), since they are able to take de-
cisions in real-time, avoiding the employment of hu-
mans in order to evaluate the available information
about people who request certain financial services,
such as, for instance, a loan.
In such a context, it should be noted how the major
financial losses of an operator that offers financial ser-
vices are those related to an incorrect evaluation of the
customers reliability (Bijak et al., 2015), For instance,
in the consumer credit context (Livshits, 2015), such
a reliability is expressed in terms of user solvency and
the losses are related to the loans that have not been
fully or partially repaid.
Many sector studies have reported that the con-
sumer credit has exponentially increased over the last
years, as shown in Figure 1, which reports a study
on the Euro area performed by Trading Economics1
on the basis of the European Central Bank (ECB)2
data. The Euro area has been used by way of exam-
ple, since a similar trend is also registered in other
world areas such as, for instance, Russia and USA.
Other aspects related to the role of the Credit Rating
Agencies (CRAs)3with regard to the globalization of
the financial markets have been investigated and dis-
cussed in (Doumpos et al., 2019).
For the aforementioned reasons, we are assisting
and supporting an important increase of the invest-
ments, in terms of money and number of researchers,
with the aim to develop increasingly effective credit
scoring techniques. Ideally, these technologies should
be able to correctly classify each user as reliable or
unreliable, on the basis of the available information
1https://tradingeconomics.com/
2https://www.ecb.europa.eu/
3Also defined ratings services, they are companies that
assign credit ratings.
Apr. 2018 Jul. 2018 Oct. 2018 Jan. 2019 Apr. 2019
660
670
680
690
700
Millions of euros (×1000)
Figure 1: Euro Area Consumer Credit.
(e.g., age, current job, status of previous loans, etc.).
Basically, these techniques can be considered sta-
tistical approaches (Mester et al., 1997) focused on
the evaluation of the probability that a user will not re-
pay (or partially repay) a credit (Mester et al., 1997).
On the basis of this probability, typically calculated
in real-time, a financial operator can decide whether
to grant or not the requested financial service (Hassan
et al., 2018).
Similarly to other data domains such as, for in-
stance, those related to the Fraud Detection or the In-
trusion Detection tasks (Dal Pozzolo et al., 2014; Saia
et al., 2017; Saia, 2017), the information that is usu-
ally available to train a credit scoring model is charac-
terized by an unbalanced distribution of data (Rodda
and Erothi, 2016; Saia et al., 2018b).
Therefore, the data that is available to define the
evaluation model (from now on denoted as instances)
is composed by a huge number of reliable sam-
ples, with respect to the unreliable ones (Khemakhem
et al., 2018). Several studies in literature prove that
such a data unbalance reduces the effectiveness of
classification algorithms (Haixiang et al., 2017; Khe-
makhem et al., 2018).
1.1 Research Motivation
On the basis of our previous experience (Saia and
Carta, 2016c; Saia and Carta, 2016a; Saia and Carta,
2016b; Saia and Carta, 2017; Saia et al., 2018a), the
proposed Discretized Enriched Data (DED) model
has been designed by us in order to face some well-
known problems related to this data domain. The first
of them is given by the heterogeneity of the patterns
used to define a classification model, since they de-
pend on the available information about the users,
which is previously collected. Such information is
characterized by a number of features that could be
very different, even when they define the same class
of information.
Through our DED model we perform a twofold
process, the first one aimed to reduce the pattern by
adopting a discretization criterion, whereas the sec-
ond one aimed to enrich the discretized features by
adding a series of meta-features able to better char-
acterize the related class of information (i.e., reliable
or unreliable). In order to assess the real advantages
related to our approach, without the risk of results
being biased by over-fitting (Hawkins, 2004), differ-
ently from the majority of related works in literature,
we do not use a canonical cross-validation criterion.
This because a canonical cross-validation crite-
rion does not guarantee a complete separation be-
tween the data used to define the evaluation model and
the data used to evaluate its performance. For this rea-
son, we have adopted a criterion, largely used in other
domains, which focuses on the importance of assess-
ing the real performance of an evaluation/prediction
model (e.g. financial market forecasting (Henrique
et al., 2019)). Specifically, we assess the performance
of the DED model on never seen before data (conven-
tionally denoted as out-of-sample) and we define it on
different data (conventionally denoted as in-sample).
The canonical cross-validation criterion has been used
only in the context of these two sets of data.
The scientific contribution related to our work is
the following:
- formalization of the Discretized Enriched Data
(DED) model, which is aimed to improve the effec-
tiveness of the machine learning algorithms in the
credit scoring data domain;
- implementation of the DED model in the context
of a machine learning classifier we selected on the
basis of its effectiveness through a series of exper-
iments performed by using the in-sample part of
each credit scoring dataset;
- evaluation of the DED model performance, per-
formed by using the out-of-sample part of each
credit scoring dataset, comparing it with the perfor-
mance of the same machine learning algorithm that
uses the canonical data model.
The rest of the paper has been structured as fol-
lows: Section 2 provides information about the back-
ground and the related work of the credit scoring do-
main; Section 3 introduces the formal notation used in
this paper and defines the problem we address; Sec-
tion 4 provides the formalization and the implemen-
tation details of the proposed data model; Section 5
describes the experimental environment, the datasets,
the experimental strategy, and the used metrics, re-
porting and discussing the experimental results; Sec-
tion 6 makes some concluding remarks and directions
for future works.
2 BACKGROUND AND RELATED
WORK
This section provides an overview of the concepts re-
lated to the credit scoring research field and the state-
of-the-art solutions, by also describing problems that
are still unsolved and by introducing the idea that
stands behind the DED model proposed in this paper.
2.1 Credit Risk Models
In accord with several studies in literature (Crook
et al., 2007), we start off by identifying the follow-
ing different types of credit risk models, with regard
to a default4event: the Probability of Default (PD)
model, which is aimed to evaluate the likelihood of a
default over a specified period; the Exposure At De-
fault (EAD) model, which is aimed to evaluate the
total value a financial operator is exposed to when a
loan defaults; the Loss Given Default (LGD) model,
which is aimed to evaluate the amount of money a fi-
nancial operator loses when a loan defaults.
In this paper we take into account the first of these
credit risk models (i.e., the Probability of Default),
expressing it in terms of binary classification of the
evaluated users, as reliable or unreliable.
2.2 Approaches and Strategies
The current literature offers a number of approaches
and strategies, designed to perform the credit scoring
task, such as:
- those based on statistical methods, where the au-
thors, for instance, exploit the Logistic Regression
(LR) (Sohn et al., 2016) method in order to de-
fine a fuzzy credit scoring model able to predict
the default possibility of a loan, or perform this op-
eration by using the Linear Discriminant Analysis
(LDA) (Khemais et al., 2016);
- those that rely on Machine Learning (ML) algo-
rithms (Barboza et al., 2017), such as the Support
Vector Machines (SVM) method employed in (Har-
ris, 2015), where the authors adopt a Clustered Sup-
port Vector Machine (CSVM) approach to perform
the credit scoring, or in (Zhang et al., 2018), where
instead the authors exploit an optimized Random
Forest (RF) approach to perform such a task;
- those that exploit Artificial Intelligence (AI) strate-
gies (Liu et al., 2019), such as the Artificial Neural
Network (ANN) (Bequ´
e and Lessmann, 2017);
4This term denotes the failure to meet the legal obliga-
tions/conditions related to a loan.
- those that rely on transformed data domains, such
as in (Saia and Carta, 2017; Saia et al., 2018a),
where the authors exploit, respectively, the Fourier
and Wavelet transforms;
- those where specific aspects, such as data
entropy (Saia and Carta, 2016a), linear-
dependence (Saia and Carta, 2016c; Saia and
Carta, 2016b), or word embeddings (Boratto et al.,
2016) have been taken into account to perform
credit scoring tasks (Zhao et al., 2019);
- those based on hybrid approaches (Ala’raj and Ab-
bod, 2016; Tripathi et al., 2018) where several dif-
ferent approaches and strategies have been com-
bined in order to define a model able to improve
the credit scoring performance.
The literature also provides many surveys where
the performance of the state-of-the-art solutions for
credit scoring have been compared, such as that
in (Lessmann et al., 2015b).
2.3 Open Problems
Regardless of the approach and strategy used to per-
form credit scoring tasks, there are several common
problems that have to be addressed. The most impor-
tant of them are:
-Datasets Availability: the literature puts an accent
on the limited availability of public datasets to use
in the validation process (Lessmann et al., 2015a).
This issue is mainly related to the fact that finan-
cial operators often refuse to share their data, or
to privacy reasons, e.g., there are many countries
where legal reasons, related to the protection of
privacy, prevent the creation of publicly available
datasets (Jappelli et al., 2000);
-Data Unbalance: the difference between the sam-
ples related to the reliable cases and the samples
related to the unreliable ones, is a common charac-
teristic between the available datasets (Brown and
Mues, 2012). A data configuration of this kind re-
duces the performance of evaluation models that are
trained with these unbalanced sets of data (Chawla
et al., 2004);
-Samples Unavailability: it is related to the well-
known cold start problem that affects many re-
search areas (Li et al., 2019). It happens when there
is no availability of samples related to a class of in-
formation (e.g., the unreliable one), making it im-
possible to train an evaluation model.
2.4 Evaluation Metrics
Several studies have also been performed in literature,
in order to identify the best performance evaluation
1 2 3 4 5 6 7 8 9 10
0.0
20.0
40.0
60.0
80.0
100.0
5
19
41
71
Discretization range
Continuous val ues
Figure 2: Discretization Process.
criteria to adopt for a correct evaluation of credit scor-
ing models, such as that in (Chen et al., 2016). Some
of the most used metrics for assessing the effective-
ness of a credit scoring model are reported in the fol-
lowing:
- those based on the confusion matrix5, such as the
Accuracy, the Sensitivity, the Specificity, or the
Matthews Correlation Coefficient (MCC) (Powers,
2011);
- those based on the error analysis, such as the Mean
Square Error (MSE), the Root Mean Square Error
(RMSE), or the Mean Absolute Error (MAE) (Chai
and Draxler, 2014);
- those based on the the Receiver Operating Charac-
teristic (ROC) curve, such as the Area Under the
ROC Curve (AUC) (Huang and Ling, 2005).
Considering that some of these metrics do not
work well with data unbalance (Jeni et al., 2013),
such as, for instance, the majority of metrics based
on on the confusion matrix, many works in literature
addressing the problem of unbalanced datasets (e.g.,
as it happens in the credit scoring context taken into
account in this paper) adopt more that one metric to
correctly evaluate their results.
2.5 Data Transformation
Some basic concepts, related to data discretization
and enrichment processes, are briefly introduced in
this section, along with the reasons why we decided
to use them for defining the proposed DED model.
2.5.1 Discretization
Many algorithms must have knowledge of the type
and domain of the data where they operate and, in
addition, some of them (e.g., Decision Trees) require
categorical feature values (Garc´
ıa et al., 2016), con-
straining us to perform a preprocessing of the contin-
uous feature values through a discretization method.
The process of data discretization is largely
adopted in literature as an effective data preprocess-
5The matrix of size 2x2 that contains the number of True
Negatives (TN), False Negatives (FN), True Positives (TP),
and False Positives (FP).
ing technique (Liu et al., 2002). Its goal is to trans-
form the feature values from a quantitative to a quali-
tative form, by dividing each feature value into a dis-
crete number of non overlapped intervals. This means
that each numerical feature value (continuous or dis-
crete) is mapped into one of these intervals, improv-
ing the effectiveness of many machine learning al-
gorithms (Wu and Kumar, 2009) that deal with real-
world data usually characterized by continuous val-
ues.
However, regardless of the algorithms that need
a discretized data input, the discretization process
presents additional advantages, such as the data di-
mensionality reduction that leads towards a faster and
accurate learning (Garc´
ıa et al., 2016) or the improve-
ment in terms of data understandability, given by the
discretization of the original continuous values (Liu
et al., 2002).
If, on one hand, the main disadvantage of a dis-
cretization process is given by the loss of information
that occurs during the transformation of continuous
values into discrete values, on the other hand, an op-
timal discretization of the original data represents a
NP-complete6process.
In the DED model proposed in this paper, the data
discretization process produces a twofold advantage:
the first one related to the aforementioned benefits for
the involved machine learning algorithms; the second
one related to the reduction of the possible feature
patterns, since all the continuous values have been
mapped to a limited range of discrete values.
By way of example, Figure 2 shows the discretiza-
tion of four feature values defined in a continue range
of values [0,100]into a discrete range {0,1,...,10}.
2.5.2 Enrichment
The literature indicates the data enrichment as a pro-
cess adopted in order to improve a data domain
through a series of additional information, such as
meta-features. For instance, the work presented
in (Giraud-Carrier et al., 2004) defines a set of meta-
features able to improve the prediction performance
of the learning algorithms taken into account.
The meta-features are usually defined by aggre-
gating some original features, according to a specific
metric (e.g., minimum value,maximum value,mean
value,standard deviation, etc.), which can be calcu-
lated in the space of a single dataset instance (row)
or in the context of the entire dataset (Castiello et al.,
2005).
6The computational complexity theory defines NP-
complete a problem when its solution requires a restricted
class of brute force search algorithms
It should be observed that the meta-features are
largely used in the field of Meta Learning (Vilalta and
Drissi, 2002), a branch of machine learning that ex-
ploits automatic learning algorithms on meta-data in
the context of machine learning processes.
For this paper purposes, we exploit them to bal-
ance the loss of information, which is a consequence
of the applied discretization process, in order to add
further information aimed to well characterize the in-
volved classes of information (i.e., reliable and un-
reliable). More formally, given a set of discretized
features {d1,d2,...,dX}, we add a series of meta-
features to them {m1,m2,...,mY}, obtaining a new
set of features, as shown in Equation 1.
d1,1,d1,2,...,d1,X,m1,X+1,m1,X+2,...,m1,X+Y(1)
3 NOTATION AND PROBLEM
DEFINITION
This section describes the formal notation adopted in
this paper and defines the addressed problem.
3.1 Formal Notation
Given a set Iof already classified instances, com-
posed by a subset I+Iof reliable cases and a sub-
set IIof unreliable cases, and a set ˆ
Iof unclas-
sified instances, considering that an instance is com-
posed by a set of features Fand that it belongs to only
one class in the set C, we define the formal notation
adopted in this paper as reported in Table 1.
Table 1: Formal Notation.
Notation Description Note
I={i1,i2,...,iX}Set of classified instances
I+={i+
1,i+
2,...,i+
Y}Subset of reliable instances I+I
I={i
1,i
2,...,i
W}Subset of unreliable instances II
ˆ
I={ˆ
i1,ˆ
i2,...,ˆ
iZ}Set of unclassified instances
F={f1,f2,..., fN}Set of instance features
C={reliable,unrel iable}Set of instance classifications
3.2 Problem Definition
We can formalize our objective as shown in Equa-
tion 2, where the function f(ˆ
i,I)evaluates the classi-
fication of the ˆ
iinstance, performed by exploiting the
available information in the set I. It returns a binary
value β, where 0 denotes a misclassification and 1 de-
notes a correct classification. Therefore, our objec-
tive is the maximization of the σvalue, which repre-
sents the sum of the βvalues returned by the function
f(ˆ
i,I).
max
0σ≤|ˆ
I|
σ=|ˆ
I|
z=1
f(ˆ
iz,I)(2)
4 APPROACH FORMALIZATION
The proposed DED model has been defined and im-
plemented in a credit scoring system by performing
the following four steps:
-Data Discretization: the values of all features in
the sets Iand ˆ
Iare discretized in accord with a range
defined in the context of a series of experiments per-
formed by using the in-sample data;
-Data Enrichment: a series of meta-features are de-
fined and added to the discretized features of each
instance iIand ˆ
iˆ
I;
-Data Model: the DED model to use in the context
of a credit scoring machine learning algorithm is
defined on the basis of the previous data processes;
-Data Classification: the DED model is imple-
mented in a classification algorithm aimed to clas-
sify each instance ˆ
iˆ
Ias reliable or unreliable.
4.1 Data Discretization
Each feature fFin the sets Iand ˆ
Ihas been pro-
cessed in order to move the original range of value of
each feature to a defined range of discrete integer val-
ues {0,1,...,} ∈ Z, where the value of has been
determined in experimental way, as reported in Sec-
tion 5.4.5.
Denoting the data discretization process as f
d,
we operate in order to move each feature fFfrom
its original value to one of the values in the discrete
range of integers {d1,d2,...,d}. Such a process re-
duces the number of possible patterns of values (con-
tinuous and discrete) given by the original feature
vector F, with respect to the value, as shown in
Equation 3.
{f1,f2, .. . , fN}
→ {d1,d2,...,dN},iI
{f1,f2, .. . , fN}
→ {d1,d2,...,dN},ˆ
iˆ
I(3)
4.2 Data Enrichment
After performing the discretization process
previously described, the new feature vector
{d1,d2,...,d}of each instance in the sets Iand ˆ
I
has been enriched by adding some meta-features we
denoted µ. These meta-features have been calculated
in the context of each feature vector and they are:
Minimum (m), Maximum (M), Average (A), and Stan-
dard Deviation (S), then we have µ={m,M,A,S}.
This new process mitigates the pattern reduction
operated during the discretization by adding a series
of information able to improve the characterization
of the instances, and it is formalized in Equation 4.
µ=
m=min(d1,d2,...,dN)
M=max(d1,d2,...,dN)
A=1
NN
n=1(dn)
S=q1
N1N
n=1(dn¯
d)2
(4)
4.3 Data Model
As a result of the data discretization and data enrich-
ment processes, we obtain our new DED data model
where the original values fFassumed by each
instance feature have been transformed into a new
value, in accord with an experimental defined value ,
and the number of features have been extended with a
series µ={m,M,A,S}of new meta-features, as for-
malized in Equation 5. It should be noted that, for the
sake of simplicity, the equation refers to the set Ionly,
but the formalization is the same for the set ˆ
I).
DED(I) =
d1,1d1,2.. . d1,Nm1,N+1M1,N+2A1,N+3S1,N+4
d2,1d2,2.. . d2,Nm2,N+1M2,N+2A2,N+3S2,N+4
.
.
..
.
.....
.
..
.
..
.
.....
.
.
dX,1dX,2.. . dX,NmX,N+1MX,N+2AX,N+3SX,N+4
(5)
4.4 Data Classification
Finally, in the last step of our approach, we implement
the new DED model in the classification Algorithm 1,
in order to perform the classification of each unclas-
sified instance ˆ
iˆ
I.
At step 1, the procedure takes the following pa-
rameters as input: a classification algorithm alg, the
set of classified instances in the set I, and the unclas-
sified instances in the set ˆ
I. The data transformation
related to our DED approach is performed for these
sets of data at steps 2 and 3, and the transformed data
of the set Iis used in order to train the algorithm alg
model at step 4. The classification process is per-
formed at steps from 5to 8for each instance in the
set ˆ
Iand the classifications are stored in the out. At
the end of the process of classification, the results are
returned by the algorithm at step 9.
5 EXPERIMENTS
This section presents the experimental environment,
the adopted real-world datasets, the assessment met-
rics, the experimental strategy, and the obtained re-
sults.
Algorithm 1:Instance classification.
Require: alg=Classifier, I=Classified instances, ˆ
I=Unclassified instances
Ensure: out=Classification of instances in ˆ
I
1: procedure INSTAN CE CLASSIFICATION(al g,I,ˆ
I)
2: I00 getDED(I)
3: ˆ
I00 getDED(ˆ
I)
4: model Cl assi f ierT raining(alg,I00)
5: for each ˆ
i00 ˆ
I00 do
6: cgetClass(model ,ˆ
i00)
7: out.add (c)
8: end for
9: return out
10: end procedure
5.1 Environment
The code related to the performed experiments has
been written in Python language, using the scikit-
learn 7library.
In addition, in order to grant the experiments re-
producibility, we have fixed the seed of the pseudo-
random number generator to 1in the scikit-learn code
(i.e., the random state parameter).
5.2 Datasets
The German Credit (GC) and the Default of Credit
Card Clients (GC) are real-world datasets we selected
in order to validate the proposed DED model. They
represent two benchmarks in the credit score research
context and they both are characterized by different
size and data unbalance (as shown in Table 2), repro-
ducing different data configurations in the credit scor-
ing scenario. All of them are freely downloadable at
the UCI Repository of Machine Learning Databases8.
Premising that each instance (i.e., each dataset
row) in these datasets is numerically classified as reli-
able or unreliable, in the following we briefly provide
their description:
the GC dataset is composed by 1,000 instances,
of which 700 classified as reliable (70.00%) and
300 classified as unreliable (30.00%), and each
instance is characterized by 20 features, as de-
tailed in Table 3.
the DC dataset is composed by 30,000 instances,
of which 23,364 classified as reliable (77.88%)
and 6,636 classified as unreliable (22.12%), and
each instance is characterized by 23 features, as
detailed in Table 4;
7http://scikit-learn.org
8ftp://ftp.ics.uci.edu/pub/machine-learning-
databases/statlog/
Table 2: Datasets composition.
Dataset Total Reliable Unreliable Number of
name instances instances instances features
GC 1,000 700 300 21
DC 30,000 23,364 6,636 23
Table 3: Features of GC Dataset.
Field Feature Field Feature
01 Status of checking account 11 Present residence since
02 Duration 12 Property
03 Credit history 13 Age
04 Purpose 14 Other installment plans
05 Credit amount 15 Housing
06 Savings account/bonds 16 Existing credits
07 Present employment since 17 Job
08 Installment rate 18 Maintained people
09 Personal status and sex 19 Telephone
10 Other debtors/guarantors 20 Foreign worker
Table 4: Features of DC Dataset.
Field Feature Field Feature
01 Credit amount 13 Bill statement in August 2005
02 Gender 14 Bill statement in July 2005
03 Education 15 Bill statement in June 2005
04 Marital status 16 Bill statement in May 2005
05 Age 17 Bill statement in April 2005
06 Repayments in September 2005 18 Amount paid in September 2005
07 Repayments in August 2005 19 Amount paid in August 2005
08 Repayments in July 2005 20 Amount paid in July 2005
09 Repayments in June 2005 21 Amount paid in June 2005
10 Repayments in May 2005 22 Amount paid in May 2005
11 Repayments in April 2005 23 Amount paid in April 2005
12 Bill statement in September 2005
5.3 Metrics
In order to assess the performance of the proposed
DED model, with regard to a canonical data model,
we have adopted two different metrics.
The first one is the Sensitivity, a metric based on
the confusion matrix that reports us the true positive
rate related to the performed classification, then the
capability of the evaluation model to correctly classify
the reliable instances.
The second one is the Matthews Correlation Co-
efficient, it is also based on the confusion matrix and
it is able to evaluate the effectiveness of the evalua-
tion model in terms of distinguishing the reliable in-
stances from the unreliable ones and, for this reason,
it is commonly used in order to evaluate the perfor-
mance of a binary evaluation model.
The third metric is based on the the Receiver Op-
erating Characteristic (ROC) curve. It is the Area
Under the Receiver Operating Characteristic curve
(AUC) and it represents a metric largely used for
its capability to evaluate the predictive capability of
an evaluation model, even when the involved data is
characterized by a high degree of data unbalance.
All the aforementioned metrics are formalized in
the following sections.
5.3.1 Sensitivity
According to the formal notation provided in Sec-
tion 3.1, the formalization of the Sensitivity metric
is shown in Equation 6, where ˆ
Idenotes the set of
unclassified instances, TP is the number of instances
correctly classified as reliable, and FN is the num-
ber of unreliable instances wrongly classified as reli-
able. This gives us the proportion of instances which
are correctly classified by an evaluation model (Bequ´
e
and Lessmann, 2017).
Sensitivity(ˆ
I) = T P
(T P +FN)(6)
5.3.2 Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) per-
forms a balanced evaluation and it also works well
with data imbalance (Luque et al., 2019; Boughorbel
et al., 2017). Its formalization is shown in Equation 7
and its result is a value in the range [1,+1], with
+1 when all the classifications are correct and 1
when all the classifications are wrong, whereas 0 in-
dicates the performance related to a random predictor.
It should be observed how such a metric can be seen
as a discretization of the Pearson correlation (Benesty
et al., 2009) for binary variables.
MCC =(TP·T N )(FP·F N)
(T P+FP)·(T P+F N)·(T N +F P)·(T N+F N)(7)
5.3.3 AUC
As reported in a large number of studies in litera-
ture (Abell´
an and Castellano, 2017; Powers, 2011) ,
the Area Under the Receiver Operating Characteris-
tic curve (AUC) represents a reliable metric for the
evaluation of the performance related to a credit scor-
ing model. More formally, given the subsets of reli-
able and unreliable instances in the set I, respectively,
I+and I, the possible comparisons κof the scores
of each instance iare formalized in the Equation 8,
whereas the AUC is obtained by averaging over them,
as formalized in Equation 9. It returns a value in the
range [0,1], where 1 denotes the best performance.
κ(i+,i) =
1,i f i+>i
0.5,i f i+=i
0,i f i+<i
(8)
AU C =1
I+·I
|I+|
1
|I|
1
κ(i+,i)(9)
5.4 Strategy
Here we report all the details of the experimental
strategy, from the choice of the best state-of-the-art
algorithm to the definition of the discretization range
.
5.4.1 Algorithm Selection
In order to evaluate the benefits of our DED model,
we will perform a series of experiments aimed to se-
lect the most performing state-of-the-art algorithm to
use as competitor. This means that we compare the
performance of this algorithm, before and after ap-
plying our data model.
For this task we have taken into account the
following five machine learning algorithms, since
they represent the most performing and widely used
ones in credit scoring literature: Gradient Boost-
ing (GBC) (Chopra and Bhilare, 2018); Adaptive
Boosting (ADA) (Xia et al., 2017); Random Forests
(RFA) (Malekipirbazari and Aksakalli, 2015); Multi-
layer Perceptron (MLP) (Luo et al., 2017); Decision
Tree (DTC) (Damrongsakmethee and Neagoe, 2019).
5.4.2 Evaluation Criteria
The proposed DED model has been evaluated by di-
viding each dataset in two parts: the first one (in-
sample), used to identify the most performing ap-
proach to use as a competitor and to define the pa-
rameter of our model, which will be applied to the
selected algorithm in order to assess its benefits, and
the second one (out-of-sample), which we use for this
operation (i.e., performance comparison).
This kind of strategy, analogously to other stud-
ies in the literature (Rapach and Wohar, 2006), allows
us to evaluate the results, by preventing the algorithm
selection and parameter definition process from intro-
ducing bias by over-fitting (Hawkins, 2004), a risk re-
lated to the use of a canonical k-fold cross-validation
process of data validation.
For this reason, each of the adopted datasets (i.e.,
GC and DC) has been divided into an in-sample part
(50%) and an out-of-sample part (50%) . In addition,
with the aim to further reduce the impact of the data
dependency, in the context of each of these subsets we
have adopted a k-fold cross-validation criterion (k=5).
5.4.3 Data Preprocessing
Before the experiments, we preprocessed the datasets
through a binarization method aimed to transform
each instance classification (when required) from its
original form to the binary form 0=reliable and 1=un-
reliable.
According to the literature (Ghodselahi, 2011;
Wang and Huang, 2009) that, in order to better ex-
pose the data structure to the machine learning algo-
rithms, allowing them to get better performance or
converge faster, suggests to convert the feature val-
ues to the same range of values, we decided to verify
the performance improvement related to the adoption
of two largely used preprocessing methods: normal-
ization and standardization.
The first method rescales each ffeature value into
the range [0,1], whereas the second one (also known
as Z-score normalization) rescales the feature values
so that they assume the properties of a Gaussian dis-
tribution with µ=0 and σ=1, where µdenotes the
mean and σthe standard deviation from that mean,
according to Equation 10.
f00 =fµ
σ(10)
As shown in Table 5, which reports the mean per-
formance (i.e., related to the Accuracy,MCC, and
AUC metrics) measured in all datasets and all al-
gorithms after the application of the aforementioned
methods of data preprocessing, along to that mea-
sured without any data preprocessing. The best per-
formances are highlighted in bold and, furthermore,
in this case all the performed experiments involve
only the in-sample part of each dataset.
The results indicate that the data preprocessing
through the normalization and standardization meth-
ods does not lead toward significant improvement in
terms of overall mean performance, since 5times out
of 10 we obtain a better performance without using
any data preprocessing (against 3out of 10 and 2out
of 10). For this reason we decided to not apply any
method of data preprocessing during the paper exper-
iments.
5.4.4 Competitor Selection
On the basis of the algorithms introduced in Sec-
tion 5.4.1, the evaluation criteria defined in Sec-
tion 5.4.2, and the data preprocessing performed
as described in Section 5.4.3, we selected Gradient
Boosting (GBC) as the competitor algorithm to use
in order to evaluate the effectiveness of the proposed
DED model.
It has been selected since the mean value of the
Gradient Boosting performance (i.e., in terms of Sen-
sitivity,MCC, and AUC) measured on all datasets is
better than that of the other algorithms taken into ac-
count, as shown in Table 6.
050 100 150 200
0.50
0.55
0.60
(GC Dataset)
Performance
050 100 150 200
0.61
0.62
(DC Dataset)
Performance
Figure 3: Out-of-sample Discretization Range Definition.
5.4.5 Discretization Range Definition
According to the previous steps, this new set of ex-
periments is aimed to define the optimal range of dis-
cretization by using the selected algorithm (i.e.,
Gradient Boosting) in the context of the in-sample
part of each dataset.
The obtained results are shown in Figure 3, where
Performance denotes the average value between Ac-
curacy,MCC, and AUC metrics, i.e., (Accuracy +
MCC +AUC)/3. They indicate 25 and 187 as the
optimal value for the GC and DC datasets, respec-
tively.
Table 5: Mean Performance After Features Preprocessing.
Algorithm Dataset Non-preprocessed Normalized Standardized
GBC GC 0.5614 0.5942 0.6007
ADA GC 0.5766 0.6246 0.5861
RFA GC 0.5540 0.5614 0.5579
MLP GC 0.6114 0.5649 0.5589
DTC GC 0.5796 0.5456 0.5521
GBC DC 0.6087 0.5442 0.6076
ADA DC 0.6031 0.5361 0.5980
RFA DC 0.5613 0.4909 0.5586
MLP DC 0.4613 0.6177 0.5985
DTC DC 0.4982 0.4572 0.5185
Table 6: Algorithms Performance.
Algorithm Sensitivity MCC AUC Mean
GBC 0,8325 0,7065 0,4463 0,6617
ADA 0,8204 0,6943 0,4283 0,6477
RFA 0,8216 0,6939 0,4344 0,6500
MLP 0,7501 0,6038 0,2317 0,5285
DTC 0,8222 0,6791 0,3605 0,6206
5.5 Results
This section presents and discusses the results of the
experiments, with the aim to assess the effectiveness
of the proposed model with regard to a canonical one.
0.80 0.82 0.84
GC
DC
0.7889
0.8308
0.8029
0.8330
Sensitivity
Datasets
Canonical model DED model
0.25 0.30 0.35
GC
DC
0.2708
0.3555
0.3393
0.3764
MCC
Datasets
0.62 0.63 0.64 0.65
GC
DC
0.6244
0.6399
0.6534
0.6464
AUC
Datasets
Figure 4: Out-of-sample Classification Performance.
5.5.1 Results Presentation
In this set of experiments, we apply the algorithm
and the value detected through the previous exper-
iments, described in Section 5.4, in order to evaluate
the capability of the proposed DED model with re-
gard to a canonical data model based on the original
feature space.
5.5.2 Results Analysis
The analysis of the experimental results leads toward
the following considerations:
- in terms of single metrics of evaluation, Figure 4
shows that our DED model outperforms the canoni-
cal one in terms of Sensitivity,MCC, and AUC met-
rics, in both datasets;
- the improvement measured in terms of Sensitivity is
not related to a degradation of the MCC and AUC
performance, meaning that there is not a direct cor-
relation between the increasing of the true positive
rate and the increasing of the false positive rate;
- considering that the used GC and DC datasets are
characterized by different size (respectively, 1,000
and 30,000 samples) and level of data unbalance
(respectively, 30.00% and 22.12% of unreliable
samples), our model has proved its effectiveness in
different credit scoring scenarios;
- it should be noted that the adopted in-sample/out-
of-sample validation strategy has further increased
the data unbalance in the GG dataset, since its out-
GC DC
0.56
0.58
0.60
0.62
Datasets
Performance
Canonical model DED model
Figure 5: Overall Performance.
of-sample part contains a 27.20% of reliable sam-
ples (22.51% in the DC dataset);
- combining the in-sample/out-of-sample validation
strategy with the canonical k-fold cross-validation
criterion allowed us to verify the effectiveness of
the proposed model on never seen before data,
therefore without over-fitting;
- the DED model proposed in this paper has proved
its effectiveness in terms of instance characteriza-
tion, by exploiting a combined approach based on
data discretization/enrichment, outperforming the
best machine learning algorithm based on a canon-
ical data model, which we selected in the same data
domain (i.e., in-sample data) used to tune (i.e., the
range of discretization) it;
- in conclusion, since the proposed data model is
able to improve the overall performance (i.e., mean
value of all metrics) of a machine learning algo-
rithm, as shown in Figure 5, it can be exploited in
many state-of-the-art solutions based on machine
learning algorithms, such as, for instance, those
based on hybrid or ensemble configurations.
6 CONCLUSIONS AND FUTURE
WORK
The growth in terms of importance and use of credit
scoring tools has led towards an increasing number
of research activities aimed to detect more and more
effective methods and strategies.
Similarly to other scenarios, characterized by a
data unbalance such as, for instance, the Fraud Detec-
tion or the Intrusion Detection ones, even in this sce-
nario a slight performance improvement of a classifi-
cation model produces enormous advantages, which
in our case are related to the reduction of the financial
losses.
The DED model proposed in this paper has proved
that the transformation of the original feature space,
made by applying a discretization and an enrichment
process, improves the performance of one of the most
performing machine learning algorithm (i.e., Gradi-
ent Boosting).
This result opens up new perspectives for the
definition of more effective credit scoring solutions,
considering that many state-of-the-art approaches are
based on machine learning algorithms (e.g., those that
perform credit scoring in the context of rating agen-
cies, financial institutions, etc.).
As future work we want to test the effectiveness
of the proposed data model in the context of credit
scoring solutions that implement more than a sin-
gle machine learning algorithm, such as, for exam-
ple, the homogeneous and heterogeneous ensemble
approaches.
ACKNOWLEDGEMENTS
This research is partially funded by Italian Ministry
of Education, University and Research - Program
Smart Cities and Communities and Social Innovation
project ILEARNTV (D.D. n.1937 del 05.06.2014,
CUP F74G14000200008 F19G14000910008). We
gratefully acknowledge the support of NVIDIA Cor-
poration with the donation of the Titan Xp GPU used
for this research.
REFERENCES
Abell´
an, J. and Castellano, J. G. (2017). A compara-
tive study on base classifiers in ensemble methods
for credit scoring. Expert Systems with Applications,
73:1–10.
Ala’raj, M. and Abbod, M. F. (2016). A new hybrid ensem-
ble credit scoring model based on classifiers consen-
sus system approach. Expert Systems with Applica-
tions, 64:36–55.
Barboza, F., Kimura, H., and Altman, E. (2017). Machine
learning models and bankruptcy prediction. Expert
Systems with Applications, 83:405–417.
Benesty, J., Chen, J., Huang, Y., and Cohen, I. (2009).
Pearson correlation coefficient. In Noise reduction in
speech processing, pages 1–4. Springer.
Bequ´
e, A. and Lessmann, S. (2017). Extreme learning ma-
chines for credit scoring: An empirical evaluation. Ex-
pert Systems with Applications, 86:42–53.
Bijak, K., Mues, C., So, M.-C., and Thomas, L. (2015).
Credit card market literature review: Affordability and
repayment.
Boratto, L., Carta, S., Fenu, G., and Saia, R. (2016). Us-
ing neural word embeddings to model user behavior
and detect user segments. Knowledge-based systems,
108:5–14.
Boughorbel, S., Jarray, F., and El-Anbari, M. (2017). Opti-
mal classifier for imbalanced data using matthews cor-
relation coefficient metric. PloS one, 12(6):e0177678.
Brown, I. and Mues, C. (2012). An experimental compari-
son of classification algorithms for imbalanced credit
scoring data sets. Expert Systems with Applications,
39(3):3446–3453.
Castiello, C., Castellano, G., and Fanelli, A. M. (2005).
Meta-data: Characterization of input features for
meta-learning. In International Conference on Mod-
eling Decisions for Artificial Intelligence, pages 457–
468. Springer.
Chai, T. and Draxler, R. R. (2014). Root mean square er-
ror (rmse) or mean absolute error (mae)?–arguments
against avoiding rmse in the literature. Geoscientific
model development, 7(3):1247–1250.
Chawla, N. V., Japkowicz, N., and Kotcz, A. (2004). Special
issue on learning from imbalanced data sets. ACM
Sigkdd Explorations Newsletter, 6(1):1–6.
Chen, N., Ribeiro, B., and Chen, A. (2016). Financial credit
risk assessment: a recent review. Artificial Intelli-
gence Review, 45(1):1–23.
Chopra, A. and Bhilare, P. (2018). Application of ensemble
models in credit scoring models. Business Perspec-
tives and Research, 6(2):129–141.
Crook, J. N., Edelman, D. B., and Thomas, L. C. (2007).
Recent developments in consumer credit risk assess-
ment. European Journal of Operational Research,
183(3):1447–1465.
Dal Pozzolo, A., Caelen, O., Le Borgne, Y.-A., Wa-
terschoot, S., and Bontempi, G. (2014). Learned
lessons in credit card fraud detection from a practi-
tioner perspective. Expert systems with applications,
41(10):4915–4928.
Damrongsakmethee, T. and Neagoe, V.-E. (2019). Principal
component analysis and relieff cascaded with decision
tree for credit scoring. In Computer Science On-line
Conference, pages 85–95. Springer.
Doumpos, M., Lemonakis, C., Niklis, D., and Zopounidis,
C. (2019). Credit scoring and rating. In Analytical
Techniques in the Assessment of Credit Risk, pages
23–41. Springer.
Garc´
ıa, S., Ram´
ırez-Gallego, S., Luengo, J., Ben´
ıtez, J. M.,
and Herrera, F. (2016). Big data preprocessing: meth-
ods and prospects. Big Data Analytics, 1(1):9.
Ghodselahi, A. (2011). A hybrid support vector machine
ensemble model for credit scoring. International
Journal of Computer Applications, 17(5):1–5.
Giraud-Carrier, C., Vilalta, R., and Brazdil, P. (2004). Intro-
duction to the special issue on meta-learning. Machine
learning, 54(3):187–193.
Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue,
H., and Bing, G. (2017). Learning from class-
imbalanced data: Review of methods and applica-
tions. Expert Systems with Applications, 73:220–239.
Harris, T. (2015). Credit scoring using the clustered sup-
port vector machine. Expert Systems with Applica-
tions, 42(2):741–750.
Hassan, M. K., Brodmann, J., Rayfield, B., and Huda, M.
(2018). Modeling credit risk in credit unions using
survival analysis. International Journal of Bank Mar-
keting, 36(3):482–495.
Hawkins, D. M. (2004). The problem of overfitting. Jour-
nal of chemical information and computer sciences,
44(1):1–12.
Henrique, B. M., Sobreiro, V. A., and Kimura, H. (2019).
Literature review: Machine learning techniques ap-
plied to financial market prediction. Expert Systems
with Applications.
Huang, J. and Ling, C. X. (2005). Using auc and accuracy
in evaluating learning algorithms. IEEE Transactions
on knowledge and Data Engineering, 17(3):299–310.
Jappelli, T., Pagano, M., et al. (2000). Information sharing
in credit markets: a survey. Technical report, CSEF
working paper.
Jeni, L. A., Cohn, J. F., and De La Torre, F. (2013). Fac-
ing imbalanced data–recommendations for the use of
performance metrics. In 2013 Humaine Association
Conference on Affective Computing and Intelligent In-
teraction, pages 245–251. IEEE.
Khemais, Z., Nesrine, D., Mohamed, M., et al. (2016).
Credit scoring and default risk prediction: A compar-
ative study between discriminant analysis & logistic
regression. International Journal of Economics and
Finance, 8(4):39.
Khemakhem, S., Ben Said, F., and Boujelbene, Y. (2018).
Credit risk assessment for unbalanced datasets based
on data mining, artificial neural network and support
vector machines. Journal of Modelling in Manage-
ment, 13(4):932–951.
Lessmann, S., Baesens, B., Seow, H., and Thomas, L. C.
(2015a). Benchmarking state-of-the-art classifica-
tion algorithms for credit scoring: An update of re-
search. European Journal of Operational Research,
247(1):124–136.
Lessmann, S., Baesens, B., Seow, H.-V., and Thomas,
L. C. (2015b). Benchmarking state-of-the-art classi-
fication algorithms for credit scoring: An update of
research. European Journal of Operational Research,
247(1):124–136.
Li, Q., Wu, Q., Zhu, C., Zhang, J., and Zhao, W. (2019).
Unsupervised user behavior representation for fraud
review detection with cold-start problem. In Pacific-
Asia Conference on Knowledge Discovery and Data
Mining, pages 222–236. Springer.
Liu, C., Huang, H., and Lu, S. (2019). Research on personal
credit scoring model based on artificial intelligence. In
International Conference on Application of Intelligent
Systems in Multi-modal Information Analytics, pages
466–473. Springer.
Liu, H., Hussain, F., Tan, C. L., and Dash, M. (2002). Dis-
cretization: An enabling technique. Data mining and
knowledge discovery, 6(4):393–423.
Livshits, I. (2015). Recent developments in consumer credit
and default literature. Journal of Economic Surveys,
29(4):594–613.
Luo, C., Wu, D., and Wu, D. (2017). A deep learn-
ing approach for credit scoring using credit default
swaps. Engineering Applications of Artificial Intel-
ligence, 65:465–470.
Luque, A., Carrasco, A., Mart´
ın, A., and de las Heras, A.
(2019). The impact of class imbalance in classifica-
tion performance metrics based on the binary confu-
sion matrix. Pattern Recognition, 91:216–231.
Malekipirbazari, M. and Aksakalli, V. (2015). Risk assess-
ment in social lending via random forests. Expert Sys-
tems with Applications, 42(10):4621–4631.
Mester, L. J. et al. (1997). What’s the point of credit scor-
ing? Business review, 3:3–16.
Powers, D. M. (2011). Evaluation: from precision, recall
and f-measure to roc, informedness, markedness and
correlation.
Rapach, D. E. and Wohar, M. E. (2006). In-sample vs.
out-of-sample tests of stock return predictability in the
context of data mining. Journal of Empirical Finance,
13(2):231–247.
Rodda, S. and Erothi, U. S. R. (2016). Class imbal-
ance problem in the network intrusion detection sys-
tems. In 2016 International Conference on Electrical,
Electronics, and Optimization Techniques (ICEEOT),
pages 2685–2688. IEEE.
Saia, R. (2017). A discrete wavelet transform approach to
fraud detection. In International Conference on Net-
work and System Security, pages 464–474. Springer.
Saia, R. and Carta, S. (2016a). An entropy based algorithm
for credit scoring. In International Conference on Re-
search and Practical Issues of Enterprise Information
Systems, pages 263–276. Springer.
Saia, R. and Carta, S. (2016b). Introducing a vector space
model to perform a proactive credit scoring. In In-
ternational Joint Conference on Knowledge Discov-
ery, Knowledge Engineering, and Knowledge Man-
agement, pages 125–148. Springer.
Saia, R. and Carta, S. (2016c). A linear-dependence-based
approach to design proactive credit scoring models. In
KDIR, pages 111–120.
Saia, R. and Carta, S. (2017). A fourier spectral pattern
analysis to design credit scoring models. In Proceed-
ings of the 1st International Conference on Internet of
Things and Machine Learning, page 18. ACM.
Saia, R., Carta, S., et al. (2017). A frequency-domain-
based pattern mining for credit card fraud detection.
In IoTBDS, pages 386–391.
Saia, R., Carta, S., and Fenu, G. (2018a). A wavelet-based
data analysis to credit scoring. In Proceedings of the
2nd International Conference on Digital Signal Pro-
cessing, pages 176–180. ACM.
Saia, R., Carta, S., and Recupero, D. R. (2018b). A
probabilistic-driven ensemble approach to perform
event classification in intrusion detection system. In
KDIR, pages 139–146. SciTePress.
Siddiqi, N. (2017). Intelligent credit scoring: Building and
implementing better credit risk scorecards. John Wi-
ley & Sons.
Sohn, S. Y., Kim, D. H., and Yoon, J. H. (2016). Technol-
ogy credit scoring model with fuzzy logistic regres-
sion. Applied Soft Computing, 43:150–158.
Tripathi, D., Edla, D. R., and Cheruku, R. (2018). Hy-
brid credit scoring model using neighborhood rough
set and multi-layer ensemble classification. Journal
of Intelligent & Fuzzy Systems, 34(3):1543–1549.
Vilalta, R. and Drissi, Y. (2002). A perspective view and
survey of meta-learning. Artificial intelligence review,
18(2):77–95.
Wang, C.-M. and Huang, Y.-F. (2009). Evolutionary-based
feature selection approaches with new criteria for data
mining: A case study of credit approval data. Expert
Systems with Applications, 36(3):5900–5908.
Wu, X. and Kumar, V. (2009). The top ten algorithms in
data mining. CRC press.
Xia, Y., Liu, C., Li, Y., and Liu, N. (2017). A boosted de-
cision tree approach using bayesian hyper-parameter
optimization for credit scoring. Expert Systems with
Applications, 78:225–241.
Zhang, X., Yang, Y., and Zhou, Z. (2018). A novel credit
scoring model based on optimized random forest. In
2018 IEEE 8th Annual Computing and Communica-
tion Workshop and Conference (CCWC), pages 60–
65. IEEE.
Zhao, Y., Shen, Y., and Huang, Y. (2019). Dmdp: A
dynamic multi-source default probability prediction
framework. Data Science and Engineering, 4(1):3–
13.
... Based on our previous experience [13][14][15][16][17][18] to deal with credit scoring, we present in this work a solution for data heterogeneity in credit scoring datasets. We do that by assessing the performance of a Two-Step Feature Space Transforming (TSFST) method we previously proposed in [19] to improve credit scoring systems. Our approach to improve features information has a twofold process, composed of (i) enrichment; and (ii) discretization phases. ...
... The work presented in this paper is based on our previously published one [19]. Notwithstanding, it has been extended in such a way to add the following new discussions and contributions: ...
... 1. Extension of the Background and Related Work section by discussing more relevant and recent state-of-the-art approaches, extending the information related to this research field with the aim to provide the readers a quite exhaustive overview on the credit scoring scenario. 2. We changed the order of operations reported in our previous work [19], as we realized that it achieves better results. 3. Inclusion of a new real-world dataset, which allows us to evaluate the performance using a dataset characterized by a low number of instances (690, which is lower than 1000 and 30000 from the other datasets) and features (14, which is also a low number if compared to 21 and 23 from the other datasets). ...
Chapter
Full-text available
The increasing amount of credit offered by financial institutions has required intelligent and efficient methodologies of credit scoring. Therefore, the use of different machine learning solutions to that task has been growing during the past recent years. Such procedures have been used in order to identify customers who are reliable or unreliable, with the intention to counterbalance financial losses due to loans offered to wrong customer profiles. Notwithstanding, such an application of machine learning suffers with several limitations when put into practice, such as unbalanced datasets and, specially, the absence of sufficient information from the features that can be useful to discriminate reliable and unreliable loans. To overcome such drawbacks, we propose in this work a Two-Step Feature Space Transforming approach, which operates by evolving feature information in a twofold operation: (i) data enhancement; and (ii) data discretization. In the first step, additional meta-features are used in order to improve data discrimination. In the second step, the goal is to reduce the diversity of features. Experiments results performed in real-world datasets with different levels of unbalancing show that such a step can improve, in a consistent way, the performance of the best machine learning algorithm for such a task. With such results we aim to open new perspectives for novel efficient credit scoring systems.
... In more detail, the adopted Data Aggregation Approach (DAA) aggregates the transaction of each user in terms of: number of transactions, activity days, minimum handled amount, maximum handled amount, total handled amount, mean handled amount, and standard deviation measured in all the transactions. Meta-feature Addition: similarly to other previous works Carta et al., 2020;Carta et al., 2019a;Saia et al., 2019a), which exploit preprocessing techniques in order to improve the performance of the machine learning algorithms, we propose a Meta-features Addition Approach (MAA) aimed to better characterize each instance in the sets I andÎ, improving the credit scoring performance. In more detail, we added a series of metafeatures MF = {m f 1 , m f 2 , . . . ...
Conference Paper
Full-text available
The Payments Systems Directive 2 (PSD2), recently issued by the European Union, allows the banks to share their customer data if they authorize the operation. On the one hand, this opportunity offers interesting perspectives to the financial operators, allowing them to evaluate the customers reliability (Credit Scoring) even in the absence of the canonical information typically used (e.g., age, current job, total incomes, or previous loans). On the other hand, the state-of-the-art approaches and strategies still train their Credit Scoring models using the canonical information. This scenario is further worsened by the scarcity of proper datasets needed for research purposes and the class imbalance between the reliable and unreliable cases, which biases the reliability of the classification models trained using this information. The proposed work is aimed at experimentally investigating the possibility of defining a Credit Scoring model based on the bank transactions of a customer, instead of using the canonical information, comparing the performance of the two models (canonical and transaction-based), and proposing an approach to improve the performance of the transactions-based model through the introduction of meta-features. The performed experiments show the feasibility of a Credit Scoring model based only on banking transactions and the possibility of improving its performance by introducing simple meta-features.
... Based on our previous experience (Saia et al., 2019b;Saia et al., 2019a), where we have experimented the positive effects resulting from the transformation of the original data feature space, here we propose a revised and improved approach, named Twofold Feature Space Transformation (TFST). It is aimed to get a better characterization of the network events by a twofold process: (i) addition of metainformation in order to get a better characterization of the network events aimed to discriminate the nor-mal activities from the intrusion ones; (ii) discretization of the new extended feature space aimed to reduce the number of potential event patterns, decreasing the false alarm rate and improving the IDS performance. ...
... For these reasons, after having positively evaluated the effectiveness of data transformation in other fields [32][33][34], which show it to be able to achieve better characterization of the involved information and a clear reduction of the possible feature patterns, we experimented on this strategy in the IDS domain. In more detail, the proposed LFE strategy operates by introducing a series of new features calculated on the basis of the existing ones, along with a discretization of each feature value, according to a numeric range experimentally defined. ...
Article
Full-text available
The dramatic increase in devices and services that has characterized modern societies in recent decades, boosted by the exponential growth of ever faster network connections and the predominant use of wireless connection technologies, has materialized a very crucial challenge in terms of security. The anomaly-based Intrusion Detection Systems, which for a long time have represented one of the most efficient solutions in order to detect intrusion attempts on a network, then have to face this new and more complicated scenario. Well-known problems, such as the difficulty of distinguishing legitimate activities from illegitimate ones due to their similar characteristics and their high degree of heterogeneity, today have become even more complex, considering the increase in the network activity. After providing an extensive overview of the scenario under consideration, this work proposes a Local Feature Engineering (LFE) strategy aimed to face such problems through the adoption of a data preprocessing strategy that reduces the number of possible network event patterns, increasing at the same time their characterization. Unlike the canonical feature engineering approaches, which take into account the entire dataset, it operates locally in the feature space of each single event. The experiments conducted on real-world data have shown that this strategy, which is based on the introduction of new features and the discretization of their values, improves the performance of the canonical state-of-the-art solutions.
... Based on our previous experience (Saia et al., 2019b;Saia et al., 2019a), where we have experimented the positive effects resulting from the transformation of the original data feature space, here we propose a revised and improved approach, named Twofold Feature Space Transformation (TFST). It is aimed to get a better characterization of the network events by a twofold process: (i) addition of metainformation in order to get a better characterization of the network events aimed to discriminate the nor-mal activities from the intrusion ones; (ii) discretization of the new extended feature space aimed to reduce the number of potential event patterns, decreasing the false alarm rate and improving the IDS performance. ...
Conference Paper
Full-text available
The anomaly-based Intrusion Detection Systems (IDSs) represent one of the most efficient methods in countering the intrusion attempts against the ever growing number of network-based services. Despite the central role they play, their effectiveness is jeopardized by a series of problems that reduce the IDS effectiveness in a real-world context, mainly due to the difficulty of correctly classifying attacks with characteristics very similar to a normal network activity or, again, due to the difficulty of contrasting novel forms of attacks (zero-days). Such problems have been faced in this paper by adopting a Twofold Feature Space Transformation (TFST) approach aimed to gain a better characterization of the network events and a reduction of their potential patterns. The idea behind such an approach is based on: (i) the addition of meta-information, improving the event characterization; (ii) the discretization of the new feature space in order to join together patterns that lead back to the same events, reducing the number of false alarms. The validation process performed by using a real-world dataset indicates that the proposed approach is able to outperform the canonical state-of-the-art solutions, improving their intrusion detection capability.
... The literature proposes a large number of Credit Scoring techniques [15,16,17] to maximize Equation 1, along with several studies focused on comparing their performance in several real-world datasets. We discuss some of such solutions in the remaining of this section. ...
Article
Full-text available
Lenders, such as credit card companies and banks, use credit scores to evaluate the potential risk posed by lending money to consumers and, therefore, mitigating losses due to bad debt. Within the financial technology domain, an ideal approach should be able to operate proactively, without the need of knowing the behavior of non-reliable users. Actually, this does not happen because the most used techniques need to train their models with both reliable and non-reliable data in order to classify new samples. Such a scenario might be affected by the cold-start problem in datasets, where there is a scarcity or total absence of non-reliable examples, which is further worsened by the potential unbalanced distribution of the data that reduces the classification performances. In this paper, we overcome the aforementioned issues by proposing a proactive approach, composed of a combined entropy-based method that is trained considering only reliable cases and the sample under investigation. Experiments done in different real-world datasets show competitive performances with several state-of-art approaches that use the entire dataset of reliable and unreliable cases.
Article
Full-text available
Background and objectives Diagnosis of Pulmonary Rifampicin Resistant Tuberculosis (RR-TB) with the Drug-Susceptibility Test (DST) is costly and time-consuming. Furthermore, GeneXpert for rapid diagnosis is not widely available in Indonesia. This study aims to develop and evaluate the CUHAS-ROBUST model performance, an artificial-intelligence-based RR-TB screening tool. Methods A cross-sectional study involved suspected all type of RR-TB patients with complete sputum Lowenstein Jensen DST (reference) and 19 clinical, laboratory, and radiology parameter results, retrieved from medical records in hospitals under the Faculty of Medicine, Hasanuddin University Indonesia, from January 2015-December 2019. The Artificial Neural Network (ANN) models were built along with other classifiers. The model was tested on participants recruited from January 2020-October 2020 and deployed into CUHAS-ROBUST (index test) application. Sensitivity, specificity, and accuracy were obtained for assessment. Results A total of 487 participants (32 Multidrug-Resistant/MDR 57 RR-TB, 398 drug-sensitive) were recruited for model building and 157 participants (23 MDR and 21 RR) in prospective testing. The ANN full model yields the highest values of accuracy (88% (95% CI 85–91)), and sensitivity (84% (95% CI 76–89)) compare to other models that show sensitivity below 80% (Logistic Regression 32%, Decision Tree 44%, Random Forest 25%, Extreme Gradient Boost 25%). However, this ANN has lower specificity among other models (90% (95% CI 86–93)) where Logistic Regression demonstrates the highest (99% (95% CI 97–99)). This ANN model was selected for the CUHAS-ROBUST application, although still lower than the sensitivity of global GeneXpert results (87.5%). Conclusion The ANN-CUHAS ROBUST outperforms other AI classifiers model in detecting all type of RR-TB, and by deploying into the application, the health staff can utilize the tool for screening purposes particularly at the primary care level where the GeneXpert examination is not available. Trial registration NCT04208789 .
Article
Full-text available
In this paper, we propose a dynamic forecasting framework, named DMDP (dynamic multi-source default probability prediction), to predict the default probability of a company. The default probability is a very important factor to assess the credit risk of listed companies on a stock market. Aiming at aiding financial institutions in decision making, our DMDP framework not only analyzes financial data to capture the historical performance of a company, but also utilizes long short-term memory model to dynamically incorporate daily news from social media to take the perceptions of market participants and public opinions into consideration. The study of this paper makes two key contributions. First, we make use of unstructured news crawled from social media to alleviate the impact of financial fraud issue made on default probability prediction. Second, we propose a neural network method to integrate both structured financial factors and unstructured social media data with appropriate time alignment for default probability prediction. Extensive experimental results demonstrate the effectiveness of DMDP in predicting default probability for the listed companies in mainland China, compared with various baselines.
Conference Paper
Full-text available
Nowadays, it is clear how the network services represent a widespread element, which is absolutely essential for each category of users, professional and non-professional. Such a scenario needs a constant research activity aimed to ensure the security of the involved services, so as to prevent any fraudulently exploitation of the related network resources. This is not a simple task, because day by day new threats arise, forcing the research community to face them by developing new specific countermeasures. The Intrusion Detection System (IDS) covers a central role in this scenario, as its main task is to detect the intrusion attempts through an evaluation model designed to classify each new network event as normal or intrusion. This paper introduces a Probabilistic-Driven Ensemble (PDE) approach that operates by using several classification algorithms, whose effectiveness has been improved on the basis of a probabilistic criterion. A series of experiments, performed by using real-world data, show how such an approach outperforms the state-of-the-art competitors, proving its better capability to detect intrusion events with regard to the canonical solutions.
Article
Full-text available
Loan default is a serious problem in banking industries. Banking systems have strong processes in place for identification of customers with poor credit risk scores; however, most of the credit scoring models need to be constantly updated with newer variables and statistical techniques for improved accuracy. While totally eliminating default is almost impossible, loan risk teams, however, minimize the rate of default, thereby protecting banks from the adverse effects of loan default. Credit scoring models have used logistic regression and linear discriminant analysis for identification of potential defaulters. Newer and contemporary machine learning techniques have the ability to outperform classic old age techniques. This article aims to conduct empirical analysis on publically available bank loan dataset to study banking loan default using decision tree as the base learner and comparing it with ensemble tree learning techniques such as bagging, boosting, and random forests. The results of the empirical analysis suggest that the gradient boosting model outperforms the base decision tree learner, indicating that ensemble model works better than individual models. The study recommends that the risk team should adopt newer contemporary techniques to achieve better accuracy resulting in effective loan recovery strategies. © 2018, K.J. Somaiya Institute of Management Studies and Research.
Article
A major issue in the classification of class imbalanced datasets involves the determination of the most suitable performance metrics to be used. In previous work using several examples, it has been shown that imbalance can exert a major impact on the value and meaning of accuracy and on certain other well-known performance metrics. In this paper, our approach goes beyond simply studying case studies and develops a systematic analysis of this impact by simulating the results obtained using binary classifiers. A set of functions and numerical indicators are attained which enables the comparison of the behaviour of several performance metrics based on the binary confusion matrix when they are faced with imbalanced datasets. Throughout the paper, a new way to measure the imbalance is defined which surpasses the Imbalance Ratio used in previous studies. From the simulation results, several clusters of performance metrics have been identified that involve the use of Geometric Mean or Bookmaker Informedness as the best null-biased metrics if their focus on classification successes (dismissing the errors) presents no limitation for the specific application where they are used. However, if classification errors must also be considered, then the Matthews Correlation Coefficient arises as the best choice. Finally, a set of null-biased multi-perspective Class Balance Metrics is proposed which extends the concept of Class Balance Accuracy to other performance metrics.
Article
The search for models to predict the prices of financial markets is still a highly researched topic, despite major related challenges. The prices of financial assets are non-linear, dynamic, and chaotic; thus, they are financial time series that are difficult to predict. Among the latest techniques, machine learning models are some of the most researched, given their capabilities for recognizing complex patterns in various applications. With the high productivity in the machine learning area applied to the prediction of financial market prices, objective methods are required for a consistent analysis of the most relevant bibliography on the subject. This article proposes the use of bibliographic survey techniques that highlight the most important texts for an area of research. Specifically, these techniques are applied to the literature about machine learning for predicting financial market values, resulting in a bibliographical review of the most important studies about this topic. Fifty-seven texts were reviewed, and a classification was proposed for markets, assets, methods, and variables. Among the main results, of particular note is the greater number of studies that use data from the North American market. The most commonly used models for prediction involve support vector machines (SVMs) and neural networks. It was concluded that the research theme is still relevant and that the use of data from developing markets is a research opportunity.
Article
Purpose Credit scoring datasets are generally unbalanced. The number of repaid loans is higher than that of defaulted ones. Therefore, the classification of these data is biased toward the majority class, which practically means that it tends to attribute a mistaken “good borrower” status even to “very risky borrowers”. In addition to the use of statistics and machine learning classifiers, this paper aims to explore the relevance and performance of sampling models combined with statistical prediction and artificial intelligence techniques to predict and quantify the default probability based on real-world credit data. Design/methodology/approach A real database from a Tunisian commercial bank was used and unbalanced data issues were addressed by the random over-sampling (ROS) and synthetic minority over-sampling technique (SMOTE). Performance was evaluated in terms of the confusion matrix and the receiver operating characteristic curve. Findings The results indicated that the combination of intelligent and statistical techniques and re-sampling approaches are promising for the default rate management and provide accurate credit risk estimates. Originality/value This paper empirically investigates the effectiveness of ROS and SMOTE in combination with logistic regression, artificial neural networks and support vector machines. The authors address the role of sampling strategies in the Tunisian credit market and its impact on credit risk. These sampling strategies may help financial institutions to reduce the erroneous classification costs in comparison with the unbalanced original data and may serve as a means for improving the bank’s performance and competitiveness.
Chapter
Credit scoring usually refers to models and systems that provide a numerical credit score for each borrower, mostly for internal use by financial institutions and corporate clients. Credit ratings provide risk classifications for corporate loans, bond issues, and countries (e.g., sovereign credit ratings). This chapter describes the basic characteristics of both schemes. The presentation begins with a discussion of the different contexts of scoring and rating (through the cycle and point in time assessments, issuer ratings and issue-specific ratings, behavioral—profit scoring and social lending). Then, the main modeling requirements are outlined, and the model development process is explained. The chapter closes with a brief discussion of the credit rating industry, focusing on the major credit rating agencies (CRAs), who play a crucial role due to the globalization of the financial markets and the wide range of debt issues, which pose challenges to their monitoring and risk assessment for investors, financial institutions, supervisors, and other stakeholders.
Conference Paper
Nowadays, the dramatic growth in consumer credit has made ineffective the methods based on the human intervention, aimed to assess the potential solvency of loan applicants. For this reason, the development of approaches able to automate this operation represents today an active and important research area named Credit Scoring. In such scenario it should be noted how the design of effective approaches represents an hard challenge, due to a series of well-known problems, such as, for instance, the data imbalance, the data heterogeneity, and the cold start. The Centroid wavelet-based approach proposed in this paper faces these issues by moving the data analysis from its canonical domain to a new time-frequency one, where this operation is performed through three different metrics of similarity. Its main objective is to achieve a better characterization of the loan applicants on the basis of the information previously gathered by the Credit Scoring system. The performed experiments demonstrate how such approach outperforms the state-of-the-art solutions.