ArticlePDF Available

Abstract

Lenders, such as credit card companies and banks, use credit scores to evaluate the potential risk posed by lending money to consumers and, therefore, mitigating losses due to bad debt. Within the financial technology domain, an ideal approach should be able to operate proactively, without the need of knowing the behavior of non-reliable users. Actually, this does not happen because the most used techniques need to train their models with both reliable and non-reliable data in order to classify new samples. Such a scenario might be affected by the cold-start problem in datasets, where there is a scarcity or total absence of non-reliable examples, which is further worsened by the potential unbalanced distribution of the data that reduces the classification performances. In this paper, we overcome the aforementioned issues by proposing a proactive approach, composed of a combined entropy-based method that is trained considering only reliable cases and the sample under investigation. Experiments done in different real-world datasets show competitive performances with several state-of-art approaches that use the entire dataset of reliable and unreliable cases.
A Combined Entropy-based Approach for a Proactive
Credit Scoring
Salvatore Carta, Anselmo Ferreira, Diego Reforgiato Recupero, Marco Saia,
Roberto Saia
Department of Mathematics and Computer Science
University of Cagliari, Via Ospedale 72 - 09124 Cagliari, Italy
Abstract
Lenders, such as credit card companies and banks, use credit scores to evaluate
the potential risk posed by lending money to consumers and, therefore, miti-
gating losses due to bad debt. Within the financial technology domain, an ideal
approach should be able to operate proactively, without the need of knowing the
behavior of non-reliable users. Actually, this does not happen because the most
used techniques need to train their models with both reliable and non-reliable
data in order to classify new samples. Such a scenario might be affected by
the cold-start problem in datasets, where there is a scarcity or total absence of
non-reliable examples, which is further worsened by the potential unbalanced
distribution of the data that reduces the classification performances. In this pa-
per, we overcome the aforementioned issues by proposing a proactive approach,
composed of a combined entropy-based method that is trained considering only
reliable cases and the sample under investigation. Experiments done in differ-
ent real-world datasets show competitive performances with several state-of-art
approaches that use the entire dataset of reliable and unreliable cases.
Keywords: FinTech, Trust Management, Business Intelligence, Credit
Scoring, Data Mining, Entropy
Email address: {salvatore, anselmo.ferreira, diego.reforgiato, roberto.saia
}@unica.it, m.saia@studenti.unica.it (Salvatore Carta, Anselmo Ferreira, Diego
Reforgiato Recupero, Marco Saia, Roberto Saia)
Preprint submitted to Engineering Applications of Artificial Intelligence October 12, 2019
1. Introduction
The main task of a Credit Scoring system is the evaluation of new loan appli-
cations (from now on named instances) in terms of their potential reliability. Its
goal is to lead the financial operators toward a decision about accepting or not a
new credit, on the basis of a reliability score assigned by the Credit Scoring sys-
tem [1]. In a nutshell, the Credit Scoring system is a statistical approach able to
evaluate the probability that a new instance is considered reliable (non-default)
or unreliable (default), by exploiting a model defined on the basis of previous
instances [2, 3]. Banks and credit card companies use credit scores to determine
who qualifies for a loan, at what interest rate, and at what credit limits. There-
fore, Credit Scoring systems reduce losses due to default cases [4], and, for this
reason, they represent a crucial instrument. Although similar technical issues
are shared, Credit Scoring is different from Fraud detection, which consists of a
set of activities undertaken to prevent money or property from being obtained
through false pretenses.
Thanks to their capability to analyze all the components that contribute to
determine default cases [5], Credit Scoring techniques can also be considered a
powerful instrument for risk assessment and real-time monitoring [6].
Moreover, lenders may also use credit scores to determine which customers
are likely to bring in the most revenue. However, as usually happens with other
similar contexts (e.g., Fraud Detection [7]), the main problem that limits the
effectiveness of Credit Scoring classification techniques is represented by the
unbalanced distribution of data [8]. This happens because the default cases
available for training the evaluation model are fewer than the non-default ones,
hampering the performances of machine learning approaches applied to Credit
Scoring [9]. To note that the unbalanced distribution of data is one of the
problems that enables the cold start problem. As such, approaches for balancing
data mitigate the cold start problem as well.
To overcome such an issue, in this paper we evaluate the instances in terms of
their features entropy, defining a metric able to measure their level of reliability
2
considering only non-default cases and the instance under investigation. More
formally, we evaluate the reliability of a new instance in terms of comparing the
Shannon Entropy (from now on referred simply as entropy) measured within a
set of previous non-default instances before and after adding the instance under
investigation. As the entropy measures the uncertainty of a random variable,
a larger entropy in the set including the sample investigated indicates that it
contains similar data in its features, which increases the level of equiprobability,
and then we tend to classify it as reliable. Otherwise, it contains different data
and we consider the instance as unreliable. Such a process allows us operating
proactively, overcoming the issue related to the unbalanced distribution of the
data and, at the same time, mitigating the cold-start problem (i.e., the scarcity
or total absence of default examples).
We report comparisons between our approach and Random Forests, which
are considered state-of-the-art approaches for credit scoring tasks [10, 11, 12].
For that we used two real-world datasets, characterized by different distribution
of data (unbalanced and slightly unbalanced). Experiments results show that,
although our approach is trained on reliable cases only, it has similar perfor-
mances to the Random Forests.
Therefore, the main scientific contributions given by this paper are listed
below:
(i) Calculation of the Local Entropy in the process of credit scoring, a process
aimed to measure the entropy achieved by each feature in the previous
non-default instances, in order to evaluate the entropy variations in terms
of single features of an instance.
(ii) Calculation of the Global Entropy in the process of credit scoring, a meta-
feature obtained by calculating the integral of the area under curve given
by the local entropies, which allows us evaluating the entropy variations
in terms of all features of an instance.
(iii) Definition of the Entropy Difference Approach, an algorithm able to classify
3
the new instances as reliable or unreliable by exploiting both the Local
Entropy and Global Entropy information.
This paper is based on a previous work [13], which has been completely re-
vised, rewritten, improved and extended with the following novel contributions:
1. We updated our proposed approach by defining a threshold of differences
in the process of instance classification, aiming at optimizing the perfor-
mance on the basis of the specific operative context, differently from our
previous formalization [13] based on comparing two counters.
2. A feature selection step is now done in our proposed approach, in order to
select instance features based on a twofold criterion (i.e., basic and mutual
entropy). Additionally, experiment comparisons between the performance
achieved by our approach before and after we performed the proposed
feature selection process are reported, to better highlight the benefits of
such a pre-processing step.
3. A complexity analysis is added by evaluating the asymptotic time com-
plexity of the proposed algorithm, in order to determinate its impact in
some particular contexts such as real-time Credit Scoring system, a pro-
cess not done in our previous work [13].
4. One more dataset, which is more suitable for the scenario taken into ac-
count (i.e., the Australian Credit Approval dataset), is added to the ex-
periments, allowing us to better evaluate the performance of our approach
in two different data configurations (highly unbalanced and slightly un-
balanced).
5. We added a new metric of evaluation (i.e., Sensitivity) in the experiments,
which allows us to have a detailed overview of the proposed approach
performance.
6. We added experiments results of the parameter tuning process aimed at
4
finding the best threshold of the proposed algorithm, which was not re-
ported in our previous work [13].
7. We added three more baselines based on improved Naive-Bayes classifiers
as competitors.
8. We performed one experiment of varying the number of default (minority
class) samples available to the classifiers, better highlighting the benefits
of the proposed approach in a real world credit scoring scenario.
The remainder of the paper is organized as follows. Section 2 discusses
the background and related works of credit scoring. Section 3 describes the
implementation of the proposed approach. Section 4 provides details on the
experimental environment, the adopted datasets and metrics, as well as on the
implementation of the proposed approach and the competitors. Section 5 shows
the experimental results and, finally, some concluding remarks and future work
are given in Section 6.
2. Related Works
The research related to the Credit Scoring has grown quite significantly in
recent years, in coincidence with the exponential increase of consumer credit [14].
The literature proposes a large number of Credit Scoring techniques [15, 16, 17]
to maximize Equation 1, along with several studies focused on comparing their
performance in several real-world datasets. We discuss some of such solutions
in the remaining of this section.
The work in [18] used the Wavelet transform and three metrics to perform
credit scoring. Similarly, the approach in [19] moved the credit scoring from the
canonical time domain to the frequency one, by comparing differences of mag-
nitudes after Fourier Transform conversion of time-series data. An interesting
approach was proposed in [20], which presents a comparison of non-square ma-
trix determinants identify the reliability of users data to allow money loan. The
5
work in [21] used a score based on outlier parameters for each transaction, to-
gether with an isolation forest classifier to detect unreliable users. Kolmogorov-
Smirnov statistics were used in [22] to cluster unreliable and reliable users. Au-
thors of [23] used data preprocessing and a Random Forest optimized through
a grid search step. A three-way decisions approach with probabilistic rough
sets is proposed in [24]. In [25], a deep learning Convolutional Neural Network
approach is used for the first time for credit scoring, which is applied to features
that are pre-processed with the Relief feature selection technique and converted
into grayscale images. An application of kernel-free fuzzy quadratic surface
Support Vector Machines is proposed in [26], and an interesting comparison
of different neural networks, such as Multilayer Perceptrons and Convolutional
Neural Networks for Credit Scoring is done in [27]. An extensive work in this
sense was done in [10], where a large scale benchmark of forty-one methods for
the instance classification has been performed on eight Credit Scoring datasets.
Another type of problem, related to the optimization of the parameters involved
in these approaches was instead tackled in [28], which also reports a discussion
about the canonical metrics used to measure the performance [29].
Machine learning techniques can also be combined in order to build hybrid
approaches of Credit Scoring as, for instance, those presented in [30, 31], which
exploit a two-stage hybrid model with artificial neural networks and a multivari-
ate adaptive regression splines model, or that described in [32], which instead
exploits neural networks with k-mean clustering method. Another kind of clas-
sifiers combination, commonly known as ensembles, has also been extensively
studied in the literature. The work in [33] used several classifiers, including
SVMs and logistic regression, in order to validate a feature selection approach,
called group penalty function, which penalizes the use of variables from the same
source of information in the final features. In [34], a multi-step data process-
ing operation that includes normalization and dimensionality reduction, allied
with an ensemble of five classifiers optimized by a Bayesian algorithm, are used
in the pipeline. The work in [35] ensembles five classifiers (logistic regression,
support vector machine, neural network, gradient boosting decision tree and
6
random forest) using a genetic algorithm and fuzzy assignment. In [36], a set of
classifiers are joined in an ensemble according to their soft probabilities. In [37],
an ensemble is used with a feature selection step based on feature clustering,
and the final result is a weighted voting approach.
Other works are closely related and can be integrated to Credit Scoring
application. For example, in user profiling, users can be considered good and
bad borrowers, not only according to core credit information, but also their
behavior in social networks. In this sense, the work in [38] used a Naive-Bayes
based classifier in both features: hard (credit information) and soft (friendship
and group information). Linguistic-based features are coupled with machine
learning classifiers in [39] to detect a person’s behavior. Finally, the work in
[40] used deep learning through Long Short Term Memory networks on texts to
define the personality of a person.
Notwithstanding, several issues and limitations are still considered open
problems in Credit Scoring tasks. We discuss all of them in the following:
1. Data Scarcity Problem: this issue refers to the lack of data to validate
machine learning models [41]. This happens mainly due to the policies
and constraints adopted by researchers working in this field, which do
not allow them releasing information about their business activities for
privacy, competition, or legal issues.
2. Non-adaptability Problem this problem concerns the inability of the
Credit Scoring models to correctly classify the new instances, especially
when their features generate different patterns w.r.t the patterns used
to define the evaluation model. All the Credit Scoring approaches are
affected by this problem that leads toward misclassification, due to their
inability to identify new patterns in the instances under analysis.
3. Data Heterogeneity Problem: the pattern recognition process used to
detect some specific patterns on the basis of a model previously defined
represents a very important branch of the machine learning, since it can
7
be used to solve a large number of real-world problems [42]. However, it
should be noted how the effectiveness of these processes can be reduced
by the heterogeneity of the involved data. Such a problem, also known
in literature as instance identification or naming problem, is due to the
fact that same data are often represented in a different way in different
datasets [43].
4. Cold-start Problem: such an issue arises when the set of data used
to train an evaluation model does not contain enough information about
the domain taken into account, making it impossible to define a reliable
model [44, 45, 46]. In other words, this happens when the training data are
not representative of all the involved classes of information [47, 48], which
in the application discussed herein are represented by the default and non-
default cases. More formally, within the credit scoring domain, the cold
start problem consists of the following three cases: (i) New community.
When a catalogue of financial indicators exist but almost no users are
present and the lack of user interaction makes it very hard to provide
reliable suggestions. (ii) New financial feature. A new financial feature
is added to the system but there are no interactions (financial features
applicable to a given user) present. (iii) New user. A new user registers
but he/she has not provided any interaction yet, therefore it is not possible
to provide personalized analysis.
5. Data Unbalance Problem: without underestimating the other prob-
lems, we can state that the main complicating factor in a Credit Scoring
process is the imbalanced class distribution of data [49, 9], caused by the
fact that the default cases are much smaller than the non-default ones.
This means that the information available to train an evaluation model is
typically composed of a large number of legitimate cases and a small num-
ber of fraudulent ones, a data configuration that reduces the effectiveness
of the most common classification approaches [9, 11]. A common solution
adopted in order to face this problem is the artificial balance of data [50].
8
It consists of an over-sampling or under-sampling operation. In the first
case the balance is obtained by duplicating some of the instances that
occur the least (usually, the default ones), while in the second case it is
obtained by removing some of the instances that occur the most (usually,
the non-default ones). An analysis of the advantages and disadvantages
related to this preprocessing phase has been presented in [51, 52].
Some works have focused on the problem of imbalanced learning in datasets.
In [53], the authors presented a technique that clones the minority class instances
according to the similarity between them and the minority class mode. The work
in [54] proposed cost-sensitive Bayesian network classifiers, which incorporate
an instance weighting method giving different classification errors to different
classes. Authors in [55] proposed undersampling and oversampling approaches
based on a novel class imbalance metric, which splits the imbalance problem
into multiple balanced subproblems. Then, weak classifiers trained in a bagging
manner are used in a boosting fashion. The approach proposed in [56] capture
the covariance structure of the minority class in order to generate synthetic sam-
ples with Mahalanobis Distance-based Over-sampling and Generalized Singular
Value Decomposition. The research performed in [57] studied potential bias
characteristics of imbalanced crowdsourcing labeled datasets. Then, the au-
thors proposed a novel consensus algorithm based on weighted majority voting
of four classifiers. Such algorithm uses the frequency of minority class to obtain
a bias rate, assigning weights to the majority and minority classes. The authors
of [58] enhanced a multi-class classifier based on fuzzy rough sets. Firstly, they
proposr an adaptive weight setting for the binary classifiers involved, addressing
the varying characteristics of sub-problems. Then, a new dynamic aggregation
method combines the predictions of binary classifiers with a global class affinity
method before making a final decision. Finally, authors in [59] evolved one-
vs-one schemes for multi-class imbalance classification problems, by applying
binary ensemble learning approaches with an aggregation approach.
However, differently from all of these previous approaches, our method
9
doesn’t need any samples from the minority class in the proposed pipeline, a
problem that can happen specially when the cold-start problem arises (i.e., there
is no default cases in the dataset). Our approach faces these problems by train-
ing its evaluation model using only one class of data (the non-default cases, or
the majority class), comparing entropy-based metrics behavior of non-evaluated
samples before and after they are added to a set of previous non-default sam-
ples. Therefore, our proposed approach represents a side effect of adopting a
proactive methodology by being aware of limitations of the environment. We
discuss further details of our proposed approach in the next section.
3. Proposed Approach
Before we discuss our solution for the credit score in more details, let us
define the problem of Credit Scoring more formally. Given a set of classified
instances T={t1, t2, . . . , tK}and a set of features F={f1, f2, . . . , fM}that
compose each tT, we denote as T+={t1, t2, . . . , tN}the subset of non-
default instances (then T+T), and as T={t1, t2, . . . , tJ}the subset of
default ones (then TT). We also denote as ˆ
T={ˆ
t1,ˆ
t2,...,ˆ
tU}a set
of unclassified instances and as E={e1, e2, . . . , eU}these instances after the
classification process (thus |ˆ
T|=|E|). It should be observed that an instance
can only belong to one class cC, where C={reliable, unreliable}. So,
the Credit Score system problem is to define a function eval(ˆ
tu) which returns
the maximum sum of a binary value σ, used to assess the correctness of ˆ
tu
classification (i.e., 0=misclassification, 1=correct classification), or
max
0σ≤| ˆ
T|
σ=
|ˆ
T|
X
u=1
eval(ˆ
tu).(1)
Given such concepts, the implementation of our approach has been carried
out through the following four steps:
1. Feature Selection Process: evaluation of each instance feature in or-
der to evaluate its contribution in the context of the definition of our
evaluation model.
10
2. Local Entropy Calculation: calculation of the local entropy Λ, which
gives information about the level of entropy assumed by each single feature
in the set T+.
3. Global Entropy Calculation: calculation of the global entropy γ, a
meta-information defined by calculating the integral of the area under the
Λ curve.
4. Entropy Difference Approach: definition of the Entropy Difference
Approach (EDA) able to classify the new instances on the basis of the Λ
and γinformation.
A pipeline of the proposed EDA approach is shown in Figure 1. In the
first step, the set of previous non-default instances T+and the set of instances
to be evaluated ˆ
Tare preprocessed, performing a feature selection task aimed
to exclude from the evaluation process the features with a low level of char-
acterization of the instances. This step reduces the computational complexity
and returns sets with reduced features T0
+and ˆ
T0. In the next steps, the local
entropy is calculated for each feature of the set T0
+, as well as the global en-
tropy of all the features in T0
+. The last step performs the comparison between
the local and global entropy previously calculated for the set T0
+, and the same
information calculated for adding each element of the set ˆ
T0to T0
+, classifying
the non evaluated instances on the basis of the threshold Θ. The result of the
entire process is then stored in the set E.
Algorithm 1 describes the general idea of the approach and is composed of
two steps. It receives as input the set T+of reliable instances, the set ˆ
Tof non-
evaluated instances and three thresholds: min1, and min2 from the feature
selection approach, and Θ from the proposed EDA approach. The first step
calculates the reduced features using basic and mutual Shanon entropies metrics
to eliminate features according to thresholds min1 and min2 (a process further
discussed in Section 3.1). The transformed sets T0
+of reliable instances and ˆ
T0
of non-evaluated instances are then the input of the proposed EDA approach
11
Feature
Selection
Local
Entropy
Evaluation
Global
Entropy
Evaluation
ˆ
T0
Classification
ˆ
T
T+
E
Θ
ˆ
T0
T0
+
Λ
, γ)
Figure 1: EDA High-level Architecture
(Section 3.4), which classifies each ˆ
t0ˆ
T0by the threshold θon comparisons of
Local Maximum Entropy (Section 3.2) and global Maximum Entropy (Section
3.3) values calculated before and after adding non evaluated instances ˆ
t0ˆ
T0
to ˆ
T0
+. Then, the set Ewill return the classification of each non evaluated
sample ˆ
t0ˆ
T0. In the following subsections, we will describe in details all the
aforementioned steps.
Algorithm 1 P r oactive C redit S coring Approach
Input: T+=Set of non-default instances; ˆ
T=Set of instances to evaluate; min1, min2=Basic and
mutual entropy thresholds; Θ=EDA Threshold
Output: E=Set of classified instances
1: procedure Proactive Credit scoring(T+,ˆ
T,min1,min2, Θ)
2: T0
+,ˆ
T0F eatureSelection(T+,ˆ
T , min1, min2)See Section 3.1
3: EInstancesE valuation(T0
+,ˆ
T0,Θ) See Sections 3.2, 3.3 and 3.4.
4: return E
5: end procedure
12
3.1. Feature Selection
Many studies [60] have discussed how the performance of a Credit Scoring
model is strongly influenced by the features used during the process of their
definition. This process is known as Feature Selection and it can be performed
by using different techniques, on the basis of the characteristics of the context
taken into account. It means that the choice of the best features to use during
the model definition is not based on a unique criterion, but rather it exploits
several criteria with the aim to evaluate, as best as possible, the influence of
each feature in the process of defining the Credit Scoring model. This represents
an important preprocessing step, since it can reduce the complexity of the final
model, decreasing the training times and increasing the generalization of the
model at the same time. Further, it can also reduce the problem related to the
overfitting, a problem that occurs when a statistical model describes random
error or noise instead of the underlying relationship, and this frequently happens
during the definition of excessively complex models, since many parameters,
with respect to the number of training data, are involved.
In the proposed approach, the feature selection is performed by exploiting a
dual entropy-based approach that evaluates the importance of the features both
individually and mutually. For that, we use two metrics, defined as follows.
Basic Shannon Entropy.It measures the uncertainty associated with a ran-
dom variable by evaluating the average minimum number of bits needed to
encode a string of symbols based on their frequency. High values of entropy in-
dicate a high level of uncertainty in the data prediction process and, otherwise,
low values of entropy indicate a lower degree of uncertainty in this process. More
formally, given a set of values fF, the entropy H(F) is defined as shown in
the Equation 2, where P(f) is the probability that the element fis present in
the set F.
H(F) = PfFP(f)log2[P(f)] (2)
13
Mutual Shannon Entropy.It measures the amount of information a ran-
dom variable gives about another one. High mutual information values indicate
a large reduction in uncertainty, while low mutual information values indicate a
small reduction of uncertainty. A value of zero indicates that the variables are
independent. More formally, given two discrete variables Xand Ywhose joint
probability distribution is PXY (x, y), denoting as µ(X;Y) the mutual informa-
tion between Xand Y, the Mutual Shannon Entropy is calculated as shown in
Equation 3 below
µ(X;Y) = Px,y PXY (x, y ) log PX Y (x,y)
PX(x)PY(y)=EPXY log PXY
PXPY.(3)
With these two metrics in mind, we perform the feature selection through
the following steps:
1. The basic entropy of each single feature is measured, evaluating its con-
tribution in the instance characterization.
2. The mutual entropy of each feature with respect to the other features is
evaluated.
3. Results of the previous two steps are combined, selecting the features to
be used within the model definition process.
Such an approach allows us evaluating the contribution of each feature from
a dual point of view, by deciding when we can exclude it in order to reduce
the computational complexity, an important preprocessing task in case of large
datasets.
The feature selection process is detailed in Algorithm 2. It takes as input a
set T+of previous non-default instances, the set ˆ
Tof instances to evaluate and
min1and min2values, which represent the thresholds used to determine when
an entropy value must be considered relevant (as previously described). The
algorithm returns then two sets of instances, T0
+and ˆ
T0, which contain only the
features that had not been removed by the algorithm, in order to use them in
the model definition process. In step 2 of the algorithm, we extract the features
14
related to the dataset T+, processing them in the steps 4-10. Such a process
calculates the basic and mutual entropy (steps 5 and 6) in the set of values
assumed by each feature in the dataset T+, removing (steps 8 and 9) from T+
and ˆ
Tthe features in T+that present a basic entropy above the min1value and
a mutual entropy below the min2value (step 7). At step 12, the sets T0
+and
ˆ
T0with reduced features are returned by the algorithm.
Algorithm 2 F eatur e Selection
Input: T+=Set of non-default instances; ˆ
T=Set of instances to evaluate; min1, min2=Basic and
mutual entropy thresholds
Output: T0
+=Set of non-default instances with selected features; ˆ
T0=Set of instances to evaluate
with selected features
1: procedure FeatureSelection(T+,ˆ
T,min1,min2)
2: F+getAllF eatures(T+)
3: ˆ
FgetAllF eatures(ˆ
T)
4: for each fin F+do
5: be getBasicE ntropy(F+, f )
6: me getM utualEntropy(F+, f )
7: if be > min1AND me < min2then
8: T0
+removeF eature(f , F+)
9: ˆ
T0removeF eature(f , ˆ
F)
10: end if
11: end for
12: return T0
+,ˆ
T0
13: end procedure
3.2. Local Maximum Entropy Calculation
Denoting as H(f0) the entropy measured in the values assumed by a feature
f0F0in the set T0
+, we define the set Λ as the entropy achieved by each
f0F0, so we have that |Λ|=|F0|. Such calculation is performed as shown in
Equation 4.
Λ = {λ1=max(H(f0
1)), λ2=max(H(f0
2)), . . . , λM=max(H(f0
M))}(4)
In our proposed Entropy Difference Approach, such a metric is calculated
twice, before and after we added to T0
+a non evaluated instance ˆ
t0ˆ
T0.
15
3.3. Global Maximum Entropy Calculation
We denote as global maximum entropy γthe integral of the area under curve
of the local Entropy Λ (previously defined in Section 3.2), as shown in Figure 2.
f1f2... fM
λ1
λ2
.
.
.
λM
γ
F eatures (F)
Entropy (Λ)
Figure 2: Global Entropy γ
More formally, the value of γis calculated by using the trapezium rule, as
shown in Equation 5.
γ=RλM
λ1
f(x)dx x
2
|Λ|
P
n=1
(f(xn+1) + f(xn))
with
x=(λMλ1)
|Λ|
(5)
The global entropy is a meta-feature that gives us information about the
entropy achieved by all the features in T0
+, before and after we added to it a
non evaluated instance. We use this information during the evaluation process,
jointly with that given by Λ in Equation 4.
3.4. Entropy Difference Approach
Our proposed Entropy Difference Approach (EDA) is based on the Algo-
rithm 3, which is able to evaluate and classify as reliable or unreliable a set of
non evaluated (new) instances. It takes as input a set T0
+of known non-default
16
instances with features reduced, a set ˆ
T0of non evaluated instances with the
same features reduced and a previously trained threshold Θ. Then, it returns
as output a set E, containing all the instances in ˆ
T0classified as reliable or
unreliable, depending on the Λ and γinformation.
In step 3 of the algorithm, we calculate the Λavalue, by using the reduced
features from non-default instances only in T0
+, as described in Section 3.2,
while in step 4 we obtain the global entropy γ(Section 3.3) in the same set.
The steps from 5 to 23 process all the instances ˆ
t0ˆ
T0from the instances to
be classified with reduced features. After the calculation of Λband γbvalues
(steps 7 and 8) by adding the current instance ˆ
t0to the set T0
+of non-default
instances with reduced features, the steps from 9 to 12 compare each λaΛa
with the corresponding feature λbΛb, counting how many times the value
of λbis less or equal than λa. This is stored in a counter variable count (step
11). Steps 14-16 perform the same operation, but now it takes into account the
global entropy γcomparisons. At the end of the previous sub-processes, in the
steps from 17 to 21 we classify the current instance as reliable or unreliable, on
the basis of the count value and the Θ threshold, then we set count to zero (step
22). The resulting set Eis returned at the end of the entire process at step 24.
In this paper, we also include an evaluation of the computational complex-
ity taken for the classification of a single instance ˆ
t0, because this information
allows us determining the performance of our Algorithm 3 in a context of a
real-time Credit Scoring system [61], a scenario where the response-time repre-
sents a primary aspect. We perform this operation by analyzing the theoretical
complexity of the classification Algorithm 3, previously formalized. So, let N
be the size of the set T0
+(i.e.,N=|T0
+|) and Mthe size of the set F0
+(i.e.,
M=|F0
+|). The asymptotic time complexity of a single evaluation, in terms of
Big O notation, can be determined on the basis of the following observations:
(i) as shown in Figure 3, the Algorithm 3 presents two nested loops given by
the outer loop that starts at step 4 (L1 loop), which executes Ntimes
the inner loop L2 that starts at step 7 and other operations (i.e.,getLo-
17
Algorithm 3 Entropy Dif f erenceApproach (EDA)
Input: T0
+=Non-default instances with features reduced (see Section 3.1); ˆ
T0=Instances to eval-
uate with reduced features (see Section 3.1); Θ=Threshold
Output: E=Set of classified instances
1: procedure InstancesEvaluation(T0
+,ˆ
T0, Θ)
2: F0
+getAllF eatures(T0
+)
3: ΛagetLocalM axEntropy(F0
+)
4: γagetGlobalM axEntr opya)
5: for each ˆ
t0in ˆ
T0do
6: ˆ
f0getAllF eatures(ˆ
t0)
7: ΛbgetLocalM axEntropy(F0
++ˆ
f0)
8: γbgetGlobalM axEntr opyb)
9: for each λin Λdo
10: if λbλathen
11: count count + 1
12: end if
13: end for
14: if γbγathen
15: count count + 1
16: end if
17: if count > Θthen
18: E(ˆ
t,reliable)
19: else
20: E(ˆ
t,unreliable)
21: end if
22: count 0;
23: end for
24: return E
25: end procedure
18
calMaxEntropy,getGlobalMaxEntropy, plus comparisons and assignations
operations, respectively with complexity O(N), O(M), O(1), and O(1));
(ii) the inner loop L2 executes Mtimes operations of comparisons and assig-
nations, respectively with complexity O(1) and (1); and
(iii) the complexity related to the other operations executed by the algorithm
(i.e.,getLocalMaxEntropy,getGlobalMaxEntropy in steps 2 and 3) is, re-
spectively, O(N) and O(M).
The aforementioned considerations allow us determining that the asymp-
totic time complexity of the proposed algorithm is O(N×M), a complexity
that can be effectively reduced by running in parallel the process over several
machines, e.g., by exploiting large scale distributed computing models such as
MapReduce [62].
Basic Entropy
(evaluation)
L2
Global Entropy
(evaluation)
L1
Figure 3: Algorithm Nested Loops
4. Experimental Setup
This section describes the datasets and metrics considered in the experiment,
the adopted experiments methodology and implementation details of the state-
of-the-art approach considered and the proposed approach.
19
4.1. Datasets
The datasets used during the experiments have been chosen for two reasons:
first, they represent two benchmarks in this research field; second, they represent
two different distributions of data (i.e., unbalanced and slightly unbalanced).
The first one is the German Credit (GC) dataset (unbalanced data distribution)
and the second one is the Australian Credit Approval (ACA) dataset (slightly
unbalanced data distribution). Both the datasets are freely available at the UCI
Repository of Machine Learning Databases1. These datasets are released with
all the attributes modified to protect the confidentiality of the data, and we
used a version suitable for the algorithms that can not operate with categorical
variables (i.e., a version with all numeric attributes). It should be noted that,
in case of other datasets that contain categorical variables, their conversion to
numeric form is straightforward.
Table 1: Datasets Overview
Dataset Total cases Non-default Default Attributes Classes
name |T| |T+| |T| |F| |C|
GC 1,000 700 300 21 2
ACA 690 307 383 15 2
The datasets'characteristics are summarized in Table 1 and detailed in the
following:
German Credit (GC).It contains 1,000 instances: 700 of them are non-
default instances (70.00%) and 300 are default instances (30.00%). Each in-
stance is composed of 20 features, whose type is described in Table 2 and a
binary class variable (reliable or unreliable).
1ftp://ftp.ics.uci.edu/pub/machine-learning- databases/statlog/
20
Australian Credit Approval (ACA).It contains 690 instances, 307 of them
are non-default instances (44.5%) and 383 are default instances (55.5%). Each
instance is composed of 14 features and a binary class variable (reliable or
unreliable). In order to protect the data confidentiality, all feature names and
values of this dataset have been changed to meaningless symbols, as shown in
Table 3, which reports the feature type instead of its description.
Table 2: Dataset GC Features
Feature Description Feature Description
1 Status of checking account 11 Present residence since
2 Duration 12 Property
3 Credit history 13 Age
4 Purpose 14 Other instal lment plans
5 Credit amount 15 Housing
6 Savings account/bonds 16 Existing credits
7 Present employment since 17 Job
8 Installment rate 18 Maintained people
9 Personal status and sex 19 Telephone
10 Other debtors/guarantors 20 Foreign worker
21
Table 3: Dataset ACA Features
Feature Type Feature Type
1 Categorical field 8 Categorical field
2 Continuous field 9 Categorical field
3 Continuous field 10 Continuous field
4 Categorical field 11 Categorical field
5 Categorical field 12 Categorical field
6 Categorical field 13 Continuous field
7 Continuous field 14 Continuous field
22
4.2. Metrics
This section introduces the metrics used to compare our proposed approach
with the competitor in the experiments.
Accuracy.This metric reports the number of instances correctly classified and
is calculated as:
Accuracy(ˆ
T) = |ˆ
T(+)|
|ˆ
T|,(6)
where |ˆ
T|corresponds to the total number of instances, and |ˆ
T(+)|to the number
of instances correctly classified.
Sensitivity.This metric measures the number of instances correctly classified
as reliable, providing an important information since it allows evaluating the
predictive power of our approach in terms of capability to identify the default
cases. It is calculated as
Sensitivity(ˆ
T) = |ˆ
T(T P )|
|ˆ
T(T P )|+|ˆ
T(F N)|,(7)
where |ˆ
T(T P )|corresponds to the number of instances correctly classified as
reliable and |ˆ
T(F N)|to the number of reliable instances erroneously classified as
unreliable.
F-score.The F-score represents the weighted average of the Precision and Re-
call metrics and is considered an effective performance measure for unbalanced
datasets [63]. Such a metric is calculated as
F-score(T(P), T (R)) = 2 ·P recision ·Recall
P recision +Recal
with
P recision(T(P), T (R)) = |T(R)T(P)|
|T(P)|Recall(T(P), T (R)) = |T(R)T(P)|
|T(R)|,
(8)
where T(P)denotes the set of performed classifications of instances, and T(R)
the set that contains the actual classifications of them.
23
Area Under the Receiver Operating Characteristic (AUC).This metric
is a performance measure used to evaluate the effectiveness of a classification
model [64, 65]. It is calculated as
Θ(t+, t) =
1, if t+> t
0.5, if t+=t
0, if t+< t
AUC =1
|T+|·|T|
|T+|
P
1
|T|
P
1
Θ(t+, t),(9)
where T+is the set of non-default instances, Tis the subset default instances,
and Θ indicates all possible comparisons between the instances of the two subsets
T+and T. The final result is obtained by averaging all the comparisons.
4.3. Methodology, Competitors and Proposed Approach Implementation Details
The experiments have been performed using the k-fold cross-validation, with
k=10. This approach allows us reducing the impact of data dependency, im-
proving the reliability of the results. For this setup, we choose the Random
Forest classifier [66] and three Naive Bayes improved classifiers [67, 68, 69] as
competitors.
The Random Forests [66] approach represents one of the most common and
powerful state-of-the-art techniques used for the Credit Scoring tasks, since in
most of the cases it outperforms the other ones [10, 11, 12]. It consists of an
ensemble learning approach for classification and regression based on the con-
struction of a number of randomized decision trees during the training phase.
The conclusion is inferred by averaging the obtained results and this technique
can be used to solve a wide range of prediction problems. Naive Bayes classifiers
use the Bayes Theorem by predicting probabilities that the input data belongs
to a particular class. Thus, the class with the highest probability is considered
the most likely class. We also included in the experiments this kind of classifier
as competitor as it was also used for a similar problem before [38]. Therefore,
we choose to also compare the proposed approach with some improved naive
Bayes algorithms: Hidden Naive Bayes [67] (we will refer to this competitor
24
as HNB), Deep Feature Weighted Naive Bayes [68] (we will refer to this com-
petitor as DF W N B ) and Correlation-based Feature Weighted Naive Bayes [69]
(we will refer to this competitor as CF W N B ). The implementation used to
evaluate all the baselines performances in our experiments was the one made in
the Waikato Environment for Knowledge Analysis (WEKA) machine learning
package2. Parameters of these classifiers are shown in Table 4.
Table 4: Competitor Algorithms Parameters
Algorithm Parameter Values Description
RF bagSizeP er cent 100 Size of each bag as a percentage of the training set size
batchSize 100 The preferred number of instances to process if batch prediction is being performed
maxDepth Unlimited The maximum depth of the tree
numIterations 100 The number of iterations to be performed
numDecimalP laces 2 The number of decimal places to be used for the output of numbers in the model
seed 1 The random number seed to be used
HN B batchSize 100 The preferred number of instances to process if batch prediction is being performed
numDecimalP laces 2 The number of decimal places to be used for the output of numbers in the model
DF W NB bagSizeP er cent 50 Size of each bag as a percentage of the training set size
batchSize 100 The preferred number of instances to process if batch prediction is being performed
classifier DFW N B The base classifier to be used
ignoreBelow Depth 0 Set to zero weight the attributes below this depth in the trees (0=disable)
numBagging Iterations 10 Number of bagging iterations
useCF SB asedW eighting Tr ue Use CFS-Based Feature Weighting
useGainRatioBasedW eighting F alse Use Gain-Ratio-Based Weighting
useInf oGainBasedW eighting F alse Use Info-Gain-Based Weighting
useCF SB asedW eighting Tr ue Use CFS-Based Weighting
useLogDepthW eighting False Use Log-Depth Weighting
usePr unedT rees Fal se Use Pruned Trees for bagging
useReliefB asedW eighting Fal se Use Relief-Based Weighting
useZeroOneW eig hts F alse Use Zero-One Weights
numDecimalP laces 2 The number of decimal places to be used for the output of numbers in the model
seed 1 The random number seed to be used
CF W N B batchS ize 100 The preferred number of instances to process if batch prediction is being performed
numDecimalP laces 2 The number of decimal places to be used for the output of numbers in the model
2https://www.cs.waikato.ac.nz/ml/
25
The proposed approach was developed in Java. The entropy measures needed
for the approach presented in this paper have been developed by using JavaMI 3,
a Java port of MIToolbox4.
5. Experiments
In this section, we start the discussion about the experimental results. We
divide this section into two subsections: in the first subsection, we present the
experiments done to find the parameters of the proposed approach. Then, we
discuss the final experiments results, comparing the proposed approach against
its version without feature selection and also the competitors in real-world credit
scoring datasets.
5.1. Parameter Tuning Experiments
In this Subsection, we discuss experiments results that helped us to find the
best parameters of the proposed approach. In Section 5.1.1, we show how we
found the features to be removed in our proposed approach using the feature
selection step. Then, in Section 5.1.2, we report the experiments done that
helped us to find the EDA threshold of our proposed approach.
5.1.1. Feature Selection
In our first experiment to find parameters, we perform a study aimed at
evaluating the contribution of each instance features in the proposed approach
for the classification process. We do this by exploiting two different approaches
of evaluation based on concepts of entropy previously discussed in Section 3.1.
Results of each feature’s basic and mutual entropies are shown in Figure 4.
The results shown in Figure 4 indicate that, although several features present
a high level of entropy (i.e., a low level of instance characterization, since the
entropy increases as the data becomes equally probable), they have a positive
3http://www.cs.man.ac.uk/~pococka4/JavaMI.html
4http://www.cs.man.ac.uk/~pococka4/MIToolbox.html
26
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
min
max
F eatures (GC dataset)
(a)
Basic entropy
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
min
max
F eatures (GC dataset)
(b)
Mutual entr opy
1 2 3 4 5 6 7 8 9 10 11 12 13 14
min
max
F eatures (AC A dataset)
(c)
Basic entropy
1 2 3 4 5 6 7 8 9 10 11 12 13 14
min
max
F eatures (AC A dataset)
(d)
Mutual entr opy
Figure 4: Basic and M utual Entropy
27
contribution in a mutual relation with other features (the number of mutual
relations are represented through the horizontal lines in the feature bars). Con-
sidering that all values of entropy have been normalized in the [0,1] range (y-
axis) and high values of Basic Entropy indicate high levels of uncertainty while
high values of Mutual Entropy indicate large reductions of uncertainty, we can
do the following considerations:
(i) The Basic Entropy results, reported in Figure 4.a and Figure 4.c, show
that there are many features in the GC dataset that present a high level
of Basic Entropy (i.e., we considered as relevant value of Basic Entropy a
value above the two thirds of the interval, e.g., the features 1, 3, 6, 7, 8,
11, and 19 in GC dataset, as well as features 2, 10, and 12 in the ACA
dataset).
(ii) The Mutual Entropy results, reported in Figure 4.b and Figure 4.d, show
if there are features with a high level of Basic Entropy for which a Mutual
Entropy with other features that reduces their uncertainty exists (we con-
sidered as relevant value of Mutual Entropy a value above the one quarter
of the interval). In our case, there are not features that present such a sta-
tus, since the features with a relevant value of Mutual Entropy are only the
features 12 and 15 of the GC dataset, and there are no relevant features
in the ACA dataset.
(iii) Such a scenario leads us towards the decision to exclude from the model
definition process all the features with a high level of Basic Entropy, i.e.,
the features 1, 3, 6, 7, 8, 11, and 19 of the GC dataset, and the feature
2, 10, and 12 of the ACA dataset. It should be noted that the high level
of uncertainty reported by the Basic Entropy can be determined by two
factors: either the information gathered by the system are inadequate or
the nature of information has a low relevance for the classification task.
Furthermore, it should be observed how the aforementioned process reduces
the computational complexity, since after the feature selection we excluded from
28
the model definition process 7,000 elements (feature values involved in the evalu-
ation process), i.e., 35.00% of the total elements from the GC dataset and 2,070
elements (21,00% of the total elements) from the ACA dataset, as reported in
Table 5.
Table 5: Feature Selection Process
Dataset Dataset Removed Processed Reduction
name total features total features total features percentage
GC 20,000 7,000 13,000 35.00
ACA 9,660 2,070 7,590 21.00
5.1.2. Finding the Optimal EDA Threshold
According to the formalization of our approach made by the Algorithm 3,
we need to define an optimal threshold Θ, that can be considered a function
of the hyper-plane that will classify the samples ˆ
T0into reliable or unreliable.
Such an operation was performed by testing all the possible values, as shown
in Figure 5. The tests were stopped as soon as the measured accuracy did not
improve further and the obtained results showed that the optimal threshold Θ
(i.e., that related to the maximum value of Accuracy) was 3 for the GC dataset
(Accuracy 70.30%) and 5 for the ACA dataset (Accuracy 67.20%).
5.2. Results
The experimental results are divided into three parts: (i) studying the effect
of feature selection in the proposed approach; (ii) performance evaluation in
public datasets; and (iii) performance under different levels of class unbalancing.
We discuss these experiments in details in the following subsections.
5.2.1. The Effect of Feature Selection
We first report the experiment results of comparing the proposed approach
with and without feature selection, a dataset preprocessing step of our approach
discussed in Section 3.1. Figure 6 shows that removing features detected through
29
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.4
0.6
(GC)threshold
Accuracy
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0.5
0.6
(ACA)threshold
Accuracy
Figure 5: Entropy Dif f erence Approach T uning
the process performed in Section 5.1.1 presents a twofold advantage. First, it
reduces the computational complexity, since fewer elements are involved in the
evaluation process (as reported in Table 5). Second, it improves the perfor-
mance in terms of all the metrics taken into account. From Figure 6, it may be
highlighted the big jump in Sensitivity for both datasets (0.88 to 0.92 in GC
dataset, and 0.70 to 0.86 in ACA dataset), showing that the proposed approach
eliminates noise in the samples of the default class, increasing their classification.
5.2.2. Real-World Credit Scoring
We show the results considering different metrics that compare our proposed
approach against the competitors in Figures 7 and 8. These figures show that
our approach has promising results if compared with other baselines, even with-
out any knowledge about default cases in its training step. The leftmost part
of Figure 7 shows that our approach showed the best accuracy result for the
most unbalanced dataset (GC), but not the best one for the slightly unbalanced
dataset (ACA). However, in the rightmost part of Figure 7, it is shown that the
proposed approach had the best default detection (sensitivity) for both datasets,
with an almost perfect detection of GC default cases. The performance differ-
30
(Before) (After)
0.20
0.40
0.60
0.80
1.00
0.68 0.69
0.88 0.92
0.79 0.8
0.87 0.89
GC Dataset
V alue
Accuracy Sensitivity F-score AUC
(Before) (After)
0.20
0.40
0.60
0.80
1.00
0.6
0.67
0.7
0.86
0.61
0.7
0.68
0.76
ACA Dataset
V alue
Accuracy Sensitivity F-score AUC
Figure 6: Proposed E DA approach metrics before and after the feature selection process.
ences in different datasets happens because of the complexities of samples in the
ACA dataset, that are composed of different features from the ones in the GC
dataset. The best baseline, namely a deep naive Bayes classifier [68] (DFWNB),
succeeded only for the most balanced dataset (ACA), highlighting the fact that
it performs an efficient credit scoring only when it has sufficient samples of both
classes for training.
Figure 8 shows in its leftmost part that the f-score of our approach for the
GC dataset is the best. The AUC of the proposed approach (rightmost part of
Figure 8) is also the best for the GC dataset. All the other baselines (RF, HNB,
DFWNB, CFWNB) had poor performances in this scenario, even considering
the more balanced dataset (ACA). A special case that we would like to mention
is about the low performance of the RF classifier, which is an ensemble of
decision trees that is expected to work better in this real-world scenario. Such
findings allow us to conclude that, at most, the baselines can have competitive
performances against our approach only when balanced classes are available for
training. We further show in the next subsection how our approach works better
when less default cases in the training data are available.
31
(GC) (ACA)
0.20
0.40
0.60
0.80
1.00
0.69
0.72
0.70
0.74
0.71
0.84
0.57
0.78
0.73
0.67
Datasets
Accuracy
RF HNB DFWNB CFWNB EDA
(GC) (ACA)
0.20
0.40
0.60
0.80
1.00
0.69
0.72
0.70
0.74
0.71
0.84
0.57
0.78
0.99
0.86
Datasets
Sensitivity
RF HNB DFWNB CFWNB EDA
Figure 7: Accuracy and Sensitiv ity P erf ormance
(GC) (ACA)
0.20
0.40
0.60
0.80
1.00
0.57
0.71
0.64
0.74
0.72
0.84
0.58
0.78
0.84
0.70
Datasets
F-score
RF HNB DFWNB CFWNB EDA
(GC) (ACA)
0.20
0.40
0.60
0.80
1.00
0.71
0.83
0.66
0.80
0.76
0.91
0.70
0.86
0.89
0.76
Datasets
AUC
RF HNB DFWNB CFWNB EDA
Figure 8: F-score and AU C P erf ormance
5.2.3. Performance with Different Levels of Unbalance
In addition to the experiments discussed before, we tested our approach
and competitors in the GC dataset (the most unbalanced one) with different
unbalance levels. Therefore, we reduced the 300 original unreliable cases that
compose the GC dataset according to five different levels of unbalance. In more
detail, we used 50, 100, 150, 200, and 300 (original dataset) unreliable cases,
joining them with the 700 reliable cases already present in the GC dataset. This
creates new datasets with 750, 800, 850, 900 and 1000 (original dataset) samples,
32
respectively. Therefore, the unreliable cases now correspond, respectively, to
6.66%, 12.50%, 17.64%, 22.22% and 30.00% of reliable cases in these datasets.
For the experiments, we split the resulting datasets in training and test sets
according to the 10-fold cross validation criterion, with our approach being the
only one that does not consider default cases for training. Figure 9 shows the
results of such experiments.
The results showed in Figure 9 highlight the proactive nature of our ap-
proach. By not considering the default cases in the training set, the imbalanced
nature of such a problem does not influence our training. All the other ap-
proaches are influenced by the fact that less default training data is present,
so they were able to reach accuracy comparable or better than ours only when
more allowed default training data were available, as can be seen in the first row
of Figure 9 (from 17.64% to 30%). However, the sensitivities of these approaches
are still low as the training data is still unbalanced for the default cases, while
our approach keep an almost perfect sensitivity in all unbalanced scenarios, as
can be seen in the second row of Figure 9. The F-score metric of our approach,
which is a recommendable measure for unbalanced environments, also highlights
the proactive feature of our approach as it defeats all the baselines in all unbal-
anced scenarios, as can be seen in the third row of Figure 9. Finally, the fact
that our approach defeats the baselines in 17 out of 20 experiments performed
here further enriches the contributions of our approach to be applied in the
unbalanced environment of credit scoring.
As found out in the previous experiments, we also realized that the DFWNB
was the best competitor for this experiment. However, it is noticeable that it
is biased when high levels of unbalance come into the game, a scenario that is
more likely to happen in real world credit scoring datasets. Such an approach
could defeat our approach in only two experiments in this subsection, but was
the best one in just one experiment (AUC of 6.66% dataset, leftmost part of
fourth row in Figure 9). Notwithstanding, with its good results, we believe that
both approaches can be fused for a better credit scoring. This can be done, for
example, by applying different weights for decisions of these different classifiers.
33
(6.66) (12.50) (17.64) (22.22) (30.00)
0.20
0.40
0.60
0.80
1.00
0.93
0.87
0.82
0.77
0.69
0.92
0.88
0.80
0.75
0.70
0.87
0.82
0.77
0.75
0.71
0.34
0.74
0.39
0.45
0.57
0.94
0.90
0.34
0.81
0.73
Accuracy
RF HNB DFWNB CFWNB EDA
(6.66) (12.50) (17.64) (22.22) (30.00)
0.20
0.40
0.60
0.80
1.00
0.93
0.87
0.82
0.77
0.69
0.02
0.86
0.80
0.75
0.70
0.87
0.82
0.77
0.75
0.71
0.34
0.74
0.39
0.45
0.57
0.99
0.99
0.99
0.99
0.99
Sensitivity
(6.66) (12.50) (17.64) (22.22) (30.00)
0.20
0.40
0.60
0.80
1.00
0.96
0.93
0.90
0.87
0.57
0.03
0.80
0.73
0.70
0.64
0.88
0.82
0.78
0.75
0.72
0.45
0.74
0.41
0.46
0.58
0.97
0.94
0.91
0.89
0.84
F-score
(6.66) (12.50) (17.64) (22.22) (30.00)
0.20
0.40
0.60
0.80
1.00
0.62
0.65
0.69
0.69
0.71
0.67
0.60
0.65
0.65
0.66
0.73
0.74
0.76
0.76
0.76
0.67
0.80
0.72
0.71
0.70
0.69
0.69
0.86
0.89
0.89
GC Dataset U nreliables Cases P er centage
AUC
Figure 9: P erformance with Dif f er ent Levels of U nbalance
34
6. Conclusions and Future Work
The Credit Scoring machine learning techniques cover a crucial role in many
financial contexts (i.e., personal loans, insurance policies, etc.), since they are
used by financial operators in order to evaluate the potential risks of lending,
reducing therefore the losses due to unreliable users. However, several issues are
found in such an application, such as the data imbalance problem in datasets,
where the number of unreliable cases are quite smaller than the number of reli-
able cases, and also the cold-start problem, where there is scarcity or absence of
non-reliable previous cases. These issues can seriously affect machine learning
approaches aimed at classification of new instances in the Credit Score environ-
ment.
This paper proposes a novel approach of Credit Scoring that exploits entropy-
based criteria in order to build a model able to classify a new instance without
the knowledge of past non-reliable instances. Our approach works by compar-
ing the entropy behavior of existing reliable samples before and after adding an
instance under investigation. This way, our approach can operate in a proactive
manner, facing the cold-start and the data imbalance problems that reduce the
effectiveness of the canonical approaches of Credit Scoring. The experimental
results underline two main aspects related to our approach: one the one hand, it
has competitive performances if compared to existing classifiers when the train-
ing set is composed of slightly unbalanced (or almost balanced) classes; on the
other hand it is able to outperform its competitors specially when the training
process is characterized by an unbalanced distribution of training data. This
last aspect represents an important result, since it shows the capability of the
proposed approach to operate in scenarios where the canonical approaches of
machine learning are not able to achieve optimal performance. This is especially
true in the typical contexts of Credit Scoring, where an unbalanced distribution
of data is usually present. Even without totally replacing the canonical ap-
proaches of Credit Scoring, our approach offers the possibility to overcome the
cold-start issue, together with the capability to manage the unbalanced distribu-
35
tion of data, giving the opportunity to be jointly used with existing approaches
and thus resulting in an effective hybrid model.
According to the previous considerations, a direction of future work where
we are headed to is to evaluate the advantages and disadvantages related to the
inclusion of the default cases in the model definition process, as well as the eval-
uation of our approach in heterogeneous scenarios that involve different types of
financial data, such as those generated by an electronic commerce environment.
A final goal is then to define a novel approach (hybrid or only based on the
proposed approach) able to operate in all possible scenarios, effectively.
Acknowledgments
The research performed in this paper has been supported by the ”Bando
Aiuti per progetti di Ricerca e Sviluppo - POR FESR 2014-2020 - Asse 1,
Azione 1.1.3. Project IntelliCredit: AI-powered digital lending platform”. We
also would like to thank the authors of the Deep Feature Weighted Naive Bayes
approach, who kindly shared with us their source code.
References
[1] J. Morrison, Introduction to survival analysis in business, The Journal of
Business Forecasting 23 (1) (2004) 18.
[2] L. J. Mester, et al., Whats the point of credit scoring?, Business review 3
(1997) 3–16.
[3] W. E. Henley, Statistical aspects of credit scoring., Ph.D. thesis, Open
University (1994).
[4] W. Henley, et al., Construction of a k-nearest-neighbour credit-scoring sys-
tem, IMA Journal of Management Mathematics 8 (4) (1997) 305–321.
[5] A. Fensterstock, Credit scoring and the next step, Business Credit 107 (3)
(2005) 46–49.
36
[6] J. Brill, The importance of credit scoring models in improving cash flow
and collections, Business Credit 100 (1) (1998) 16–17.
[7] A. D. Pozzolo, O. Caelen, Y. L. Borgne, S. Waterschoot, G. Bontempi,
Learned lessons in credit card fraud detection from a practitioner perspec-
tive, Expert Syst. Appl. 41 (10) (2014) 4915–4928. doi:10.1016/j.eswa.
2014.02.026.
[8] G. E. Batista, R. C. Prati, M. C. Monard, A study of the behavior of
several methods for balancing machine learning training data, ACM Sigkdd
Explorations Newsletter 6 (1) (2004) 20–29.
[9] N. Japkowicz, S. Stephen, The class imbalance problem: A systematic
study, Intelligent Data Analysis 6 (5) (2002) 429–449.
[10] S. Lessmann, B. Baesens, H. Seow, L. C. Thomas, Benchmarking state-of-
the-art classification algorithms for credit scoring: An update of research,
European Journal of Operational Research 247 (1) (2015) 124–136.
[11] I. Brown, C. Mues, An experimental comparison of classification algorithms
for imbalanced credit scoring data sets, Expert Syst. Appl. 39 (3) (2012)
3446–3453. doi:10.1016/j.eswa.2011.09.033.
[12] S. Bhattacharyya, S. Jha, K. K. Tharakunnel, J. C. Westland, Data mining
for credit card fraud: A comparative study, Decision Support Systems
50 (3) (2011) 602–613. doi:10.1016/j.dss.2010.08.008.
URL http://dx.doi.org/10.1016/j.dss.2010.08.008
[13] R. Saia, S. Carta, An entropy based algorithm for credit scoring, in: A. M.
Tjoa, L. D. Xu, M. Raffai, N. M. Novak (Eds.), Research and Practical
Issues of Enterprise Information Systems - 10th IFIP WG 8.9 Working
Conference, CONFENIS 2016, Vienna, Austria, December 13-14, 2016,
Proceedings, Vol. 268 of Lecture Notes in Business Information Process-
ing, 2016, pp. 263–276. doi:10.1007/978-3-319-49944- 4_20.
URL http://dx.doi.org/10.1007/978-3-319-49944-4_20
37
[14] D. J. Hand, W. E. Henley, Statistical classification methods in consumer
credit scoring: a review, Journal of the Royal Statistical Society: Series A
(Statistics in Society) 160 (3) (1997) 523–541.
[15] M. Doumpos, C. Zopounidis, Credit scoring, in: Multicriteria Analysis in
Finance, Springer, 2014, pp. 43–59.
[16] R. Saia, S. Carta, Introducing a vector space model to perform a proactive
credit scoring, in: International Joint Conference on Knowledge Discovery,
Knowledge Engineering, and Knowledge Management, Springer, 2016, pp.
125–148.
[17] R. Saia, S. Carta, D. R. Recupero, G. Fenu, M. Saia, A discretized enriched
technique to enhance machine learning performance in credit scoring, in:
KDIR, 2019.
[18] R. Saia, S. Carta, G. Fenu, A wavelet-based data analysis to credit scor-
ing, in: Proceedings of the 2nd International Conference on Digital Signal
Processing, ACM, 2018, pp. 176–180.
[19] R. Saia, S. Carta, A fourier spectral pattern analysis to design credit scoring
models, in: Proceedings of the 1st International Conference on Internet of
Things and Machine Learning, ACM, 2017, p. 18.
[20] R. Saia, S. Carta, A linear-dependence-based approach to design proactive
credit scoring models., in: KDIR, 2016, pp. 111–120.
[21] V. Ceronmani Sharmila, K. Kumar R, S. R, S. D, H. R, Credit card fraud
detection using anomaly techniques, in: International Conference on In-
novations in Information and Communication Technology (ICIICT), 2019,
pp. 1–6.
[22] F. Fang, Y. Chen, A new approach for credit scoring by directly maximizing
the kolmogorovsmirnov statistic, Computational Statistics & Data Analysis
133 (2019) 180 – 194.
38
[23] X. Zhang, Y. Yang, Z. Zhou, A novel credit scoring model based on op-
timized random forest, in: IEEE Annual Computing and Communication
Workshop and Conference (CCWC), 2018, pp. 60–65.
[24] S. Maldonado, G. Peters, R. Weber, Credit scoring using three-
way decisions with probabilistic rough sets, Information Sci-
encesdoi:https://doi.org/10.1016/j.ins.2018.08.001.
URL http://www.sciencedirect.com/science/article/pii/
S0020025518306078
[25] B. Zhu, W. Yang, H. Wang, Y. Yuan, A hybrid deep learning model for
consumer credit scoring, in: International Conference on Artificial Intelli-
gence and Big Data (ICAIBD), 2018, pp. 205–208. doi:10.1109/ICAIBD.
2018.8396195.
[26] Y. Tian, Z. Yong, J. Luo, A new approach for reject inference in credit
scoring using kernel-free fuzzy quadratic surface support vector machines,
Applied Soft Computing 73 (2018) 96 – 105.
[27] V. Neagoe, A. Ciotec, G. Cucu, Deep convolutional neural networks versus
multilayer perceptron for financial prediction, in: International Conference
on Communications (COMM), 2018, pp. 201–206.
[28] S. Ali, K. A. Smith, On learning algorithm selection for classification, Appl.
Soft Comput. 6 (2) (2006) 119–138. doi:10.1016/j.asoc.2004.12.002.
[29] D. J. Hand, Measuring classifier performance: a coherent alternative to
the area under the ROC curve, Machine Learning 77 (1) (2009) 103–123.
doi:10.1007/s10994-009-5119-5.
[30] T.-S. Lee, I.-F. Chen, A two-stage hybrid credit scoring model using arti-
ficial neural networks and multivariate adaptive regression splines, Expert
Systems with Applications 28 (4) (2005) 743–752.
39
[31] G. Wang, J. Hao, J. Ma, H. Jiang, A comparative assessment of ensemble
learning for credit scoring, Expert Syst. Appl. 38 (1) (2011) 223–230. doi:
10.1016/j.eswa.2010.06.048.
[32] N.-C. Hsieh, Hybrid mining approach in the design of credit scoring models,
Expert Systems with Applications 28 (4) (2005) 655–665.
[33] J. Lpez, S. Maldonado, Profit-based credit scoring based on robust opti-
mization and feature selection, Information Sciences 500 (2019) 190 – 202.
[34] S. Guo, H. He, X. Huang, A multi-stage self-adaptive classifier ensemble
model with application in credit scoring, IEEE Access 7 (2019) 78549–
78559.
[35] H. Zhang, H. He, W. Zhang, Classifier selection and clustering with fuzzy
assignment in ensemble model for credit scoring, Neurocomputing 316
(2018) 210 – 221.
[36] X. Feng, Z. Xiao, B. Zhong, J. Qiu, Y. Dong, Dynamic ensemble classifi-
cation for credit scoring using soft probability, Applied Soft Computing 65
(2018) 139 – 151. doi:https://doi.org/10.1016/j.asoc.2018.01.021.
URL http://www.sciencedirect.com/science/article/pii/
S1568494618300279
[37] D. Tripathi, D. R. Edla, V. Kuppili, A. Bablani, R. Dharavath, Credit
scoring model based on weighted voting and cluster based feature selection,
Procedia Computer Science 132 (2018) 22 – 31, international Conference
on Computational Intelligence and Data Science.
[38] R. Vedala, B. R. Kumar, An application of naive bayes classification for
credit scoring in e-lending platform, in: International Conference on Data
Science Engineering (ICDSE), 2012, pp. 81–84. doi:10.1109/ICDSE.2012.
6282321.
[39] D. Sewwandi, K. Perera, S. Sandaruwan, O. Lakchani, A. Nugaliyadde,
S. Thelijjagoda, Linguistic features based personality recognition using so-
40
cial media data, in: 2017 6th National Conference on Technology and Man-
agement (NCTM), 2017, pp. 63–68. doi:10.1109/NCTM.2017.7872829.
[40] X. Sun, B. Liu, J. Cao, J. Luo, X. Shen, Who am i? personality detection
based on deep learning for texts, in: IEEE International Conference on
Communications (ICC), 2018, pp. 1–6.
[41] R. F. L´opez, J. M. Ramon-Jeronimo, Modelling credit risk with scarce
default data: on the suitability of cooperative bootstrapped strategies for
small low-default portfolios, JORS 65 (3) (2014) 416–434. doi:10.1057/
jors.2013.119.
URL http://dx.doi.org/10.1057/jors.2013.119
[42] G. Garibotto, P. Murrieri, A. Capra, S. D. Muro, U. Petillo, F. Flammini,
M. Esposito, C. Pragliola, G. D. Leo, R. Lengu, N. Mazzino, A. Paolillo,
M. D’Urso, R. Vertucci, F. Narducci, S. Ricciardi, A. Casanova, G. Fenu,
M. D. Mizio, M. Savastano, M. D. Capua, A. Ferone, White paper on in-
dustrial applications of computer vision and pattern recognition, in: ICIAP
(2), Vol. 8157 of Lecture Notes in Computer Science, Springer, 2013, pp.
721–730.
[43] A. Chatterjee, A. Segev, Data manipulation in heterogeneous databases,
ACM SIGMOD Record 20 (4) (1991) 64–68.
[44] B. Lika, K. Kolomvatsos, S. Hadjiefthymiades, Facing the cold start prob-
lem in recommender systems, Expert Syst. Appl. 41 (4) (2014) 2065–2073.
doi:10.1016/j.eswa.2013.09.005.
URL http://dx.doi.org/10.1016/j.eswa.2013.09.005
[45] L. H. Son, Dealing with the new user cold-start problem in recommender
systems: A comparative review, Inf. Syst. 58 (2016) 87–104. doi:10.1016/
j.is.2014.10.001.
URL http://dx.doi.org/10.1016/j.is.2014.10.001
41
[46] I. Fern´andez-Tob´ıas, P. Tomeo, I. Cantador, T. D. Noia, E. D. Sciascio, Ac-
curacy and diversity in cross-domain recommendations for cold-start users
with positive-only feedback, in: S. Sen, W. Geyer, J. Freyne, P. Castells
(Eds.), Proceedings of the 10th ACM Conference on Recommender Sys-
tems, Boston, MA, USA, September 15-19, 2016, ACM, 2016, pp. 119–122.
doi:10.1145/2959100.2959175.
URL http://doi.acm.org/10.1145/2959100.2959175
[47] J. Attenberg, F. J. Provost, Inactive learning?: difficulties employing active
learning in practice, SIGKDD Explorations 12 (2) (2010) 36–41. doi:
10.1145/1964897.1964906.
URL http://doi.acm.org/10.1145/1964897.1964906
[48] V. Thanuja, B. Venkateswarlu, G. Anjaneyulu, Applications of data min-
ing in customer relationship management, Journal of Computer and Math-
ematical Sciences Vol 2 (3) (2011) 399–580.
[49] H. He, E. A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl.
Data Eng. 21 (9) (2009) 1263–1284. doi:10.1109/TKDE.2008.239.
[50] V. Vinciotti, D. J. Hand, Scorecard construction with unbalanced class
sizes, Journal of Iranian Statistical Society 2 (2) (2003) 189–205.
[51] A. I. Marqu´es, V. Garc´ıa, J. S. S´anchez, On the suitability of resampling
techniques for the class imbalance problem in credit scoring, JORS 64 (7)
(2013) 1060–1070. doi:10.1057/jors.2012.120.
URL http://dx.doi.org/10.1057/jors.2012.120
[52] S. F. Crone, S. Finlay, Instance sampling in credit scoring: An empirical
study of sample size and balancing, International Journal of Forecasting
28 (1) (2012) 224–238.
[53] L. Jiang, C. Qiu, C. Li, A novel minority cloning technique for cost-sensitive
learning, International Journal of Pattern Recognition and Artificial Intel-
ligence 29 (04) (2015) 1551004.
42
[54] L. Jiang, C. Li, S. Wang, Cost-sensitive bayesian network classifiers, Pat-
tern Recognition Letters 45 (2014) 211 – 216.
[55] B. Tang, H. He, Gir-based ensemble sampling approaches for imbalanced
learning, Pattern Recognition 71 (2017) 306 – 319.
[56] X. Yang, Q. Kuang, W. Zhang, G. Zhang, Amdo: An over-sampling tech-
nique for multi-class imbalanced problems, IEEE Transactions on Knowl-
edge and Data Engineering 30 (9) (2018) 1672–1685.
[57] J. Zhang, V. S. Sheng, Q. Li, J. Wu, X. Wu, Consensus algorithms for
biased labeling in crowdsourcing, Information Sciences 382-383 (2017) 254
– 273.
[58] S. Vluymans, A. Fern´andez, Y. Saeys, C. Cornelis, F. Herrera, Dynamic
affinity-based classification of multi-class imbalanced data with one-versus-
one decomposition: a fuzzy rough set approach, Knowledge and Informa-
tion Systems 56 (1) (2018) 55–84.
[59] Z. Zhang, B. Krawczyk, S. Garcia, A. Rosales-Perez, F. Herrera, Empow-
ering one-vs-one decomposition with ensemble learning for multi-class im-
balanced data, Knowledge-Based Systems 106 (2016) 251 – 263.
[60] Y. Liu, M. Schumann, Data mining feature selection for credit scoring
models, Journal of the Operational Research Society 56 (9) (2005) 1099–
1108.
[61] J. R. Lent, M. Lent, E. R. Meeks, Y. Cai, T. J. Coltrell, D. W. Dowhan,
Method and apparatus for real time on line credit approval, uS Patent
6,405,181 (Jun. 11 2002).
[62] J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large
clusters, Commun. ACM 51 (1) (2008) 107–113. doi:10.1145/1327452.
1327492.
URL http://doi.acm.org/10.1145/1327452.1327492
43
[63] A. D. Pozzolo, O. Caelen, R. A. Johnson, G. Bontempi, Calibrating prob-
ability with undersampling for unbalanced classification, in: IEEE Sym-
posium Series on Computational Intelligence, SSCI 2015, Cape Town,
South Africa, December 7-10, 2015, IEEE, 2015, pp. 159–166. doi:
10.1109/SSCI.2015.33.
URL http://dx.doi.org/10.1109/SSCI.2015.33
[64] D. M. W. Powers, Evaluation: From precision, recall and f-measure to
roc., informedness, markedness & correlation, Journal of Machine Learning
Technologies 2 (1) (2011) 37–63.
[65] D. Faraggi, B. Reiser, Estimation of the area under the roc curve, Statistics
in medicine 21 (20) (2002) 3093–3106.
[66] L. Breiman, Random forests, Machine Learning 45 (1) (2001) 5–32.
[67] L. Jiang, H. Zhang, Z. Cai, A novel bayes model: Hidden naive bayes,
IEEE Transactions on Knowledge and Data Engineering 21 (10) (2009)
1361–1371. doi:10.1109/TKDE.2008.234.
[68] L. Jiang, C. Li, S. Wang, L. Zhang, Deep feature weighting for naive bayes
and its application to text classification, Engineering Applications of Arti-
ficial Intelligence 52 (2016) 26–39.
[69] L. Jiang, L. Zhang, C. Li, J. Wu, A correlation-based feature weighting filter
for naive bayes, IEEE Transactions on Knowledge and Data Engineering
31 (2) (2019) 201–213.
44
... The first branch of techniques treats the credit scoring problem as a data analysis or data transformation problem, discriminating reliable and unreliable samples by investigating data disturbance or analyzing new transformed spaces. The work of Carta et al. [13,54] followed that direction by investigating data entropy before and after an unknown sample is inserted in a dataset in order to measure how it is affected, with this information being helpful to detect default (or unreliable) cases. Fan et al. [30] also exploited the entropy criterion in order to face the issues related to imbalanced datasets. ...
Article
Full-text available
The credit scoring models are aimed to assess the capability of refunding a loan by assessing user reliability in several financial contexts, representing a crucial instrument for a large number of financial operators such as banks. Literature solutions offer many approaches designed to evaluate users' reliability on the basis of information about them, but they share some well-known problems that reduce their performance, such as data imbalance and heterogeneity. In order to face these problems, this paper introduces an ensemble stochastic criterion that operates in a discretized feature space, extended with some meta-features in order to perform efficient credit scoring. Such an approach uses several classification algorithms in such a way that the final classification is obtained by a stochastic criterion applied to a new feature space, obtained by a two-fold preprocessing technique. We validated the proposed approach by using real-world datasets with different data imbalance configurations, and the obtained results show that it outperforms some state-of-the-art solutions.
... To mitigate the noise problem, a successful approach has proven to be the use of ensembles. They have demonstrated superior predictive performance compared to individual forecasting models, hence their notable success in different domains such as credit scoring [13], sentiment analysis [14], power systems control [15] or natural calamities forecasting [16]. In the literature, we can find several implementations of StatArb that apply classification to construct the trading portfolio [17], [18]. ...
Article
Full-text available
In recent years, machine learning algorithms have been successfully employed to leverage the potential of identifying hidden patterns of financial market behavior and, consequently, have become a land of opportunities for financial applications such as algorithmic trading. In this paper, we propose a statistical arbitrage trading strategy with two key elements: an ensemble of regression algorithms for asset return prediction, followed by a dynamic asset selection. More specifically, we construct an extremely heterogeneous ensemble ensuring model diversity by using state-of-the-art machine learning algorithms, data diversity by using a feature selection process, and method diversity by using individual models for each asset, as well models that learn cross-sectional across multiple assets. Then, their predictive results are fed into a quality assurance mechanism that prunes assets with poor forecasting performance in the previous periods. We evaluate the approach on historical data of component stocks of the S&P500 index. By performing an in-depth risk-return analysis, we show that this setup outperforms highly competitive trading strategies considered as baselines. Experimentally, we show that the dynamic asset selection enhances overall trading performance both in terms of return and risk. Moreover, the proposed approach proved to yield superior results during both financial turmoil and massive market growth periods, and it showed to have general application for any risk-balanced trading strategy aiming to exploit different asset classes.
... One successful alternative to mitigate the noise present in the data has already been proven to be ensemble methods. In literature, they demonstrated superior predictive performance compared to individual forecasting algorithms and hence their notorious success in different domains such as credit scoring [11] or sentiment analysis [3,37,38]. Furthermore, in literature, it has been proved that the employment of heterogeneous ensembles for forecasting outperforms homogeneous ones [9,31]. ...
Chapter
Full-text available
Nowadays, machine learning usage has gained significant interest in financial time series prediction, hence being a promise land for financial applications such as algorithmic trading. In this setting, this paper proposes a general approach based on an ensemble of regression algorithms and dynamic asset selection applied to the well-known statistical arbitrage trading strategy. Several extremely heterogeneous state-of-the-art machine learning algorithms, exploiting different feature selection processes in input, are used as base components of the ensemble, which is in charge to forecast the return of each of the considered stocks. Before being used as an input to the arbitrage mechanism, the final ranking of the assets takes also into account a quality assurance mechanism that prunes the stocks with poor forecasting accuracy in the previous periods. The approach has a general application for any risk balanced trading strategy aiming to exploit different financial assets. It was evaluated implementing an intra-day trading statistical arbitrage on the stocks of the S&P500 index. Our approach outperforms each single base regressor we adopted, which we considered as baselines. More important, it also outperforms Buy-and-hold of S&P500 Index, both during financial turmoil such as the global financial crisis, and also during the massive market growth in the recent years.
... An entropy-based algorithm can produce better discrimination and is widely used. A recent study 22 proposed an entropy-based combination method to score loan credit. In clinical application, some researchers have proposed an automatic sleep scoring method by combining multiscale entropy features with information on sleep architecture. ...
Article
Full-text available
Importance Many indicators need to be considered when judging the condition of patients with infertility, which makes diagnosis and treatment complicated. Objective To construct a dynamic scoring system for infertility to assist clinicians in efficiently and accurately assessing the condition of patients with infertility. Design, Setting, and Participants This prognostic study reviewed 95 868 medical records of couples with infertility in which women had undergone in vitro fertilization and embryo transfer at the Reproductive Center of Tongji Medical College, Huazhong University of Science and Technology, in Wuhan, Hubei, China, from January 2006 to May 2019. A dynamic diagnosis and grading system for infertility was constructed. The analysis was conducted between May 20, 2019, and April 15, 2020. Main Outcomes and Measures Patients were divided into pregnant and nonpregnant groups according to eventual pregnancy results. The evaluation index system was constructed based on the test results of the significant difference between the 2 groups of indicators and the clinician’s experience. Random forest machine learning was used to determine the weight of the index, and the entropy-based feature discretization algorithm classified the abnormality of the index and the patient's condition. A 10-fold cross-validation method was used to test the validity of the system. Results A total of 60 648 couples with infertility were enrolled, in which 15 021 women became pregnant, with a mean (SD) age of 30.30 (4.02) years. A total of 45 627 couples were in the nonpregnant group, with a mean (SD) age among women of 32.17 (5.58) years. Seven indicators were selected to build the dynamic grading system for patients with infertility: age, body mass index, follicle-stimulating hormone level, antral follicle count, anti-Mullerian hormone level, number of oocytes, and endometrial thickness. The importance weight of each indicator obtained by the random forest algorithm was 0.1748 for age, 0.0785 for body mass index, 0.0581 for follicle-stimulating hormone level, 0.1214 for antral follicle count, 0.1616 for anti-Mullerian hormone level, 0.2307 for number of oocytes, and 0.1749 for endometrial thickness. The grading system divided the condition of the patient with infertility into 5 grades from A to E. The worst E grade represented a 0.90% pregnancy rate, and the pregnancy rate in the A grade was 53.82%. The cross-validation results showed that the stability of the system was 95.94% (95% CI, 95.14%-96.74%). Conclusions and Relevance This machine learning–derived algorithm may assist clinicians in making an efficient and accurate initial judgment on the condition of patients with infertility.
... One successful alternative to mitigate the noise present in the data has already been proven to be ensemble methods. In literature, they demonstrated superior predictive performance compared to individual forecasting algorithms and hence their notorious success in different domains such as credit scoring [9] or sentiment analysis [3]. Furthermore, in literature, it has been proved that the employment of heterogeneous ensembles for forecasting outperforms homogeneous ones [7,30]. ...
Conference Paper
Nowadays, machine learning usage has gained significant interest in financial time series prediction, hence being a promise land for financial applications such as algorithmic trading. In this setting, this paper proposes a general framework based on an ensemble of regression algorithms and dynamic asset selection applied to the well known statistical arbitrage trading strategy. Several extremely heterogeneous state-of-the-art machine learning algorithms, exploiting different feature selection processes in input, are used as base components of the ensemble, which is in charge to forecast the return of each of the considered stocks. Before being used as an input to the arbitrage mechanism, the final ranking of the assets takes also into account a quality assurance mechanism that prunes the stocks with poor forecasting accuracy in the previous periods. The framework has a general application for any risk balanced trading strategy aiming to exploit different financial assets. It was evaluated implementing an intra-day trading statistical arbitrage on the stocks of the S&P500 index. Our approach outperforms each single base regressor we adopted, which we considered as baselines. More important, it also outperforms Buy-and-hold of S&P500 Index, both during financial turmoil such as the global financial crisis, and also during the massive market growth in the recent years.
Article
Credit scoring is used to help financial institutions control default risks and reduce economic losses, and a variety of mainstream machine learning and data mining algorithms have been applied for this purpose. However, real-world datasets are generally noisy, which seriously affects the performance of credit scoring models. Among the mainstream strategies for handling noise, instance filtering may result in information loss, especially for hard-to-access credit datasets, and label noise correction may produce erroneous information in the dataset. In this study, to reduce the adverse impact of noisy data on the performance of classification algorithms, a novel hybrid ensemble credit scoring model with stacking-based noise detection and weight assignment is developed to remove or adapt noisy data in raw datasets and to form noise-detected training data to obtain excellent default risk prediction competence. Furthermore, a new weight assignment approach based on a cloud model is proposed, which is applied to calculate the weight values of the classifiers in the weighted voting ensemble model to improve the prediction accuracy of the proposed model. In this study, five public datasets are adopted using five performance metrics to evaluate the performance of the proposed model. The experimental results demonstrate good model prediction power and robustness.
Article
In recent years, research has found that in many credit risk evaluation domains, deep learning is superior to traditional machine learning methods and classifier ensembles perform significantly better than single classifiers. However, credit evaluation model based on deep learning ensemble algorithm has rarely been studied. Moreover, credit data imbalance still challenges the performance of credit scoring models. Therefore, to go some way to filling this research gap, this study developed a new deep learning ensemble credit risk evaluation model to deal with imbalanced credit data. First, an improved synthetic minority oversampling technique (SMOTE) method was developed to overcome known SMOTE shortcomings, after which a new deep learning ensemble classification method combined with the long-short-term-memory (LSTM) network and the adaptive boosting (AdaBoost) algorithm was developed to train and learn the processed credit data. Then, area under the curve (AUC), the Kolmogorov–Smirnov (KS) and the non-parametric Wilcoxon test were employed to compare the performance of the proposed model and other widely used credit scoring models on two imbalanced credit datasets. The experimental test results indicated that the proposed deep learning ensemble model was generally more competitive when addressing imbalanced credit risk evaluation problems than other models.
Conference Paper
Full-text available
The automated credit scoring tools play a crucial role in many financial environments, since they are able to perform a real-time evaluation of a user (e.g., a loan applicant) on the basis of several solvency criteria, without the aid of human operators. Such an automation allows who work and offer services in the financial area to take quick decisions with regard to different services, first and foremost those concerning the consumer credit, whose requests have exponentially increased over the last years. In order to face some well-known problems related to the state-of-the-art credit scoring approaches, this paper formalizes a novel data model that we called Discretized Enriched Data (DED), which operates by transforming the original feature space in order to improve the performance of the credit scoring machine learning algorithms. The idea behind the proposed DED model revolves around two processes, the first one aimed to reduce the number of feature patterns through a data discretization process, and the second one aimed to enrich the discretized data by adding several meta-features. The data discretization faces the problem of heterogeneity, which characterizes such a domain, whereas the data enrichment works on the related loss of information by adding meta-features that improve the data characterization. Our model has been evaluated in the context of real-world datasets with different sizes and levels of data unbalance, which are considered a benchmark in credit scoring literature. The obtained results indicate that it is able to improve the performance of one of the most performing machine learning algorithm largely used in this field, opening up new perspectives for the definition of more effective credit scoring solutions.
Article
In recent years, classification ensembles or multiple classifier systems have been widely applied to credit scoring, and they achieve significantly better performance than individual classifiers do. Selective ensembles, an important part of this group of systems, are a promising field of research. However, none of them considers the relative costs of Type I error and Type II error for credit scoring when selecting classifiers, which bring higher risks for the financial institutions. Moreover, earlier dynamic selective ensembles usually select and combine classifiers for each test sample dynamically based on classifiers’ performance in the validation set, regardless of their behaviors in the testing set. To fill the gap and overcome the limitations, we propose a new dynamic ensemble classification method for credit scoring based on soft probability. In this method, the classifiers are first selected based on their classification ability and the relative costs of Type I error and Type II error in the validation set. With the selected classifiers, we combine different classifiers for the samples in the testing set based on their classification results to get an interval probability of default by using soft probability. The proposed method is compared with some well-known individual classifiers and ensemble classification methods, including five selective ensembles, for credit scoring by using ten real-world data sets and seven performance indicators. Through these analyses and statistical tests, the experimental results demonstrate the ability and efficiency of the proposed method to improve prediction performance against the benchmark models.
Article
Due to its simplicity, efficiency, and efficacy, naive Bayes (NB) has continued to be one of the top 10 algorithms in the data mining and machine learning community. Of numerous approaches to alleviating its conditional independence assumption, feature weighting has placed more emphasis on highly predictive features than those that are less predictive. In this paper, we argue that for NB highly predictive features should be highly correlated with the class (maximum mutual relevance), yet uncorrelated with other features (minimum mutual redundancy). Based on this premise, we propose a correlation-based feature weighting (CFW) filter for NB. In CFW, the weight for a feature is a sigmoid transformation of the difference between the feature-class correlation (mutual relevance) and the average feature-feature intercorrelation (average mutual redundancy). Experimental results show that NB with CFW significantly outperforms NB and all the other existing state-of-the-art feature weighting filters used to compare. Compared to feature weighting wrappers for improving NB, the main advantages of CFW are its low computational complexity (no search involved) and the fact that it maintains the simplicity of the final model. Besides, we apply CFW to text classification and have achieved remarkable improvements.
Article
In recent years, credit scoring has received wide attention from financial institutions, with the rating accuracy influencing both risk control and profitability to a considerable extent. This paper presents a novel multi-stage self-adaptive classifier ensemble model based on statistical techniques and machine learning techniques, to improve the prediction performance. First, the multi-step data preprocessing is employed to process the original data into standardized data and generate more representative features. Second, base classifiers can be self-adaptively selected from the candidate classifier repository according to their performance in datasets, and their parameters are optimized by Bayesian optimization algorithm. Third, the ensemble model is integrated through these optimized base classifiers, and it can generate new features through multi-layer stacking and obtain the classifier weights in the ensemble model through the particle swarm optimization. The proposed model is applied to credit scoring to test its prediction performance. In the experimental study, three real-world credit datasets and four evaluation indicators are adopted for performance evaluation. The results show that compared to single classifier and other ensemble classification methods, the proposed model has better performance and better data adaptability. It proves the reliability and practicability of the proposed model, and provides effective decision support for relevant financial institutions.
Article
A novel framework for profit-based credit scoring is proposed in this work. The approach is based on robust optimization, which is designed for dealing with uncertainty in the data, and therefore is effective at classifying new samples that follow a slightly different distribution in relation to the original dataset used to create the model. Instead of minimizing a loss function based on statistical measures, the proposed method maximizes the profit of the credit scoring model, balancing the benefits and losses of granting credit with the variable acquisition costs. The reduction of these is performed using feature selection techniques embedded in the learning process. The robust approach results in four second order cone programming formulations, which can be solved efficiently using interior point algorithms. Experiments on two credit scoring datasets demonstrate the virtues of our approach in terms of its predictive performance, and the managerial insights that can be gained from it.
Article
Credit scoring plays a critical role in many areas such as business, finance, engineering and health. The Kolmogorov–Smirnov statistic is one of the most important performance evaluation criteria for scoring methods and has been widely used in practice. However, none of the existing scoring methods deals with the Kolmogorov–Smirnov statistic directly at the modeling stage. To fill the gap, a new credit scoring method that Directly Maximizes the Kolmogorov-Smirnov statistic (DMKS) is proposed. Theoretically, the consistency of the proposed DMKS estimator is proved. Computationally, an iterative marginal optimization algorithm and a smoothed pool-adjacent-violators algorithm are proposed to overcome the computational difficulties caused by the neither smooth nor continuous objective function. Empirically, results of simulation studies and two real business examples are presented. The proposed method compares favorably with the popular existing scoring methods considering the tradeoff among predictive ability in terms of KS, computational complexity and practical interpretability.
Article
Credit scoring models have offered benefits to lenders and borrowers for many years. However, in practice these models are normally built on a sample of accepted applicants and fail to consider the remaining rejected applicants. This may cause a sample bias which is an important statistical issue, especially in the online lending situation where a large proportion of requests are rejected. Reject inference is a method for inferring how rejected applicants would have behaved if they had been granted and incorporating this information in rebuilding a more accurate credit scoring system. Due to the good performances of SVM models in this area, this paper proposes a new approach based on the state-of-the-art kernel-free fuzzy quadratic surface SVM model. It is worth pointing out that our method not only performs very well in classification as some latest works, but also handles some big issues in the classical SVM models, such as searching proper kernel functions and solving complex models. Besides, this paper is the first one to eliminate the bad effect of outliers in credit scoring. Moreover, we use two real-world loan data sets to compare our method with some benchmark methods. Particularly, one of the data set is very valuable for the study of reject inference, because the outcomes of rejected applicants are partially known. Finally, the numerical results strongly demonstrate the superiority of the proposed method in applicability, accuracy and efficiency.