Content uploaded by Roberto Saia

Author content

All content in this area was uploaded by Roberto Saia on Mar 03, 2018

Content may be subject to copyright.

A Wavelet-based Data Analysis to Credit Scoring

Roberto Saia, Salvatore Carta, Gianni Fenu

Department of Mathematics and Computer Science

University of Cagliari, Via Ospedale 72 - 09124 Cagliari, Italy

Email: {roberto.saia, salvatore, fenu}@unica.it

Abstract—Nowadays, the dramatic growth in consumer

credit has made ineffective the methods based on the human

intervention, aimed to assess the potential solvency of loan

applicants. For this reason, the development of approaches

able to automate this operation represents today an active

and important research area named Credit Scoring. In such

scenario it should be noted how the design of effective

approaches represents an hard challenge, due to a series of

well-known problems, such as, for instance, the data

imbalance, the data heterogeneity, and the cold start. The

Centroid wavelet-based approach proposed in this paper

faces these issues by moving the data analysis from its

canonical domain to a new time-frequency one, where this

operation is performed through three different metrics of

similarity. Its main objective is to achieve a better

characterization of the loan applicants on the basis of the

information previously gathered by the Credit Scoring

system. The performed experiments demonstrate how such

approach outperforms the state-of-the-art solutions.

Index Terms—business intelligence, credit scoring, pattern

mining, data processing, wavelets, classifications, metrics

I. INTRODUCTION

The approaches of Credit Scoring are aimed to

evaluate the user reliability in several contexts such as,

for instance, those related to the loan applications (from

now on named as instances). They cover a more and

more crucial role in this our age dominated by the

consumer credit, since the amount of money lost by the

financial operators due to loans fully or partially not

repaid depends on their effectiveness.

A Credit Scoring approach works by classifying each

new instance as reliable or unreliable by exploiting an

evaluation model defined on the basis of the previous

instances. They can be considered as a series of statistical

methods aimed to evaluate the probability that a new

instance will lead to a fully or partially non-repayment of

a loan [1]. The definition of effective Credit Scoring

approaches is not an easy task, due to a series of well-

known problems.

The idea around which this paper revolves is mainly

based on the the exploitation of the Discrete Wavelet

Transformation (DWT) [2], which is used in order to

move the data analysis in a non-canonical time-frequency

domain. In such domain the analysis is performed

through three different metrics of similarity, which are

aimed to better characterize the classes of data involved

in the Credit Scoring processes . The contributions given

by this paper are as follows:

definition of the time series to be use as input of

the DWT process, defined on the basis of the

previous instances;

conversion of the instance time series into the

frequency-time domain by using the DWT

process;

formalization of the Centroid Wavelet-based

Approach (CWA) algorithm able to classify the

new instances.

The paper is organized as follows: Section II provides

an overview of the Credit Scoring scenario; Section III

introduces the formal notation adopted in this paper;

Section IV describes our approach; Section V provides

details about the performed experiments and their results;

Section VI draws certain conclusions and points to some

further directions for research.

II. BACKGROUND AND RELATED WORK

An ideal Credit Scoring system should be able to

evaluate each new instance correctly, by classifying it as

reliable or unreliable on the basis of the information

available in the previous instances.

Literature proposes a considerable number of

classification techniques aimed to perform such task [3],

as well as many studies focused on the evaluation of their

performance [4], on the optimal tuning of their

parameters [5], and on the most suitable metrics of

evaluation [6]. On the basis of the results offered by the

Credit Scoring techniques is possible to predict when an

application (e.g., for a loan) potentially leads towards a

risk of partial or total non-repayment [7].

Regardless of the adopted technique, there are a

number of problems that complicate such tasks. The

imbalanced class distribution of data is the most

important of them and it happens because the previous

instances, collected by a Credit Scoring system to train its

evaluation model, are composed by a big number of

reliable cases, compared to the number of unreliable

ones. It leads towards a reduction of the Credit Scoring

techniques effectiveness [8].

Another problem to face is the data heterogeneity

which in literature is described as the incompatibility

among similar features resulting in the same data being

represented differently in different datasets [9].

The cold start problem instead happens when the

previous instances are not representative for both classes

of information (reliable and unreliable), preventing the

definition of an effective evaluation model [10].

The proposed approach is mainly based on the

Discrete Wavelet Transformation (DWT) process [11].

Such process exploits the wavelets, a task that in

literature is usually performed in order to reduce the size

or the noise of data (e.g., in the image compression and

filtering tasks). The wavelets are mathematical functions

that work by decomposing the input data into different

frequencies at different scales. The input data of a DWT

process is usually a time series [12], a sequence of values

obtained by measuring the variations during the time of a

specific type of data (e.g., voltage, temperature, etc.). The

output is a new representation of data in a frequency-time

domain (data representation in terms of both frequency

and time). In our case, the time series used as input of the

DWT process are the values assumed by the instance

features.

The frequency-time domain offers us some interesting

advantages, the most important of them is the Multi-

Resolution Analysis operates by DWT that allows us to

observe the data at different levels of resolutions [13],

with the possibility of obtaining an approximated or

detailed vision on them. Our approach exploits this in

order to have a better characterization of the reliable and

unreliable instances.

The Equation 1 shows the formalization of the

Continuous Wavelet Transform (CWT), where Ψ(t) is the

mother wavelet (i.e., a continuous function in both the

time and frequency domain) and * denotes the complex

conjugate.

Considering that, for several reasons (e.g., the

computational load), it is not possible to perform a data

analysis by using all the wavelet coefficients, a common

approach is to use a discrete subset of the upper half-

plane, so we can be able to rebuild the original data by

using the corresponding wavelet coefficients. Such

discrete subset is composed by all the points

, where , and after this operation

we can formalize the child wavelets (Equation 2) .

A data compression that allows us to have a data

overview (approximated data view) is related to the use

of small scales, as this is equivalent to using high

frequencies (because the scale is given by the formula

1/frequency). In a opposite way, a data expansion that

allows us to observe the data changing (detailed data

view) is related to the use of large scales, as it is

equivalent to using low frequencies.

A number of functions can be used as mother wavelet

(e.g., Haar, Daubechies, Symlets, Meyer, Coiflets, etc),

but for the objectives of this paper we take into account

only the Haar [13] one. It is a sequence of rescaled

square-shaped functions which represent a wavelet

family. They are based on the mother Ψ and father φ

(scaling) functions shown in Equation 3.

The proposed Centroid Wavelet-based Approach

compares the instances in the new time-frequency

domain through three metrics of similarity, which are

aimed to evaluate different aspects of the instances.

III. FORMAL NOTATION

Given a set of classified instances ,

and a set of features that compose

each instance i, we denote as the subset of

reliable instances, as the subset of unreliable

ones, and as C={reliable,unreliable} the set of possible

instance classifications. It should be noted that an

instance can belong only to one class . We also

denote as a set of unclassified

instances and as these instances

after the classification process, thus . Finally,

we denote as and

, respectively, input and output of

the DWT process.

IV. PROPOSED APPROACH

The proposed approach has been implemented through

the three steps listed and explained in the following:

Input Data Definition: definition of the time

series to use in the DWT process, made by using

the sequence of values assumed by each single

feature of an instance;

Output Data Generation: generation of the new

data representation in the frequency-time domain,

by processing the time series through the DWT

process;

Instance Classification: formalization of an

algorithm based on our Centroid Wavelet-based

Approach (CWA), able to classify a new instance

as reliable or unreliable on the basis of three

different metrics of similarity.

A. Input Data Definition

The first step is aimed to prepare the time series to use

as input in the DWT process.

It is performed by using the sequence of value

assumed by the features of an instance, then we

consider the features of all instances, i.e., subset of

previous reliable instances , subset of previous

unreliable instances , and set of instances to

evaluate in .

Figure 1. Evaluation Criterion

B. Output Data Generation

In the second step we perform the DWT process by

using as input the time series related to the and

sets.

Our Credit Scoring approach exploits two wavelet

properties. The first one is the Dimensionality reduction:

the DWT process reduces the dimensionality of a time

series by performing an orthonormal transformation that

allows us to recover the original data. This can be

exploited in order to reduce the computational load. The

second one is the Multiresolution analysis: the DWT

process allows us to analyze the data by using an

approximated or detailed point of view. This can be

exploited in order to define a model of representation of

the instances less or more affected by data variation.

The proposed approach exploits these two properties

by transforming the original instances data through the

Haar wavelet, preferring an approximation coefficients

N/2 in order to obtain a more stable model able to face

the data heterogeneity issue. The result of each instance

transformation is the set O.

C. Instance Classification

Each unevaluated instance is classified after a

comparison process performed between it and all the

instances in the training set (i.e., the reliable ones in the

subset I+ and the unreliable ones in the subset I-). Such

comparison process is performed in terms of the three

metrics detailed described later, i.e., the Cosine Similarity

(Ө), the Features Weighted Sequence Similarity (Φ), and

the Normalized Magnitude Similarity (μ). Premising that

the radius (ρ) is a value experimentally defined in

Section V, we classify a new instance by adopting the

following criteria:

first we define a centroid + (see Figure 1), using

as coordinates of the x and y axes, respectively,

max(Ө)-ρ and max(Ф)-ρ, where max(Ө) stands for

the maximum value of the Cosine Similarity and

max(Ф) stands for the maximum value of the

Features Weighted Sequence Similarity, both

calculated between the instance to evaluate and all

the instances in the training set;

the classification process is based on the type and

magnitude of the instances bounded by the

circular area of radius ρ, centered in + (the

circular area of Figure 1, where the size of the

cases represents the Normalized Magnitude

Similarity μ of the instance);

a new instance is classified as reliable if the

weight (in terms of μ) of the selected reliable

instances within the radius ρ is greater than that of

the unreliable ones, otherwise it is classified as

unreliable.

Figure 1 shows a case when the instance under

evaluation is classified as unreliable since the sum of the

weight of the three unreliable instances within the radius

is greater than that of the other two reliable ones. By

following a prudential criterion, we classify the new

instances as unreliable when the sum of the weight of the

reliable ones within the radius is equal than that of the

unreliable ones.

1) Metrics

This section describes the three metrics used in our

approach.

Cosine Similarity: The Cosine Similarity (Ө) metric is

able to measure the similarity between two vectors v1 and

v2 with size larger than zero. More formally, given two

vectors v1 and v2 of attributes, it is represented using a

dot product and magnitude as shown in the Equation 4.

We normalized the result in a range [0,1], where 0

indicates two completely different vectors and 1 two

equal vectors, and the intermediate values indicate

different levels of similarity between the two vectors.

Features Weighted Sequence Similarity: The

Features Weighted Sequence Similarity (Ф) is not a

canonical metric, since it was defined in the context of

this paper in order to evaluate the similarity between

instances in terms of the weighted sequence of the

features that composed them. The idea behind this metric

is that similar instances present similar sequences of

features, i.e., if we sort the features of two instances on

the basis of their values, the sequences of their indexes

will be similar in terms of cosine similarity Ө. More

formally, given two instances i(1) and i(2) we calculate Φ as

shown in Equation 5, where TS(1) and TS(2) are the time

series of the instances to compare and the function idx

gives us the sorted time series TS in terms of former

element indexes (i.e., the indexes of the TS elements

before sorting).

Normalized Magnitude Similarity: The Normalized

Magnitude Similarity (μ) is also a non-canonical metric

defined in the context of this paper with the aim to

evaluate the difference in terms of magnitude between

two instances i(1) and i(2).

It is measured by taking into account their DWT

outputs O(1) and O(2). It is calculated as shown in

Equation 6, where max(Δ) is the maximum value

assumed by Δ in the context of all the instances in the

training set I.

2) Algorithm

The classification Algorithm 1 takes as input the set of

previous instances I, an instance to evaluate, and

the radius value ρ, returning as output a boolean value

class that indicates the classification of the instance

(i.e., true=reliable or false=unreliable).

Algorithm 1. Classification Algorithm

V. EXPERIMENTS

Our approach has been developed in Java, by using the

Jwave1 library to perform the Discrete Wavelet

Transformations and the WEKA2 library to implement

the state-of-the-art competitor used to evaluate its

performance (i.e., Random Forests).

A. Datasets

The real-world datasets used to evaluate our approach

are freely downloadable at the UCI Repository of

Machine Learning Databases3. They represent three

benchmarks in the Credit Scoring field, allowing us to

evaluate our approach in different data scenarios, both for

number of instances and for class balancing.

German Credit Data (GCD): This dataset contains

the classification of people as reliable or unreliable in

terms of credit risks. We used the numerical version,

which is composed by 1,000 instances: 700 classified as

reliable (70.0%) and 300 classified as unreliable (30.0%).

1https://github.com/cscheiblich/JWave/

2http://www.cs.waikato.ac.nz/ml/weka/

3ftp://ftp.ics.uci.edu/pub/machine-learning-databases/statlog

Each instance is defined by 24 features and a binary class

variable (reliable or unreliable).

Australian Credit Approval (ACA): This dataset

contains credit card applications classified as reliable or

unreliable on the basis of their final outcome. We used

the numerical version, which is composed by 690

instances: 307 classified as reliable (44.5%) and 383

classified as unreliable (55.5%). Each instance is defined

by 14 features and a binary class variable (reliable or

unreliable).

Japanese Credit Screening (JCS): This dataset

contains a number of people instances classified as

reliable (granted credit) or unreliable (non granted

credit). We used the numerical version, which is

composed by 125 instances: 85 classified as reliable

(68.0%) and 40 classified as unreliable (32.0%).

Each instance is defined by 16 features and a binary

class variable (reliable or unreliable).

B. Strategy

All the experiments have been performed by adopting

the k-fold cross-validation criterion, with k=10, in order

to reduce the impact of data dependency, improving the

value of the obtained results. In more detail, each dataset

has been divided in k subsets, and each k subset has been

used as test set, while the other k-1 subsets have been

used as training set. The result is given by the average of

all the obtained results.

The final results have been verified for the existence of

a statistical difference between them, by using the

independent-samples two-tailed Student's t-tests

(p<0.05).

Given that the Algorithm 1 needs as input a radius

value ρ, it has been experimentally calculated for each

dataset by testing a large range of values in the context of

the training set I, choosing as final value the one that

leads towards the best performance in terms of F-

measure. The results indicate as optimal ρ values 0.967

for the GCD dataset, 0.849 for the ACA dataset, and

0.789 for the JCS dataset, since these values maximize

the F-measure.

C. Competitor

Although the literature indicates Random Forest as one

of the most performing approach for the Credit Scoring

[4], we have however carried out a preliminary study

aimed to compare the AUC performance of ten binary

classification approaches (i.e., Naive Bayes, Logistic

Regression, Multilayer Perceptron, Random Tree,

Decision Tree, Logic Boost, SGD, Voted Perceptron, K-

nearest, and Random Forests), adopting the same cross-

validation criterion used for the other experiments. The

results indicate that Random Forest outperforms all the

other approaches (GCD=0.79, ACA=0.93, JCS=0.97), so

it is the only one we will be confronting with.

We have also performed a series of experiments in

order to optimize the Random Forest performance. After

we configured as unlimited the maximum depth of the

tree parameter, we tested different values of the number

of randomly chosen attributes (nrca) parameter. In order

to avoid the overfitting problem, the tuning process has

been performed by using both the training and testing sets

and also in this case we adopted the same cross-validation

criterion used for the other experiments. The results

indicate as optimal nrca values 22 for the GCD dataset, 7

for the ACA dataset, and 15 for the JCS dataset, since

these values maximize the F-measure.

D. Results

The analysis of the results shown in Figure 2, where,

respectively, have been compared the performance of our

Centroid Wavelet-based Approach (CWA) to that of its

competitor Random Forests (RF), in terms of F-measure

and Area Under ROC Curve (AUC), leads towards the

following considerations.

Figure 2.a shows that our CWA approach outperforms

its competitor RF in terms of F-measure in the context of

all the datasets, regardless of their size and their level of

imbalance. This indicates its capability to classify the

instances correctly, both with regard to the number of all

performed classifications and to the number of the

classifications that should have been made. Such result

underlines two aspects, the first one related to the better

performance achieved by it, while the second one related

to the constancy of them. In fact, our approach

outperforms RF in the context of all the datasets and its

level of performance do not vary much (both in terms of

quality and range), differently from its competitor.

Figure 2.b shows that our CWA approach reaches AUC

performance similar (i.e., JCS dataset) or higher (i.e.,

GCD and ACA datasets) than that of its competitor RF,

regardless of the size of data and the level of imbalance

of them. The AUC metric measures the effectiveness of

the evaluation model and the results indicate that our

model gets higher performance than that of its competitor

in the context of all datasets. It should be also noted how

it obtains the best performance in a typical real-world

scenario (i.e., the GCD dataset) characterized by many

instances and a high degree of imbalance.

Additional considerations on the dimensionality

reduction: In our DWT approach we have exploited only

one of the two wavelet properties previously introduced

(i.e., the multiresolution analysis) in order to mitigate the

heterogeneity data issue through the pair-wise average

and directed distances operations made by the Haar

wavelet function. Now, we want to introduce the

possibility to exploit the second properties (i.e., the

dimensionality reduction) in order to reduce its

computational complexity without a significant

performance decay. Indeed, we can obtain a substantial

curtailment of the processed elements (i.e., |F|∙|I|) by

taking into account only the pair-wise average part of the

Haar wavelet function output (small differences in values

are to be attributed to the needed transformations to make

, with .

The results in terms of AUC performance show a

reduction of 48.0%, 46.0%, and 47.0% (respectively, in

the GCD, ACA, and JCS dataset), without detecting any

significant performance decay, getting even better results.

Figure 2. F-measure and AUC Performance

It should be added that we get a similar result by using

the directed distances part of the output instead of the

pair-wise average one.

VI. CONCLUSIONS AND FUTURE WORK

The Credit Scoring approach proposed in this paper is

based on a threefold assessment of similarity carried out

in the frequency-time domain offered by the Discrete

Wavelet Transformation process. The data analysis

performed in this non-canonical domain through three

metrics of similarity has led to a better capability to

characterize the user instances in their correct class of

destination (i.e., reliable or unreliable). Experimental

results proved the effectiveness of such approach in the

context of three datasets, where it outperforms its state-

of-the-art competitor in typical real-world scenarios

characterized by a considerable number of instances,

regardless of their data distribution.

Future work would be oriented to experiment

additional metrics of similarity, as well as other wavelet

functions in the Discrete Wavelet Transformation process,

with the aim to further improve the effectiveness of the

classification model. Another interesting future work

would be the experimentation of the proposed approach

in scenarios other than Credit Scoring.

ACKNOWLEDGMENT

This research is partially funded by Regione Sardegna

under project Next generation Open Mobile Apps

Development (NOMAD), Pacchetti Integrati di

Agevolazione (PIA) - Industria Artigianato e Servizi -

Annualità 2013.

REFERENCES

[1] W. E. Henley, “Statistical aspects of credit scoring.” Ph.D.

dissertation, Open University, 1994.

[2] P. Chaovalit, A. Gangopadhyay, G. Karabatis, and Z. Chen,

“Discrete wavelet transform-based time series analysis and

mining,” ACM Comput. Surv., vol. 43, no. 2, pp. 6:1–6:37, 2011.

[3] M. Doumpos and C. Zopounidis, “Credit scoring,” in Multicriteria

Analysis in Finance. Springer, 2014, pp. 43–59.

[4] S. Lessmann, B. Baesens, H. Seow, and L. C. Thomas,

“Benchmarking state-of-the-art classification algorithms for credit

scoring: An update of research,” European Journal of Operational

Research, vol. 247, no. 1, pp. 124–136, 2015.

[5] S. Ali and K. A. Smith, “On learning algorithm selection for

classification,” Appl. Soft Comput., vol. 6, no. 2, pp. 119–138,

2006.

[6] D. J. Hand, “Measuring classifier performance: a coherent

alternative to the area under the ROC curve,” Machine Learning,

vol. 77, no. 1, pp. 103–123, 2009.

[7] M. Siami, Z. Hajimohammadi et al., “Credit scoring in banks and

financial institutions via data mining techniques: A literature

review,” Journal of AI and Data Mining, vol. 1, no. 2, pp. 119–

129, 2013.

[8] N. Japkowicz and S. Stephen, “The class imbalance problem: A

systematic study,” Intell. Data Anal., vol. 6, no. 5, pp. 429–449,

2002.

[9] A. Chatterjee and A. Segev, “Data manipulation in heterogeneous

databases,” ACM SIGMOD Record, vol. 20, no. 4, pp. 64–68,

1991.

[10] J. Attenberg and F. J. Provost, “Inactive learning?: difficulties

employing active learning in practice,” SIGKDD Explorations,

vol. 12, no. 2, pp. 36–41, 2010.

[11] D. B. Percival and A. T. Walden, Wavelet methods for time series

analysis. Cambridge university press, 2006, vol. 4.

[12] R. Agrawal, C. Faloutsos, and A. N. Swami, “Efficient similarity

search in sequence databases,” in Foundations of Data

Organization and Algorithms, 4th International Conference,

FODO’93, Chicago, Illinois, USA, October 13-15, 1993,

Proceedings, ser. Lecture Notes in Computer Science, D. B.

Lomet, Ed., vol. 730. Springer, 1993, pp. 69–84.

[13] S. G. Mallat, “A theory for multiresolution signal decomposition:

the wavelet representation,” IEEE transactions on pattern analysis

and machine intelligence, vol. 11, no. 7, pp. 674–693, 1989.