Content uploaded by Nikolay Nikitin
All content in this area was uploaded by Nikolay Nikitin on Sep 06, 2018
Content may be subject to copyright.
Evolutionary ensemble approach for behavioral credit
Nikolay O. Nikitin1, Anna V. Kalyuzhnaya1, Klavdiya Bochenina1, Alexander A.
Kudryashov1, Amir Uteuov1, Ivan Derevitskii1, Alexander V. Boukhanovsky1
1ITMO University, 49 Kronverksky Pr. St. Petersburg, 197101, Russian Federation
Abstract. This paper is concerned with the question of potential quality of scor-
ing models that can be achieved using not only an application form data but also
behavioral data extracted from the transactional datasets. The several model types
and a different configuration of the ensembles were analyzed in the set of exper-
iments. Another aim of the research is to prove the effectiveness of evolutionary
optimization of an ensemble structure and use it to increase the quality of default
prediction. The example of obtained results is presented using models for bor-
rowers default prediction trained at the set of features (purchase amount, location,
merchant category) extracted from a transactional dataset of bank customers.
Keywords: Credit scoring, credit risk modeling, financial behavior, ensemble
modeling, evolutionary algorithms
Scoring tasks and associated scoring models vary a lot depending on application area
and objectives. For example, application form-based scoring  is used by lenders to
decide which credit applicants are good or bad. Collection scoring techniques  are
used for segmentation of defaulted borrowers to optimize debts recovery, and profit
scoring approach  is used to estimate profit on specific credit product.
In this work, we consider scoring prediction problem for behavioral data in several
aspects. First of all, the set of experiments were conducted to determine the potential
quality of default prediction using different types of scoring models for the behavioral
dataset. Then, the possible impact of the evolutionary approach to improving the quality
of ensemble of different models by optimization of its structure was analyzed in com-
parison with the un-optimized ensemble.
This paper follows in Section 2 with a review of works in the same domain, in Sec-
tion 3 we introduce the problem statement and the approaches for scoring task. Section
4 describes the dataset used as a case study and presents the conducted experiments. In
Section 5 we provide the summary of results; conclusion and future ways of increasing
the scoring model are placed in Section 6.
2. Related work
Credit scoring problem is usually considered within a framework of supervised learn-
ing. Thus, all common machine learning methods are used to deal with it: Bayesian
methods, logistic regression, neural networks, k-nearest neighbor, etc. (a review of pop-
ular existing methods for credit scoring problems can be found in ).
A good and yet relatively simple solution to improve the predictive power of ma-
chine learning model is to use the ensemble methods. The key idea of this approach is
to train different estimators (probably on different regions of input space) and then
combine their predictions. In  authors performed a comparison of three common
ensemble techniques on the datasets for credit scoring problem: bagging, boosting and
stacking. Stacking and bagging on decision trees were reported as two best ensemble
techniques. The Kaggle platform for machine learning competitions published a prac-
tical review of ensemble methods, illustrated on real word problems .
The current trend appears to be the enrichment of primary applicational features with
information about dynamics of financial and social behavior extracted from bank trans-
actional bases and open data sources. According to some studies, the involvement of
transactional data allows increasing the quality of scoring prediction significantly. 
It is worth to mention that in the vast majority of the studies on credit scoring models
are aimed to the resulting quality of classification and do not study in detail the possible
effect of optimization the structure of models’ ensemble. In contrast, in this paper we
are aimed to investigate, how good the prediction for behavioral data can be and how
much the ensemble structure and parameters can be evolutionary improved to increase
the reliability of the scoring prediction.
3. Problem statement and approaches for behavioral scoring
The widely used approach for making a credit-granting (underwriting) decision is the
application form-based scoring. It’s based on demographic and static features like age,
gender, education, employment history, financial statement, credit history. The appli-
cation form data allows to create sufficiently effective scoring model , but this ap-
proach isn’t possible in some cases. For example, pre-approved credit card proposal to
debit card clients can be based only on a limited behavioral dataset, that bank can ex-
tract from the transactional history of a customer.
3.1 Predictive models for scoring task
The credit default prediction result is binary, so the two-classes classification algo-
rithms potentially applicable approach for this task. The set of behavioral characteris-
tics of every client can be used as predictor variables, and default flag as the response
variable. The several methods were chosen to build an ensemble: K-nearest neighbors
classifier, linear (LDA) and quadratic separating surface (QDA) classifiers, feed-for-
ward neural network with one single hidden layer, XGboost-based predictive model,
Random Forest classifier.
An ensemble approach to scoring model allows to combine the advantages of differ-
ent classifiers and to improve the quality of prediction of default. The probability vec-
tors obtained at the output of each model included in the ensemble used as an input of
a metamodel constructed using an algorithm that performs the maximization of the
quality metrics and generates the optimal set of ensemble weights. While the total num-
ber of models combinations is 27, we implement the evolutionary algorithm using the
DEoptim package, which performs fast multidimensional optimization in weights
space using the algorithm of differential evolution.
3.2 Metrics for quality estimation of classification-based models
The standard accuracy metric is not suitable for the scoring task due to the very unbal-
anced sample of profiles with many “good” profiles (>97%) and a small amount of
“bad.” Therefore, threshold-independence metric AUC (the area under the receiver
operating characteristic curve – ROC, that describes the diagnostic ability of a binary
classifier) was selected to compare models.
Kolmogorov-Smirnov statistic allows estimating the value of threshold by com-
parison of probability distributions of original and predicted datasets. The probability
which corresponds to the maximum of KS coefficient can be chosen as optimal in gen-
4. Case study
4.1 Transactional dataset
To provide the experiments with different configurations of scoring model, totally de-
personalized transactional dataset was used. We obtained it for research purposes from
one of the major banks in Russia. The dataset contains details for more than 10M anon-
ymized transactions that were done by cardholders before they applied for a credit card
and bank’s application underwriting procedure accepted them. The time range of
transactions starts on January 1, 2014, and covers the range up to December 31, 2016.
Each entity in the dataset is assigned to indicator variable of default, that corre-
sponds to the payment delinquency for 90 or more days. The delinquency rate for the
profiles from this dataset is 3.02%
The set of parameters of transactions included in the dataset and the summary of
behavioral profile parameters that can be used as predictor variables in scoring models
presented in table 1.
Table 1. Variables from the transactional dataset
Behavioral profile parameters
IDs of client and contract,
date of the transaction,
date of contract signing.
The numbers of actual and closed contracts
Amount of transaction (in
roubles), the location of
The terminal used for op-
eration (if known), transac-
tion type (payment/cash
Code (if known).
Common frequency and quantitative charac-
teristics of transactions;
Merchant category-specific characteristics of
Address of payment termi-
Spatial-based characteristics of transactions
Binary flag of default
This data allows identifying the profiles of bank clients as a set of some derived
parameters, that characterize their financial behavior pattern, obtained from transac-
tions structure. Also, the date variable can be used to take macroeconomic variability
Since some profile variables have a lognormal distribution, the logarithmic transfor-
mation for one-side-restricted values and additional scaling to [0,1] range was applied.
4.1 Evolutionary ensemble model
The comparison of performance for scoring models is presented in Fig. 1.
Fig. 1. The AUC and KS performance metrics for scoring models
The maximum value of Kolmogorov-Smirnov coefficient can be interpreted as an
optimal value for probability threshold for each model.
The ensemble of these models can have a different configuration, and it’s unneces-
sary to include all models to the ensemble. To measure the effect from every new model
in the ensemble, we conduct a set of experiments and compare the quality of the scoring
prediction for evolutionary-optimized ensembles with a different structure (from 2 to 7
models with separate optimization procedure for every size value). The logit regression
was chosen as the base model. The structure of the ensemble is presented in Fig 2a, the
summary plot displaying the results is shown in Fig. 2b.
Fig. 2. a) Structure of the ensemble of heterogeneous scoring models
b) Dependence of quality metrics AUC and KS on the number of models in optimized con-
figuration the ensemble
It can be seen that the overall quality of the scoring score increases with the number
of models used, but the useful effect isn’t similar - for example, the neural network
model does not enhance the ensemble quality.
The set of predictors of ensemble scoring models includes variables with different
predictive power. The redundant variables make the development of interpretable scor-
ing card difficult and can cause the re-training effect. Therefore, the evolutionary ap-
proach based on kofnGA algorithm was used to create an optimal subset of input
variables. The results of execution are presented in Fig. 3.
Fig. 3. The convergence of the evolutionary algorithm with an AUC-based fitness function
The convergence is achieved in 4 generations where experimentally determined
population size equals 50 individuals; mutation probability equals 0.7 and variables
subset size equals 14. The optimal variable set contains 8 variables from “MCC” group,
4 financial parameters and 2 geo parameters.
5. Results and discussions
The experiments results can be interpreted as a confirmation of the effectiveness of
evolutionary ensemble optimization approach. The summary of model results with 10-
fold cross-validation and 70-30 train/test ratio is presented in table 2.
Table 2. Summary of scoring models performance
It can be seen that some best of single models provide similar quality of scoring
predictions. The simple “blended” ensemble with equal weights for every model cannot
improve the final quality, but the evolutionary optimization allows to increase the result
of the scoring prediction slightly. The problem for the case study is the limited access
to additional data (like applications forms), that’s why the prognostic ability of the ap-
plied models can’t be entirely disclosed.
The obtained results confirm that evolutionary-controlled ensembling of scoring mod-
els allows increasing the quality of default prediction. Nevertheless, it also can be seen
that optimized ensemble slightly improve the result of the best single model (XGBoost)
and, moreover, all results of individual models are relatively close to each other. This
fact leads us to the conclusion that the further improvement of the developed model can
be achieved by taking additional behavioral and non-behavioral factors into account to
increase current quality threshold.
This research is financially supported by The Russian Science Foundation, Agreement
№17-71-30029 with co-financing of Bank Saint Petersburg.
1. Abdou H.A., Pointon J. Credit Scoring, Statistical Techniques and Evaluation Criteria:
A Review of the Literature // Intell. Syst. Accounting, Financ. Manag. 2011. Vol. 18, №
2–3. P. 59–88.
2. Ha S.H. Behavioral assessment of recoverable credit of retailer’s customers // Inf. Sci.
(Ny). 2010. Vol. 180, № 19. P. 3703–3717.
3. Serrano-Cinca C., Gutiérrez-Nieto B. The use of profit scoring as an alternative to credit
scoring systems in peer-to-peer (P2P) lending // Decis. Support Syst. 2016. Vol. 89. P.
4. Lessmann S. et al. Benchmarking state-of-the-art classification algorithms for credit
scoring: An update of research // Eur. J. Oper. Res. 2015. Vol. 247, № 1. P. 124–136.
5. Wang G. et al. A comparative assessment of ensemble learning for credit scoring //
Expert Syst. Appl. 2011. Vol. 38, № 1. P. 223–230.
6. KAGGLE ENSEMBLING GUIDE [Electronic resource].
7. Westley K., Theodore I. Transaction Scoring: Where Risk Meets Opportunity
8. Mullen, Katharine M., et al. “DEoptim: An R package for global optimization by
differential evolution.” (2009).
9. Wolters, M. A. (2015). A genetic algorithm for selection of fixed-size subsets with
application to design problems. J Stat Softw, 68(1), 1-18.