ChapterPDF Available

Evolutionary Ensemble Approach for Behavioral Credit Scoring


Abstract and Figures

This paper is concerned with the question of potential quality of scoring models that can be achieved using not only application form data but also behavioral data extracted from the transactional datasets. The several model types and a different configuration of the ensembles were analyzed in a set of experiments. Another aim of the research is to prove the effectiveness of evolutionary optimization of an ensemble structure and use it to increase the quality of default prediction. The example of obtained results is presented using models for borrowers default prediction trained on the set of features (purchase amount, location, merchant category) extracted from a transactional dataset of bank customers.
Content may be subject to copyright.
Evolutionary ensemble approach for behavioral credit
Nikolay O. Nikitin1, Anna V. Kalyuzhnaya1, Klavdiya Bochenina1, Alexander A.
Kudryashov1, Amir Uteuov1, Ivan Derevitskii1, Alexander V. Boukhanovsky1
1ITMO University, 49 Kronverksky Pr. St. Petersburg, 197101, Russian Federation
Abstract. This paper is concerned with the question of potential quality of scor-
ing models that can be achieved using not only an application form data but also
behavioral data extracted from the transactional datasets. The several model types
and a different configuration of the ensembles were analyzed in the set of exper-
iments. Another aim of the research is to prove the effectiveness of evolutionary
optimization of an ensemble structure and use it to increase the quality of default
prediction. The example of obtained results is presented using models for bor-
rowers default prediction trained at the set of features (purchase amount, location,
merchant category) extracted from a transactional dataset of bank customers.
Keywords: Credit scoring, credit risk modeling, financial behavior, ensemble
modeling, evolutionary algorithms
1. Introduction
Scoring tasks and associated scoring models vary a lot depending on application area
and objectives. For example, application form-based scoring [1] is used by lenders to
decide which credit applicants are good or bad. Collection scoring techniques [2] are
used for segmentation of defaulted borrowers to optimize debts recovery, and profit
scoring approach [3] is used to estimate profit on specific credit product.
In this work, we consider scoring prediction problem for behavioral data in several
aspects. First of all, the set of experiments were conducted to determine the potential
quality of default prediction using different types of scoring models for the behavioral
dataset. Then, the possible impact of the evolutionary approach to improving the quality
of ensemble of different models by optimization of its structure was analyzed in com-
parison with the un-optimized ensemble.
This paper follows in Section 2 with a review of works in the same domain, in Sec-
tion 3 we introduce the problem statement and the approaches for scoring task. Section
4 describes the dataset used as a case study and presents the conducted experiments. In
Section 5 we provide the summary of results; conclusion and future ways of increasing
the scoring model are placed in Section 6.
2. Related work
Credit scoring problem is usually considered within a framework of supervised learn-
ing. Thus, all common machine learning methods are used to deal with it: Bayesian
methods, logistic regression, neural networks, k-nearest neighbor, etc. (a review of pop-
ular existing methods for credit scoring problems can be found in [4]).
A good and yet relatively simple solution to improve the predictive power of ma-
chine learning model is to use the ensemble methods. The key idea of this approach is
to train different estimators (probably on different regions of input space) and then
combine their predictions. In [5] authors performed a comparison of three common
ensemble techniques on the datasets for credit scoring problem: bagging, boosting and
stacking. Stacking and bagging on decision trees were reported as two best ensemble
techniques. The Kaggle platform for machine learning competitions published a prac-
tical review of ensemble methods, illustrated on real word problems [6].
The current trend appears to be the enrichment of primary applicational features with
information about dynamics of financial and social behavior extracted from bank trans-
actional bases and open data sources. According to some studies, the involvement of
transactional data allows increasing the quality of scoring prediction significantly. [7]
It is worth to mention that in the vast majority of the studies on credit scoring models
are aimed to the resulting quality of classification and do not study in detail the possible
effect of optimization the structure of models ensemble. In contrast, in this paper we
are aimed to investigate, how good the prediction for behavioral data can be and how
much the ensemble structure and parameters can be evolutionary improved to increase
the reliability of the scoring prediction.
3. Problem statement and approaches for behavioral scoring
The widely used approach for making a credit-granting (underwriting) decision is the
application form-based scoring. It’s based on demographic and static features like age,
gender, education, employment history, financial statement, credit history. The appli-
cation form data allows to create sufficiently effective scoring model [16], but this ap-
proach isn’t possible in some cases. For example, pre-approved credit card proposal to
debit card clients can be based only on a limited behavioral dataset, that bank can ex-
tract from the transactional history of a customer.
3.1 Predictive models for scoring task
The credit default prediction result is binary, so the two-classes classification algo-
rithms potentially applicable approach for this task. The set of behavioral characteris-
tics of every client can be used as predictor variables, and default flag as the response
variable. The several methods were chosen to build an ensemble: K-nearest neighbors
classifier, linear (LDA) and quadratic separating surface (QDA) classifiers, feed-for-
ward neural network with one single hidden layer, XGboost-based predictive model,
Random Forest classifier.
An ensemble approach to scoring model allows to combine the advantages of differ-
ent classifiers and to improve the quality of prediction of default. The probability vec-
tors obtained at the output of each model included in the ensemble used as an input of
a metamodel constructed using an algorithm that performs the maximization of the
quality metrics and generates the optimal set of ensemble weights. While the total num-
ber of models combinations is 27, we implement the evolutionary algorithm using the
DEoptim[8] package, which performs fast multidimensional optimization in weights
space using the algorithm of differential evolution.
3.2 Metrics for quality estimation of classification-based models
The standard accuracy metric is not suitable for the scoring task due to the very unbal-
anced sample of profiles with many “good” profiles (>97%) and a small amount of
“bad.” Therefore, threshold-independence metric AUC (the area under the receiver
operating characteristic curve ROC, that describes the diagnostic ability of a binary
classifier) was selected to compare models.
Kolmogorov-Smirnov statistic allows estimating the value of threshold by com-
parison of probability distributions of original and predicted datasets. The probability
which corresponds to the maximum of KS coefficient can be chosen as optimal in gen-
eral case.
4. Case study
4.1 Transactional dataset
To provide the experiments with different configurations of scoring model, totally de-
personalized transactional dataset was used. We obtained it for research purposes from
one of the major banks in Russia. The dataset contains details for more than 10M anon-
ymized transactions that were done by cardholders before they applied for a credit card
and bank’s application underwriting procedure accepted them. The time range of
transactions starts on January 1, 2014, and covers the range up to December 31, 2016.
Each entity in the dataset is assigned to indicator variable of default, that corre-
sponds to the payment delinquency for 90 or more days. The delinquency rate for the
profiles from this dataset is 3.02%
The set of parameters of transactions included in the dataset and the summary of
behavioral profile parameters that can be used as predictor variables in scoring models
presented in table 1.
Table 1. Variables from the transactional dataset
Transactional variables
Behavioral profile parameters
Attributes of
IDs of client and contract,
date of the transaction,
date of contract signing.
The numbers of actual and closed contracts
Amount of transaction (in
roubles), the location of
The terminal used for op-
eration (if known), transac-
tion type (payment/cash
Merchant Category
Code (if known).
Common frequency and quantitative charac-
teristics of transactions;
Merchant category-specific characteristics of
Address of payment termi-
Spatial-based characteristics of transactions
Binary flag of default
This data allows identifying the profiles of bank clients as a set of some derived
parameters, that characterize their financial behavior pattern, obtained from transac-
tions structure. Also, the date variable can be used to take macroeconomic variability
into account.
Since some profile variables have a lognormal distribution, the logarithmic transfor-
mation for one-side-restricted values and additional scaling to [0,1] range was applied.
4.1 Evolutionary ensemble model
The comparison of performance for scoring models is presented in Fig. 1.
Fig. 1. The AUC and KS performance metrics for scoring models
The maximum value of Kolmogorov-Smirnov coefficient can be interpreted as an
optimal value for probability threshold for each model.
The ensemble of these models can have a different configuration, and it’s unneces-
sary to include all models to the ensemble. To measure the effect from every new model
in the ensemble, we conduct a set of experiments and compare the quality of the scoring
prediction for evolutionary-optimized ensembles with a different structure (from 2 to 7
models with separate optimization procedure for every size value). The logit regression
was chosen as the base model. The structure of the ensemble is presented in Fig 2a, the
summary plot displaying the results is shown in Fig. 2b.
Fig. 2. a) Structure of the ensemble of heterogeneous scoring models
b) Dependence of quality metrics AUC and KS on the number of models in optimized con-
figuration the ensemble
It can be seen that the overall quality of the scoring score increases with the number
of models used, but the useful effect isn’t similar - for example, the neural network
model does not enhance the ensemble quality.
The set of predictors of ensemble scoring models includes variables with different
predictive power. The redundant variables make the development of interpretable scor-
ing card difficult and can cause the re-training effect. Therefore, the evolutionary ap-
proach based on kofnGA algorithm[9] was used to create an optimal subset of input
variables. The results of execution are presented in Fig. 3.
Fig. 3. The convergence of the evolutionary algorithm with an AUC-based fitness function
The convergence is achieved in 4 generations where experimentally determined
population size equals 50 individuals; mutation probability equals 0.7 and variables
subset size equals 14. The optimal variable set contains 8 variables from “MCC” group,
4 financial parameters and 2 geo parameters.
5. Results and discussions
The experiments results can be interpreted as a confirmation of the effectiveness of
evolutionary ensemble optimization approach. The summary of model results with 10-
fold cross-validation and 70-30 train/test ratio is presented in table 2.
Table 2. Summary of scoring models performance
Training sample
Validation sample
It can be seen that some best of single models provide similar quality of scoring
predictions. The simple “blended” ensemble with equal weights for every model cannot
improve the final quality, but the evolutionary optimization allows to increase the result
of the scoring prediction slightly. The problem for the case study is the limited access
to additional data (like applications forms), that’s why the prognostic ability of the ap-
plied models can’t be entirely disclosed.
6. Conclusion
The obtained results confirm that evolutionary-controlled ensembling of scoring mod-
els allows increasing the quality of default prediction. Nevertheless, it also can be seen
that optimized ensemble slightly improve the result of the best single model (XGBoost)
and, moreover, all results of individual models are relatively close to each other. This
fact leads us to the conclusion that the further improvement of the developed model can
be achieved by taking additional behavioral and non-behavioral factors into account to
increase current quality threshold.
This research is financially supported by The Russian Science Foundation, Agreement
№17-71-30029 with co-financing of Bank Saint Petersburg.
1. Abdou H.A., Pointon J. Credit Scoring, Statistical Techniques and Evaluation Criteria:
A Review of the Literature // Intell. Syst. Accounting, Financ. Manag. 2011. Vol. 18, №
23. P. 5988.
2. Ha S.H. Behavioral assessment of recoverable credit of retailer’s customers // Inf. Sci.
(Ny). 2010. Vol. 180, № 19. P. 3703–3717.
3. Serrano-Cinca C., Gutiérrez-Nieto B. The use of profit scoring as an alternative to credit
scoring systems in peer-to-peer (P2P) lending // Decis. Support Syst. 2016. Vol. 89. P.
4. Lessmann S. et al. Benchmarking state-of-the-art classification algorithms for credit
scoring: An update of research // Eur. J. Oper. Res. 2015. Vol. 247, № 1. P. 124–136.
5. Wang G. et al. A comparative assessment of ensemble learning for credit scoring //
Expert Syst. Appl. 2011. Vol. 38, № 1. P. 223–230.
6. KAGGLE ENSEMBLING GUIDE [Electronic resource].
7. Westley K., Theodore I. Transaction Scoring: Where Risk Meets Opportunity
[Electronic resource].
8. Mullen, Katharine M., et al. “DEoptim: An R package for global optimization by
differential evolution.” (2009).
9. Wolters, M. A. (2015). A genetic algorithm for selection of fixed-size subsets with
application to design problems. J Stat Softw, 68(1), 1-18.
... Background Some researchers have found data with conditions referred to as imbalanced classes, where a small amount of data is referred to as a minority class and many data are referred to as the majority class. This is usually found in data such as credit, health and other data [12]. Liu et al. [10] state that learning algorithms that do not consider imbalances in the majority class tend to be overwhelmed by minority classes. ...
Full-text available
Some researchers find data with imbalanced class conditions, where there are data with a number of minorities and a majority. SMOTE is a data approach for an imbalanced classes and XGBoost is one algorithm for an imbalanced data problems. This research uses SMOTE and XGBoost or abbreviated as SMOTEXGBoost for handling data with an imbalanced classes. The results showed almost the same accuracy value between SMOTE and SMOTEXGBoost at 99%. While the value of AUC SMOTEXBoost has a more stable value than SMOTE that is equal to 99.89% for training and 98.51% for testing.
... A cooperative training algorithm showed the potential to improve the generalization ability. An evolutionary approach was used in [61] to build an ensemble structure, i.e., to compose optimal sets of models and input variables. Evolutionary-controlled ensembles provide better performance than a single model learning in scoring predictions. ...
Full-text available
Distributed intelligent systems (DIS) appear where natural intelligence agents (humans) and artificial intelligence agents (algorithms) interact, exchanging data and decisions and learning how to evolve toward a better quality of solutions. The networked dynamics of distributed natural and artificial intelligence agents leads to emerging complexity different from the ones observed before. In this study, we review and systematize different approaches in the distributed intelligence field, including the quantum domain. A definition and mathematical model of DIS (as a new class of systems) and its components, including a general model of DIS dynamics, are introduced. In particular, the suggested new model of DIS contains both natural (humans) and artificial (computer programs, chatbots, etc.) intelligence agents, which take into account their interactions and communications. We present the case study of domain-oriented DIS based on different agents’ classes and show that DIS dynamics shows complexity effects observed in other well-studied complex systems. We examine our model by means of the platform of personal self-adaptive educational assistants (avatars), especially designed in our University. Avatars interact with each other and with their owners. Our experiment allows finding an answer to the vital question: How quickly will DIS adapt to owners’ preferences so that they are satisfied? We introduce and examine in detail learning time as a function of network topology. We have shown that DIS has an intrinsic source of complexity that needs to be addressed while developing predictable and trustworthy systems of natural and artificial intelligence agents. Remarkably, our research and findings promoted the improvement of the educational process at our university in the presence of COVID-19 pandemic conditions.
Full-text available
This study goes beyond peer-to-peer (P2P) lending credit scoring systems by proposing a profit scoring. Credit scoring systems estimate loan default probability. Although failed borrowers do not reimburse the entire loan, certain amounts may be recovered. Moreover, the riskiest types of loans possess a high probability of default, but they also pay high interest rates that can compensate for delinquent loans. Unlike prior studies, which generally seek to determine the probability of default, we focus on predicting the expected profitability of investing in P2P loans, measured by the internal rate of return. Overall, 40,901 P2P loans are examined in this study. Factors that determine loan profitability are analyzed, finding that these factors differ from factors that determine the probability of default. The results show that P2P lending is not currently a fully efficient market. This means that data mining techniques are able to identify the most profitable loans, or in financial jargon, “beat the market”. In the analyzed sample, it is found that a lender selecting loans by applying a profit scoring system using multivariate regression outperforms the results obtained by using a traditional credit scoring system, based on logistic regression.
Full-text available
The R function kofnGA conducts a genetic algorithm search for the best subset of k items from a set of n alternatives, given an objective function that measures the quality of a subset. The function fills a gap in the presently available subset selection software, which typically searches over a range of subset sizes, restricts the types of objective functions considered, or does not include freely available code. The new function is demonstrated on two types of problem where a fixed-size subset search is desirable: design of environmental monitoring networks, and D-optimal design of experiments. Additionally, the performance is evaluated on a class of constructed test problems with a novel design that is interesting in its own right.
Full-text available
Many years have passed since Baesens et al. published their benchmarking study of classification algorithms in credit scoring [Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., & Vanthienen, J. (2003). Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the Operational Research Society, 54(6), 627-635.]. The interest in prediction methods for scorecard development is unbroken. However, there have been several advancements including novel learning methods, performance measures and techniques to reliably compare different classifiers, which the credit scoring literature does not reflect. To close these research gaps, we update the study of Baesens et al. and compare several novel classification algorithms to the state-of-the-art in credit scoring. In addition, we examine the extent to which the assessment of alternative scorecards differs across established and novel indicators of predictive accuracy. Finally, we explore whether more accurate classifiers are managerial meaningful. Our study provides valuable insight for professionals and academics in credit scoring. It helps practitioners to stay abreast of technical advancements in predictive modeling. From an academic point of view, the study provides an independent assessment of recent scoring methods and offers a new baseline to which future approaches can be compared.
Full-text available
Credit scoring has been regarded as a core appraisal tool of different institutions during the last few decades and has been widely investigated in different areas, such as finance and accounting. Different scoring techniques are being used in areas of classification and prediction, where statistical techniques have conventionally been used. Both sophisticated and traditional techniques, as well as performance evaluation criteria, are investigated in the literature. The principal aim of this paper, in general, is to carry out a comprehensive review of 214 articles/books/theses that involve credit scoring applications in various areas but in particular primarily in finance and banking. This paper also aims to investigate how credit scoring has developed in importance and to identify the key determinants in the construction of a scoring model, by means of a widespread review of different statistical techniques and performance evaluation criteria. Our review of literature revealed that there is no overall best statistical technique used in building scoring models and the best technique for all circumstances does not yet exist. Also, the applications of the scoring methodologies have been widely extended to include different areas, and this subsequently can help decision makers, particularly in banking, to predict their clients' behaviour. Finally, this paper also suggests a number of directions for future research. Copyright © 2011 John Wiley & Sons, Ltd.
Full-text available
This article describes the R package DEoptim, which implements the differential evolu- tion algorithm for global optimization of a real-valued function of a real-valued parameter vector. The implementation of differential evolution in DEoptim interfaces with C code for efficiency. The utility of the package is illustrated by case studies in fitting a Parratt model for X-ray reflectometry data and a Markov-switching generalized autoregressive conditional heteroskedasticity model for the returns of the Swiss Market Index.
Both statistical techniques and Artificial Intelligence (AI) techniques have been explored for credit scoring, an important finance activity. Although there are no consistent conclusions on which ones are better, recent studies suggest combining multiple classifiers, i.e., ensemble learning, may have a better performance. In this study, we conduct a comparative assessment of the performance of three popular ensemble methods, i.e., Bagging, Boosting, and Stacking, based on four base learners, i.e., Logistic Regression Analysis (LRA), Decision Tree (DT), Artificial Neural Network (ANN) and Support Vector Machine (SVM). Experimental results reveal that the three ensemble methods can substantially improve individual base learners. In particular, Bagging performs better than Boosting across all credit datasets. Stacking and Bagging DT in our experiments, get the best performance in terms of average accuracy, type I error and type II error.
The increasing rate of late payments by credit card customers, which are caused by the recent economic downturn, is causing not only reduced profit margins but also significant sales losses for retail companies. Under pressure to increase revenues, credit prediction should be a part of customer delinquency management. In this study, a credit prediction model has been developed to manage delinquents holding retail credit cards. The hybrid model combines a Kohonen network and a Cox’s proportional hazard model. A Kohonen network is used to cluster credit delinquents into homogeneous groups. A Cox’s hazard model is used to analyze repayment patterns of delinquents in each group. The model estimates the expected time of credit recovery from delinquents. This model’s prediction accuracy scored above 93%.
Transaction Scoring: Where Risk Meets Opportunity
  • K Westley
  • I Theodore
Westley K., Theodore I. Transaction Scoring: Where Risk Meets Opportunity [Electronic resource].
DEoptim: an R package for global optimization by differential evolution
  • K M Mullen
  • KM Mullen