ArticlePDF Available

Study of Data Mining Techniques used for Financial Data Analysis

Authors:

Abstract and Figures

This paper describes about different data mining techniques used in financial data analysis. Financial data analysis is used in many financial institutes for accurate analysis of consumer data to find defaulter and valid customer. For this different data mining techniques can be used. The information thus obtained can be used for Decision making. In this paper we study about loan default risk analysis, Type of scoring and different data mining techniques like Bayes classification, Decision Tree, Boosting, Bagging, Random forest algorithm and other techniques.
Content may be subject to copyright.
ISSN: 2319-5967
ISO 9001:2008 Certified
International Journal of Engineering Science and Innovative Technology (IJESIT)
Volume 2, Issue 3, May 2013
503
Abstract This paper describes about different data mining techniques used in financial data analysis. Financial data
analysis is used in many financial institutes for accurate analysis of consumer data to find defaulter and valid customer.
For this different data mining techniques can be used. The information thus obtained can be used for Decision making. In
this paper we study about loan default risk analysis, Type of scoring and different data mining techniques like Bayes
classification, Decision Tree, Boosting, Bagging, Random forest algorithm and other techniques.
Index Terms Data mining, Bayes Classification, Decision Tree, Boosting, Bagging, Random forest algorithm
I. INTRODUCTION
In Banking Sectors and other such leading organization the accurate assessment of consumer is of uttermost
importance. Credit loans and finances have risk of being defaulted. These loans involve large amounts of capital
and their non-retrieval can lead to major loss for the financial institution. Therefore, the accurate assessment of the
risk involved is a crucial matter for banks and other such organizations. Not only is it important to minimize risk in
granting credit but also the errors in declining any valid customer. This is to save the banks from lawsuits.
Increasing the demand for consumer credit has led to the competition in credit industry. In India, there are an
increasing number of people opting for loans for houses and cars and there is also a large demand for credit cards.
Such credits can be defaulted upon and have a great impact on the economy of the country. Thus, assessing loan
default risk is important for our economy. Earlier Assessing of credit was usually done using statistical and
mathematical methods by analysts. Nowadays Data mining techniques have gained popularity over the years
because of their ability in discovering practical knowledge from the database and transforming them into useful
information.
II. LOAN DEFAULT RISK ANALYSIS
Loan default risk assessment is one of the crucial issues which financial institutions, particularly banks are faced
and determining the effective variables is one of the critical parts in this type of studies. In this Credit scoring is a
widely used technique that helps financial institutions evaluates the likelihood for a credit applicant to default on
the financial obligation and decide whether to grant credit or not. The precise judgment of the credit worthiness of
applicants allows financial institutions to increase the volume of granted credit while minimizing possible losses.
Credit scoring models are known as statistical models which have been widely used to predict the default risk of
individuals or companies. These are multivariate models which use the main economic and financial indicators of
a company or individuals' characteristic such as age, income and marital status as input, assign them a weight which
reflect its relative importance in predicting default. The result is an index of credit worthiness that is expressed as
a numerical score, which measures the borrower’s probability of default. The initial credit scoring models are
devised in the 1930s by authors such as Fisher and Durand. The goal of a credit scoring model is to classify credit
applicants into two classes: the “good credit" class that is liable to reimburse the financial obligation and the “bad
credit" class that should be denied credit due to the high probability of defaulting on the financial obligation. The
classification is contingent on characteristics of the borrower (such as age, education level, occupation, marital
status and income), the repayment performance on previous loans and the type of loan. These models are also
applicable to small businesses since these may be regarded as extensions of an individual costumer.
Type of scoring
Based on Paleologo, G., Elisseeff, A., and Antonini, G., research there are different kinds of scoring:
1) Application Scoring:
This kind of scoring consists of estimation credit worthiness of a new applicant who applies for credit. It estimates
financial risk with respect to social, demographic financial conditions of a new applicant to decide whether to
grand credit to them or not.
Study of Data Mining Techniques used for
Financial Data Analysis
Abhijit A. Sawant and P. M. Chawan
Department of Computer Technology, VJTI, Mumbai, INDIA
ISSN: 2319-5967
ISO 9001:2008 Certified
International Journal of Engineering Science and Innovative Technology (IJESIT)
Volume 2, Issue 3, May 2013
504
2) Behavioral Scoring:
It is similar to application scoring with difference that it involves existing customers, so the lender has some
evidences about borrower’s behavior to dynamic management of portfolio process.
3) Collection Scoring:
Collection scoring classifies customers into different groups according to their levels of insolvency. In the other
words, it separates the customers who need more decisive actions from those who do not require to be attended to
immediately. These models are used in order to management of delinquent customers from the first signs of
delinquency.
4) Fraud Detection:
It categorizes the applicants according to the probability that an applicant be guilty.
III. DATA MINING TECHNIQUES
Data mining a field at the intersection of computer science and statistics is the process that attempts to discover
patterns in large data sets. It utilizes methods at the intersection of artificial intelligence, machine learning,
statistics, and database systems. The overall goal of the data mining process is to extract information from a data set
and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves
database and data management aspects, data preprocessing, model and inference considerations, interestingness
metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.
Mostly use Data mining techniques are as follows:
A. Bayes Classification
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong independent
assumptions and is particularly suited when the dimensionality of the inputs is high. A naive Bayes classifier
assumes that the existence (or nonexistence) of a specific feature of a class is unrelated to the existence (or
nonexistence) of any other feature. Classification is a form of data analysis which can be used to extract models
describing important data classes. Classification predicts categorical labels (or discrete values).Data classification
is a two step process. In the first step, a model is built describing a predetermined set of data classes or concepts.
The model is constructed by analyzing database tuples described by attributes. Each tuple is assumed to belong to
a predefined class, as determined by one of the attributes, called the class label attribute. In the second step, the
model is used for classification. First, the predictive accuracy of the model is estimated. The accuracy of a model
on a given test set is the percentage of test set samples that are correctly classified by the model. . If the accuracy of
the model is considered acceptable, the model can be used to classify future data tuples or objects for which the
class label is not known.
Bayesian Algorithm:
1. Order the nodes according to their topological order.
2. Initiate importance function Pr0(X\E), the desired number of samples m, the updating interval l, and the score
arrays for every node.
3. k 0, TØ
4. for i 1 to m do
5. if (i mod l == 0) then
6. k k+1
7. update importance function Prk (X\E) based on T
end if
8. si generate a sample according to Prk (X\E)
9. T T U {si}
10. Calculate Score( si, Pr(X\E,e), Prk(X|E) and add it to the corresponding entry of every array according to the
instantiated states.
11. Normalize the score arrays for every node.
The major disadvantage of this model is that the predictive accuracy is highly correlated with this assumption.
An advantage of this method is that it requires a small amount of training data to estimate the parameters (means
and variances of the variables) necessary for classification.
B. Decision Tree
A classification (decision) tree is a tree-like graph of decisions and their possible consequences. Topmost node in
this tree is the root node on which a decision is supposed to be taken. In each inner node, it is done as a test on an
attribute or input variable. Specifically each branch of the tree is a classification question and the leaves of the tree
ISSN: 2319-5967
ISO 9001:2008 Certified
International Journal of Engineering Science and Innovative Technology (IJESIT)
Volume 2, Issue 3, May 2013
505
are partitions of the dataset with their classification. The processes in decision tree algorithms are very similar
when they build trees.
These algorithms look at all possible distinguishing questions that could possibly break up the original training
dataset into segments that are nearly homogeneous with respect to the different classes being predicted. Some
decision tree algorithms may use heuristics in order to pick the questions or even pick them at random. The
advantage of this method is that it is a white box model and so it is simple to understand and explain, but the
limitation of this model is that, it cannot be generalized into a designed structure for all contexts.
A decision tree is a mapping from observations about an item to conclusion about its target value as a predictive
model in data mining and machine learning. Generally, for such tree models, other descriptive names are
classification tree (discrete target) or regression tree (continuous target). In these tree structures, the leaf nodes
represent classifications, the inner nodes represent the current predictive attributes and branches represent
conjunctions of attributions that lead to the final classifications. The popular decision trees algorithms include ID3,
C4.5 which is an extension of ID3 algorithm and CART.
Fig 1- Decision Tree
C. Boosting
The concept of boosting applies to the area of predictive data mining, to generate multiple models or classifiers (for
prediction or classification), and to derive weights to combine the predictions from those models into a single
prediction or predicted classification.
A simple algorithm for boosting works like this: Start by applying some method to the learning data, where each
observation is assigned an equal weight. Compute the predicted classifications, and apply weights to the
observations in the learning sample that are inversely proportional to the accuracy of the classification. In other
words, assign greater weight to those observations that were difficult to classify (where the misclassification rate
was high), and lower weights to those that were easy to classify (where the misclassification rate was low).
Boosting will generate a sequence of classifiers, where each consecutive classifier in the sequence is an "expert" in
classifying observations that were not well classified by those preceding it. During deployment (for prediction or
classification of new cases), the predictions from the different classifiers can then be combined (e.g., via voting, or
some weighted voting procedure) to derive a single best prediction or classification.
The most popular boosting algorithm is AdaBoost. AdaBoost, short for Adaptive Boosting, is a machine
learning algorithm, formulated by Yoav Freund and Robert Schapire It is a meta-algorithm, and can be used in
conjunction with many other learning algorithms to improve their performance.
The Boosting Algorithm AdaBoost
Given: (x1, y1),…, (xm,ym); xi ϵ X, yi ϵ {-1,1}
Initialize weights D1 (i)=1/m
For t= 1, …, T:
1. (Call WeakLearn), which returns the weak classifier ht : X {-1,1} with minimum error w.r.t distribution Dt;
2. Choose αt ϵ R,
3. Update
ISSN: 2319-5967
ISO 9001:2008 Certified
International Journal of Engineering Science and Innovative Technology (IJESIT)
Volume 2, Issue 3, May 2013
506
Dt+1(i)=(Dt(i) exp (-αt yi ht(xi)))/Zt
Where Zt is a normalization factor Choosen so that Dt+1 is a distribution
Output the strong Classifier
H(x) =sign (∑T t=1 αt ht(x))
D. Bagging
Bagging (Bootstrap aggregating) was proposed by Leo Breiman in 1994 to improve the classification by
combining classifications of randomly generated training sets.
Bootstrap aggregating (bagging) is a machine learning ensemble meta-algorithm designed to improve the stability
and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces
variance and helps to avoid over fitting. Although it is usually applied to decision tree methods, it can be used with
any type of method. Bagging is a special case of the model averaging approach.
Given a standard training set D of size n, bagging generates m new training sets Di, each of size n′ < n, by sampling
from D uniformly and with replacement. By sampling with replacement, some observations may be repeated in
each Di. If n′=n, then for large n the set Di is expected to have the fraction (1 - 1/e) of the unique examples of D, the
rest being duplicates. This kind of sample is known as a bootstrap sample. The m models are fitted using the above
m bootstrap samples and combined by averaging the output voting.
Bagging leads to "improvements for unstable procedures", which include, for example, neural nets, classification
and regression trees, and subset selection in linear regression? On the other hand, it can mildly degrade the
performance of stable methods such as K-nearest neighbours.
The Bagging Algorithm
Training phase
1. Initialize the parameter
D=Ø, the ensemble.
Lk the number of classifiers to train.
2. For k=1,…, L
Take a bootstrap sample Sk from Z.
Build a classifier Dk using Sk as the training set
Add the classifier to the current ensemble,
D=D U Dk.
3. Return D.
Classification phase
4. Run D1, …, Dk on the input x.
5. The clas with the maximum number of votes is chosen as the lable for x.
E. Random forest Algorithm
Random forests are an ensemble learning method for classification (and regression) that operate by constructing a
multitude of decision trees at training time and outputting the class that is the mode of the classes output by
individual trees. The algorithm for inducing a random forest was developed by Leo Breiman and Adele Cutler, and
"Random Forests" is their trademark. The term came from random decision forest which was first proposed by Tin
Kam Ho of Bell Labs in 1995. The method combines Breiman's "bagging" idea and the random selection of
features, introduced independently by Ho and Amit and Geman in order to construct a collection of decision trees
with controlled variation.The selection of a random subset of features is an example of the random subspace
method, which, in Ho's formulation, is a way to implement stochastic discrimination proposed by Eugene
Kleinberg.The introduction of random forests proper was first made in a paper by Leo Breiman. This paper
describes a method of building a forest of uncorrelated trees using a CART like procedure, combined with
randomized node optimization and baggingIt is better to think of random forests as a framework rather than as a
particular model. The framework consists of several interchangeable parts which can be mixed and matched to
create a large number of particular models, all built around the same central theme. Constructing a model in this
framework requires making several choices:
1. The shape of the decision to use in each node.
2. The type of predictor to use in each leaf.
3. The splitting objective to optimize in each node.
4. The method for injecting randomness into the trees.
F. Other Techniques
I] The Back Propagation Algorithm
ISSN: 2319-5967
ISO 9001:2008 Certified
International Journal of Engineering Science and Innovative Technology (IJESIT)
Volume 2, Issue 3, May 2013
507
The back propagation algorithm (Rumelhart and McClelland, 1986) is used in layered feed-forward ANNs. This
means that the artificial neurons are organized in layers, and send their signals “forward”, and then the errors are
propagated backwards. The network receives inputs by neurons in the input layer, and the output of the network is
given by the neurons on an output layer. There may be one or more intermediate hidden layers. The back
propagation algorithm uses supervised learning, which means that we provide the algorithm with examples of the
inputs and outputs we want the network to compute, and then the error (difference between actual and expected
results) is calculated. The idea of the back propagation algorithm is to reduce this error, until the ANN learns the
training data. The training begins with random weights, and the goal is to adjust them so that the error will be
minimal.
II] Genetic Algorithm:
Genetic Algorithm (GA) was developed by Holland in 1970. This incorporates Darwinian evolutionary theory with
sexual reproduction. GA is stochastic search algorithm modeled on the process of natural selection, which
underlines biological evolution. A has been successfully applied in many search, optimization, and machine
learning problems. GA process in an iteration manner by generating new populations of strings from old ones.
Every string is the encoded binary, real etc., version of a candidate solution. An evaluation function associates a
fitness measure to every string indicating its fitness for the problem. Standard GA apply genetic operators such
selection, crossover and mutation on an initially random population in order to compute a whole generation of new
strings.
Fig 2- Genetic Algorithm Flowchart
Selection deals with the probabilistic survival of the fittest, in those more fit chromosomes are chosen to survive.
Where fitness is a comparable measure of how well a chromosome solves the problem at hand.
Crossover takes individual chromosomes from P combines them to form new ones.
Mutation alters the new solutions so as to add stochasticity in the search for better solutions.
In general the main motivation for using GAs in the discovery of high-level prediction rules is that they perform a
global search and cope better with attribute interaction than the greedy rule induction algorithms often used in data
mining.
ISSN: 2319-5967
ISO 9001:2008 Certified
International Journal of Engineering Science and Innovative Technology (IJESIT)
Volume 2, Issue 3, May 2013
508
III] Particle Swamp Optimization:
The investigation and analysis on the biologic colony demonstrated that intelligence generated from complex
activities can provide efficient solutions for specific optimization problems. Inspired by the social behavior of
animals such as fish schooling and bird flocking, Kennedy and Eberhart designed the Particle Swarm Optimization
(PSO) in 1995. The basic PSO model consists of a swarm of particles moving in a d-dimensional search space. The
direction and distance of each particle in the hyper-dimensional space is determined by its fitness and velocity. In
general, the fitness is primarily related with the optimization objective and the velocity is updated according to a
sophisticated rule.
Artificial neural networks (ANNs) are thus, non-linear statistical modeling based on the function of the human
brain. They are powerful tools for unknown data relationship modeling. Artificial Neural Networks are able to
recognize the complex pattern between input and output variables then predict the outcome of new independent
input data.
IV] Support vector machine:
Support vector machine is a classifier technique. This method involves three elements. A score formula which is a
linear combination of features selected for the classification problem, an objective function which considers both
training and test samples to optimize the classification of new data, an optimizing algorithm for determining the
optimal parameters of training sample objective function.
The advantages of the method are that, in the nonparametric case, SVM requires no data structure assumptions
such as normal distribution and continuity. SVM can perform a nonlinear mapping from an original input space
into a high dimensional feature space and this method is capable of handling both continuous and categorical
predictions. The weaknesses of this method are that, it is difficult to interpret unless the features interpretable and
standard formulations do not contain specification of business constraints.
IV. CONCLUSION
The system using data mining for loan Default risk analysis enables the bank to reduce the manual errors involved
in the same. Decision trees are preferred by banks because they are a white box model. The discrimination made by
decision trees is obvious and the people can understand its working easily. This enables the banks and other
financial institutions to provide an account for accepting or rejecting an applicant. Boosting has already increased
the efficiency of decision trees. The assessment of risk will enable banks to increase profit and can result in
reduction of interest rate.
REFERENCES
[1] Jiawei Han, Micheline Kamber, “Data Mining Concepts and Technique”, 2nd edition
[2] Margaret H. Dunham,”Data Mining-Introductory and Advanced Topocs” Pearson Education, Sixtyh Impression, 2009.
[3] Abbas Keramati, Niloofar Yousefi, “A Proposed Classification of Data Mining Techniques in Credit Scoring",
International Conference on Industrial Engineering and Operations Management, 2011
[4] Boris Kovalerchuk, Evgenii Vityaev,” DATA MINING FOR FINANCIAL APPLICATIONS”, 2002
[5] Defu Zhang, Xiyue Zhou, Stephen C.H. Leung, JieminZheng, “Vertical bagging decision trees model for credit scoring”,
Expert Systems with Applications 37 (2010) 78387843.
[6] Raymond Anderson, “Credit risk assessment: Enterprise-credit frameworks”.
[7] Girisha Kumar Jha, ”Artificial Neural Networks”.
[8] Hossein Falaki ,”AdaBoost Algorithm”.
[9] Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan, “Mining Very Large Database” IEEE,1999.
[10] http://en.wikipedia.org.
AUTHOR BIOGRAPHY
Abhijit A. Sawant is currently pursuing M. Tech in Computer engineering second year from Veermata Jijabai
Technological Institute (V.J.T.I), Matunga, Mumbai (INDIA)”. He has received his Bachelors’ Degree in Information
technology from “Padmabhushan Vasantdada Patil Pratishan’s College of Engineering (P.V.P.P.C.O.E), Sion, Mumbai
(INDIA)” in 2010. He has published 2 papers in International Journal till date. His areas of interest are Software
Engineering, Database, Data ware house, Data mining.
ISSN: 2319-5967
ISO 9001:2008 Certified
International Journal of Engineering Science and Innovative Technology (IJESIT)
Volume 2, Issue 3, May 2013
509
Pramila M. Chawan is currently working as an Associate Professor in the Computer Technology Department of
Veermata Jijabai Technological Institute (V.J.T.I.), Matunga, and Mumbai (INDIA)”. She received her Bachelors’
Degree in Computer Engineering from V.J.T.I., Mumbai University (INDIA) in 1991 & Masters’ Degree in Computer
Engineering from V.J.T.I., Mumbai University (INDIA) in 1997.She has an academic experience of 20 years. She has
taught Computer related subjects at both Undergraduate & Post Graduate levels. Her areas of interest are Software
Engineering, Software Project Management, Management Information Systems, Advanced Computer Architecture &
Operating Systems. She has published 12 papers in National Conferences and 7 papers in International Conferences &
Symposiums. She also has 40 International Journal publications to her credit. She has guides 35 M. Tech. projects &
85 B. Tech. projects.
... The forecast of securities exchanges is viewed as a difficult errand of monetary time, arrangement expectation [4]. Securities exchange forecast is trying essentially on account of the unconventionality of the financial exchange with its loud and unstable condition, considering the solid association with various stochastic factors, for example, political occasions, papers just as quarterly and yearly reports [3]. Social media offers a robust outlet for people thoughts and feelings. ...
... Information mining is the way toward extricating the helpful information, examples and patterns from a lot of dataset. Information mining [3] is an interdisciplinary subfield of software engineering with a general target to separate information from an informational collection and change the information into a possible structure for further use. Close to the rough examination step, it in like manner incorporates database and the board perspectives, information pre-handling, model and surmising contemplations. ...
... The various information digging procedures are utilized for exploring money related information examination. Money related information investigation is utilized in numerous budgetary organizations for precise examination of purchaser information to discover defaulter and legitimate customer [3]. ...
Article
Full-text available
Stock market analysis is a common economic activity that has been an attractive topic to research and used in different forms of day-to-day life in order to predict the stock prices. Techniques like major analysis, Statistical investigation, Time arrangement analysis and so on are reliably worthy forecast device. In this paper, Data mining, Machine learning (ML) and Sentiment analysis are techniques used for analyzing public emotions in order predict the future stock prices. The goal of a project is to review totally different techniques to predict stock worth movement victimization the sentiment analysis from social media, data processing. Sentiment classifiers are designed for social media text like product reviews, blog posts, and email corpus messages. In the company’s communication network, information mining calculation is utilized as to mine email correspondence records and verifiable stock costs. Implementing various Machine learning and Classification models such as Deep Neural network, Random forests, Support Vector Machine, the company can successfully implemented a company-specific model capable of predicting stock price movement with efficient accuracy.
... The work showed that data driven models built using Broad default definitions can outperform Narrow default definitions. A brief case study of different data mining techniques like Bayes Classification, bagging algorithm, Random forest, Decision tree, Random Forest and other techniques used in financial data analysis were applied in [23]. The work in [24] checks the applicability of the integrated model on a sample dataset taken from Banks in India. ...
Preprint
Full-text available
Credit ratings are becoming one of the primary references for financial institutions of the country to assess credit risk in order to accurately predict the likelihood of business failure of an individual or an enterprise. Financial institutions, therefore, depend on credit rating tools and services to help them predict the ability of creditors to meet financial persuasions. Conventional credit rating is broadly categorized into two classes namely: good credit and bad credit. This approach lacks adequate precision to perform credit risk analysis in practice. Related studies have shown that data-driven machine learning algorithms outperform many conventional statistical approaches in solving this type of problem, both in terms of accuracy and efficiency. The purpose of this paper is to construct and validate a credit risk assessment model using Linear Discriminant Analysis as a dimensionality reduction technique to discriminate good creditors from bad ones and identify the best classifier for credit assessment of commercial banks based on real-world data. This will help commercial banks to avoid monetary losses and prevent financial crisis
... Bagging ML models reduce the changes in a black-box model, like in our case, the decision tree, by randomising its development method and constructing an ensemble as a result. Bagging ensemble models are generally good for resolving financial business problems associated with fraud detection or pricing [10]. ...
... He also suggests that these methods should be utilized to gain a competitive edge. Chawan et al. [9] talk about the significance of data mining and the various kinds of financial data that are available through this process. They detail the different techniques involved, such as clustering, association rules, decision trees, and neural networks. ...
Article
A powerful analytical tool, data mining has become a common method utilized by financial institutions to identify and manage risks. The paper aims to provide a comprehensive analysis of the various techniques used in the mining of data for financial and banking organizations. It also explores the applications of these techniques in the risk assessment process. Risk assessment is a vital part of the operations of financial institutions. It involves identifying, measuring, and mitigating the risks that can affect an institution's financial health. The paper covers the various types of data mining tools that are commonly utilized for assessing financial and banking risks, such as clustering, association rule mining, and classification. It also provides a review of the limitations and challenges associated with using such techniques. The paper also reviews the literature on the various applications of data mining in the financial and banking sectors. These include operational risk assessment, credit risk assessment, and fraud detection. The review provides an overview of each application's aspects, such as data sources, preprocessing techniques, algorithms, and the results. The paper then explores the future directions for the research on the use of data mining for the assessment of financial and banking risks. It covers the latest trends in the field of data analysis, such as the incorporation of artificial intelligence, machine learning, big data analytics, and more
... Criminal analysis entails identifying crimes as well as the criminals' connections to them. We get large amounts of criminal datasets from various crimes such as cybercrime, violent crimes, fraud detections, and drug offences [7,8] . In this discipline, data mining is used for applications like as counter-terrorism, criminality matching, and crime trends, among others. ...
... The final prediction is the average performance of the sample on these learners. Bagging is also currently being used in the financial industry for deep learning models, including fraud detection, credit risk assessment and option pricing issues [9]. ...
... They concluded by the projected future work aiming to propose a hybrid method combining the most interesting algorithms. Authors of [16] used different algorithms for Financial data analysis to assist banks to avoid credit defaulters. The paper classifies scoring into four types to work on it: application Scoring, behavioral Scoring, collection Scoring, and fraud detection. ...
Thesis
Full-text available
Acquisition of credit from financial institutions has become a very common phenomenon in today’s world. Many people apply for loans every day for a variety of purposes. But not all the applicants are reliable, and not everyone can be approved. There are instances where people do not repay the bulk of the loan amount to the bank, resulting in huge financial loss. Credit risk evaluation is an important problem in the financial industry in today’s world. In this research, we would review the Merton model and Jarrow Lando Turnbull credit rating model used in estimating the default probabilities. With the development of modern technology, financial institutions are researching and developing effective ways of modelling credit risk. Data analysis techniques have an important role in the prediction and analysis of credit risk. Various machine learning techniques are used to develop financial prediction models. The research is aimed at using data mining techniques such as Logistic regression, Decision tree and Random forest machine learning algorithms for accurate analysis of customer data to identify defaulters and valid customers on the German credit data obtained from UCI Machine Learning Repository. It would help credit providers identify fraudulent customers before providing credit. The proposed models provide important information which can be used for decision making on loan proposals.
Article
Full-text available
Data mining has variety of applications in Banking and finance sectors like fraud detection, marketing, customer relationship management, customer acquisition and retention and credit risk analysis. For any financial organization which provides loan to the customer. To acquire more number of customers and to retain existing customers is very essential for the bank. Credit risk analysis is the core application for Banking sectors. A large number of approaches and methodologies have been tried out for analysis of customers database requesting for home loan. Bank maintains customers data in their database repositories. Different types of loans are available .This paper highlights study of home loan applications. Traditional method is bank loan officers study the loan applications ,personal details and financial history of the customers and then take decisions of defaulters and non defaulters but in today's world information is increasing due to electronic transactions and number of loan applications are also increasing. Hence traditional approach will not work accurately. So we need a strong tool and techniques to store this huge amount of data do some analysis on it and find different patterns for future use. Hence there is need of data mining technique to research in this area. Depending on patterns home loan customers are classifying into mainly 'Risky' and 'Safe'. But in this research work we have further classify customers into 'Safe','More Safe','Risk','More Risky' .Data mining techniques plays important role in the process of extracting previously unknown information typically in the form of patterns from large databases. This study mainly focuses on study of home loan applications by using data mining techniques.This is an overview of modeling behavior of bank customers using predictive data mining techniques. An accurate estimation of credit risk analysis could be transformed into a more efficient use of economic capital.
Article
Full-text available
Data mining is known as Knowledge Discovery in Database. It refers to mining knowledge from large amounts of data. Data mining is the procedure to haul out patterns from different types of datasets. Data mining is purely based on the concepts of artificial intelligence and statistical theory. Data mining has performed extraordinary commitment in learning revelation in variety of application areas. It is popular research area also. Data mining has variety of applications in the fields like Science, Telecommunication ,Information Technology, Biology, Medical science, Banking ,Marketing and many more. This research paper provides focus on data mining application in banking sector. This research paper provides the study of loan applicants by using data mining classification method. Now a days, There are numerous dangers identified with bank loans, for the individuals who get the loans. Number of transactions in banking industries are rapidly growing and huge volumes of databases are available .These databases represent the customers behavior and the risks related with loan .Data Mining is one of the most encouraging and essential area of research with the aim of extracting important information from remarkable amount of database. In this paper we are presenting a comparative study of models developed by using data mining technique classification for classifying loan applications. The model has been built using data from bank. Here performance of Naïve Bayes,J48 and Bagging algorithms are measured. Data mining tool used in this study is WEKA. At the end we have discussed results and performances of algorithms are compared on the basis of accuracy and model development time.
Chapter
Full-text available
This chapter describes Data Mining in finance by discussing financial tasks, specifics of methodologies and techniques in this Data Mining area. It includes time dependence, data selection, forecast horizon, measures of success, quality of patterns, hypothesis evaluation, problem ID, method profile, attribute-based and relational methodologies. The second part of the chapter discusses Data Mining models and practice in finance. It covers use of neural networks in portfolio management, design of interpretable trading rules and discovering money laundering schemes using decision rules and relational Data Mining methodology.
Conference Paper
Full-text available
This paper provides succinct frameworks for credit-risk assessment in the enterprise-credit domain (SMEs and larger businesses), that can guide lenders when choosing appropriate data and tools. Traditionally, lenders relied upon judgmental assessments of the five Cs (capacity, capital, character, collateral, and conditions), but modern technology has allowed them to amass and capitalise on data. Besides judgment, lenders can also apply scoring, reduced-form, and structural models—with the choice being dependent upon the size and nature of the firms being assessed. For the largest companies with traded securities, reduced-form and structural models can be used to interpret their prices and price movements. In contrast, credit scoring is used mostly in data-rich small-business credit environments, but can add value elsewhere. Entrepreneurs' personal data play a major role for the smallest firms, but enterprise-specific data (transactional, trade-creditor, financial-statement) take over for larger companies. Even where numbers are small, it is still possible to develop scorecards to predict agency and internal rating grades. Financial-statement data is a key component, and this paper presents results showing that underwriters' expertise can be harnessed to develop financial-ratio scoring models whose results approximate those developed using empirical data.
Chapter
Full-text available
This chapter describes Data Mining in finance by discussing financial tasks, specifics of methodologies and techniques in this Data Mining area. It includes time dependence, data selection, forecast horizon, measures of success, quality of patterns, hypothesis evaluation, problem ID, method profile, attribute-based and relational methodologies. The second part of the chapter discusses Data Mining models and practice in finance. It covers use of neural networks in portfolio management, design of interpretable trading rules and discovering money laundering schemes using decision rules and relational Data Mining methodology.
Article
Abstract - Although neural networks have been applied to medical problems in recent years, their applicability has been limited for a variety of reasons. One of those barriers has been the problem of recognizing rare categories. Rare categories occur and must be learned if practical application of neural-network technology is to be achieved. Hospitals as learning organizations have evolved through complex phases of service failures and continuous service improvement to meet the business needs of a varied continuum of care customers. This paper explores the use of Artificial Neural Network (ANN) in the development of a decision support system to manage healthcare non-clinical services. The expert system was designed to specify treatment variables for new patients based upon a set of templates (a database of treatment plans for previous patients) and a similarity metric for determining the goodness of fit between the relevant anatomy of new patients and patients in the database. A set of artificial neural networks was used to optimize the treatment variables to the individual patient. If patient-to-patient variations in the size and shape of relevant anatomical structures for a given treatment site are adequately sampled, then it would seem possible to develop a general method for automatically mapping individual patient anatomy to a corresponding set of treatment variables. A medical expert system approach to standardized treatment planning was developed that should lead to improved planning efficiency and consistency. An expert system approach to standardized treatment planning has the potential of improving the overall efficiency of the planning process by reducing the number of iterations required to generate an optimized dose distribution, and to function most effectively, should be closely integrated with a dosimetric based treatment planning system.
Book
This is the third edition of the premier professional reference on the subject of data mining, expanding and updating the previous market leading edition. This was the first (and is still the best and most popular) of its kind. Combines sound theory with truly practical applications to prepare students for real-world challenges in data mining. Like the first and second editions, Data Mining: Concepts and Techniques, 3rd Edition equips professionals with a sound understanding of data mining principles and teaches proven methods for knowledge discovery in large corporate databases. The first and second editions also established itself as the market leader for courses in data mining, data analytics, and knowledge discovery. Revisions incorporate input from instructors, changes in the field, and new and important topics such as data warehouse and data cube technology, mining stream data, mining social networks, and mining spatial, multimedia and other complex data. This book begins with a conceptual introduction followed by a comprehensive and state-of-the-art coverage of concepts and techniques. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. Wherever possible, the authors raise and answer questions of utility, feasibility, optimization, and scalability. relational data. -- A comprehensive, practical look at the concepts and techniques you need to get the most out of real business data. -- Updates that incorporate input from readers, changes in the field, and more material on statistics and machine learning, -- Scores of algorithms and implementation examples, all in easily understood pseudo-code and suitable for use in real-world, large-scale data mining projects. -- Complete classroom support for instructors as well as bonus content available at the companion website. A comprehensive and practical look at the concepts and techniques you need in the area of data mining and knowledge discovery.
Article
In recent years, more and more people, especially young people, begin to use credit card with the changing of consumption concept in China so that the business on credit cards is growing fast. Therefore, it is significative that some effective tools such as credit-scoring models are created to help those decision makers engaged in credit cards. A novel credit-scoring model, called vertical bagging decision trees model (abbreviated to VBDTM), is proposed for the purpose in this paper. The model is a new bagging method that is different from the traditional bagging. The VBDTM model gets an aggregation of classifiers by means of the combination of predictive attributes. In the VBDTM model, all train samples and just parts of attributes take part in learning of every classifier. By contrast, classifiers are trained with the sample subsets in the traditional bagging method and every classifier has the same attributes. The VBDTM has been tested by two credit databases from the UCI Machine Learning Repository, and the analysis results show that the performance of the method proposed by us is outstanding on the prediction accuracy.
Article
Established companies have had decades to accumulate masses of data about their customers, suppliers, products and services, and employees. Data mining, also known as knowledge discovery in databases, gives organizations the tools to sift through these vast data stores to find the trends, patterns, and correlations that can guide strategic decision making. Traditionally, algorithms for data analysis assume that the input data contains relatively few records. Current databases however, are much too large to be held in main memory. To be efficient, the data mining techniques applied to very large databases must be highly scalable. An algorithm is said to be scalable if (given a fixed amount of main memory), its runtime increases linearly with the number of records in the input database. Recent work has focused on scaling data mining algorithms to very large data sets. The authors describe a broad range of algorithms that address three classical data mining problems: market basket analysis, clustering, and classification
Data Mining-Introductory and Advanced Topocs" Pearson Education, Sixtyh Impression
  • Margaret H Dunham
Margaret H. Dunham,"Data Mining-Introductory and Advanced Topocs" Pearson Education, Sixtyh Impression, 2009.
A Proposed Classification of Data Mining Techniques in Credit Scoring
  • Abbas Keramati
  • Niloofar Yousefi
Abbas Keramati, Niloofar Yousefi, "A Proposed Classification of Data Mining Techniques in Credit Scoring", International Conference on Industrial Engineering and Operations Management, 2011