Conference PaperPDF Available

An Approach for Predicting Employee Churn by Using Data Mining


Abstract and Figures

Employee churn prediction which is closely related to customer churn prediction is a major issue of the companies. Despite the importance of the issue, there is few attention in the literature about. In this study, we applied well-known classification methods including, Decision Tree, Logistic Regression, SVM, KNN, Random Forest, and Naive Bayes methods on the HR data. Then, we analyze the results by calculating the accuracy, precision, recall, and F-measure values of the results. Moreover, we implement a feature selection method on the data and analyze the results with previous ones. The results will lead companies to predict their employees' churn status and consequently help them to reduce their human resource costs.
Content may be subject to copyright.
An Approach for Predicting Employee Churn by
Using Data Mining
Ibrahim Onuralp Yi˘
Turk Telekom Labs
Turk Telekom
Istanbul, Turkey
Hamed Shourabizadeh
Department of Industrial Engineering
Ozyegin University
Istanbul, Turkey
Abstract—Employee churn prediction which is closely related
to customer churn prediction is a major issue of the companies.
Despite the importance of the issue, there is few attention in
the literature about. In this study, we applied well-known clas-
sification methods including, Decision Tree, Logistic Regression,
SVM, KNN, Random Forest, and Naive Bayes methods on the HR
data. Then, we analyze the results by calculating the accuracy,
precision, recall, and F-measure values of the results. Moreover,
we implement a feature selection method on the data and analyze
the results with previous ones. The results will lead companies
to predict their employees’ churn status and consequently help
them to reduce their human resource costs.
KeywordsEmployee Churn Prediction, Data Analysis, Feature
Selection, Data Mining, Classification
An employee would decide to join or leave an organization
based on several reasons, for instance, work environment, work
place, gender equity, pay equity etc. Others may consider
personal reasons such as relocation due to family, maternity,
health, conflict with the managers or colleagues in a team.
Employee churn is a big issue for the organizations specially
when trained, technical and key employees leave for a better
opportunity in a competitor organization. It requires time,
effort and results in financial loss to replace a trained em-
ployee. Therefore, we use the current and past employee data
to analyze the common grounds for employee attrition. The
employee churn prediction helps in identifying and solving
the issues that results in attrition. We can use this information
for possible retention of the current employees.
In this study, we implement some of the well-known
techniques of data classification namely, Decision Tree, Naive
Bayes, Logistic Regression, Support Vector Machine (SVM),
K-Nearest Neighbor (KNN), and Random Forest on the Human
Resources (HR) Employee Attrition dataset provided by IBM.
The dataset includes 1470 records with 34 features includ-
ing categorical and numeric features. Before implementing
method, we calculated the correlation of the features in order to
avoid features with high correlation. The results of this meth-
ods has been analyzed then by their accuracy, precision, recall,
and F-measure values. Then, the method with best performance
has been conducted. Finally, we implement a feature selection
method to select the most important features of the dataset
and implemented the above mentioned classification methods
on the datasets with reduced number of features. Moreover,
we also compared the results of the methods with and without
feature selection.
Churn prediction, particularly customer churn prediction,
attracted huge attention of researchers. For instance, Verbeke
et al. propose profit centric performance measure by calculat-
ing the maximum profit that can be generated by including
the optimal fraction of customers with the highest predicted
probabilities to churn in a retention campaign [1]. Coussement
and Van den Poel studied the problem of optimizing the
performance of a decision support system for churn prediction
[2]. They studied the effect of textual information in the
churn prediction method. They found that adding unstructured,
textual information into a conventional churn prediction model
resulted in a significant increase in predictive performance.
In a similar study, Wei and Chiu propose churn prediction
of telecommunication customers by analyzing call details of
the customers [3]. Coussement and Van den Poel implement
SVM method to predict customer churns [4]. Their study
shows that support vector machines results good generalization
performance when applied to noisy marketing data. Burez
and Van den Poel study class imbalances in customer churn
prediction [5]. Results of the study show that under-sampling
can lead to improved prediction accuracy.
In another study, Tsai and Chen use association rules to
select important features and then apply neural networks and
Decision Tree to predict customer churns in a telecommunica-
tion company [6]. Similar to us, they use four performance
measurements to analyze their results, accuracy, precision,
recall, and F-measure. Coussement et al. develop a method
named Generalized Additive Models (GAM) which makes the
model able to fit complex non-linear to the data [7]. There are
also, other studies which implement well-known techniques of
data mining to predict customer churns. Huang et al. proposes
some new features to customer churn prediction and implement
seven prediction techniques including Logistic Regression,
Linear Classification, Naive Bayes, Decision Tree, Multilayer
Perceptron Neural Networks, Support Vector Machines and the
evolutionary data mining algorithms [8].
Churn prediction analysis is frequently studied in the liter-
ature. In contrast, there are few studies in the literature which
consider employee churn prediction and analysis. Saradhi and
Palshikar study employee churn prediction by implementing
naive Bayes, Logistic Regression, Decision Tree, and Random
Forest methods [9]. In another study, Khare et al. propose
an attrition risk equation using Logistic regression to predict
churns in employees [10]. The last study, to the best of our
knowledge, is Kane-Sellers’s study on the dataset of Fortune
500 North American industrial automation manufacturer’s pro-
fessional sales force [11]. The main method implemented by
Kane-Sellers is Logistic Regression method.
Considering that employee churn is closely related with
customer churn but not identical, and also the costs related to
employee churn, which are even higher than customer churns
in some companies, the literature needs further attention of
researchers in this field. Therefore, we try to not only compare
several classification methods, but also apply feature selection
method for employee churn problem as distinct from the above
Churn analysis and prediction is widely studied for cus-
tomer churn, since it is a notorious issue and results in revenue
loss. Employee churn is a similar problem for organization,
however to predict employee churn is rather more complex
than customer churn prediction. Employee churn leads to
issues such as efforts and time to get the replacement and
retraining, financial loss, customers dissatisfaction and many
more. Therefore, for smooth running of an organization, the
key is to retain it’s trained workforce.
Employee churn can be categorized in two type; (1)
voluntary, those who leave for their own reasons, and (2)
involuntary, those who are released from their services by
the organization. Usually companies focus on voluntary churn,
where an employee would either leave for a better opportunity
in terms of pay, benefits, work environment etc, or due to
negative reasons at the present organization such as conflict
with the supervisors, lack of opportunities for promotion, lack
of interesting work and many more. In this study, we also focus
on predict voluntary churn employees.
We apply a wide range of data mining techniques from as
simple as Naive Bayes, linear regression and nearest neighbors
to more complex techniques as SVM, Random Forests and
other ensemble methods. We solve this problem following
hereunder procedure for the employee data analysis and churn
1) Analyze the employee dataset that consists of current
and past employees records
2) Clean the dataset, handle the missing information and
derive new features if required
3) Select the features among the employee data that are
suitable for the prediction of churn
4) Try several classification and report the ones most
suitable by comparing the accuracy, precision, recall,
and F-measure results on the test data
5) Apply feature selection method, and select the fea-
tures that are more convenient in order to predict
employee churn
6) Build classification model
7) Further the prediction of churn employees on using
the model
8) Decision on the strategies of retention
A. Data Selection & Cleaning
The first step in our approach is data selection and clean-
ing. In this step, we utilize to predict employee attrition by
using HR Employee Attrition dataset provided by IBM. The
dataset contains employee information such as demographics,
experience, skills, nature of work or unit, position etc. This
step is to identify and select the features from the employee
data that are more suitable for our analysis. There were totally
34 features some of which were not useful or had same value
for all records. For instance, all employees were over 18 years
old. Another example, an employee ID or name may not be
important and we can discard such features from the data. After
removing those unnecessary features we had 30 features. Table
I shows the features of data and their type and definition.
No Feature Data Type
1 Age Numeric
2 Business Travel Categorical
3 Daily Rate Numeric
4 Department Categorical
5 Distance From Home Numeric
6 Education Categorical
7 Education Field Categorical
8 Gender Categorical
9 Environment Satisfaction Categorical
10 Hourly Rate Numeric
11 Job Involvement Categorical
12 Job Level Categorical
13 Job Role Categorical
14 Job Satisfaction Categorical
15 Marital Status Categorical
16 Monthly Income Numeric
17 Monthly Rate Numeric
18 Number of Companies Worked Numeric
19 Over Time Categorical
20 Percent Salary Hike Numeric
21 Performance Rating Categorical
22 Relationship Satisfaction Categorical
23 Stock Option Level Categorical
24 Total Working Years Numeric
25 Training Times Last Year Numeric
26 Work Life Balance Categorical
27 Years At Company Numeric
28 Years In Current Role Numeric
29 Years Since Last Promotion Numeric
30 Years With Current Manager Numeric
Data preprocessing is one of the key steps in our approach,
since a clean data gives us very good results even by using
simple algorithms. We may need some more attributes that
are not directly observed in the employee data but can be
derived from inside the data. Moreover, we may encounter
some missing data and can use several imputation techniques.
B. Comparison of Classification Methods
In this section, we compare the classification methods
for understanding which method is more befitting to predict
churners or non-churners. We want to figure out how accurate
a classification algorithm by measuring accuracy, precision,
and F-measure on the test set.
Evaluating the algorithm on the same data which has
been trained on will lead to overfitting. In order to prevent
overfitting, we split dataset into train and test parts. We train
our churn prediction model on the train dataset consisting
of 54% of the records, and validate the model using test
dataset consisting of 46% of the records. Table II demonstrates
detailed information about the distribution of the datasets after
spliting into two parts.
Dataset Churn Non-Churn Total Churn Rate
train 160 640 800 0.20
test 77 593 670 0.11
We start with simple binary classification methods for
employee churn prediction Decision Tree, Naive Bayes, and
nearest neighbors methods. Then, we try to more complex
methods as Support Vector Machines (SVM), Logistic Re-
gression, and Random Forests. We train our churn prediction
models on the available labeled data. We compare our results
by using different methods on the test data. The results of the
basic binary predictor is in the form of churn or no churn.
Figure 1 demonstrates the comparison of classification
methods in terms of accuracy metric on the test dataset.
Figure 1. Comparison of Classification Methods According to Accuracy
Figure 2 shows the comparison of classification methods
in accordance with precision metric on the test dataset.
Figure 2. Comparison of Classification Methods According to Precision
The comparison of classification methods according to
recall metric is given in Figure 3.
Figure 3. Comparison of Classification Methods According to Recall
These figures indicate that SVM gives better results than
the other classification methods in terms of accuracy, precision
and F-measure that is the harmonic mean of precision and
C. Feature Selection
Feature selection techniques can be utilized to build many
prediction models with different subsets of train dataset and
determine those features that are and are not relevant to
build reliable and accurate model. We apply feature selection
method, and select the features that are more convenient in
order to predict employee churn. In this study, we benefit
from a popular feature selection method called Recursive
Feature Elimination (RFE) [12]. After applying RFE method
and removing redundant features, the new feature set consists
of 14 features as follows: Education, Education Field, Envi-
ronment Satisfaction, Gender, Job Involvement, Job Level, Job
Role, Job Satisfaction, Marital Status, Over-Time, Performance
Rating, Relationship Satisfaction, Stock Option Level, Years
Since Last Promotion, Years With Current Manager.
The classification methods on the test dataset have been
performed both before and after applying RFE method. The
results are given in Table III and Table IV. To evaluate the
performance of RFE method, the new feature set provides
considerably increased accuracy and precision for almost all
classification methods. On the other hand, the results show
same as last time that SVM is the best method in terms of
accuracy, precision and F-measure.
Method Accuracy Precision Recall F-measure
Decision Tree 0.765 0.31 0.36 0.33
Naive Bayes 0.791 0.40 0.59 0.48
Logistic Regression 0.871 0.74 0.32 0.44
SVM 0.857 0.75 0.17 0.28
KNN 0.844 0.58 0.11 0.18
Random Forest 0.850 0.64 0.14 0.24
Method Accuracy Precision Recall F-measure
Decision Tree 0.788 0.35 0.37 0.36
Naive Bayes 0.856 0.58 0.38 0.46
Logistic Regression 0.871 0.74 0.31 0.44
SVM 0.897 0.98 0.37 0.53
KNN 0.841 0.53 0.12 0.19
Random Forest 0.854 0.65 0.21 0.31
Employee churn results in financial, time and effort loss for
organizations. It is a big issue since a trained and experience
employee is hard to replace and costly. We seek to analyze
the past and current employees data to predict the future
churners and learn the causes of employee turnover. The results
of this study demonstrate that data mining algorithms can
be used to build reliable and accurate predictive models for
employee churn. The problem of churn prediction is not just to
identify churners from no churners. By using exploratory data
analysis and data mining techniques, we can predict the churn
probability for each employee and give them score to make
the retention strategies. As a future direction, we plan to build
a comprehensive and universal model that the organization can
use for the better of the employees, cost effectiveness and
future prospects.
[1] W. Verbeke, K. Dejaeger, D. Martens, J. Hur, and B. Baesens, “New
insights into churn prediction in the telecommunication sector: A
profit driven data mining approach,European Journal of Operational
Research, vol. 218, no. 1, pp. 211–229, 2012.
[2] K. Coussement and D. Van den Poel, “Integrating the voice of customers
through call center emails into a decision support system for churn
prediction,” Information & Management, vol. 45, no. 3, pp. 164–174,
[3] C.-P. Wei and I.-T. Chiu, “Turning telecommunications call details
to churn prediction: a data mining approach,” Expert systems with
applications, vol. 23, no. 2, pp. 103–112, 2002.
[4] K. Coussement and D. Van den Poel, “Churn prediction in subscription
services: An application of support vector machines while comparing
two parameter-selection techniques,Expert systems with applications,
vol. 34, no. 1, pp. 313–327, 2008.
[5] J. Burez and D. Van den Poel, “Handling class imbalance in customer
churn prediction,” Expert Systems with Applications, vol. 36, no. 3,
pp. 4626–4636, 2009.
[6] C.-F. Tsai and M.-Y. Chen, “Variable selection by association rules for
customer churn prediction of multimedia on demand,” Expert Systems
with Applications, vol. 37, no. 3, pp. 2006–2015, 2010.
[7] K. Coussement, D. F. Benoit, and D. Van den Poel, “Improved mar-
keting decision making in a customer churn prediction context using
generalized additive models,Expert Systems with Applications, vol. 37,
no. 3, pp. 2132–2143, 2010.
[8] B. Huang, M. T. Kechadi, and B. Buckley, “Customer churn prediction
in telecommunications,” Expert Systems with Applications, vol. 39,
no. 1, pp. 1414–1425, 2012.
[9] V. V. Saradhi and G. K. Palshikar, “Employee churn prediction,Expert
Systems with Applications, vol. 38, no. 3, pp. 1999–2006, 2011.
[10] R. Khare, D. Kaloya, C. K. Choudhary, and G. Gupta, “Employee
attrition risk assessment using logistic regression analysis,”
[11] M. L. Kane-Sellers, Predictive models of employee voluntary turnover
in a North American professional sales force using data-mining anal-
ysis. PhD thesis, Texas A&M University, 2007.
[12] X. Lin, F. Yang, L. Zhou, P. Yin, H. Kong, W. Xing, X. Lu, L. Jia,
Q. Wang, and G. Xu, “A support vector machine-recursive feature elimi-
nation feature selection method based on artificial contrast variables and
mutual information,” Journal of chromatography B, vol. 910, pp. 149–
155, 2012.
... Sisodia et al. (2017) As causas que podem influenciar o churning, estão relacionadas com: ambiente de trabalho, trabalho desempenhado, equidade de género, equidade salarial, razões pessoais, como realocação devido à família, maternidade, saúde, conflito com chefias ou colegas de equipa. Yigit & Shourabizadeh (2017) ...
Full-text available
Apesar da pertinência da operacionalização do conceito de churning de recursos humanos, este ainda é um tema pouco desenvolvido, com pouca literatura e estudos empíricos. É neste sentido que surge o interesse pelo estudo desta temática,permitindo contribuir para o desenvolvimento de um tema de grande complexidade, assim como contribuir quer para o aumento da literatura, quer para o aumento de estudosempíricos. Objetivos/ Métodos: Este artigo tem como objetivo analisar, através de uma abordagem qualitativa, quais as principais causas de churning de recursos humanos. O estudo segue uma abordagem qualitativa com recurso à análise da literatura internacional e à realização de 20 entrevistas semi-estruturadas como instrumentos de recolha de dados.Como forma de tratamento dos dados, recorreu-se à análise de conteúdo que permitiu selecionar as principais variáveis em estudo. Resultados: Através dos dados obtidos foi possível definir como principais causas de churning: o baixo salário; a falta de progressão na carreira; a falta de desenvolvimento individual; os horários rígidos; a fraca liderança; a concorrência; a localidade; o mau ambiente de trabalho; a fraca cultura organizacional; a falta de promoção; a falta de reconhecimento; a falta de disponibilidade; a dificuldade na conciliação trabalho-família e a falta de motivação. Conclusões: Como forma de minimizar a ocorrência de churning, propõe-se a implementação de medidas estratégicas por parte das organizações de forma a ir de encontro com as necessidades e as expectativas dos seus trabalhadores, com o intuito de que estes se sintam satisfeitos e motivados na organização e com o seu trabalho, afastando a decisão de sair da organização.
... Decision Tree, Linear Regression, Support Vector Machine, k-Nearest Neighbours, Random Forest, Naïve Bayesian Classification Yiğit et al. [42] No eXtreme Gradient Boosting ...
Full-text available
Employee categorisation differentiates valuable employees as eighty per cent of profit comes from twenty per cent of employees. Also, retention of all employees is quite challenging and incur a cost. Previous studies have focused on employee churn analysis using various machine learning algorithms but have missed the categorisation of an employee based on accomplishments. This paper provides an approach of categorising employees to quantify the importance of the employees using multi-criteria decision making (MCDM) techniques, i.e., criteria importance through inter-criteria correlation (CRITIC) to assign relative weights to employee accomplishments and fuzzy Measurement Alternatives and Ranking according to the Compromise Solution (MARCOS) method to divide employees into three categories. Followed by executing churn analysis of each category of employees and original dataset using machine learning algorithms to investigate the importance of employee categorisation. CatBoost, Support Vector Machine, Decision Tree, Random Forest and XGradient Boost algorithms have been used to analyse the categorised and non-categorised dataset on the accuracy, precision, recall and Mathew's Correlation Coefficient (MCC) to derive the best suitable algorithm for the used dataset. CatBoost algorithm showed the best results regarding performance measurements for categorised employees are better than all employee datasets.
... Bahsedilen 16 çalışmada kullanılan veri kümelerini üç farklı grupta incelenmiştir. Erişime açık veri kümesi kullanan çalışmalar; [6]- [10], erişime kapalı veri kümesi kullanan çalışmalar; [1], [2], [11]- [18], bunların dışında veri kümesinin kaynagını bahsetmeyen bir çalışma vardır. Açık veri kaynaklarıyla yapılmış çalışmalar kanıtlanması ve geliştirilmesi daha rahat olacagı için bu bildiri açık veri kümesi ile yapılıp, bildirinin yöntem kısmındaki tüm materyaller açık kaynak kod olarak sunulmuştur. ...
Conference Paper
Full-text available
Employees are one of the most critical elements of companies. Unexpected employee turnover causes a huge cost for companies. The new recruitment process not only consumes money and time, but it also takes time for newly hired employees to contribute effectively. In this study, we did an employee churn analysis that predicts whether the employees will leave their current company. Within the study's scope, we have trained standard and sequential models, then compared the models' successes. In the end, we have created an ensemble model from successful models. This study, which is carried out with sample Kaggle data, can be used as a preliminary for studies to be done with real employee data.
Employees are one of the most important resources of a company. The churn of valuable employees significantly affects a company’s performance. The design of systems that predict employee churn is critical importance for companies. At this point, machine learning algorithms offer important opportunities for the diagnosis of employee churn. Nowadays, traditional classification algorithms have been replaced by deep learning models. In this study, firstly, a Convolutional Neural Network (CNN) model was applied on a numerical dataset for employee churn prediction in retailing. Later, because the data loss is too much in data transformations, a new hybrid extended convolutional decision tree model (ECDT) was proposed by improving the CNN algorithm. Finally, a novel model (ECDT‐GRID) was developed by applying grid search optimization to improve the classification accuracy of ECDT. Numerical results showed that the developed ECDT‐GRID model outperformed the CNN and ECDT models and basic classification algorithms in terms of classification accuracy, and this model provided an efficient methodology for prediction of employee churn.
Full-text available
Background Psychosocial risks, also present in educational processes, are stress factors particularly critical in state-schools, affecting the efficacy, stress, and job satisfaction of the teachers. This study proposes an intelligent algorithm to improve the prediction of psychosocial risk, as a tool for the generation of health and risk prevention assistance programs. Methods The proposed approach, Physical Surface Tension-Neural Net (PST-NN), applied the theory of superficial tension in liquids to an artificial neural network (ANN), in order to model four risk levels (low, medium, high and very high psychosocial risk). The model was trained and tested using the results of tests for measurement of the psychosocial risk levels of 5,443 teachers. Psychosocial, and also physiological and musculoskeletal symptoms, factors were included as inputs of the model. The classification efficiency of the PST-NN approach was evaluated by using the sensitivity, specificity, accuracy and ROC curve metrics, and compared against other techniques as the Decision Tree model, Naïve Bayes, ANN, Support Vector Machines, Robust Linear Regression and the Logistic Regression Model. Results The modification of the ANN model, by the adaptation of a layer that includes concepts related to the theory of physical surface tension, improved the separation of the subjects according to the risk level group, as a function of the mass and perimeter outputs. Indeed, the PST-NN model showed better performance to classify psychosocial risk level on state-school teachers than the linear, probabilistic and logistic models included in this study, obtaining an average accuracy value of 97.31%. Conclusions The introduction of physical models, such as the physical surface tension, can improve the classification performance of ANN. Particularly, the PST-NN model can be used to predict and classify psychosocial risk levels among state-school teachers at work. This model could help to early identification of psychosocial risk and to the development of programs to prevent it.
Full-text available
The present empirical article aims to present a theoretical-methodological model of human resource churning, elaborated based on the main results of interviews, where the main dimensions were selected: causes of churning, human resource churning and strategic measures for human resource retention, contributing to the development of a theme still underdeveloped and unexplored in Portugal. This research had as a general objective, to analyze the relationship between constructs that compose human resource churning. That allowed the formulation of the research question: "What are the main causes of human resource churning and its mitigation measures?" In order to answer the general objective and the starting question, two specific objectives were defined: to analyze what the main causes of human resource churning are and what are the main measures adopted by organizations in order to minimize human resource churning. Through these, it was possible to verify that churning is related to the costs resulting from voluntary departures, indicating the need to implement organizational policies for the retention of human resources.
Full-text available
Customer churn prediction models aim to indicate the customers with the highest propensity to attrite, allowing to improve the efficiency of customer retention campaigns and to reduce the costs associated with churn. Although cost reduction is their prime objective, churn prediction models are typically evaluated using statistically based performance measures, resulting in suboptimal model selection. Therefore, in the first part of this paper, a novel, profit centric performance measure is developed, by calculating the maximum profit that can be generated by including the optimal fraction of customers with the highest predicted probabilities to attrite in a retention campaign. The novel measure selects the optimal model and fraction of customers to include, yielding a significant increase in profits compared to statistical measures.In the second part an extensive benchmarking experiment is conducted, evaluating various classification techniques applied on eleven real-life data sets from telecom operators worldwide by using both the profit centric and statistically based performance measures. The experimental results show that a small number of variables suffices to predict churn with high accuracy, and that oversampling generally does not improve the performance significantly. Finally, a large group of classifiers is found to yield comparable performance.
This paper presents a new set of features for land-line customer churn prediction, including 2 six-month Henley segmentation, precise 4-month call details, line information, bill and payment information, account information, demographic profiles, service orders, complain information, etc. Then the seven prediction techniques (Logistic Regressions, Linear Classifications, Naive Bayes, Decision Trees, Multilayer Perceptron Neural Networks, Support Vector Machines and the Evolutionary Data Mining Algorithm) are applied in customer churn as predictors, based on the new features. Finally, the comparative experiments were carried out to evaluate the new feature set and the seven modelling techniques for customer churn prediction. The experimental results show that the new features with the six modelling techniques are more effective than the existing ones for customer churn prediction in the telecommunication service field.
Filtering the discriminative metabolites from high dimension metabolome data is very important in metabolomics study. Support vector machine-recursive feature elimination (SVM-RFE) is an efficient feature selection technique and has shown promising applications in the analysis of the metabolome data. SVM-RFE measures the weights of the features according to the support vectors, noise and non-informative variables in the high dimension data may affect the hyper-plane of the SVM learning model. Hence we proposed a mutual information (MI)-SVM-RFE method which filters out noise and non-informative variables by means of artificial variables and MI, then conducts SVM-RFE to select the most discriminative features. A serum metabolomics data set from patients with chronic hepatitis B, cirrhosis and hepatocellular carcinoma analyzed by liquid chromatography-mass spectrometry (LC-MS) was used to demonstrate the validation of our method. An accuracy of 74.33±2.98% to distinguish among three liver diseases was obtained, better than 72.00±4.15% from the original SVM-RFE. Thirty-four ion features were defined to distinguish among the control and 3 liver diseases, 17 of them were identified.
CRM gains increasing importance due to intensive competition and saturated markets. With the purpose of retaining customers, academics as well as practitioners find it crucial to build a churn prediction model that is as accurate as possible. This study applies support vector machines in a newspaper subscription context in order to construct a churn model with a higher predictive performance. Moreover, a comparison is made between two parameter-selection techniques, needed to implement support vector machines. Both techniques are based on grid search and cross-validation. Afterwards, the predictive performance of both kinds of support vector machine models is benchmarked to logistic regression and random forests. Our study shows that support vector machines show good generalization performance when applied to noisy marketing data. Nevertheless, the parameter optimization procedure plays an important role in the predictive performance. We show that only when the optimal parameter-selection procedure is applied, support vector machines outperform traditional logistic regression, whereas random forests outperform both kinds of support vector machines. As a substantive contribution, an overview of the most important churn drivers is given. Unlike ample research, monetary value and frequency do not play an important role in explaining churn in this subscription-services application. Even though most important churn predictors belong to the category of variables describing the subscription, the influence of several client/company-interaction variables cannot be neglected.
We studied the problem of optimizing the performance of a DSS for churn prediction. In particular, we investigated the beneficial effect of adding the voice of customers through call center emails – i.e. textual information – to a churn-prediction system that only uses traditional marketing information. We found that adding unstructured, textual information into a conventional churn-prediction model resulted in a significant increase in predictive performance. From a managerial point of view, this integrated framework helps marketing-decision makers to better identify customers most prone to switch. Consequently, their customer retention campaigns can be targeted more effectively because the prediction method is better at detecting those customers who are likely to leave.
As deregulation, new technologies, and new competitors open up the mobile telecommunications industry, churn prediction and management has become of great concern to mobile service providers. A mobile service provider wishing to retain its subscribers needs to be able to predict which of them may be at-risk of changing services and will make those subscribers the focus of customer retention efforts. In response to the limitations of existing churn-prediction systems and the unavailability of customer demographics in the mobile telecommunications provider investigated, we propose, design, and experimentally evaluate a churn-prediction technique that predicts churning from subscriber contractual information and call pattern changes extracted from call details. This proposed technique is capable of identifying potential churners at the contract level for a specific prediction time-period. In addition, the proposed technique incorporates the multi-classifier class-combiner approach to address the challenge of a highly skewed class distribution between churners and non-churners. The empirical evaluation results suggest that the proposed call-behavior-based churn-prediction technique exhibits satisfactory predictive effectiveness when more recent call details are employed for the churn prediction model construction. Furthermore, the proposed technique is able to demonstrate satisfactory or reasonable predictive power within the one-month interval between model construction and churn prediction. Using a previous demographics-based churn-prediction system as a reference, the lift factors attained by our proposed technique appear largely satisfactory.
Customer churn is often a rare event in service industries, but of great interest and great value. Until recently, however, class imbalance has not received much attention in the context of data mining [Weiss, G. M. (2004). Mining with rarity: A unifying framework. SIGKDD Explorations, 6(1), 7–19]. In this study, we investigate how we can better handle class imbalance in churn prediction. Using more appropriate evaluation metrics (AUC, lift), we investigated the increase in performance of sampling (both random and advanced under-sampling) and two specific modelling techniques (gradient boosting and weighted random forests) compared to some standard modelling techniques.
Multimedia on demand (MOD) is an interactive system that provides a number of value-added services in addition to traditional TV services, such as video on demand and interactive online learning. This opens a new marketing and managerial problem for the telecommunication industry to retain valuable MOD customers. Data mining techniques have been widely applied to develop customer churn prediction models, such as neural networks and decision trees in the domain of mobile telecommunication. However, much related work focuses on developing the prediction models per se. Few studies consider the pre-processing step during data mining whose aim is to filter out unrepresentative data or information. This paper presents the important processes of developing MOD customer churn prediction models by data mining techniques. They contain the pre-processing stage for selecting important variables by association rules, which have not been applied before, the model construction stage by neural networks (NN) and decision trees (DT), which are widely adapted in the literature, and four evaluation measures including prediction accuracy, precision, recall, and F-measure, all of which have not been considered to examine the model performance. The source data are based on one telecommunication company providing the MOD services in Taiwan, and the experimental results show that using association rules allows the DT and NN models to provide better prediction performances over a chosen validation dataset. In particular, the DT model performs better than the NN model. Moreover, some useful and important rules in the DT model, which show the factors affecting a high proportion of customer churn, are also discussed for the marketing and managerial purpose.
Customer churn is a notorious problem for most industries, as loss of a customer affects revenues and brand image and acquiring new customers is difficult. Reliable predictive models for customer churn could be useful in devising customer retention plans. We survey and compare some major machine learning techniques that have been used to build predictive customer churn models. Employee churn (or attrition) closely related but not identical to customer churn is similarly painful for an organization, leading to disruptions, customer dissatisfaction and time and efforts lost in finding and training replacement. We present a case study that we carried out for building and comparing predictive employee churn models. We also propose a simple value model for employees that can be used to identify how many of the churned employees were “valuable”. This work has the potential for designing better employee retention plans and improving employee satisfaction.
Nowadays, companies are investing in a well-considered CRM strategy. One of the cornerstones in CRM is customer churn prediction, where one tries to predict whether or not a customer will leave the company. This study focuses on how to better support marketing decision makers in identifying risky customers by using Generalized Additive Models (GAM). Compared to Logistic Regression, GAM relaxes the linearity constraint which allows for complex non-linear fits to the data. The contributions to the literature are three-fold: (i) it is shown that GAM is able to improve marketing decision making by better identifying risky customers; (ii) it is shown that GAM increases the interpretability of the churn model by visualizing the non-linear relationships with customer churn identifying a quasi-exponential, a U, an inverted U or a complex trend and (iii) marketing managers are able to significantly increase business value by applying GAM in this churn prediction context.