Utilizing Customers’ Purchase and Contract Renewal
Details to Predict Defection in the Cloud Software
Niken Prasasti1,2, Katsutoshi Kanamori1, and Hayato Ohwada1
1Department of Industrial Administration, Tokyo University of Science, Japan
2Graduate School of Business and Management, Bandung Institute of Technology, Indonesia
Abstract. This paper aims to provide a solution to the prediction of customer
defection task in the growing market of cloud software industry. From the origi-
nal unstructured data from the company, we proposed a procedure to first identify
the real defection condition, whether the customer is defecting from the company
or merely stop using current product to up/downgrade it, and second to produce
new feature as the measurement of customer loyalty, obtained from compiling
the number of customers’ purchasing and renewing activity. From its result, we
investigated the important variables for classifying defecting customer using ran-
dom forest and built the prediction model using decision tree. The final results
indicate that the group of defecting customers are mainly characterized by their
loyalty and their number of total payment.
Keywords: customer defection, cloud software industry, machine learning, de-
cision tree, random forest
Recently, as the growth of internet users, there are very real trends toward the use of
cloud software industry. The cloud software market is said to be growing at a 36%
compound annual growth through 2016, as reported in . The increasing growth of
the software is supported by its convenient in which it can be used everywhere as long
as the users ‘devices are connected to the internet. Some widely used examples of the
cloud software are the web-based file hosting, social networking, office applications,
and security software.
In particular, predicting customer defection is important for a fast-growing business
with contractual models in order to improved marketing decision making. Defection is
a term refers to customers’ decision to stop using the service or product provided by
the company. Defection prediction has been a concern in research and industry, as it is
one of important measures to retain customers . In making prediction, most of the
company collect useful customers’ data and use it to create predictive model of defec-
tion using predictive analytic methods such as data mining and machine learning.
We are concerned in the defection management problem of the cloud software in-
dustry. In this paper, our case is a security software company. Though we are able to
obtain customers’ data from the e-commerce site of the company, making prediction of
customer defection is not considered as a simple task. First, it is because the data fea-
tures are limited and only contains several attributes of customers, not as in other sev-
eral previous works on defection prediction who use typical customer demographic,
call logs, and usage details. Second, it is barely possible to gain more customer infor-
mation by direct approach to each customer, since the number of customer in this com-
pany is abundant. Third, in particular, the available data simply contains the records of
the customers’ activity in opting-in (continue) and opting-out (defect) from one prod-
uct, while in the real situation, some customers are opting-out to upgrade/downgrade
their product. Fourth, with the vast market growing, however, managing customer de-
fection is one important issue.
This paper goals in tackling those previously mentioned problems in managing cus-
tomer defection in the case company. Our approaches are first, we provide an algorithm
in which purpose is to detect which customer is literally defecting from the company
and which is not, second, we produce new feature from the available data that can be
used as a measurement of customer loyalty, and third, by using random forest, we ana-
lyze the most important variables that contributes in classifying the defecting customer.
In addition, we model the customer defection using decision tree, in order to have a
visual interpretable result that can be useful for company as the end user.
The remainders of this paper is organized as follows. Section 2 reviews the former
works that are related to the defection prediction. Section 3 defines the data used in the
study. Section 4 presents the data preparation procedures, section 5 presents the ma-
chine learning procedures in analyzing the important variables and predicting customer
defection. The results of the experiments are provided in Section 6. Finally, the lesson
of predicting customer defection is provided in the conclusion as the last section.
2 Related Works
Over recent years, predicting customer defection has increasingly received attention of
researchers. Studies focus on the search for methods and features are the most effective
in predicting defection. The most common methods have been used for defection pre-
diction are such as: decision tree, regression, Naïve Bayes, and neural network. Most
of the former works focus on the customer defection in the telecommunication industry.
Predicting customer defection involves the search and identification of defecting in-
dicators. Assuming that changes in call patterns may appear as defection warning sig-
nals,  used the call details to extract the features that describe the changes in the
customers’ calling patterns. The features are then used as the input into decision tree to
build classifier. Using the same method, decision tree,  discovered that the most sig-
nificant differentiator between defecting and retaining customers are: age, tenure, gen-
der, billing amount, number of payment, call duration, and number of changing infor-
mation. The findings are obtained using several groups of features: customer de-
mographics, billing information, service status, and service change log. Another useful
features explored by  using the data containing customer complaints and service in-
teractions with the operator for predicting the defection. They also compared the pre-
dicting performance of neural network, decision tree, and regression.
Previously, in  we reviewed the applicability of some machine learning tech-
niques in predicting customer defection using several common used techniques such as
decision tree, random forest, neural network and support vector machine. As comple-
mentary, in , the result of predicting customer defection is applied to the calculation
of customer lifetime value, regarding the high relation between customer defection/re-
tention and the prediction of customer lifetime value.
Previously mentioned, most of the former works rely on the customer demographics,
customer service logs, usage details, complain data, bill and payment, and so on. A
relatively under investigated source of input for predicting customer defection is the
original purchase and renewal data of customer in a contractual based company. The
reason for this that the data often contains unstructured data that is hard to analyze.
Though our data are limited in the number of features and its structures are compli-
cated, in this paper we provide a new series of practices in the customer defection prob-
lem management. We propose an algorithm to identify the real customer defection in
order to make the later prediction more reliable and to produce one new feature from
the available data that is able to be the measurement of customer loyalty. Moreover, we
use the results to analyze which variable are important in classifying defecting customer
and to build the customer defection prediction model.
3 Data Set
The basic problem of predicting customer defection is to find a good model that can
predict the customer defection in the case company. A quality model to predict defec-
tion can only be constructed only if quality data is available. In this paper, we make use
of two types of data, and later will be compared to be the better predictor in customer
The purchase and auto-renewal data: contains six-year records (from 2007 up to
2013) of customer activity in purchasing and renewing their product. It includes the
contract ID of customer, the latest status renewal flag, latest date of auto-renewal con-
tract, total number of purchase and renew, product base that customer purchase, total
payment has been done by customer, validity period of the product, the status of using
optional service/not, type of customer whether personal or company, and the status of
The web log data: six-month log files (from January to June 2013) contains the total
payment has been done by customer, status of using optional service/not, type of cus-
tomer whether personal/company, type of operating system customer use for their
gadget, type of browser customer use to browse the internet, the number of website’s
page view, the number of website’s visits, the number of product view, and cart view,
and the number of order customer has made.
The data is originally used to record the details of the activity of “opting-in” and
“opting-out” of each customer after they receive the notification e-mail of auto-renewal.
However, there are types of customers that may opt-out from one service of their prod-
uct and instead they opt-in for another service. Therefore, some new features should be
extracted from the original purchase and auto-renewal data for predicting the defection
in the company.
4 Data Preparation
As one the originality of the research, data preparation has one important rule in differ-
entiating this study with other former studies. The main purpose of the data preparation
in this case is to figure out whether the original data can be useful in doing the customer
defection prediction model or not. In case it is, what are the features can be extracted
from it and can be useful for the machine learning process?
Table 1. The original table on the e-commerce site
From Table 1, we can see an illustration of the original content of the table contains
historical records of customer activity collected from the company e-commerce site,
with is the features previously mentioned in Section 2 for each type of data.
It contains the information of CONTRACT_ID, which is the ID number of purchase or
renew that customer makes. When one customer make several actions, whether pur-
chasing a new product or renewing the contract of their current product, the data will
be recorded by the e-commerce site under the same CONTRACT_ID. Thus, if we use
the original data from the site without preparing it, the prediction model will not be
reliable since the site is only able to records data per activity. It does not provide the
summary of each customer whether in the moment they are truly defecting from the
company or merely defect from their current product.
To overcome the problem, first, we detect the actual defection by acquiring the
CLASS feature as the actual defection flag attribute of each customer. Second, we pro-
duce the UPDATE_COUNT feature as the new measurement of customer loyalty
which defined the length of period one customer has been staying in the company by
accumulating their purchasing and renewing frequency. Eq. 1 and 2 generally describe
how we detect the actual defection and calculate the UPDATE_COUNT.
In Eq.2, UPDATE_COUNT is calculated by summing the frequency of renewing on
each purchase with the total number of purchase excluding the first purchase
. Since we are going to have a prediction on the customer defection, we subtract the
total length by 1 as the data is going to be used for prediction.
We have the results that can be used for prediction. Thus, the final features from the
purchase and auto-renewal data will be used for the prediction are followings:
UPDATE_COUNT, total payment has been done by customer
(CC_PRODUCT_PRICE), the status of using optional service/not (OPT_FLAG), type
of customer whether personal or company (ORG_FLAG), the status of e-mail delivery
(MAIL_STATUS), and the actual defection flag (CLASS).
From the web log data, the final features will be used from are followings:
UPDATE_COUNT, CC_PRODUCT_PRICE, OPT_FLAG, ORGFLAG,
MAIL_STATUS, the operating system used in the gadget (OS), the type of browser
used (BROWSER), the number of page view (PAGE_VIEW), product view
(PRODUCT_VIEW), cart view (CART_VIEW), visiting web frequency (VISIT), the
number of order the customer has made (ORDER), and the CLASS.
5 Machine Learning Process and Evaluation Criteria
Machine learning procedures are executed in the form of classifier using two algo-
rithms: C4.5 Decision Tree and Random Forest. The advantage of classification using
decision tree is that it can be easily interpreted and intuitively understandable. Moreo-
ver, it provides the ability to make prediction on very large data sets. The decision tree
algorithm goals to select the best feature to split a node based on a statistical measure.
The widely used decision tree algorithm, ID3, uses information gain to select the attrib-
ute that will categorize the samples into individual classes . However, ID3 does not
allow attributes with continuous values and there is some biases in measuring the in-
formation gain on the attributes with many value. As the successor of ID3, C4.5 algo-
rithm overcome the problem by creating the threshold to fit the continuous attributes
and avoiding the bias in information gain by normalization .
Random forest is a collection of a bagging of unpruned decision trees with a ran-
domized selection at each split, and outputs the class that is the majority of the classes
output by individual trees . The bagging process make it possible for the random
forest to improve prediction accuracy over a single decision tree. Moreover, random
forest experts in characterizing and exploiting structure in high dimensional data for the
purpose of classification and prediction . However, the resulting model by the ran-
dom forest can be difficult to interpret. One key feature from the random forest learning
algorithm that will be used in this paper is the novel variable importance measure.
As one necessary step to make sure the model generated well, evaluation of machine
learning techniques performance should be held. In order to assess the classification
performance, various followings performance criteria are calculated: accuracy, recall,
precision, and F-measure. The followings are the calculation of each evaluation criteria
based on the confusion matrix shown in Table 2.
The overall accuracy is measured by the proportion of the total number of predictions
that were correct, calculated by
Precision or positive prediction value is calculated by
Recall or true positive rate is calculated by
F-measure or F-score is calculated by
Table 2. Confusion matrix
Table 3. Number of examples of the initial data set
Purchase and renewal data
Web log data
In Table 3, the total number of examples available from the initial data sets are given.
The whole examples will be used in building the prediction model using the C4.5 deci-
sion tree algorithm, with a 10-fold cross validation techniques for the data splitting to
ensure that every instance from the original dataset has the same chance of appearing
in the training and testing set. However, in the context of analyzing important variables
using random forest, due to the limitation of it in handling big size data, only a subset
of purchase and auto-renewal data samples with the same distribution with the initial
data set are used.
6 Experimental Results
6.1 Measuring variable importance using random forest
In the case of prediction, it is critical to understand the importance of the variables that
is providing the predictive accuracy. In this paper, using the variable importance algo-
rithm in random forest, we obtained the mean decrease accuracy of each variable. The
mean decrease in accuracy for a variable is the normalized difference of the
classiﬁcation accuracy for the out-of-bag data when the data for that variable is included
as observed, and the classiﬁcation accuracy for the out-of-bag data when the values of
the variable in the out-of-bag data have been randomly permuted . Higher values
of mean decrease in accuracy indicates variables that are more important to the
classiﬁcation. Table 4 gives the number of samples used by random forest in order to
obtain the importance of each variable on each customer segment.
Table 4. The number of samples used in obtaining the variable importance using random forest
Purchase and renewal data
Web log data
Fig. 1. Variable importance obtained using the purchase and auto renewal data
For each of the three customer segment, UPDATE_COUNT were identified as the most
important variables to the classifications using the purchase and auto-renewal data (Fig.
1). Even though we cannot say that variables identiﬁed as ‘‘important’’ are right or
wrong, the results for random forest coincide more closely with expectations based on
understanding of the customer loyalty. The more loyal the customer, which described
by their period of staying, the less they are having the probability to defect.
The result in Fig.2 shows the variable importance obtained from using the web log
data. Similar to the previous result, in the Low Price customer segment,
UPDATE_COUNT is the most important variable in predicting. On the other side, for
the Middle Price and High Price customer segment, the total payment that each cus-
tomer has made (CC_PRODUCT_PRICE) appears to be the most important variable.
However, there was consistency in the variables identified as being the two most im-
portant using all data set: UPDATE_COUNT and CC_PRODUCT_PRICE.
Fig. 2. Variable importance for each customer segment obtained using the web log data
6.2 Prediction model using C4.5 decision tree
Previously mentioned, one advantage from using decision tree classifier is the con-
venience in interpreting the results. R supports the process of interpretation by provid-
ing the tree visualization and tree rules. As for the company or other end user, decision
tree result make it easier to decide the next action on retaining the customer based on
the defection prediction. We show an example of the result of the visualization of cus-
tomer defection prediction on the Low Price customer segment using the purchase and
auto-renewal data (Fig. 3) and the example of the rules of defecting customer as fol-
Rule number: 7 [RIHAN_FLAG=true cover=137787 (35%) prob=0.98]
Rule number: 25 [RIHAN_FLAG=true cover=15697 (4%) prob=0.88]
Rule number: 13 [RIHAN_FLAG=true cover=70863 (18%) prob=0.84]
The rules on node 7 explains that about 35% of customer who has the attributes
UPDATE_COUNT less than 2.5 and spend the payment on CC_PRODUCT_PRICE
less than 4,722 (JPY) has the probability of 98% to defect. We can see from both visu-
alization and rules, decision tree obtained a model which uses UPDATE_COUNT and
the total payment or CC_PRODUCT_PRICE as the most powerful predictor. Simi-
larly, it occurs to all customer segment when we used the purchase and auto-renewal
data. Using web log data (Fig.4), the status of e-mail delivery appears to be one of three
predictors resulting the predictive accuracy.
Fig. 3. The visualization of tree on Low Price customer segment based on purchase and auto-
Fig. 4. The visualization of tree on Low Price customer segment based on web log data
Table 5. Predictive accuracy of C4.5 decision tree on the purchase and auto-renewal data
Table 6. Predictive accuracy of C4.5 decision tree on the web-log data
The performance evaluation of the model on predicting defection using the C4.5
decision tree algorithm are shown in Table 4 and 5. The minimum object is set to 40
and the complexity of the tree is set to be 0.005. The results indicate that using the
purchase and auto-renewal data, we can obtain better prediction model of customer
defection, and it is safely concluded that the new features we acquired in the data prep-
aration are useful in predicting the customer defection in the case company.
To summarize and clarify how our methods apply to the case company problems,
we can consider several questions and the answers that would be appropriate from the
1. Which customers are defecting?
If “defecting” was defined clearly based on the original data sets, the answer to
this question will be straight forward to the number of ‘opt-out’ appears in the
database query, and it is not factual. In this manner, our method can clearly
analyze that customers whose renewal records is zero are actually defecting
from the company.
2. What variables are characterizing the defecting customer?
Defecting customers are mostly characterized by the loyalty attributes. It is the
length of period of them staying in the company and measured by the number
of their purchase and auto-renew activity. In addition, the number of total pay-
ment the customer has made also represents the possibility of their likelihood to
3. Can we decide what strategy can be done in preventing defecting customer?
In the sense of previous question and answer, we can decide what strategy on
each customer segment can be considered to prevent customer from defecting.
As an instance, since the customer loyalty is the main characterizing variable,
company can allocate more marketing campaign on the customers with low loy-
7 Conclusion and Future Works
One key activity of customer defection management is the process of predicting cus-
tomer defection. This paper presents several procedures that contributes in tackling
some novel problems of prediction defection task. The following steps has been accom-
plished: we provided algorithm that is beneficial in making decision on which customer
is truly defecting from the company, produced one new feature from the available data
that can be considered as the measurement of customer loyalty, then by using machine
learning techniques, we identified which variables are important for classifying the de-
fecting customers, then finally we build prediction model of customer defection using
both purchase and auto-renewal data and the web log data.
This paper is not capturing the dynamic of customer activity and characteristic. Thus,
future work will seek on integrating machine learning techniques with a more dynamic
approach, such as agent-based modeling and simulation. Agent-based will be able to
provide a computational model for simulating interactions between customer from the
micro level to a macro level. In addition, machine learning can provide the predictive
accuracy of customer behavior that is going to be useful for validating the agent-based
1. L. Colombus, "Predicting Enterprise Cloud Computing Growth," Forbes, 9 April 2013.
[Online]. Available: http://www.forbes.com/sites/louiscolumbus/2013/09/04/predicting-en-
terprise-cloud-computing-growth/. [Accessed 20 July 2014].
2. B. Huang, M.-T. Kechadi and B. Buckley, "Customer Churn Prediction for Broadband In-
ternet Services," in LNCS 5691, 2009.
3. C. Wei and I. Chiu, "Turning telecommunications call detail to churn prediction: A data
mining approach," Expert Systems with Applications, vol. 23, pp. 103-112, 2002.
4. S. Yung, D. Yen and H. Wang, "Applying data mining to telecom churn management," Ex-
pert System with Applications, vol. 31, pp. 515-524, 2006.
5. A. T. R. R. J. Hadden and D. Ruta, "Churn Prediction: Does Technology Matter," Interna-
tional Journal of Intelligent Systems and Technologies, vol. 1, 2006.
6. N. Prasasti and H. Ohwada, "Applicability of Machine-Learning Techniques in Predicting
Customer Defection," in International Symposium on Technology Management and Emerg-
ing Technologies (ISTMET 2014), 2014.
7. N. Prasasti, M. Okada, K. Kanamori and H. Ohwada, "Customer Lifetime Value and Defec-
tion Possibility Prediction Model using Machine Learning: An Application to a cloud-based
Software Company," in Lecture Notes in Computer Science 8398, Springer Publisher, 2013.
8. J. Quinlan, "Induction of Decision Trees," Machine Learning, vol. 1, pp. 81-106, 1986.
9. Y. Xiong, D. Syzmanski and D. Kihara, "Characterization and Prediction of Human Protein-
Protein Interaction," in Biological Data Mining and Its Applications in Healthcare, 2014, pp.
10. L. Breiman, "Random Forests," Machine Learning, vol. 45, pp. 25-32, 2001.
11. D. R. Cutler, T. C. Edwards, K. H. Beard, A. Cutler, K. T. Hess, J. Gibson and J. J. Lawler,
"Random Forest for Classification in Ecology," Ecology, vol. 88(11), pp. 2783-2792, 2007.