ArticlePDF Available

Abstract and Figures

With miscellaneous information accessible in public depositories, consumer data is the knowledgebase for anticipating client preferences. For instance, subscriber details are inspected in telecommunication sector to ascertain growth, customer engagement and imminent opportunity for advancement of services. Amongst such parameters, churn rate is substantial to scrutinize migrating consumers. However, predicting churn is often accustomed with prevalent risk of invading sensitive information from subscribers. Henceforth, it is worth safeguarding subtle details prior to customer-churn assessment. A dual approach is adopted based on dragonfly and pseudonymizer algorithms to secure lucidity of customer data. This twofold approach ensures sensitive attributes are protected prior to churn analysis. Exactitude of this method is investigated by comparing performances of conventional privacy preserving models against the current model. Furthermore, churn detection is substantiated prior and post data preservation for detecting information loss. It was found that the privacy based feature selection method secured sensitive attributes effectively as compared to traditional approaches. Moreover, information loss estimated prior and post security concealment identified random forest classifier as superlative churn detection model with enhanced accuracy of 94.3% and minimal data forfeiture of 0.32%. Likewise, this approach can be adopted in several domains to shield vulnerable information prior to data modeling.
Content may be subject to copyright.
information
Article
Encrypting and Preserving Sensitive Attributes in
Customer Churn Data Using Novel Dragonfly Based
Pseudonymizer Approach
Kalyan Nagaraj * , Sharvani GS and Amulyashree Sridhar
Department of Computer Science and Engineering, RV College of Engineering, Bangalore 560059, India
*Correspondence: kalyan1991n@gmail.com
Received: 27 July 2019; Accepted: 28 August 2019; Published: 31 August 2019


Abstract:
With miscellaneous information accessible in public depositories, consumer data is the
knowledgebase for anticipating client preferences. For instance, subscriber details are inspected in
telecommunication sector to ascertain growth, customer engagement and imminent opportunity
for advancement of services. Amongst such parameters, churn rate is substantial to scrutinize
migrating consumers. However, predicting churn is often accustomed with prevalent risk of invading
sensitive information from subscribers. Henceforth, it is worth safeguarding subtle details prior to
customer-churn assessment. A dual approach is adopted based on dragonfly and pseudonymizer
algorithms to secure lucidity of customer data. This twofold approach ensures sensitive attributes
are protected prior to churn analysis. Exactitude of this method is investigated by comparing
performances of conventional privacy preserving models against the current model. Furthermore,
churn detection is substantiated prior and post data preservation for detecting information loss.
It was found that the privacy based feature selection method secured sensitive attributes eectively
as compared to traditional approaches. Moreover, information loss estimated prior and post security
concealment identified random forest classifier as superlative churn detection model with enhanced
accuracy of 94.3% and minimal data forfeiture of 0.32%. Likewise, this approach can be adopted in
several domains to shield vulnerable information prior to data modeling.
Keywords: customer data; churn analysis; privacy; ensemble approach; data mining
1. Introduction
Advancement in technology has gathered immense data from several sectors including healthcare,
retail, finance and telecommunication. Quantifiable information is captured from consumers to gain
valuable insights in each of these sectors [
1
]. Furthermore, augmented usage of mobile spectrums
has paved way for tracing activities and interests of consumers via numerous e-commerce apps [
2
].
Among multiple sectors tracking consumer data, telecommunication domain is a major arena that
has accustomed to several developments ranging from wired to wireless mode, envisioning a digital
evolution [
3
]. Such progressive improvements have generated data in video and voice formats, making
massive quantities of customer data accessible to telecom operators. As ratification, Telecom Regulatory
Authority of India (TRAI) released a report for the month January 2019 reflecting about 1022.58 million
active wireless subscribers in the country [
4
]. With 5G wireless technologies being appraised as the
future generation spectrum, this number is expected to further upsurge [
5
]. Extracting useful patterns
from such colossal data is the ultimate motive of telecom service providers, for understanding customer
behavioral trends. Likewise, these patterns also aids in designing personalized services for targeted
patrons based on their preceding choices [
6
]. These preferences further upscale the revenue of a
Information 2019,10, 274; doi:10.3390/info10090274 www.mdpi.com/journal/information
Information 2019,10, 274 2 of 21
product by identifying high-value clients. Another substantial parameter accessed by telecom players
is churn rate. Churn rate often impacts the marketability of a telecom product.
Churn rate ascertains the extent of subscribers a telecom operator loses to its competitors in a
timely manner [
7
]. Despite gaining new customers, telecom providers are suering due to churn
loss, as it is a well-known fact that retaining old consumers is an easier task than attracting new
ones [
8
]. Henceforth, churn prediction and customer retention are preliminary requirements for
handling churn [
9
]. Over the years, several methods have been devised for detecting customer churn.
However, predictions accomplished through mathematical models have gained superior performance
in identifying churn [
10
]. These models are devised by examining the behavior of customer attributes
towards assessment of churn. Even though these techniques have been widely accepted for predicting
churn rate, there is a downside associated with such predictions.
Analyzing churn features from consumer data invades sensitive information from subscribers.
Consumer privacy is sacrificed to decipher the ‘value’ within data to improve digital marketing
and in turn the revenue [
11
]. Furthermore, such data may be shared with third party vendors for
recognizing the ever growing interests of consumers. Despite the existence of TRAI recommendations
on data protection for voice and SMS services, it is dicult to implement them in real-time, as data is
maintained concurrently with telecom operators and subscribers [
12
]. Hence, it is advisable to capture
only relevant personal details with the consent of consumers prior to data analysis [
13
]. The same
principle can be applied to preserve information prior to churn analysis. This twofold technique
ensures data is secured at primal level prior to domain analysis.
Before preserving sensitive information, it is important to scrutinize the attributes which are
elusive in the dataset. In this direction, feature selection techniques are being adopted for identifying
these attributes. Feature selection identifies the subset of features which are relevant and independent
of other attributes in the data [
14
]. Once sensitive attributes have been identified after feature selection,
it is crucial to preserve this information prior to data modeling. In this context, privacy preserving data
mining (PPDM) techniques have gathered immense popularity over the years for their proficiencies to
secure data. PPDM approaches preserve data integrity by converting the vulnerable features into an
intermediate format which is not discernible by malicious users [
15
,
16
]. Some of the popular PPDM
techniques implemented for ensuring privacy include k-anonymity, l-diversity, t-closeness,
ε
-dierential
privacy and personalized privacy. Each of these techniques employs dierent phenomenon to preserve
vulnerable information. Even though k-anonymization is an eective approach it ignores sensitive
attributes, l-diversity considers sensitive attributes yet distribution of these features are often ignored by
the algorithm [
17
]. Coming to t-closeness, correlation among identifiers decreases as t-value increases
due to distribution of attributes.
ε
-dierential and personalized privacy are dicult to implement
in real time with dependency on the user for knowledgebase. Hence, there is a need to devise an
enhanced PPDM model which is capable of preserving vulnerable information without compromising
on data quality, privacy measures and simplicity of the approach [18].
In this direction, the current study adopts a twofold approach for securing sensitive information
prior to churn prediction. Initially, dragonfly algorithm is applied for optimizing profound attributes
from churn data. Furthermore, privacy preserving is employed on the identified features using
pseudonymization approach for avoiding privacy breach. Consequently, the performance of this novel
twofold approach is compared with conventional methods to determine information camouflage.
Once sensitive details are secured from the churn dataset, data mining models are employed to detect
the occurrence of churn among consumers. These models are furthermore analyzed to discover
information loss prior and post the ensemble approach. Such a twofold technique ensures data privacy
is obscured without disclosing the context.
Information 2019,10, 274 3 of 21
2. Literature Survey
2.1. Churn Detection via Mathematical Modeling
As mentioned previously, churn rate estimation is crucial for telecom providers to access the
customer satisfaction index. Churn assessment via mathematical models has been adopted extensively
over the years to predict customer migration. Numerous studies have estimated churn rate using
linear and non-linear mathematical models. Some of the significant studies are discussed in Table 1.
Despite the existence of such outstanding classifiers, there is a need to scrutinize these models to
ensure that no susceptible information is leaked. Thereby, the next section elaborates on importance of
privacy preserving techniques towards information conservation.
2.2. PPDM Approaches towards Information Security
This section introduces the implications of several PPDM approaches towards masquerade of
sensitive data. Several PPDM techniques have been designed to mitigate security attacks on data
including anonymization, randomization and perturbation [
18
]. Some of the predominant variants of
anonymization (i.e., k-anonymity, l-diversity, t-closeness) and other techniques are discussed in Table 2.
Extensive literature survey in churn prediction and privacy preservation has discovered the
possibility for employing PPDM techniques towards securing sensitive churn attributes. This two fold
approach of combining information security in churn detection is unique in preserving sensitive data
attributes prior to modeling.
Information 2019,10, 274 4 of 21
Table 1. List of substantial churn detection studies.
Studies
Performed Mathematical Model/s Adopted Substantial Outcomes Limitations/Future Scope
[19] Linear Regression The study achieved 95% confidence interval in detecting
customer retention -
[20]
Legit regression, Boosting, Decision trees,
neural network
The non-linear neural network estimated better customer
dissatisfaction compared to other classifiers -
[21]Support Vector Machine (SVM), neural
network, legit regression
SVM outperformed other classifiers in detecting customer
churn in online insurance domain
Optimization of kernel parameters of SVM
could further uplift the predictive performance
[22]
Random forest, Regression forest, logistic
& linear regression
Results indicated that random and regression forest models
outperformed with better fit compared to linear techniques -
[23] AdaBoost, SVM
Three variants of AdaBoost classifier (Real, Gentle, Meta)
predicted better churn customers from credit debt database
compared to SVM
-
[24]SVM, artificial neural network, naïve
bayes, logistic regression, decision tree
SVM performed enhanced customer churn detection compared
to other classifiers -
[25]
Improved balanced random forest
(IBRF), decision trees, neural networks,
class-weighted core SVM (CWC-SVM)
IBRF performed better churn prediction on real bank dataset
compared to other classifiers -
[26] Partial least square (PLS) classifier The model outperforms traditional classifiers in determining
key attributes and churn customers -
[27] Genetic programming, Adaboost Genetic program based Adaboosting evaluates churn
customers with better area under curve metric of 0.89 -
[28]Random forest, Particle Swarm
Optimization (PSO)
PSO is used to remove data imbalance while random forest is
implemented to detect churn on reduced dataset. The model
results in enhanced churn prediction
-
[29] Naïve Bayes, Bayesian network & C4.5 Feature selection implemented by naïve Bayes and Bayesian
network resulted in improved customer churn prediction
Overfitting of minor class instances may result
in false predictions. Hence balancing the data
is to be performed extensively.
[30]Decision tree, K-nearest neighbor,
artificial neural network, SVM
Hybrid model generated from the classifiers resulted in 95%
accuracy for detecting churn from Iran mobile company data -
[31] Rough set theory Rules are extracted for churn detection. Rough set based
genetic algorithm predicted churn with higher ecacy -
Information 2019,10, 274 5 of 21
Table 1. Cont.
Studies
Performed Mathematical Model/s Adopted Substantial Outcomes Limitations/Future Scope
[32] Bagging, boosting Among multiple classifiers compared for churn detection,
bagging and boosting ensembles performed better prediction -
[33] Dynamic behavior models Spatio-temporal financial behavioral patterns are known to
influence churn behavior
Possibility of bias in the data may aect the
predictive performances
[34] Naïve Bayes Customer churn prediction is detected from publically
available datasets using naïve Bayes classifier
The current approach can be implemented for
detecting bias and outliers eectively
[35]
Decision tree, Gradient boosted machine
tree, Extreme gradient boost and random
forest
The extreme gradient boosting model resulted in area under
curve (AUC) value of 89% indicating an enhanced churn
classification rate compared to other classifiers
-
Table 2. Significance of PPDM approaches towards data security.
PPDM Techniques Features of the Technique Studies Adopted Based on the Technique
k-anonymity
The technique masks the data by suppressing the vulnerable instances and generalizing them
similar to other (k-1) records. [3644]
l-diversity Extension of k-anonymity technique which reduces granularity of sensitive information to
uphold privacy [17,4548]
t-closeness
Extension of l-diversity that reduces granularity by considering distribution of sensitive data
[4953]
Randomization Randomizes the data instances based on its properties to result in distorted data aggregates [5457]
Perturbation Noise is added to data to ensure that sensitive information is not disclosed [5861]
Pseudonymization Reversible pseudonyms replaces sensitive information in data to avoid data theft [6264]
Information 2019,10, 274 6 of 21
3. Materials and Methods
This section highlights the methodology adopted for churn detection accompanied by security
concealment of susceptible information.
3.1. Churn Data Collection
From massive population of customer indicators in telecom sector, a suitable sample is to be
identified to analyze the vulnerable patterns. For this reason, publically available churn dataset is
selected from web repository Kaggle. This dataset is designed to detect customer churn based on
three classes of attributes. The first class identifies the services a customer is accustomed to, second
class of attributes identifies the information of financial status towards bill payment and third class
discusses about the demographic indications for a customer. Spanned across these three classes are
20 data features for predicting churn behavior. This dataset is downloaded in .csv file format for further
processing [65].
3.2. Initial Model Development
Mathematical models are developed to predict influence of data attributes in churn analysis.
Conventional algorithms like logistic regression, Support Vector Machine (SVM), naïve Bayes, bagging,
boosting and random forest classifiers are employed on the data using modules available in R
programming language. Cross validation technique is adopted in 10-fold manner for validating the
outcomes from every learner.
3.3. Assessment of Performance of Models
Of the learners implemented in previous step, the best performing model is ascertained using
certain statistical parameters including True Positive Rate (TPR), accuracy, F-measure and Root Mean
Square Error (RMSE). The formulae’s of these metrics are defined below:
True positive is the metric where a mathematical model predicts the churn instances accurately
while false negative reflects the prediction of churn cases incorrectly as non-churn by a model. TPR is
the ratio of true positive churn instances in the dataset to the summation of true positive and false
negative churn cases.
TPR =True Postive instances o f customer churn
True positive instances o f churn +False negative instances o f churn (1)
Accuracy is the ratio of predictions of churn and non-churn instances from a mathematical model
against the total churn instances in the dataset. Higher the value better is the performance of a classifier.
Accuracy =Total instamces o f churn and non churn predictions by models
Total instances o f churn and non churn in dataset (2)
Root mean square error (RMSE) is the standard deviation of predicted churn customers by
mathematical model compared to actual churn customers in the dataset. The metric ranges from 0 to 1.
Lesser the value better is the performance of a model.
RMSE =sPN
i=1(Predicted churn instances by models Actual churn instances in dataset)2
Total churn an d non churn instances in dataset (3)
Information 2019,10, 274 7 of 21
F-measure is defined as the harmonic mean of recall and precision. Recall indicates the score of
relevant churn customers identified from churn dataset, while precision specifies the score of identified
customers who are turned out to be churned.
Fmeasure =2× recall ×precision
recall +precision!(4)
3.4. Preserving Sensitive Churn Attributes Using Dragonfly Based Pseudonymization
Applying mathematical models as in previous steps without securing the vulnerable attributes
will result in an intrinsic security threat. Hence, a twofold technique is adopted to withhold the
privacy, prior to churn prediction. This dual procedure involves optimal feature selection followed by
privacy preservation.
Feature selection plays a significant role in this study owing to two fundamental motives. Primarily,
a data mining model cannot be designed accurately for a dataset having as many as 20 features, as there
is high probability for dependency among attributes [
66
]. Another vital intention for adopting feature
selection is to identify vulnerable attributes (if any) prior to data modeling.
In this context, dragonfly algorithm is employed on the dataset for attribute selection. This
algorithm is selected amongst others because of its preeminence in solving optimization problems,
specifically feature selection [
67
]. Dragonfly algorithm is conceptualized based on dual behavioral
mode of dragonfly insects. They remain either in static (in case of hunting for prey) or dynamic (in case
of migration) mode. This indistinguishable behavior of dragonflies in prey exploitation and predator
avoidance is modeled using five constraints based on the properties of separation, alignment, cohesion,
food attraction and enemy distraction. These parameters together indicate the behavior of a dragonfly.
The positions of these flies in search space are updated using two vectors, position vector (Q) and step
vector (Q) respectively [68].
All these parameters are represented mathematically as follows:
a.
Separation (S
i
): It is defined as the dierence between current position of an individual dragonfly
(Z) and ith position of the neighboring individual (Z
i
) summated across total number of neighbors
(K) of a dragonfly;
Si=
K
X
i=1
ZZi(5)
b.
Alignment (A
i
): It is defined as the sum total of neighbor’s velocities (V
k
) with reference to all the
neighbors (K);
Ai=PK
i=1Vk
K(6)
c.
Cohesion (C
i
): It is demarcated as the ratio of sum total of neighbor’s ith position (Z
i
) of a
dragonfly to all the neighbors (K), which is subtracted from the current position of an individual
(Z) fly;
Ci=
PK
i=1Zi
K
Z(7)
d.
Attraction towards prey/food source (P
i
): It is the distance calculated as the dierence between
current position of individual (Z) and position of prey (Z+);
Pi=Z+Z(8)
Information 2019,10, 274 8 of 21
e.
Avoidance against enemy (E
i
): It is the distance calculated as the dierence between current
position of individual (Z) and position of enemy (Z-).
Ei=ZZ(9)
The step vector (
Q) is further calculated by summating these five parameters as a product of
their individual weights. This metric is used to define the locus of dragonflies in search space across
iterations denoted by:
Qt+1=(aAi+sSi+cCi+pPi+eEi)+wQt(10)
Here A
i
,S
i
,C
i
,P
i
and E
i
denote the alignment, separation, cohesion, prey and enemy coecients
for ith individual dragonfly. Correspondingly a,s,c,p,edenotes the weights from alignment, separation,
cohesion, prey and enemy parameters for ith individual. These parameters balance the behavior of
prey attraction and predator avoidance. Here, wrepresents the inertia weight and trepresents number
of iterations. As the algorithm recapitulates, converge is assured when the weights of the dragonfly
constraints are altered adaptively. Concurrently, position vector is derived from step vector as follows:
Qt+1=Qt+Qt+1(11)
Here, tis defined to denote the present iteration. Based on these changes, the dragonflies optimize
their flying paths as well. Furthermore, dragonflies adapt random walk (L
`е
vy flight) behavior to fly
around the search space to ensure randomness, stochastic behavior and optimum exploration. In this
condition, the positions of dragonflies are updated as mentioned below:
Qt+1=Qt+Lievy(d)×Qt(12)
Here ddenotes the dimension for position vector of dragonflies.
Lievy(q)=0.01 ×m1×σ
|m2|1
β
(13)
The L`еvy flight, L`еvy (q) defines two parameters m
1
and m
2
denoting two random number in the
range [0, 1], while βis a constant. Furthermore, σis calculated as follows:
σ=
Γ(1+β)×sin πβ
2
Γ1+β
2×β×2(β1
2)
1
β
(14)
Here, Γ(x)=(x1)!
Based on these estimates, the dragonfly algorithm is implemented on the churn dataset. Prior
to feature selection, the dataset is partitioned into training (70%) and test (30%) sets. Training data
is used initially for feature selection, while test data is applied on results derived from feature
selection to validate the correctness of the procedure adopted. Dragonfly algorithm comprises of the
following steps:
(i)
Initialization of parameters: Initially, all the five basic parameters are defined randomly to update
the positions of dragonflies in search space. Each position corresponds to one feature in the churn
dataset which needs to be optimized iteratively.
(ii)
Deriving the fitness function: After parameters are initialized, the positions of the dragonflies
are updated based on a pre-defined fitness function F. The fitness function is defined in this
study based on objective criteria. The objective ensures the features are minimized iteratively
without compromising on the predictive capabilities of selected features towards churn detection.
This objective is considered in the defined fitness function adopted from [
68
], using weight factor,
Information 2019,10, 274 9 of 21
w
j
to ensure that maximum predictive capability is maintained after feature selection. The fitness
function Fis defined below. Here, Pred represents the predictive capabilities of the data features,
w
j
is the weight factor which ranges between [0, 1], L
a
is the length of attributes selected in the
feature space while Lsis the sum of all the churn data attributes;
F=maxPred +wj1La
Ls (15)
(iii)
Conditions for termination: Once the fitness function Ffails to update neighbor’s parameters for
an individual dragonfly, the algorithm terminates on reaching the best feature space. Suppose
the best feature space is not found, the algorithm terminates on reaching the maximum limit
for iterations.
The soundness of features derived from dragonfly algorithm is further validated using a
wrapper-based random forest classifier, Boruta algorithm available in R programming language.
The algorithm shues independent attributes in the dataset to ensure correlated instances are
separated. It is followed by building a random forest model for merged data consisting of original
attributes. Comparisons are further done to ensure that variables having higher importance score are
selected as significant attributes from data [
69
]. Boruta algorithm is selected because of its ability in
choosing uncorrelated pertinent data features. Henceforth this confirmatory approach ensures relevant
churn attributes are selected prior to privacy analysis.
From the attributes recognized in previous step, vulnerable features needs to be preserved prior to
churn analysis. For this reason, privacy preserving algorithms are to be tested. These algorithms must
ensure that preserved attributes are not re-disclosed after protection (i.e., no re-disclosure) and must also
ensure that original instance of data is recoverable at the time of need (i.e., re-availability). Techniques
like k-anonymization prevent re-disclosure, however if reiterated, information can be recovered
leading to risk of privacy breach. Hence such techniques cannot be adopted while securing sensitive
information from churn data. Thereby, enhanced algorithms of privacy preservation are to be adopted
for supporting re-availability and avoiding re-disclosure of the data. One such privacy mechanism
to conserve data is by pseudonymization [
70
]. Pseudonymization is accomplished by replacing
vulnerable attributes with consistent reversible data such that information is made re-available and
not re-disclosed after replacement. This technique would be suitable for protecting churn attributes.
Pseudonymization is applied by replacing vulnerable churn identifiers pseudonyms or aliases so
that the original customer data is converted to an intermediate reversible format. These pseudonyms
are assigned randomly and uniquely to each customer instance, by avoiding duplicate cases.
Pseudonymization is performed in R programming language for features derived from dragonfly
algorithm. Furthermore, to avoid attacks due to knowledge of original data (i.e., background knowledge
attack) the pseudonyms are encrypted using unique hash indices. Suppose the pseudonyms need to be
swapped back to original data decryption is performed. Thereby encryption enabled pseudonymization
ensures that data breach is prohibited from background knowledge attacks [
71
]. Additionally,
performance of the pseudonymization approach is compared with other privacy preserving models to
access eciency of data preservation.
3.5. Development of Models Post Privacy Analysis
Once the sensitive information is secured using the integrated dragonfly based pseudonymization
approach, mathematical models are once again developed for detecting churn cases. Previously
employed conventional algorithms are applied once again on the preserved churn dataset. The churn
rate is analyzed from these models for the unpreserved relevant churn attributes as performed
previously.
Information 2019,10, 274 10 of 21
3.6. Detection of Information Loss Prior and Post Privacy Preservation
It is important to ensure that any privacy preservation approach is not associated with loss of
appropriate information from original data. For this reason, performance of the churn prediction
algorithms are accessed recursively using statistical techniques to ensure that there is no significant
statistical dierence in the prediction abilities of the classifiers before and after privacy preservation.
For this reason, the research hypothesis is framed with the following assumptions:
Null Hypothesis (H0).
There is no statistical dierence in the performance of the churn detection models prior
and post privacy analysis.
Alternative Hypothesis (H1).
There is substantial statistical dierence in the performance of the churn
detection models prior and post privacy analysis.
Prior to research formulation, it is assumed that H
0
is valid. Furthermore this convention is
validated by performing Student’s t-test between the pre and post-security development models.
4. Results
This section highlights the significant outcomes derived from churn detection from telecom
customer’s dataset.
4.1. Processing of Churn Dataset
The customer churn data downloaded from Kaggle is analyzed to identify instances of customer
churn vs. non-churn. Three classes of attributes are responsible for detecting customer churn based
on accustomed services (i.e., internet, phone, streaming content access, online support and device
protection) account details (i.e., payment mode, monthly bills, contract mode, paperless bills and
total billing) and demographic details of a customer (i.e., age, gender, dependents and partners).
These classes comprises of 20 attributes in the dataset which collectively predict customer churn.
The dataset all together includes 7043 instances having 1869 occurrences of customer churn, while
remaining are loyal customers. The churn prediction approach adopted in this work is shown in
Figure 1. Furthermore, the distribution of churn attributes in the dataset is shown in Figure 2.
Information 2019, 10, x 10 of 21
This section highlights the significant outcomes derived from churn detection from telecom
customer’s dataset.
4.1. Processing of Churn Dataset
The customer churn data downloaded from Kaggle is analyzed to identify instances of
customer churn vs. non-churn. Three classes of attributes are responsible for detecting customer
churn based on accustomed services (i.e., internet, phone, streaming content access, online support
and device protection) account details (i.e., payment mode, monthly bills, contract mode, paperless
bills and total billing) and demographic details of a customer (i.e., age, gender, dependents and
partners). These classes comprises of 20 attributes in the dataset which collectively predict customer
churn. The dataset all together includes 7043 instances having 1869 occurrences of customer churn,
while remaining are loyal customers. The churn prediction approach adopted in this work is shown
in Figure 1. Furthermore, the distribution of churn attributes in the dataset is shown in Figure 2.
Figure 1. Overview of privacy imposed churn detection approach.
Figure 1. Overview of privacy imposed churn detection approach.
Information 2019,10, 274 11 of 21
Information 2019, 10, x 11 of 21
Figure 2. The distribution of attributes in churn dataset; Here red indicates churn customers while
blue indicates non-churn customers. The figure is derived from Weka software [72].
4.2. Initial Model Development Phase
Once data is collected, churn prediction is performed initially on all the data attributes. Several
data mining models including logistic regression, naïve Bayes, SVM, bagging, boosting and random
forest classifiers are adopted for churn detection. Each of these models is developed in R
programming language using dependencies available in the platform. Statistical parameters like
accuracy and F-measure suggested that random forest classifier outperformed other classifiers in
terms of churn detection. The performance of these classifiers is shown in Table 3.
Table 3. The performance of models in initial phase of churn detection.
Sl. No Classifier R Language Dependency True Positive Rate Accuracy RMSE F-Measure
1. Logistic Regression glm 0.887 0.793 0.398 0.793
2. Naïve Bayes naivebayes 0.893 0.791 0.334 0.801
3. SVM e1071 0.910 0.839 0.298 0.835
4. Bagging adabag 0.912 0.860 0.263 0.866
5. Boosting adabag 0.934 0.889 0.191 0.905
6. Random Forest randomForest 0.997 0.956 0.112 0.956
However, these models exposed certain sensitive information while detecting customer churn
including demographic particulars. Hence, feature selection is to be performed to identify the
susceptible churn attributes which needs to be conserved prior to data modeling.
4.3. Feature Selection and Attribute Preservation Using Dragonfly Based Pseudonymizer
To identify and preserve subtle churn attributes, a dual approach is adopted using feature
selection and privacy preservation techniques. Feature selection is implemented using dragonfly
algorithm by subjecting the data attributes randomly as initial population using ‘metaheuristicOpt’
dependency in R programming language [73]. The fitness function F is evaluated such that feature
set is minimized by retaining superlative attributes. Hence, the algorithm performs optimization by
feature minimization. In this case, schewefel’s function is used for defining the objective.
The algorithm works by recognizing the best feature as food source and worst feature as enemy
for a dragonfly. Furthermore the neighboring dragonflies are evaluated from each dragonfly based
Figure 2.
The distribution of attributes in churn dataset; Here red indicates churn customers while blue
indicates non-churn customers. The figure is derived from Weka software [72].
4.2. Initial Model Development Phase
Once data is collected, churn prediction is performed initially on all the data attributes. Several
data mining models including logistic regression, naïve Bayes, SVM, bagging, boosting and random
forest classifiers are adopted for churn detection. Each of these models is developed in R programming
language using dependencies available in the platform. Statistical parameters like accuracy and
F-measure suggested that random forest classifier outperformed other classifiers in terms of churn
detection. The performance of these classifiers is shown in Table 3.
Table 3. The performance of models in initial phase of churn detection.
Sl. No Classifier R Language
Dependency
True Positive
Rate Accuracy RMSE F-Measure
1. Logistic Regression glm 0.887 0.793 0.398 0.793
2. Naïve Bayes naivebayes 0.893 0.791 0.334 0.801
3. SVM e1071 0.910 0.839 0.298 0.835
4. Bagging adabag 0.912 0.860 0.263 0.866
5. Boosting adabag 0.934 0.889 0.191 0.905
6. Random Forest
randomForest
0.997 0.956 0.112 0.956
However, these models exposed certain sensitive information while detecting customer churn
including demographic particulars. Hence, feature selection is to be performed to identify the
susceptible churn attributes which needs to be conserved prior to data modeling.
4.3. Feature Selection and Attribute Preservation Using Dragonfly Based Pseudonymizer
To identify and preserve subtle churn attributes, a dual approach is adopted using feature selection
and privacy preservation techniques. Feature selection is implemented using dragonfly algorithm by
subjecting the data attributes randomly as initial population using ‘metaheuristicOpt’ dependency in R
programming language [
73
]. The fitness function Fis evaluated such that feature set is minimized by
Information 2019,10, 274 12 of 21
retaining superlative attributes. Hence, the algorithm performs optimization by feature minimization.
In this case, schewefel’s function is used for defining the objective.
The algorithm works by recognizing the best feature as food source and worst feature as enemy
for a dragonfly. Furthermore the neighboring dragonflies are evaluated from each dragonfly based on
the five parameters i.e., S
i
,A
i
,C
i
,P
i
and E
i
respectively. These neighbors are updated iteratively based
on radius parameter which increases in linear fashion. The weights updated in turn helps in updating
the position of remaining dragonflies. This procedure is iterated until germane churn features are
identified from 500 iterations in random orientation. The threshold limit is set to 0.75 (i.e., features
above 75% significance w.r.t churn rate are selected) to ignore less pertinent attributes.
Results revealed eight features as significant in churn detection based on the estimated values of
fitness function F. The data features ranked as per Fis shown in Table 4. The significant attributes
identified from the table include contract, total charges, tenure, tech support, monthly charges, online
backup, online security, and internet service respectively.
Additionally, these features are also armed by wrapper embedded learner, Boruta based on
their importance scores in R programming language. The algorithm is iterated 732 times to derive
eight significant features based on their importance scores. These attributes are visualized as a plot in
Figure 3revealing their importance scores. Higher the importance score, superior is the impact of a
data feature towards churn rate. The plot illustrates features equivalent to dragonfly computation by
suggesting alike features as significant churn attributes. However the order of their importance varies
across both the computations.
Information 2019, 10, x 14 of 21
Additionally, these features are also affirmed by wrapper embedded learner, Boruta based on
their importance scores in R programming language. The algorithm is iterated 732 times to derive
eight significant features based on their importance scores. These attributes are visualized as a plot
in Figure 3 revealing their importance scores. Higher the importance score, superior is the impact of
a data feature towards churn rate. The plot illustrates features equivalent to dragonfly computation
by suggesting alike features as significant churn attributes. However the order of their importance
varies across both the computations.
Figure 3. The distribution of churn features based on importance score in Boruta algorithm.
Of these eight distinctive churn attributes, characteristics like tenure, contract, monthly charges
and total charges reveal sensitive information about churn customers. Hence, these attributes needs
to be preserved before model development. However, preservation must also ensure that the
sensitive information is regenerated when required. For this purpose, pseudonymization technique
is employed on the selected attributes using ‘synergetr’ package in R programming language [74].
Pseudonymization is a privacy preservation technique which replaces the vulnerable information
with a pseudonym. The pseudonym (Ji) is an identifier that prevents disclosure of data attributes.
Pseudonyms are defined for each vulnerable attribute V1, V2, V3 and V4 in the dataset, so that original
data is preserved. Furthermore these pseudonyms are encrypted using hash indices to avoid privacy
loss due to background knowledge attacks. To derive original data from the encrypted pseudonym,
decryption key is formulated.
This twofold approach of feature selection and privacy preservation is named as “Dragonfly
based pseudonymization”. The algorithm is iterated over finite bound of 10,000 random iterations to
infer the concealment of churn information. However, if there is no vulnerable attribute present in
the data at that instance, the algorithm iterates until its maximum limit and terminates.
The ensemble algorithm adopted in this approach is shown in Algorithm 1.
Algorithm 1 Dragonfly based pseudonymizer
1. Define the initial values of dragonfly population (P) for churn data denoting the boundary limits
for maximum number of iterations (n)
2. Define the position of dragonflies (Yi such that i = 1, 2, …., n)
3. Define the step vector Yi such that i = 1, 2, …., n
4. while termination condition is not reached do
Figure 3. The distribution of churn features based on importance score in Boruta algorithm.
Information 2019,10, 274 13 of 21
Table 4. Churn features arranged as per their relevance via fitness function.
Sl. No Feature Name Feature Description Feature Category Fitness Value
1. Contract Denotes the contract period of the customer if it is monthly, yearly or for two years
Account information
0.9356
2. Tenure Indicates the number of months a customer is patron to the service provider
Account information
0.9174
3. Total charges Indicates the total charges to be paid by the customer
Account information
0.9043
4. Monthly charges Indicates the monthly charges to be paid by the customer
Account information
0.8859
5. Tech support Indicates if the customer has technical support or not, based on the internet service accustomed Customer services 0.8533
6. Online security Indicates if the customer has online security or not, based on the internet service accustomed Customer services 0.8476
7. Internet service Indicates the internet service provider of the customer, which can be either fiber optic, DSL or
none of these Customer services 0.7971
8. Online backup Indicates if the customer has online backup or not, based on the internet service accustomed Customer services 0.8044
9. Payment method Denotes the type of payment method. It can be automatic bank transfer mode, automatic credit
card mode, electronic check or mailed check
Account information
0.7433
10. Streaming TV Denotes if the customer has the service for streaming television or not Customer services 0.7239
11. Paperless billing Denotes if the customer has the service for paperless billing or not
Account information
0.7009
12. Streaming movies Indicates if the customer has the service for streaming movies or not Customer services 0.6955
13. Multiple lines Indicates if the customer has service for multiple lines or not Customer services 0.5487
14. Senior Citizen Indicates if the customer is a senior citizen or not
Demographic details
0.5321
15. Partner Denotes whether the customer has a partner or not
Demographic details
0.5093
16. Phone Service Indicates if the customer has services of the phone or not Customer services 0.5005
17. Dependents Indicates if the customer has any dependents or not
Demographic details
0.4799
18. Device protection Denotes if the customer has protection or not for the device Customer services 0.4588
19. Gender Indicates if the customer is male or female
Demographic details
0.3566
20. Customer ID A unique identifier given to each customer
Demographic details
0.2967
Information 2019,10, 274 14 of 21
Of these eight distinctive churn attributes, characteristics like tenure, contract, monthly charges
and total charges reveal sensitive information about churn customers. Hence, these attributes
needs to be preserved before model development. However, preservation must also ensure that the
sensitive information is regenerated when required. For this purpose, pseudonymization technique
is employed on the selected attributes using ‘synergetr’ package in R programming language [
74
].
Pseudonymization is a privacy preservation technique which replaces the vulnerable information
with a pseudonym. The pseudonym (J
i
) is an identifier that prevents disclosure of data attributes.
Pseudonyms are defined for each vulnerable attribute V
1
,V
2
,V
3
and V
4
in the dataset, so that original
data is preserved. Furthermore these pseudonyms are encrypted using hash indices to avoid privacy
loss due to background knowledge attacks. To derive original data from the encrypted pseudonym,
decryption key is formulated.
This twofold approach of feature selection and privacy preservation is named as “Dragonfly
based pseudonymization”. The algorithm is iterated over finite bound of 10,000 random iterations to
infer the concealment of churn information. However, if there is no vulnerable attribute present in the
data at that instance, the algorithm iterates until its maximum limit and terminates.
The ensemble algorithm adopted in this approach is shown in Algorithm 1.
Algorithm 1 Dragonfly based pseudonymizer
1. Define the initial values of dragonfly population (P) for churn data denoting the boundary limits for
maximum number of iterations (n)
2. Define the position of dragonflies (Yisuch that i=1, 2, . . . ., n)
3. Define the step vector Yisuch that i=1, 2, . . . ., n
4. while termination condition is not reached do
5. Calculate the fitness function F, for every position of dragonfly
6. Identify the food source and enemy
7. Update the values of w,s,a,c,pand e
8. Calculate Si,Ai,Ci,Piand Eiusing Equations (5)–(9)
9. Update the status of neighboring dragonflies
10. if atleast one neighboring dragonfly exists
11. Update the step vector using (10)
12. Update the velocity vector using (11)
13. else
14. Update the position vector using (12)
15. Discover if new positions computed satisfy boundary conditions to bring back dragonflies
16. Generate best optimized solution O
17. Input the solution O to pseudonymizer function
18. Define the length of the pseudonym Jifor each vulnerable attribute Vasuch that a=1, 2, . . . ., n
19. Eliminate duplicate pseudonyms
20. Encrypt with relevant pseudonyms for all data instances of vulnerable attributes
21. Repeat until pseudonyms are generated for all vulnerable attributes
22. Replace the vulnerable attribute Vawith pseudonyms Ji
23. Reiterate until all sensitive information in Vais preserved
24. Produce the final preserved data instances for churn prediction
25. Decrypt the pseudonyms by removing the aliases to view original churn dataset
4.4. Performance Analysis of Dual Approach
The performance of privacy endured twofold model is to be estimated by comparison with other
PPDM models. For this purpose, susceptible features selected from dragonfly algorithm are subjected
to privacy preservation by employing randomization, anonymization and perturbation algorithms in
R programming platform. The performance of all the models towards feature conservancy is enlisted
in Table 5.
Information 2019,10, 274 15 of 21
Table 5. Comparison of privacy preservation by pseudonymization and other PPDM models.
Sl. No Features to Be
Preserved Iterations Pseudonymization Anonymization Randomization Perturbation
1. Tenure 1000 Preserved Preserved Preserved Not preserved
2. Contract 1000 Preserved Preserved Not preserved Preserved
3.
Monthly charges
1000 Preserved Not preserved Not preserved Preserved
4. Total charges 1000 Preserved Not preserved Preserved Preserved
As observed from the table, pseudonymization technique performs with better stability over other
algorithms by securing all the key attributes over 1000 random iterations.
4.5. Model Re-Development Phase
The churn dataset derived subsequently after privacy enabled approach with eight essential
features having four preserved attributes is taken as input for model re-development phase.
The algorithms used in initial model development phase are employed at this point for accessing churn
rate among customers. These models are developed in R programming language as of previous phase.
The results from model development in re-development phases are enlisted in Table 6.
Table 6. Performance of models in re-development phase after privacy preservation.
Sl. No Classifier R Language
Dependency
True Positive
Rate Accuracy RMSE F-Measure
1 Logistic Regression glm 0.887 0.788 0.398 0.793
2 Naïve Bayes naivebayes 0.893 0.780 0.334 0.801
3 SVM e1071 0.910 0.828 0.298 0.835
4 Bagging adabag 0.912 0.858 0.263 0.866
5 Boosting adabag 0.934 0.873 0.191 0.905
6 Random Forest randomForest 0.997 0.943 0.112 0.956
The table displays random forest algorithm as the best performing classifier with enhanced
accuracy. Remaining classifiers are similarly arranged based on their performance abilities in detecting
churn rate.
4.6. Estimating Dierences in Models Based on Hypothesis
The predictive performance of models in the initial phase is compared with the models in the
re-development phase to adjudicate the amount of information loss. For this reason, t-test is performed
on the churn models from initial and re-development phases. The cutovalue alpha is assumed to
be 0.05 for estimating the statistical dierence among two categories of learners. Initially, the data
churn instances are randomly reshued to split them into training (70%) and test (30%) datasets.
Mathematical models are generated on training data as per previous iterations and F metric is estimated
for all the models. Furthermore, the F value is evaluated on test data to avoid bias. From such
random churn population, a sample data is extracted with 500 data instances having 200 churn and
300 non-churn customer cases. Fmetric is computed on this sample set and t-test score is evaluated for
estimating the validity of How.r.t alpha value.
Furthermore, p-value is estimated for both categories of models as shown in Table 7. These p-values
are compared with alpha value to estimate statistical dierences among the learners. As observed
from the table, p-values are found to be lesser than the alpha value (i.e., 0.05).
Information 2019,10, 274 16 of 21
Table 7. Performance of models in re-development phase after privacy preservation.
Group 1
(Initial Models)
Group 2
(Re-Development Models) Sample Data Size p-Value t-Score
Logistic regression Logistic regression 500 0.04 188.86
Naïve Bayes Naïve Bayes 500 0.04 198.67
SVM SVM 500 0.03 165.23
Bagging Bagging 500 0.03 154.89
Boosting Boosting 500 0.02 99.35
Random Forest Random Forest 500 0.01 66.95
Henceforth, there is a significant dierence in the performance of the models between initial and
re-development phases. Thereby, the research hypothesis H
0
is found to be invalid. Alternatively H
1
is
accepted in this case to ensure there is dierence in information between the two categories of models.
These dierences in the models ensure the presence of information loss.
4.7. Estimating Information Loss Between Initial and Re-Development Models
The previous step ensured that there is inherent information loss in the models. Hence it is
noteworthy to estimate the value. Ideally, there must not be any information loss associated before and
after information preservation. However, information loss is observed in this case as data attributes
are modified. Modification can happen either due to deletion or addition of information, leading to
loss of original data. Hence, information loss is defined for absolute values of churn data instances
accordingly by ignoring the sign:
In f ormation Loss (%)=(Mic Mrc )
Total churn instances in the dataset 100 (16)
Here, M
ic
indicates the churn predicted by initial models and M
rc
indicates the churn prediction
by re-development phase models.
The tabulation of information loss for the models in both phases is reflected in Table 8. As observed
from table, minimal information loss is observed from random forest classifier. This result is in par
with other computations as the same classifier has achieved better churn detection in both the phases
of model development. Hence, minimal information loss is associated with the model that performs
better churn detection.
Table 8. Information loss analysis in two phases of model development.
Sl. No Classifier Iterations
Total Churn
Instances in
the Dataset
Data Instances
Detecting Churn
in Initial Model
Development
Data Instances
Detecting Churn
After Model
Re-Development
Information
Loss after Dual
Approach (%)
1. Logistic
Regression 900 1869 1203 1109 5.02
2. Naïve Bayes 900 1869 1301 1245 2.99
3. SVM 900 1869 1432 1397 1.87
4. Bagging 900 1869 1645 1657 0.64
5. Boosting 900 1869 1793 1804 0.58
6 Random Forest 900 1869 1823 1829 0.32
5. Discussion
The current study is designed to analyze the impact of privacy preservation in churn rate prediction.
To emphasize on its importance, customer churn dataset is initially retrieved. Based on the analysis,
four key findings are derived from this study: (i) A twofold approach designed for feature selection and
privacy preservation using dragonfly based pseudonymization algorithm; (ii) set of relevant features
Information 2019,10, 274 17 of 21
which detect customer churn; (iii) set of vulnerable churn attributes which are to be preserved prior to
churn detection and (iv) churn detection using mathematical models.
In context of customer churn detection, the features identified in the current study indicate
significance of attribute selection techniques prior to model development. Suppose model development
phase precedes feature selection, there is no assurance that pertinent features could be utilized to
identify churn. There is high likelihood that predictions are amplified incidentally due to repetitive
iterations. To eliminate such cases of bias, the study performs model development prior to feature
selection followed by model development post feature selection and privacy preservation. This
twofold model development phase ensures that churn is detected without any feature dependency.
Furthermore, information loss due to data sanctuary is detected by analyzing models developed in
both the phases. This twofold approach seems to be appropriate for safeguarding interest of consumers
against viable security threats. Even though several protocols are available to preserve information till
date, current data mining approach helps in predicting future possibilities of churn instances based on
the existing statistics from mathematical models [30].
However the study has two major limitations. The first limitation is due to the contemplation of a
solitary dataset for concealment conservation in churn. Outcomes derived from the individual data
analysis are often explicit to the information. These outcomes may not be relevant on a dierent churn
dataset. Hence, global analysis of multidimensional churn datasets helps in validating the outcomes
derived after privacy preservation. Such dynamic predictive approaches will provide insights into
large-scale security loopholes at data level. The second limitation is due to the small set of vulnerable
churn features in the dataset. Suppose the sensitive information is increased by 10 or 100-fold it is not
guaranteed that the current approach would perform with same ecacy. The algorithm needs to be
tested and fine-tuned on datasets with dierent feature dimensionalities.
Data driven implementation adopted in this study can be utilized in developing a global privacy
impenetrable decision support system for telecom operators and subscribers. To accomplish this
undertaking, superfluous investigation is required in further stages. As a practical direction, one can
consider relevant features in churn detection for optimization of parameters in distributed mode
which aids in elimination privacy ambiguities. Henceforth, several nascent security threats would be
filtered out prior to data analysis workflow. An analogous approach can be employed by developing
multidimensional privacy preserving models on the data features using centralized and distributed
connectivity to provide encryption of multiple formats from subtle information.
Author Contributions:
The conceptualization of work is done by K.N. and S.G.; methodology, validation and
original draft preparation is done by K.N. and A.S.; Reviewing, editing and supervision is done by S.G.
Funding: This research received no external funding.
Conflicts of Interest: The authors declare no conflict of interest.
References
1.
Diaz, F.; Gamon, M.; Hofman, J.M.; Kıcıman, E.; Rothschild, D. Online and Social Media Data as an Imperfect
Continuous Panel Survey. PLoS ONE 2016,11, e014506. [CrossRef] [PubMed]
2.
Tomlinson, M.; Solomon, W.; Singh, Y.; Doherty, T.; Chopra, M.; Ijumba, P.; Tsai, A.C.; Jackson, D. The use of
mobile phones as a data collection tool: A report from a household survey in South Africa. BMC Med. Inf.
Decis. Mak. 2009,9, 1–8. [CrossRef] [PubMed]
3.
McDonald, C. Big Data Opportunities for Telecommunications. Available online: https://mapr.com/blog/big-
data-opportunities-telecommunications/(accessed on 11 January 2019).
4.
Telecom Regulatory Authority of India Highlights of Telecom Subscription Data as on 31 January 2019.
Available online: https://main.trai.gov.in/sites/default/files/PR_No.22of2019.pdf (accessed on 21 February
2019).
5.
Albreem, M.A.M. 5G wireless communication systems: Vision and challenges. In Proceedings of the 2015
International Conference on Computer, Communications, and Control Technology (I4CT), Kuching, SWK,
Malaysia, 21–23 April 2015; pp. 493–497.
Information 2019,10, 274 18 of 21
6.
Weiss, G.M. Data Mining in Telecommunications. In Data Mining and Knowledge Discovery Handbook; Springer:
Boston, MA, USA, 2005; pp. 1189–1201.
7.
Berson, A.; Smith, S.; Thearling, K. Building Data Mining Applications for CRM; McGraw-Hill Professional:
New York, NY, USA, 1999.
8.
Lu, H.; Lin, J.C.-C. Predicting customer behavior in the market-space: A study of Rayport and Sviokla’s
framework. Inf. Manag. 2002,40, 1–10. [CrossRef]
9.
Mendoza, L.E.; Marius, A.; P
é
rez, M.; Grim
á
n, A.C. Critical success factors for a customer relationship
management strategy. Inf. Softw. Technol. 2007,49, 913–945. [CrossRef]
10.
Hung, S.-Y.; Yen, D.C.; Wang, H.-Y. Applying data mining to telecom churn management. Expert Syst. Appl.
2006,31, 515–524. [CrossRef]
11.
Penders, J. Privacy in (mobile) Telecommunications Services. Ethics Inf. Technol.
2004
,6, 247–260. [CrossRef]
12.
Agarwal, S.; Aulakh, G. TRAI Recommendations on Data Privacy Raises Eyebrows. Available online:
https://economictimes.indiatimes.com/industry/telecom/telecom-policy/trai-recommendations-on- data-pr
ivacy-raises-eyebrows/articleshow/65033263.cms (accessed on 21 March 2019).
13. Hauer, B. Data and Information Leakage Prevention Within the Scope of Information Security. IEEE Access
2015,3, 2554–2565. [CrossRef]
14.
Blum, A.L.; Langley, P. Selection of relevant features and examples in machine learning. Artif. Intell.
1997
,97,
245–271. [CrossRef]
15.
Lindell, Y.; Pinkas, B. Privacy Preserving Data Mining. In Proceedings of the 20th Annual International
Cryptology Conference on Advances in Cryptology, Santa Barbara, CA, USA, 20–24 August 2000; pp. 36–54.
16.
Clifton, C.; Kantarcio
ˇ
glu, M.; Doan, A.; Schadow, G.; Vaidya, J.; Elmagarmid, A.; Suciu, D. Privacy-preserving
data integration and sharing. In Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data
Mining and Knowledge Discovery DMKD’04, Paris, France, 13 June 2004; pp. 19–26.
17.
Machanavajjhala, A.; Gehrke, J.; Kifer, D.; Venkitasubramaniam, M. L-diversity: Privacy beyond k-anonymity.
In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), Atlanta, Georgia, 3–7
April 2006; p. 24.
18.
Mendes, R.; Vilela, J.P. Privacy-Preserving Data Mining: Methods, Metrics, and Applications. IEEE Access
2017,5, 10562–10582. [CrossRef]
19.
Karp, A.H. Using Logistic Regression to Predict Customer Retention. 1998. Available online: https:
//www.lexjansen.com/nesug/nesug98/solu/p095.pdf (accessed on 16 August 2019).
20.
Mozer, M.C.; Wolniewicz, R.; Grimes, D.B.; Johnson, E.; Kaushansky, H. Predicting Subscriber Dissatisfaction
and Improving Retention in the Wireless Telecommunications Industry. IEEE Trans. Neural Netw.
2000
,11,
690–696. [CrossRef]
21.
Hur, Y.; Lim, S. Customer Churning Prediction Using Support Vector Machines in Online Auto Insurance Service;
Lecture Notes in Computer Science; Springer: Berlin, Germany, 2005; pp. 928–933.
22.
Larivi
è
re, B.; Van den Poel, D. Predicting customer retention and profitability by using random forests and
regression forests techniques. Expert Syst. Appl. 2005,29, 472–484. [CrossRef]
23.
Shao, J.; Li, X.; Liu, W. The Application of AdaBoost in Customer Churn Prediction. In Proceedings of the
2007 International Conference on Service Systems and Service Management, Chengdu, China, 9–11 June
2007; pp. 1–6.
24.
Zhao, J.; Dang, X.-H. Bank Customer Churn Prediction Based on Support Vector Machine: Taking a
Commercial Bank’s VIP Customer Churn as the Example. In Proceedings of the 2008 4th International
Conference on Wireless Communications, Networking and Mobile Computing, Dalian, China, 12–17 October
2008; pp. 1–4.
25.
Xie, Y.; Li, X.; Ngai, E.W.T.; Ying, W. Customer churn prediction using improved balanced random forests.
Expert Syst. Appl. 2009,36, 5445–5449. [CrossRef]
26.
Lee, H.; Lee, Y.; Cho, H.; Im, K.; Kim, Y.S. Mining churning behaviors and developing retention strategies
based on a partial least squares (PLS) mode. Decis. Support Syst. 2011,52, 207–216. [CrossRef]
27.
Idris, A.; Khan, A.; Lee, Y.S. Genetic Programming and Adaboosting based churn prediction for Telecom.
In Proceedings of the 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Seoul,
Korea, 14–17 October 2012; pp. 1328–1332.
Information 2019,10, 274 19 of 21
28.
Idris, A.; Rizwan, M.; Khan, A. Churn prediction in telecom using Random Forest and PSO based data
balancing in combination with various feature selection strategies. Comput. Electr. Eng.
2012
,38, 1808–1819.
[CrossRef]
29.
Kirui, C.; Hong, L.; Cheruiyot, W.; Kirui, H. Predicting Customer Churn in Mobile Telephony Industry Using
Probabilistic Classifiers in Data Mining. Int. J. Comput. Sci. Issues 2013,10, 165–172.
30.
Keramati, A.; Jafari-Marandi, R.; Aliannejadi, M.; Ahmadian, I.; Mozaari, M.; Abbasi, U. Improved churn
prediction in telecommunication industry using data mining techniques. Appl. Soft Comput.
2014
,24,
994–1012. [CrossRef]
31.
Amin, A.; Shehzad, S.; Khan, C.; Ali, I.; Anwar, S. Churn Prediction in Telecommunication Industry Using
Rough Set Approach. New Trends Comput. Collect. Intell. 2015,572, 83–95.
32.
Khodabandehlou, S.; Rahman, M.Z. Comparison of supervised machine learning techniques for customer
churn prediction based on analysis of customer behavior. J. Syst. Inf. Technol. 2017,19, 65–93. [CrossRef]
33. Erdem, K.; Dong, X.; Suhara, Y.; Balcisoy, S.; Bozkaya, B.; Pentland, A.S. Behavioral attributes and financial
churn prediction. EPJ Data Sci. 2018,7, 1–18.
34.
Amin, A.; Al-Obeidat, F.; Shah, B.; Adnan, A.; Loo, J.; Anwar, S. Customer churn prediction in
telecommunication industry using data certainty. J. Bus. Res. 2019,94, 290–301. [CrossRef]
35.
Ahmad, A.K.; Jafar, A.; Aljoumaa, K. Customer churn prediction in telecom using machine learning in big
data platform. J. Big Data 2019,6, 1–24. [CrossRef]
36.
Samarati, P.; Sweeney, L. Generalizing Data to Provide Anonymity when Disclosing Information.
In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems,
Seattle, WA, USA, 1–4 June 1998; p. 188.
37.
Sweeney, L. Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain.
Fuzziness Knowl. Based Syst. 2002,10, 571–588. [CrossRef]
38.
Xu, J.; Wang, W.; Pie, J.; Wang, X.; Shi, B.; Fu, A.W.-C. Utility-based anonymization using local recoding. In
Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
Philadelphia, PA, USA, 20–23 August 2006; pp. 785–790.
39.
Cormode, G.; Srivastava, D.; Yu, T.; Zhang, Q. Anonymizing bipartite graph data using safe groupings.
Proc. VLDB Endow. 2008,1, 833–844. [CrossRef]
40.
Munt
é
s-Mulero, V.; Nin, J. Privacy and anonymization for very large datasets. In Proceedings of the 18th
ACM Conference on Information and Knowledge Management, Hong Kong, China, 2–6 November 2009;
pp. 2117–2118.
41.
Masoumzadeh, A.; Joshi, J. Preserving Structural Properties in Edge-Perturbing Anonymization Techniques
for Social Networks. IEEE Trans. Dependable Secur. Comput. 2012,9, 877–889. [CrossRef]
42.
Emam, K.E.I.; Rodgers, S.; Malin, B. Anonymising and sharing individual patient data. BMJ
2015
,350, h1139.
[CrossRef] [PubMed]
43.
Goswami, P.; Madan, S. Privacy preserving data publishing and data anonymization approaches: A review.
In Proceedings of the 2017 International Conference on Computing, Communication and Automation
(ICCCA), Greater Noida, India, 5–6 May 2017; pp. 139–142.
44.
Bild, R.; Kuhn, K.A.; Prasser, F. SafePub: A Truthful Data Anonymization Algorithm With Strong Privacy
Guarantees. Proc. Priv. Enhancing Technol. 2018,1, 67–87. [CrossRef]
45.
Liu, F.; Hua, K.A.; Cai, Y. Query l-diversity in Location-Based Services. In Proceedings of the 2009 Tenth
International Conference on Mobile Data Management: Systems, Services and Middleware, Taipei, Taiwan,
18–20 May 2009; pp. 436–442.
46.
Das, D.; Bhattacharyya, D.K. Decomposition+: Improving
`
-Diversity for Multiple Sensitive Attributes.
Adv. Comput. Sci. Inf. Technol. Comput. Sci. Eng. 2012,85, 403–412.
47. Kern, M. Anonymity: A Formalization of Privacy-l-Diversity. Netw. Archit. Serv. 2013, 49–56. [CrossRef]
48.
Mehta, B.B.; Rao, U.P. Improved l-Diversity: Scalable Anonymization Approach for Privacy Preserving Big
Data Publishing. J. King Saud Univ. Comput. Inf. Sci. 2019, in press. [CrossRef]
49.
Li, N.; Li, T.; Venkatasubramanian, S. t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. In
Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering, Istanbul, Turkey, 16–20
April 2007; pp. 106–115.
50.
Liang, H.; Yuan, H. On the Complexity of t-Closeness Anonymization and Related Problems. Database Syst.
Adv. Appl. 2013,7825, 331–345.
Information 2019,10, 274 20 of 21
51.
Domingo-Ferrer, J.; Soria-Comas, J. From t-Closeness to Dierential Privacy and Vice Versa in Data
Anonymization. Knowl. Based Syst. 2015,74, 151–158. [CrossRef]
52.
Soria-Comas, J.; Domingo-Ferrer, J.; S
á
nchez, D.; Mart
í
nez, S. t-closeness through microaggregation: Strict
privacy with enhanced utility preservation. In Proceedings of the 2016 IEEE 32nd International Conference
on Data Engineering (ICDE), Helsinki, Finland, 16–20 May 2016; pp. 1464–1465.
53.
Kumar, P.M.V. T-Closeness Integrated L-Diversity Slicing for Privacy Preserving Data Publishing. J. Comput.
Theor. Nanosci. 2018,15, 106–110. [CrossRef]
54.
Evfimievski, A. Randomization in privacy preserving data mining. ACM SIGKDD Explor. Newsl.
2002
,4,
43–48. [CrossRef]
55.
Aggarwal, C.C.; Yu, P.S. A Survey of Randomization Methods for Privacy-Preserving Data Mining.
Adv. Database Syst. 2008,34, 137–156.
56.
Sz˝ucs, G. Random Response Forest for Privacy-Preserving Classification. J. Comput. Eng.
2013
,2013, 397096.
[CrossRef]
57.
Batmaz, Z.; Polat, H. Randomization-based Privacy-preserving Frameworks for Collaborative Filtering.
Procedia Comput. Sci. 2016,96, 33–42. [CrossRef]
58.
Kargupta, H.; Datta, S.; Wang, Q.; Sivakumar, K. Random-data perturbation techniques and
privacy-preserving data mining. Knowl. Inf. Syst. 2005,7, 387–414. [CrossRef]
59.
Liu, L.; Kantarcioglu, M.; Thuraisingham, B. The Applicability of the Perturbation Model-based Privacy
Preserving Data Mining for Real-world Data. In Proceedings of the 6th IEEE International Conference on
Data Mining, Hing Kong, China, 18–22 December 2006; pp. 507–512.
60.
Shah, A.; Gulati, R. Evaluating applicability of perturbation techniques for privacy preserving data mining
by descriptive statistics. In Proceedings of the 2016 International Conference on Advances in Computing,
Communications and Informatics (ICACCI), Jaipur, India, 21–24 September 2016; pp. 607–613.
61.
Upadhyay, S.; Sharma, C.; Sharma, P.; Bharadwaj, P.; Seeja, K.R. Privacy preserving data mining with 3-D
rotation transformation. J. King Saud Univ. Comput. Inf. Sci. 2018,30, 524–530. [CrossRef]
62.
Kotschy, W. The New General Data Protection Regulation—Is There Sucient Pay-Ofor Taking the Trouble
to Anonymize or Pseudonymize data? Available online: https://fpf.org/wp-content/uploads/2016/11/Kotschy
-paper-on-pseudonymisation.pdf (accessed on 18 August 2019).
63.
Stalla-Bourdillon, S.; Knight, A. Anonymous Data v. Personal Data—A False Debate: An EU Perspective on
Anonymization, Pseudonymization and Personal Data. Wis. Int. Law J. 2017,34, 284–322.
64.
Neumann, G.K.; Grace, P.; Burns, D.; Surridge, M. Pseudonymization risk analysis in distributed systems.
J. Internet Serv. Appl. 2019,10, 1–16. [CrossRef]
65. Telco Customer Churn Dataset. Available online: https://www.kaggle.com/blastchar/telco-customer-churn
(accessed on 23 January 2019).
66.
Tuv, E.; Borisov, A.; Runger, G.; Torkkola, K. Feature Selection with Ensembles, Artificial Variables, and
Redundancy Elimination. J. Mach. Learn. Res. 2009,10, 1341–1366.
67.
Mafarja, M.; Heidari, A.A.; Faris, H.; Mirjalili, S.; Aljarah, I. Dragonfly Algorithm: Theory, Literature Review,
and Application in Feature Selection. Nat. Inspired Optim. 2019,811, 47–67.
68.
Mirjalili, S. Dragonfly algorithm: A new meta-heuristic optimization technique for solving single-objective,
discrete, and multi-objective problems. Neural Comput. Appl. 2016,27, 1053–1073. [CrossRef]
69.
Kursa, M.B.; Rudnicki, W.R. Feature Selection with the Boruta Package. J. Stat. Softw.
2010
,36, 1–13.
[CrossRef]
70.
Biskup, J.; Flegel, U. Transaction-Based Pseudonyms in Audit Data for Privacy Respecting Intrusion Detection.
In Proceedings of the Third International Workshop on Recent Advances in Intrusion Detection, London,
UK, 2–4 October 2000; pp. 28–48.
71.
Privacy-Preserving Storage and Access of Medical Data through Pseudonymization and Encryption. Available
online: https://www.xylem-technologies.com/2011/09/privacy-preserving-storage-and-access-of-medical-
data-through-pseudonymization-and- encryption/(accessed on 19 August 2019).
72.
Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H. The WEKA Data Mining Software:
An Update. SIGKDD Explor. 2009,11, 10–18. [CrossRef]
Information 2019,10, 274 21 of 21
73.
Riza, L.S.; Nugroho, E.P. Metaheuristicopt: Metaheuristic for Optimization. Available online: https:
//cran.r-project.org/web/packages/metaheuristicOpt/metaheuristicOpt.pdf (accessed on 21 April 2019).
74.
An R Package to Generate Synthetic Data with Realistic Empirical Probability Distributions. Available online:
https://github.com/avirkki/synergetr (accessed on 23 May 2019).
©
2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
... The Chun attributes analysis performed by operators on public database infrastructure further makes learners' data vulnerable to unrestricted invasion of privacy. The solution is to scrutinize the attributes of users' in the dataset to determine excusive and private or sensitive data requiring preservations [5]. Studies in [6][7][8][9] have confirmed the use of learners' private data and behavioural activities in LMS for the purpose of understanding the needs of learners as well as enhance their experiences. ...
... This study found from literatures that, many researchers and scholars have identified the needs to increase security requirements of Big Data and mobile-based LMS to support data analysis, transfer, acquisition and storage while overcoming data leakages [18,28]. There are needs to ascertain the effectiveness of common privacy preservations schemes identified by this study for Big data and its applications (or micro-data) including cryptography, perturbation, anonymization, randomized response and condensation [5,17,29]. The future data privacy schemes must investigate inherent capabilities of machine learning algorithms in protecting mobile-based LMS private user data [18,32]. ...
... There are efforts to protect learner's data from unauthorized and inordinate exposure of privacy which have raised security concerns about mobile based learning management systems [5]. The future works are to consider the best ways of performing mining operations on learner's data without fear of privacy compromises. ...
Chapter
The adoption of mobile technologies in education are evolving like in the business and health sectors. The design of user-centric platform to enable individuals participate in the activities of learning and teaching is currently area of research. The Learning Management Systems (LMS) area assists learners and academic activities but, it continues to fall short of desired impact due to huge demands of the application. More importantly, the mobile applications offer enormous convenience not without the possibility of eavesdropping and maliciously exploiting data about users. The original structure of mobile learning requires that data and processing heads have centralized entity, which is not possible in wireless application arrangements due to communication overhead of transmitting raw data to central learning processor. This led to the use of distributed mobile learning structure, which preserve privacy of learners. This study discusses the challenges, current trends, methodology, opportunities and future direction of privacy preservation in mobile-based learning systems. The study highlighted the use of learners’ private data and behavioral activities by LMS especially in understanding the needs of learners as well as improvement of their experiences. But, it raises concerns about the risks of learners’ privacy on LMS due the mining processes of learners, which were not considered in existing related studies in literature.
... When reading the disk information, make a suitable way to read the information according to the two flag bits. If it is a normal file or directory, continue the following reading operation [12]. Otherwise, you need to recover the file first, and then filter the recovered file to get the required file information, and then hand it to the following text information extraction module for processing. ...
Article
Full-text available
It is urgent to effectively monitor the public opinion of the news communication platform. The platform designed in this paper takes microblog public opinion as the research goal, uses MongoDB to build a distributed computing platform for sensitive information of news communication platform, establishes a corpus of sensitive event topics, introduces PageRank algorithm to deal with microblog social relations, obtains the characteristics of sensitive information of news communication platform, and carries out information screening, so as to accurately screen and mine the keywords in high impact information. To ensure the practical application effect of sensitive information mining method of news communication platform based on big data analysis. Finally, the experiment proves that the sensitive information mining method of news communication platform based on big data analysis has the advantages of high timeliness and high accuracy, which fully meets the research requirements. This is fully in line with the requirements of the study.
... However, data hiding and data modification or removal of sensitive data should be necessary by specific mechanism (Sachan and Roy 2013). PPDM can be categorised into k-anonymisation, randomisation and I-diversity, t-closeness perturbation and pseudonymisation (Nagaraj et al. 2019). There is a trade-off between privacy and loss of information (Casino et al. 2015). ...
Article
Full-text available
The architectural, engineering and construction (AEC) requires most efficient joint efforts between the construction project stakeholders with continuing exchange of large amount of project data. Nowadays, it is seen paradigm shift in construction industry to work on digital technologies like BIM from conventional methods of paper-based data exchange. The expanding volume of individual and sensitive information being digitally gathered by the AEC industry and store in the cloud, which makes it vulnerable to cyber attack. According to the Data Protection Act and the General Data Protection Regulation, organisation has to ensure security of people and security of sensitive data. Subsequently, it is crucial to establish the framework or method to provide the privacy of project data. Specifically, individual information, for example, individual health records, address and all other background information should be protected as per the legal regulation In this research paper, a technique called Hybrid-k anonymity has been proposed to protect the individual’s personal information as well as employees details, supplier details, cost details. In this technique, we modified original data using randomisation technique and then apply anonymisation on modified data which can provide better accuracy with minimum loss of information. This approach will enhance the privacy of sensitive information from cyber attack of the data miner.
Book
This book constitutes revised selected papers from the Third International Conference on Information and Communication Technology and Applications, ICTA 2020, held in Minna, Nigeria, in November 2020. Due to the COVID-19 pandemic the conference was held online. The 67 full papers were carefully reviewed and selected from 234 submissions. The papers are organized in the topical sections on Artificial Intelligence, Big Data and Machine Learning; Information Security Privacy and Trust; Information Science and Technology.
Chapter
Drug errors and abuses are the most frequently reported deficiencies in the healthcare sector worldwide. In the US alone over $3.5 billion has been expended on treatment related to drug errors that concern more than 1.5 million individuals. The drug is an important part of livelihood has faced the problem of authentication because medicines have to be tested to differentiate between the real and the fake. Drug code detection will reduce the risk of these mistakes by supplying the first responders with accurate information that can quickly decode this information using a code scanner on their smartphones and thus take the necessary steps against their use. The previous study implemented a desktop application system that checks for standardized drugs by scanning the Quick Response codes on the pack. Recently, lots of improvements have taken place in terms of smartphone development with various tools like cameras, which can be used to scan drug barcode. Therefore, the study developed a mobile application to scan the drugs' barcode and verify authenticity. The application designed using an integrated database for real-time drug authentications. The application was implemented using SQL running on a server and interacted with an Application Programming Interface (API) to serve as an intermediary between the application and the browser API built with an Object-relational mapping (ORM) called Sequelize. After code is scanned to gets its serial code, the API validates the serial code and releases a quick response code through a JavaScript Object Notation (JSON). The proposed system can be used by doctors, pharmacists and patients for the identification of fakes and harmful drugs, hence reduced the calculations of fakes or harmful drugs.
Chapter
Serious games have arisen to boost users’ interaction and efficiency as they reach a particular objective, integrating with the game’s mechanics, thus producing a very enticing mission. The use of serious games in Software Engineering to increase the participation of developers has been studied with great interest to train potential professionals to encounter situations they may face in the development of software. This paper introduces ScrumGame, a serious game to train both students in Software Engineering and software practitioners in Scrum. The game was tested with users who use Scrum in their everyday work using pre-test-post-test style. The SIMS and MSLQ tests were used for this, which were both performed by the users before and after the game was played. We aimed at assessing how game use affects learning strategies and motivation. Backed up with evidence for statistical significance, findings indicate that ScrumGame has had a positive effect on the students.
Article
Full-text available
In the era of big data analytics, data owner is more concern about the data privacy. Data anonymization approaches such as k-anonymity, l-diversity, and t-closeness are used for a long time to preserve privacy in published data. However, these approaches cannot be directly applicable to a large amount of data. Distributed programming framework such as MapReduce and Spark are used for big data analytics which add more challenges to privacy preserving data publishing. Recently, we identified few scalable approaches for Privacy Preserving Big Data Publishing in literature and majority of them are based on k-anonymity and l-diversity. However, these approaches require a significant improvement to reach the level of existing privacy preserving data publishing approaches, therefore, we propose Improved Scalable l-Diversity (ImSLD) approach which is the extension of Improved Scalable k-Anonymity (ImSKA) for scalable anonymization in this paper. Our approaches are based on scalable k-anonymization that uses MapReduce as a programming paradigm. We use poker dataset and synthesize big data versions of poker dataset to test our approaches. The result analysis shows significant improvement in terms of running time due to the lesser number of MapReduce iterations and also exhibits lower information loss as compared to existing approaches while providing the same level of privacy due to tight arrangement of the records in the initial equivalence class.
Article
Full-text available
Customer churn is a major problem and one of the most important concerns for large companies. Due to the direct effect on the revenues of the companies, especially in the telecom field, companies are seeking to develop means to predict potential customer to churn. Therefore, finding factors that increase customer churn is important to take necessary actions to reduce this churn. The main contribution of our work is to develop a churn prediction model which assists telecom operators to predict customers who are most likely subject to churn. The model developed in this work uses machine learning techniques on big data platform and builds a new way of features’ engineering and selection. In order to measure the performance of the model, the Area Under Curve (AUC) standard measure is adopted, and the AUC value obtained is 93.3%. Another main contribution is to use customer social network in the prediction model by extracting Social Network Analysis (SNA) features. The use of SNA enhanced the performance of the model from 84 to 93.3% against AUC standard. The model was prepared and tested through Spark environment by working on a large dataset created by transforming big raw data provided by SyriaTel telecom company. The dataset contained all customers’ information over 9 months, and was used to train, test, and evaluate the system at SyriaTel. The model experimented four algorithms: Decision Tree, Random Forest, Gradient Boosted Machine Tree “GBM” and Extreme Gradient Boosting “XGBOOST”. However, the best results were obtained by applying XGBOOST algorithm. This algorithm was used for classification in this churn predictive model.
Chapter
Full-text available
In this chapter, a wrapper-based feature selection algorithm is designed and substantiated based on the binary variant of Dragonfly Algorithm (BDA). DA is a successful, well-established metaheuristic that revealed superior efficacy in dealing with various optimization problems including feature selection. In this chapter we are going first present the inspirations and methamatical modeds of DA in details. Then, the performance of this algorithm is tested on a special type of datasets that contain a huge number of features with low number of samples. This type of datasets makes the optimization process harder, because of the large search space, and the lack of adequate samples to train the model. The experimental results showed the ability of DA to deal with this type of datasets better than other optimizers in the literature. Moreover, an extensive literature review for the DA is provided in this chapter.
Article
Full-text available
Abstract Customer retention is crucial in a variety of businesses as acquiring new customers is often more costly than keeping the current ones. As a consequence, churn prediction has attracted great attention from both the business and academic worlds. Traditional efforts in the financial domain mainly focus on domain specific variables such as product ownership or service usage aggregation, however, without considering dynamic behavioral patterns of customers’ financial transactions. In this paper, we attempt to fill in this gap by investigating the spatio-temporal patterns and entropy of choices underlying the customers’ financial decisions, and their relations to customer churning activities. Inspired by previous works in the emerging field of computational social science, we built a prediction model based on spatio-temporal and choice behavioral traits using individual transaction records. Our results show that proposed dynamic behavioral models could predict churn decisions significantly better than traditionally considered factors such as demographic-based features, and that this effect remains consistent across multiple data sets and various churn definitions. We further study the relative importance of the various behavioral features in churn prediction, and how the predictive power varies across different demographic groups. More generally, the proposed features can also be applied to churn prediction in other domains where spatio-temporal behavioral data are available.
Conference Paper
Full-text available
Methods for privacy-preserving data publishing and analysis trade off privacy risks for individuals against the quality of output data. In this article, we present a data publishing algorithm that satisfies the differential privacy model. The transformations performed are truthful, which means that the algorithm does not perturb input data or generate synthetic output data. Instead, records are randomly drawn from the input dataset and the uniqueness of their features is reduced. This also offers an intuitive notion of privacy protection. Moreover, the approach is generic, as it can be parameterized with different objective functions to optimize its output towards different applications. We show this by integrating six well-known data quality models. We present an extensive analytical and experimental evaluation and a comparison with prior work. The results show that our algorithm is the first practical implementation of the described approach and that it can be used with reasonable privacy parameters resulting in high degrees of protection. Moreover, when parame-terizing the generic method with an objective function quantifying the suitability of data for building statistical classifiers, we measured prediction accuracies that compare very well with results obtained using state-of-the-art differentially private classification algorithms.
Article
Full-text available
Purpose This paper aims to provide a predictive framework of customer churn through six stages for accurate prediction and preventing customer churn in the field of business. Design/methodology/approach The six stages are as follows: first, collection of customer behavioral data and preparation of the data; second, the formation of derived variables and selection of influential variables, using a method of discriminant analysis; third, selection of training and testing data and reviewing their proportion; fourth, the development of prediction models using simple, bagging and boosting versions of supervised machine learning; fifth, comparison of churn prediction models based on different versions of machine-learning methods and selected variables; and sixth, providing appropriate strategies based on the proposed model. Findings According to the results, five variables, the number of items, reception of returned items, the discount, the distribution time and the prize beside the recency, frequency and monetary (RFM) variables (RFMITSDP), were chosen as the best predictor variables. The proposed model with accuracy of 97.92 per cent, in comparison to RFM, had much better performance in churn prediction and among the supervised machine learning methods, artificial neural network (ANN) had the highest accuracy, and decision trees (DT) was the least accurate one. The results show the substantially superiority of boosting versions in prediction compared with simple and bagging models. Research limitations/implications The period of the available data was limited to two years. The research data were limited to only one grocery store whereby it may not be applicable to other industries; therefore, generalizing the results to other business centers should be used with caution. Practical implications Business owners must try to enforce a clear rule to provide a prize for a certain number of purchased items. Of course, the prize can be something other than the purchased item. Business owners must accept the items returned by the customers for any reasons, and the conditions for accepting returned items and the deadline for accepting the returned items must be clearly communicated to the customers. Store owners must consider a discount for a certain amount of purchase from the store. They have to use an exponential rule to increase the discount when the amount of purchase is increased to encourage customers for more purchase. The managers of large stores must try to quickly deliver the ordered items, and they should use equipped and new transporting vehicles and skilled and friendly workforce for delivering the items. It is recommended that the types of services, the rules for prizes, the discount, the rules for accepting the returned items and the method of distributing the items must be prepared and shown in the store for all the customers to see. The special services and reward rules of the store must be communicated to the customers using new media such as social networks. To predict the customer behaviors based on the data, the future researchers should use the boosting method because it increases efficiency and accuracy of prediction. It is recommended that for predicting the customer behaviors, particularly their churning status, the ANN method be used. To extract and select the important and effective variables influencing customer behaviors, the discriminant analysis method can be used which is a very accurate and powerful method for predicting the classes of the customers. Originality/value The current study tries to fill this gap by considering five basic and important variables besides RFM in stores, i.e. prize, discount, accepting returns, delay in distribution and the number of items, so that the business owners can understand the role services such as prizes, discount, distribution and accepting returns play in retraining the customers and preventing them from churning. Another innovation of the current study is the comparison of machine-learning methods with their boosting and bagging versions, especially considering the fact that previous studies do not consider the bagging method. The other reason for the study is the conflicting results regarding the superiority of machine-learning methods in a more accurate prediction of customer behaviors, including churning. For example, some studies introduce ANN (Huang et al., 2010; Hung and Wang, 2004; Keramati et al., 2014; Runge et al., 2014), some introduce support vector machine ( Guo-en and Wei-dong, 2008; Vafeiadis et al., 2015; Yu et al., 2011) and some introduce DT (Freund and Schapire, 1996; Qureshi et al., 2013; Umayaparvathi and Iyakutti, 2012) as the best predictor, confusing the users of the results of these studies regarding the best prediction method. The current study identifies the best prediction method specifically in the field of store businesses for researchers and the owners. Moreover, another innovation of the current study is using discriminant analysis for selecting and filtering variables which are important and effective in predicting churners and non-churners, which is not used in previous studies. Therefore, the current study is unique considering the used variables, the method of comparing their accuracy and the method of selecting effective variables.
Article
In an era of big data, online services are becoming increasingly data-centric; they collect, process, analyze and anonymously disclose growing amounts of personal data in the form of pseudonymized data sets. It is crucial that such systems are engineered to both protect individual user (data subject) privacy and give back control of personal data to the user. In terms of pseudonymized data this means that unwanted individuals should not be able to deduce sensitive information about the user. However, the plethora of pseudonymization algorithms and tuneable parameters that currently exist make it difficult for a non expert developer (data controller) to understand and realise strong privacy guarantees. In this paper we propose a principled Model-Driven Engineering (MDE) framework to model data services in terms of their pseudonymization strategies and identify the risks to breaches of user privacy. A developer can explore alternative pseudonymization strategies to determine the effectiveness of their pseudonymization strategy in terms of quantifiable metrics: i) violations of privacy requirements for every user in the current data set; ii) the trade-off between conforming to these requirements and the usefulness of the data for its intended purposes. We demonstrate through an experimental evaluation that the information provided by the framework is useful, particularly in complex situations where privacy requirements are different for different users, and can inform decisions to optimize a chosen strategy in comparison to applying an off-the-shelf algorithm.
Article
Customer Churn Prediction (CCP) is a challenging activity for decision makers and machine learning community because most of the time, churn and non-churn customers have resembling features. From different experiments on customer churn and related data, it can be seen that a classifier shows different accuracy levels for different zones of a dataset. In such situations, a correlation can easily be observed in the level of classifier's accuracy and certainty of its prediction. If a mechanism can be defined to estimate the classifier's certainty for different zones within the data, then the expected classifier's accuracy can be estimated even before the classification. In this paper, a novel CCP approach is presented based on the above concept of classifier's certainty estimation using distance factor. The dataset is grouped into different zones based on the distance factor which are then divided into two categories as; (i) data with high certainty, and (ii) data with low certainty, for predicting customers exhibiting Churn and Non-churn behavior. Using different state-of-the-art evaluation measures (e.g., accuracy, f-measure, precision and recall) on different publicly available the Telecommunication Industry (TCI) datasets show that (i) the distance factor is strongly co-related with the certainty of the classifier, and (ii) the classifier obtained high accuracy in the zone with greater distance factor's value (i.e., customer churn and non-churn with high certainty) than those placed in the zone with smaller distance factor's value (i.e., customer churn and non-churn with low certainty).
Article
Privacy preserving is an important method of data publishing, it consists data onto the form of micro data table containing sensitive attributes. There are many methods used in privacy preserving data publishing, they are k-anonymity, L-diversity, slicing, and T -closeness. Every method consist various disadvantages, to overcome this disadvantages proposed method called as T -closeness integrated L-diversity slicing for Privacy Preserving Data Publishing. Here the idea of essential to understanding a situation of sensitive attributes in any equivalence class taken into overall table and disclosure is measured for various class intervals. In this work Chi squared and Pearson based correlation coefficient is used to compute correlations between pairs of attributes and sensitive attributes. The experimental outcome shows that the proposed system produces better performance when compared with earlier researches.