ArticlePDF Available

Abstract and Figures

The emergence of many business competitors has engendered severe rivalries among competing businesses in gaining new customers and retaining old ones. Due to the preceding, the need for exceptional customer services becomes pertinent, notwithstanding the size of the business. Furthermore, the ability of any business to understand each of its customers' needs will earn it greater leverage in providing targeted customer services and developing customised marketing programs for the customers. This understanding can be possible through systematic customer segmentation. Each segment comprises customers who share similar market characteristics. The ideas of Big data and machine learning have fuelled a terrific adoption of an automated approach to customer segmentation in preference to traditional market analyses that are often inefficient especially when the number of customers is too large. In this paper, the k-Means clustering algorithm is applied for this purpose. A MATLAB program of the k-Means algorithm was developed (available in the appendix) and the program is trained using a z-score normalised two-feature dataset of 100 training patterns acquired from a retail business. The features are the average amount of goods purchased by customer per month and the average number of customer visits per month. From the dataset, four customer clusters or segments were identified with 95% accuracy, and they were labeled: High
Content may be subject to copyright.
(IJARAI) International Journal of Advanced Research in Artificial Intelligence,
Vol. 4, No.10, 2015
40 | P a g e
www.ijarai.thesai.org
Application of K-Means Algorithm for Efficient
Customer Segmentation: A Strategy for Targeted
Customer Services
Chinedu Pascal Ezenkwu, Simeon Ozuomba, Constance kalu
Electrical/Electronics & Computer Engineering Department, University of Uyo, Uyo, Akwa Ibom State, Nigeria
AbstractThe emergence of many business competitors has
engendered severe rivalries among competing businesses in
gaining new customers and retaining old ones. Due to the
preceding, the need for exceptional customer services becomes
pertinent, notwithstanding the size of the business. Furthermore,
the ability of any business to understand each of its customers’
needs will earn it greater leverage in providing targeted customer
services and developing customised marketing programs for the
customers. This understanding can be possible through
systematic customer segmentation. Each segment comprises
customers who share similar market characteristics. The ideas of
Big data and machine learning have fuelled a terrific adoption of
an automated approach to customer segmentation in preference
to traditional market analyses that are often inefficient especially
when the number of customers is too large. In this paper, the k-
Means clustering algorithm is applied for this purpose. A
MATLAB program of the k-Means algorithm was developed
(available in the appendix) and the program is trained using a z-
score normalised two-feature dataset of 100 training patterns
acquired from a retail business. The features are the average
amount of goods purchased by customer per month and the
average number of customer visits per month. From the dataset,
four customer clusters or segments were identified with 95%
accuracy, and they were labeled: High-Buyers-Regular-Visitors
(HBRV), High-Buyers-Irregular-Visitors (HBIV), Low-Buyers-
Regular-Visitors (LBRV) and Low-Buyers-Irregular-Visitors
(LBIV).
Keywordsmachine learning; data mining; big data; customer
segmentation; MATLAB; k-Means algorithm; customer service;
clustering; extrapolation
I. INTRODUCTION
Over the years, the increase in competition amongst
businesses and the availability of large historical data
repositories have prompted the widespread applications of
data mining techniques in uncovering valuable and strategic
information buried in organisations’ databases. Data mining is
the process of extracting meaningful information from a
dataset and presenting it in a human understandable format for
the purpose of decision support. The data mining techniques
intersect areas such as statistics, artificial intelligence,
machine learning and database systems. The applications of
data mining include but not limited to bioinformatics, weather
forecasting, fraud detection, financial analysis and customer
segmentation. The thrust of this paper is to identify customer
segments in a retail business using a data mining approach.
Customer segmentation is the subdivision of a business
customer base into groups called customer segments such that
each customer segment consists of customers who share
similar market characteristics. This segmentation is based on
factors that can directly or indirectly influence market or
business such as products preferences or expectations,
locations, behaviours and so on. The importance of customer
segmentation include, inter alia, the ability of a business to
customise market programs that will be suitable for each of its
customer segments; business decision support in terms of
risky situation such as credit relationship with its customers;
identification of products associated with each segments and
how to manage the forces of demand and supply; unravelling
some latent dependencies and associations amongst
customers, amongst products, or between customers and
products which the business may not be aware of; ability to
predict customer defection, and which customers are most
likely to defect; and raising further market research questions
as well as providing directions to finding the solutions.
Clustering has proven efficient in discovering subtle but
tactical patterns or relationships buried within a repository of
unlabelled datasets. This form of learning is classified under
unsupervised learning. Clustering algorithms include k-Means
algorithm, k-Nearest Neighbour algorithm, Self-Organising
Map (SOM) and so on. These algorithms, without any
knowledge of the dataset beforehand, are capable of
identifying clusters therein by repeated comparisons of the
input patterns until the stable clusters in the training examples
are achieved based on the clustering criterion or criteria. Each
cluster contains data points that have very close similarities
but differ considerably from data points of other clusters.
Clustering has got immense applications in pattern
recognition, image analysis, bioinformatics and so on. In this
paper, the k-Means clustering algorithm has been applied in
customer segmentation. A MATLAB program (Appendix) of
the k-Means algorithm was developed, and the training was
realised using z-score normalised two-feature dataset of 100
training patterns acquired from a retail business. After several
iterations, four stable clusters or customer segments were
identified. The two features considered in the clustering are
the average amount of goods purchased by customer per
month and the average number of customer visits per month.
From the dataset, four customer clusters or segments were
identified and labelled thus: High-Buyers-Regular-Visitors
(HBRV), High-Buyers-Irregular-Visitors (HBIV), Low-
Buyers-Regular-Visitors (LBRV) and Low-Buyers-Irregular-
Visitors (LBIV). Furthermore, for any input pattern that was
(IJARAI) International Journal of Advanced Research in Artificial Intelligence,
Vol. 4, No.10, 2015
41 | P a g e
www.ijarai.thesai.org
not in the training set, its cluster can be correctly extrapolated
by normalising it and computing its similarities from the
cluster centroids associated with each of the clusters. It will
hence be assigned to any of clusters with which it has the
closest similarity.
II. LITERATURE REVIEW
A. Customer Segmentation
Over the years, the commercial world is becoming more
competitive, as such organizations have to satisfy the needs
and wants of their customers, attract new customers, and
hence enhance their businesses [1]. The task of identifying and
satisfying the needs and wants of each customer in a business
is a very complex task. This is because customers may be
different in their needs, wants, demography, geography, tastes
and preferences, behaviours and so on. As such, it is a wrong
practice to treat all the customers equally in business. This
challenge has motivated the adoption of the idea of customer
segmentation or market segmentation, in which the customers
are subdivided into smaller groups or segments wherein
members of each segment show similar market behaviours or
characteristics. According to [2], customer segmentation is a
strategy of dividing the market into homogenous groups. [3]
posits that ―the purpose of segmentation is the concentration
of marketing energy and force on subdivision (or market
segment) to gain a competitive advantage within the segment.
It’s analogous to the military principle of concentration of
force to overwhelm energy.‖ Customer or Market
segmentation includes geographic segmentation, demographic
segmentation, media segmentation, price segmentation,
psychographic or lifestyle segmentation, distribution
segmentation and time segmentation [3].
B. Big Data
Recently, research in Big data has gained momentum. [4]
defines Big data as ― the word describing the large volume of
both structured and unstructured data, which cannot be
analyzed using traditional techniques and algorithm.‖
According to [5], ―the amount of data in our world has been
exploding. Companies capture trillions of bytes of information
about their customers, suppliers, and operations, and millions
of networked sensors are being embedded in the physical
world in devices such as mobile phones and automobiles,
sensing, creating, and communicating data.‖ Big data has
demonstrated the capacity to improve predictions, save
money, boost efficiency and enhance decision-making in
fields as disparate as traffic control, weather forecasting,
disaster prevention, finance, fraud control, business
transaction, national security, education, and health care [6].
Big data is mainly characterised by three V’s namely: volume,
variety and velocity. There are other 2V’s available - veracity
and value, thus making it 5V’s [4]. Volume refers to the vast
amount of data in Zettabytes or Brontobytes being generated
per minute; velocity refers to speed at which new data is
created or the speed at which existing data moves around;
variety refers to different types of data; veracity describes the
degree of messiness or trustworthiness of data; and value
refers to the worth of information that can be mined from
data. The last V, value is what makes Big data and data
mining interesting to businesses and organisations.
C. Clustering and k-Means Algorithm
According to [7], clustering is the unsupervised
classification of patterns (observations, data items, or feature
vectors) into groups (clusters). [8] opined that clustering
algorithms generate clusters having similarity between data
objects based on some characteristics. Clustering is
extensively used in many areas such as pattern recognition,
computer science, medical, machine learning. [6] states that
―formally cluster structure is represented as a set of subset
C=C1,……..Ck of S, such that S=
Ci and Cj= for
  . Consequently, instances in S belong to exactly one and
only one subset‖. Clustering algorithms have been classified
into hierarchical and partitional clustering algorithms.
Hierarchical clustering algorithms create clusters based on
some hierarchies. It is based on the idea of objects being more
related to nearby objects farther away [6]. It can be top-down
or bottom-up hierarchical clustering. The top-down approach
is referred to as divisive while the bottom-up approach is
known as agglomerative. The partitional clustering algorithms
create various partitions and then evaluate them by some
criterion. k-Means algorithm is one of most popular partitional
clustering algorithm[4]. It is a centroid-based algorithm in
which each data point is placed in exactly one of the K non-
overlapping clusters selected before the algorithm is run.
The k-Means algorithm works thus: given a set of d-
dimensional training input vectors { x1, x2,.., xn }, the k-Means
clustering algorithm partitions the n training examples
into k sets of data points or clusters S = {S1, S2, …, Sk}, where
kn, such that the within cluster sum of squares is minimised.
That is,
(1)
where, is the centroid or mean of data points in cluster
.
Generic k-means clustering Algorithms:
1) Decide on the number of clusters, k.
2) Initialize the k cluster centroids
3) Assign the n data points to the nearest clusters.
4) Update the centroid of each cluster using the data
points therein.
5) Repeat steps 3 and 4 until the changes in positions of
centroids are zero.
III. METHODOLOGY
The data used in this paper was collected from a mega
retail business outfit that has many branches in Akwa Ibom
state, Nigeria. The dataset consists of 2 attributes and 100
tuples, representing 100 selected customers. The two attributes
include average amount of goods purchased by customer per
month and average number of customer visits per month. In
this paper, four steps were adopted in realising an accurate
result. They include feature normalisation alongside centroids
initialisation step, assignment step and updating step, which
are the three major generic steps in the k-Means algorithms.



S
(IJARAI) International Journal of Advanced Research in Artificial Intelligence,
Vol. 4, No.10, 2015
42 | P a g e
www.ijarai.thesai.org
A. Feature normalisation
This is a data preparation stage. Feature normalisation
helps to adjust all the data elements to a common scale in
order to improve the performance of the clustering algorithm.
Each data point is converted to the range of -2 to +2.
Normalisation techniques include Min-max, decimal scaling
and z-score. The z-score normalisation technique was used to
normalise the features before running the k-Means algorithm
on the dataset. Equation (2) gives the formulae for
normalisation using the z-score technique.
 = 
(2)
where,  is the normalised value of x in feature vector
f, is the meant of the feature vector f, and is the
standard deviation of feature vector f.
B. Centroids Initialisation
The initial centroids or means were chosen. Figure 1
presents the initialisation of the cluster centres. Four cluster
centres shown in different shapes were selected using Forgy
method. In Forgy method of initialisation k (in this case k=4)
data points are randomly selected as the cluster centroids.
Fig. 1. The initialization stage of k-Means algorithm
C. Assignment Stage
In the assignment stage, each data point is assigned to the
cluster whose centroid yields the least within cluster sum of
squares compared with other clusters. That is, the square
Euclidean norms of each data point from the current centroids
are computed. Thereafter, the data points are assigned
membership of the cluster that gives the minimum square
Euclidean norm.
This has been mathematically explained in equation (3)
        (3)
where each data point is assigned to only one cluster or
set  at the iteration t.
D. Updating Stage
After each iteration, new centroid is computed for each
cluster as the mean of all the data points present in the cluster
as shown in equation (4)




 (4)
where,
 is the updated centroid.
Fig. 2 presents the positions of the centroids and the
updated assignment of their cluster members after the 30th
iteration. The each cluster members assume the same shapes
as their cluster centroid. Table II shows the changes in the
cluster centroids from the initialisation stage (0th iteration) to
the 5th iteration.
Fig. 2. Positions of the centroids and their cluster members after the 30th
iteration
TABLE I. INITIALISATION AND UPDATING OF THE CLUSTER VECTORS OR
CENTROIDS)
IV. RESULTS AND DISCUSSION
The k-Means clustering algorithm converged after 100
iterations. That is, the cluster centroids became stable. Figure
3 shows the graph of the converged data points and centroids.
After this, the k-Means algorithm was able to cluster almost
the entire data points correctly. The centroids or the cluster
vectors after convergence are:
Cluster Centre + Cluster Centre * Cluster Centre O Cluster Centre X
[-0.8325 0.9574] [0.7403 -1.0926] [-0.8279 -0.7217] [0.8444 0.8412]
Each of the clusters represents a customer segment. From
Figure 3, the data points at the right hand top corner represent
HBRV; the data points left hand top corner represent the
HBIV; the data points at the right hand lower corner represent
LBRV; while those at the left hand lower corner represent the
LBIV. This is clearly shown in Table II.
INTIALISED CLUSTER CENTROIDS:
Iteration Cluster Centre + Cluster Centre * Cluster Centre O Cluster Centre X
0 -0.0892 1.3654 0.6541 -1.0856 -0.2131 -0.3669 -0.2131 -0.3669
UPDATED CLUSTER CENTROIDS:
1 0.5656 1.0971 0.8733 -0.9508 -0.6306 -0.6728 -0.6306 -0.6728
2 0.5798 1.0456 0.9976 -0.9639 -0.5466 -0.8295 -0.5466 -0.8295
3 0.5502 1.0346 1.0376 -0.9348 -0.5600 -0.9284 -0.5600 -0.9284
4 0.5502 1.0346 1.0376 -0.9348 -0.5641 -0.9557 -0.5641 -0.9557
5 0.5502 1.0346 1.0376 -0.9348 -0.5901 -0.9894 -0.5901 -0.9894
(IJARAI) International Journal of Advanced Research in Artificial Intelligence,
Vol. 4, No.10, 2015
43 | P a g e
www.ijarai.thesai.org
TABLE II. DESCRIPTION OF EACH CLUSTER IN TERMS OF THE CUSTOMER
SEGMENT
V. PERFORMANCE EVALUATION
Purity measure was used to measure the extent to which a
cluster contains of class of data points. The purity of each
cluster is computed with equation (5).
) =  (5)
Where,  is the proportion of class data points in
cluster i or Di.
The total purity of the whole clustering i.e. considering all
the clusters is given by equation (6).


 ) (6)
Where, D is the total number of data points being
classified.
The confusion matrix is presented in Table III.
TABLE III. CONFUSION MATRIX
Cluster
HBIV
HBRV
LBRV
Purity
Cluster +
21
1
0
0.954
Cluster X
0
28
0
1.000
Cluster O
2
0
1
0.889
Cluster *
0
0
22
0.957
Total
23
29
23
0.950
Since,  = 0.95(from row 6, column 6 of
Table 3), the clustering algorithm was 95% accurate in
performing the customers segmentation.
VI. CONCLUSIONS
This paper has presented a MATLAB implementation of the
k-Means clustering algorithm for customer segmentation
based on data collected from a mega business retail outfit that
has many branches in Akwa Ibom state, Nigeria. The
algorithm has a purity measure of 0.95 indicating 95%
accurate segmentation of the customers. Insight into the
business’s customer segmentation will avail it with the
following advantages: the ability of the business to customise
market programs that will be suitable for each of its customer
segments; business decision support in terms of risky
situations such as credit relationship with its customers;
identification of products associated with each segments and
how to manage the forces of demand and supply; unravelling
some latent dependencies and associations amongst
customers, amongst products, or between customers and
products which the business may not be aware of; ability to
predict customer defection and which customers are most
likely to defect; and raising further market research questions
as well as providing directions to finding the solutions.
Fig. 3. The centroids converge after 100th iteration
REFERENCES
[1] Puwanenthiren Premkanth, ―Market Segmentation and Its Impact on
Customer Satisfaction with Especial Reference to Commercial Bank of
Ceylon PLC.‖ Global Journal of Management and Business Research
Publisher: Global Journals Inc. (USA). 2012. Print ISSN: 0975-5853.
Volume 12 Issue 1.
[2] Sulekha Goyat. The basis of market segmentation: a critical review of
literature‖. European Journal of Business and Management
www.iiste.org. 2011. ISSN 2222-1905 (Paper) ISSN 2222-2839
(Online).Vol 3, No.9, 2011
[3] By Jerry W Thomas.“Market Segmentation‖. 2007. Retrieved from
www.decisionanalyst.com on 12-July, 2015.
[4] T.Nelson Gnanaraj, Dr.K.Ramesh Kumar N.Monica. Survey on
mining clusters using new k-mean algorithm from structured and
unstructured data‖. International Journal of Advances in Computer
Science and Technology. 2007. Volume 3, No.2.
[5] McKinsey Global Institute. Big data. The next frontier for innovation,
competition, and productivity. 2011. Retrieved from
www.mckinsey.com/mgi on 14 July, 2015.
[6] Jean Yan. ―Big Data, Bigger Opportunities- Data.gov’s roles: Promote,
lead, contribute, and collaborate in the era of big data‖. 2013. Retrieved
from http://www.meritalk.com/pdfs/bdx/bdx-whitepaper-090413.pdf on
14 July 2015.
[7] A.K. Jain, M.N. Murty and P.J. Flynn.‖Data Clustering: A Review‖.
ACM Computing Surveys. 1999. Vol. 31, No. 3.
[8] Vaishali R. Patel1 and Rupa G. Mehta. ―Impact of Outlier Removal and
Normalization Approach in Modified k-Means Clustering Algorithm‖.
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 5,
No 2, September 2011 ISSN (Online): 1694-0814
[9] Md. Al-Amin Bhuiyan and Hiromitsu Hama, ―Identification of Actors
Drawn in Ukiyoe Pictures‖, Pattern Recognition, Vol. 35, No. 1, pp. 93-
102, 2002.
[10] S. O. Olatunji, M. Al-Ahmadi, M. Elshafei, and Y. A. Fallatah, ―Saudi
arabia stock prices forecasting using artificial neural networks,‖ pp.81–
86, 2011.
[11] Q. Wen, Z. Yang, Y. Song, and P. Jia, ―Automatic stock decision
support system based on box theory and svm algorithm,‖ Expert System
Application, vol. 37, no.2, pp. 10151022, Mar. 2010.[Online].
Available: http://dx.doi.org/10.1016/j.eswa.2009.05.093.
[12] P.-C. Chang, C.-Y. Fan, and J.-L. Lin, ―Trend discovery in financial
time series data using a case based fuzzy decision tree,‖ ExpertSystem
Application, vol. 38, no. 5, pp. 60706080, May 2011. [Online].
Available: http://dx.doi.org/10.1016/j.eswa.2010.11.006
APPENDIX
clc;clf;close;clear all;
load CustData % Data file containing 100-by-2 training examples, X
%Normalisation and Selection of initial centroids
HBIV
Cluster +
HBRV
Cluster X
LBIV
Cluster O
LBRV
Cluster *
(IJARAI) International Journal of Advanced Research in Artificial Intelligence,
Vol. 4, No.10, 2015
44 | P a g e
www.ijarai.thesai.org
X=[(X(:,1)-mean(X(:,1)))/std(X(:,1)) (X(:,2)-
mean(X(:,2)))/std(X(:,2))];
j = 1;k=1;l=1;
i = randi(length(X));
while j==i
j=randi(length(X));
end
while k==i|k==j
k =randi(length(X));
end
while l==i|l==j|l==k;
l =randi(length(X));
end
centr1 = X(i,:);centr2 = X(j,:); centr3 = X(k,:);centr4 = X(l,:);
%Initial plots of points and position of initial centroids
plot(X(:,1),X(:,2),'.k','MarkerSize',15)
hold on
plot(centr1(1),centr1(2),'+r','MarkerSize',18,'LineWidth',3)
plot(centr2(1),centr2(2),'*b','MarkerSize',18,'LineWidth',3)
plot(centr3(1),centr3(2),'Og','MarkerSize',18,'LineWidth',3)
plot(centr4(1),centr4(2),'Xm','MarkerSize',18,'LineWidth',3)
title('Initialisation of cluster centres')
xlabel('Normalised Average No of visits per month (X2)')
ylabel('Normalised Average Amount of Goods Purchased per
month(X1)')
hold off;
%Iterations to update Centroids and assign clusters members
count = 1;
while count <=10
d1=(X-[ones(length(X),1)*centr1(1)
ones(length(X),1)*centr1(2)]).^2;
d2=(X-[ones(length(X),1)*centr2(1)
ones(length(X),1)*centr2(2)]).^2;
d3=(X-[ones(length(X),1)*centr3(1)
ones(length(X),1)*centr3(2)]).^2;
d4=(X-[ones(length(X),1)*centr4(1)
ones(length(X),1)*centr4(2)]).^2;
d11 = d1(:,1)+d1(:,2);
d22 = d2(:,1)+d2(:,2);
d33 = d3(:,1)+d3(:,2);
d44 = d4(:,1)+d4(:,2);
row1 = d11<d22 & d11<d33 & d11<d44;
row2 = d22<d11 & d22<d33 & d22<d44;
row3 = d33<d22 & d33<d11 & d33<d44;
row4 = d44<d22 & d44<d11 & d44<33;
cluster1 = X(row1,:);
cluster2 = X(row2,:);
cluster3 = X(row3,:);
cluster4 = X(row4,:);
centr1 = [mean(cluster1(:,1)) mean(cluster1(:,2))];
centr2 = [mean(cluster2(:,1)) mean(cluster2(:,2))];
centr3 = [mean(cluster3(:,1)) mean(cluster3(:,2))];
centr4 = [mean(cluster4(:,1)) mean(cluster4(:,2))];
count = count + 1;
end
% Plot the final centroids positions and cluster data points
figure; hold on;
plot(cluster1(:,1),cluster1(:,2),'+r','MarkerSize',10)
plot(cluster2(:,1),cluster2(:,2),'*b','MarkerSize',10)
plot(cluster3(:,1),cluster3(:,2),'og','MarkerSize',10)
plot(cluster4(:,1),cluster4(:,2),'Xm','MarkerSize',10)
plot(centr1(1),centr1(2),'+r','MarkerSize',18,'LineWidth',3)
plot(centr2(1),centr2(2),'*b','MarkerSize',18,'LineWidth',3)
plot(centr3(1),centr3(2),'Og','MarkerSize',18,'LineWidth',3)
plot(centr4(1),centr4(2),'Xm','MarkerSize',18,'LineWidth',3)
plot([-2 0 2],[0 0 0],'-k')
plot([0 0 0],[-2 0 2],'-k')
title('100th Iteration')
xlabel('Normalised Average No of visits per month (X2)')
ylabel('Normalised Average Amount of Goods Purchased per
month(X1)')
... Therefore, for a predictive QSAR model, the points of the test set must be similar to the points of the training set in terms of molecular representativeness in the descriptor space [47]. Besides, data clustering is the process of identifying groups in the dataset, known as clusters, based on some similarity measures [67,22]. Leonard and Roy [47] proposed that k-means clustering based division of training and test set as reliable method for QSAR modelling. ...
... Effective clustering maximizes intra-cluster similarities and minimizes inter-cluster similarities [84]. K-means clustering is a centroid based clustering [84], that capable of identifying clusters by repeated comparison of input patterns until the stable clusters obtained based on the clustering criteria [22]. In the k-means clustering, inter-cluster *IGA = information gain amount. ...
Article
Comedogenicity is a common adverse reaction to cosmetic ingredients that cause blackheads or pimples by blocking the pores, especially for acne-prone skin. Before animal testing was banned by European Commission in 2013, comedogenic potential of cosmetics were tested on rabbits. However, full replacement of animal tests by alternatives have not been yet possible. Therefore, there is a need for applying new approach methodologies. In this study, we aimed to develop a QSAR model to predict comedogenic potential of cosmetic ingredients by using different machine learning algorithms and types of molecular descriptors. The dataset consists of 121 cosmetic ingredients including such as fatty acids, fatty alcohols and their derivatives and pigments tested on rabbit ears was obtained from the literature. 4837 molecular descriptors were calculated via various software. Different machine learning classification algorithms were used in the modelling studies with WEKA software. The model performance was evaluated by using 10-fold cross validation. All models were compared by the means of classification accuracy, area under the ROC curve, area under the precision-recall curve, MCC, F score, kappa statistic, sensitivity, specificity and the best model was chosen accordingly. The QSAR modelling results for two models are promising for comedogenicity prediction. The random forest models by the means of Mold2 and alvaDesc descriptors gave the successful results with 85.87% and 84.87% accuracy for the cross-validated models and 75.86% and 79.31% accuracy for the test sets. In conclusion, this study is the first step in terms of comedogenicity prediction. In the near future, advances in silico modelling studies will provide us non-animal based alternative models by regarding animal rights and ethical issues for the safety evaluation of cosmetics.
... The k-means algorithm has been widely used in market segmentation. The review of different geo-marketing research revealed that k-means has been popular, especially in clustering applications in combination with GIS [67,74]. Azri argued that segmenting the marketing data could increase time efficiency and sales volume. ...
... He used a MATLAB implementation of the k-means clustering based on data collected from a mega business retail outfit that has many branches. The result of this study shows that the algorithm has a purity measure of 0.95, indicating 95% accurate segmentation of the customers [74]. Even though the k-means algorithm is widely used in geo-clustering tasks, however, this clustering method has significant disadvantages versus deep-learning algorithms. ...
Article
Full-text available
Spatial clustering is a fundamental instrument in modern geo-marketing. The complexity of handling of high-dimensional and geo-referenced data in the context of distribution networks imposes important challenges for marketers to catch the right customer segments with useful pattern similarities. The increasing availability of the geo-referenced data also places more pressure on the existing geo-marketing methods and makes it more difficult to detect hidden or non-linear relationships between the variables. In recent years, artificial neural networks have been established in different disciplines such as engineering, medical diagnosis, or finance, to solve complex problems due to their high performance and accuracy. The purpose of this paper is to perform a market segmentation by using unsupervised deep learning with self-organizing maps in the B2B industrial automation market across the United States. The results of this study demonstrate a high clustering performance (4 × 4 neurons) as well as a significant dimensionality reduction by using self-organizing maps. The high level of visualization of the maps out of the initially unorganized data set allows a comprehensive interpretation of the different clusters and patterns across space. The centroids of the clusters have been identified as footprints for assigning new marketing channels to ensure a better market coverage.
... Clustering algorithms have been classified into hierarchical and partitional clustering algorithms. Clusters are created based on top-down hierarchy referred to as divisive or bottom-up hierarchy referred to as agglomerative, whereas in partitional algorithms various partitions are created and evaluated based on some criterion [1][2][3][4][5][6] [13][14][15]. Behavioral segmentation is one of the methods which suggest clustering based on customer habits [1]. ...
Chapter
This paper addresses the analysis of mobile payment parking data for client segmentation. The transaction data transformation into client-specific attributes is performed from the company data set to achieve the goal. Two clustering algorithms – K-Means and DBScan – are compared for multiple data subsets. For the clustering result interpretation, decision tree representation is used. As a result, the most appropriate combination of the clustering algorithm, its parameters and attribute combination is determined.
Article
This study aims to identify telecom customer segments by utilizing machine learning and subsequently develop a web-based dashboard. The dashboard visualizes the cluster analysis based on demographics, behavior, and region features. The study applied analytic pipeline that involved five stages i.e. data generation, data pre-processing, data clustering, clusters analysis, and data visualization. Firstly, the customer’s dataset was generated using Faker Python package. Secondly was the pre-processing which includes the dimensionality reduction of the dataset using the PCA technique and finding the optimal number of clusters using the Elbow method. Unsupervised machine learning algorithm K-means was used to cluster the data, and these results were analyzed and labeled with labels and descriptions. Lastly, a dashboard was developed using Microsoft Power BI to visualize the clustering results in meaningful analysis. According to the results, four customer clusters were obtained. An interactive web-based dashboard called INSIGHT was developed to provide analysis of customer segments based on demographic, behavioral, and regional traits; and to devise customized query for deeper analysis. The correctness of the clustering results was evaluated and achieved a satisfactory Silhouette Score of 0.3853. Hence, the telecom could target their customers accurately based on their needs and preferences to increase service satisfaction.
Chapter
Full-text available
Nowadays Customer segmentation became very popular method for dividing company’s customers for retaining customers and making profit out of them, in the following study customers of different of organizations are classified on the basis of their behavioral characteristics such as spending and income, by taking behavioral aspects into consideration makes these methods an efficient one as compares to others. For this classification a machine algorithm named as k-means clustering algorithm is used and based on the behavioral characteristic’s customers are classified. Formed clusters help the company to target individual customer and advertise the content to them through marketing campaign and social media sites which they are really interested in.
Conference Paper
At present, most people like to buy things online (i.e.) through e-commerce websites. People consider the advantages in online shopping like reduction in travel cost, time-consumption and offers, etc. Even though these advantages paves the way for e-commerce growth, it is not sufficient. The constant parameter that supports the growth of an e-commerce company is marketing. Marketing plays a major role in a product or idea promotion in both offline and online business. Suppose if the availability of the product belongs to a particular region, then direct marketing is more efficient for the product to meet its success. But shopping websites like Amazon, Flip kart, Snap deal etc., cannot implement direct marketing because those companies' boundaries are not limited to a particular region. In this case, affiliate marketing comes into the role which acts as an intermediate between the consumer and e-commerce companies. Affiliate marketing is one in which a person uses their marketing strategy to promote the e-commerce products in social network by becoming an affiliate to that company. Affiliates will be allocated with a unique referral link. If a person buys a product using that reference link then a commission will be paid to those affiliates by the respective company. In this paper, the K-Means Clustering algorithm is going to be implemented on a dataset with the help of a Pyspark environment that helps the affiliates for their marketing to reach the right customer with the right product.
Article
Full-text available
Among machine learning techniques, classification techniques are useful for various business applications, but classification algorithms perform poorly with imbalanced data. In this study, we propose a classification technique with improved binary classification performance on both the minority and majority classes of imbalanced structured data. The proposed framework is composed of three steps. In the first step, a balanced training set is created via under-sampling. Then, each example is converted into an image depicting a line graph. In the last step, a Convolutional Neural Network (CNN) is trained using the images. In the experiments, we selected six datasets from the UCI Repository and applied the proposed framework to them. The proposed model achieved the best receiver operating characteristic (ROC) curve and Balanced Accuracy (BA) on all the datasets and five datasets, respectively. This demonstrates that the combination of under-sampling and CNNs is a viable approach for imbalanced structure data classification.
Article
The maps of vulnerability to contamination of the aquifers are part of an early warning system to avoid the deterioration groundwater quality. Weighted index overlay methods are commonly used to map aquifer vulnerability. These methods have disadvantages that indicate the need to apply alternative methods that introduce the least number of a priori considerations in the parameters processing and allow a more precise interpretation of the final results. The objective of this research was to evaluate the vulnerability to contamination of groundwater in the Almendares-Vento karstic basin, Havana, Cuba, by using the data mining technique, and to compare the results obtained by applying the RISK methods, which is a weighted index overlay method to study karstic aquifers. The variables selected to apply this unsupervised classification technique was: aquifer lithology, topographic slope of the terrain, soil attenuation index to pollutants, fault density per km2 and presence of direct infiltration zones. The cluster analysis achieved greater spatial discrimination and definition of areas with different degrees of vulnerability, demonstrating its high resolution power.
Article
Full-text available
Big data is a collection of large amount of data with various types of data and usable to be processed at much higher frequency. One of the most popular knowledge discovery approaches is to find frequent items from a transaction data set and derive association rules. Pattern finding is one of the most computationally expensive steps in large data sets. Patterns often referred to association rules. Association rule plays an important role in the process of mining data for sequential pattern. Association rules are used to acquire interesting rules from large collections of data which expresses an association between items or sets of items. Apriori is a classic algorithm for learning association rules. It is designed to operate on databases containing transactions. Apriori algorithm attempts to find subsets which are common to at least a minimum number C of the item sets. Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time and groups of candidates are tested against the data. The algorithm terminates when no further Successful extensions are found. In this paper we enhance Apriori algorithm to solve its complexity over large data sets. We first collect variety of data and then integrate both structured and unstructured data using MapReduce to find out sequential pattern from the required data sets.
Article
Full-text available
Clustering technique is mainly focus on pattern recognition for further organizational design analysis which finds groups of data objects such that objects in a group are similar to one another and dissimilar from the objects in the other group. It is important to preprocess data due to noisy data, errors, inconsistencies, outliers and lack of variable values. Different data preprocessing techniques like cleaning method, outlier detection, data integration and transformation can be carried out before clustering process to achieve successful analysis. Normalization is an important preprocessing step in Data Mining to standardize the values of all variables from dynamic range into specific range. Outliers can significantly affect data mining performance, so outlier detection and removal is an important task in wide variety of data mining applications. k-Means is one of the most well known clustering algorithms yet it suffers major shortcomings like initialize number of clusters and seed values preliminary and converges to local minima. This paper analyzed the performance of modified k-Means clustering algorithm with data preprocessing technique includes cleaning method, normalization approach and outlier detection with automatic initialization of seed values on datasets from UCI dataset repository.
Article
This article addresses the research question, what is the best method of consumer market segmentation. It deals with the issues that are already discussed by the researchers and also identifies the research gap for the further researches. It focuses on the definition, basis of market segmentation and issues related to market segmentation in detail. This research paper will provide information about the knowledge gap and will show a path for future research in the area of market segmentation, which is the heart of marketing now a day.
Article
The stock market is considered as a high complex and dynamic system with noisy, non-stationary and chaotic data series. So it is widely acknowledged that stock price series modeling and forecasting is a challenging work. A significant amount of work has been done in this field, and in them, soft computing techniques have showed good performance. Generally most of these works can be divided into two categories. One is to predict the future trend or price; another is to construct decision support system which can give certain buy/sell signals. In this paper, we propose a new intelligent trading system based on oscillation box prediction by combining stock box theory and support vector machine algorithm. The box theory believes a successful stock buying/selling generally occurs when the price effectively breaks out the original oscillation box into another new box. In the system, two SVM estimators are first utilized to make forecasts of the upper bound and lower bound of the price oscillation box. Then a trading strategy based on the two bound forecasts is constructed to make trading decisions. In the experiment, we test the system on different stock movement patterns, i.e. bull, bear and fluctuant market, and investigate the training of the system and the choice of the time span of the price box. The experiments on 442 S&P500 components show a promising performance is achieved and the system dramatically outperforms buy-and-hold strategy.
Article
This paper presents the development of line image keywords for the identification of actors drawn in Japanese traditional painting pictures known as Ukiyoe pictures. The system is based on visual features of the face from the image database files and is organized as a set of classifiers whose outputs are integrated after a normalization step. Line profile from the picture has been extracted in this investigation and has been approximated by Bézier curves. A learning algorithm has been developed to obtain the control points at high accuracy. A new curve matching method has been developed based on the feature points, rather than the corresponding points. This method can automatically fit a set of data points with piecewise geometrically continuous third order Bézier curves. Last of all, a new approach for distance calculation, namely “apple-node distance” has been introduced here for similarity calculation in image retrieval systems. The computation of similarity between curves has been established on the basis of this “apple-node” distance. The effectiveness of our method has been confirmed through computer simulation. The method developed here can be expanded to one of three dimensional shape-analyzing tools.
Article
In recent years, many attempts have been applied to predict the behavior of stock price’s movement. However, these attempts could not build an accurate and efficient stock trading system owing to the high dimensionality and non-stationary variations of stock price within a large historic database. To solve this problem, this paper applies fuzzy logic as a data mining process to generate decision trees from a stock database containing historical information. There are many attributes in the stock database and often it is impossible to develop a mathematical model to classify the data. This paper establishes a novel case based fuzzy decision tree model to identify the most important predicting attributes, and extract a set of fuzzy decision rules that can be used to predict the time series behavior in the future. The fuzzy decision tree generated from the stock database is then converted to fuzzy rules that can be further applied in decision-making of stock price’s movement based on its current condition. To demonstrate the effectiveness of the CBFDT model, it is experimentally compared with other approaches on Standard & Poor’s 500 (S&P500) index and some stocks in S&P500. The overall performances of CBFDT model are very convincing thus it provides a new implication of research in dealing with financial time series data.
Market Segmentation and Its Impact on Customer Satisfaction with Especial Reference to Commercial Bank of Ceylon PLC
  • Puwanenthiren Premkanth
Puwanenthiren Premkanth, -Market Segmentation and Its Impact on Customer Satisfaction with Especial Reference to Commercial Bank of Ceylon PLC.‖ Global Journal of Management and Business Research Publisher: Global Journals Inc. (USA). 2012. Print ISSN: 0975-5853. Volume 12 Issue 1.
―The basis of market segmentation: a critical review of literature‖ European Journal of Business and Management www.iiste.org. 2011. ISSN 2222-1905 (Paper) ISSN 2222-2839 (Online)
  • Sulekha Goyat
Sulekha Goyat. ―The basis of market segmentation: a critical review of literature‖. European Journal of Business and Management www.iiste.org. 2011. ISSN 2222-1905 (Paper) ISSN 2222-2839 (Online).Vol 3, No.9, 2011
Market Segmentation‖ Retrieved from www.decisionanalyst.com on 12
  • By Jerry
  • W Thomas
By Jerry W Thomas. " Market Segmentation‖. 2007. Retrieved from www.decisionanalyst.com on 12-July, 2015.
Big data. The next frontier for innovation, competition, and productivity. 2011. Retrieved from www.mckinsey.com
  • Mckinsey Global
McKinsey Global Institute. Big data. The next frontier for innovation, competition, and productivity. 2011. Retrieved from www.mckinsey.com/mgi on 14 July, 2015.