Conference PaperPDF Available

Click Stream Analysis in E-Commerce Websites-a Framework

Authors:

Abstract

The growth and proliferation of Internet has generated a revolution in retail practice. People nowadays prefer virtual shopping over Brick and mortar. So “Customer Retention” is a vital issue in today's e-commerce market. In order to boost customer loyalty, it is crucial for any e-commerce company to have an extensive understanding of online user behavior to strengthen the bond with their “e-customers”. Though click stream analysis has been solving e-business problems, still recommendation system and digital marketing are far from perfect. In this article, different pattern discovery methods are addressed to identify various navigation patterns from weblogs to better understand users' behavior in e-commerce websites. The integrated approach of cognitive science and data mining on click stream could provide deeper insights about customer's thinking patterns, perceptions and their decision-making styles which could be utilized for effective customer retention.
978-1-5386-5257-2/18/$31.00 ©2018 IEEE
Click Stream Analysis in e-Commerce Websites-a
Framework
*1
A. Vijaya Bharathi,
*2
Jyothi M. Rao,
#$3
Amiya K. Tripathy
*
Computer Enginering, K.J. Somaiya College of Engineering, Mumbai, India
#
Computer Enginering, Don Bosco Institute of Technology, Mumbai, India
$
School of Science, Edith Cowan University, Perth, Australia
1
vijaya.a@somaiya.edu,
2
jyothirao@somaiya.edu,
3
amiya@dbit.in
Abstract—The growth and proliferation of Internet has
generated a revolution in retail practice. People nowadays
prefer virtual shopping over Brick and mortar. So "Customer
Retention" is a vital issue in today's e-commerce market. In
order to boost customer loyalty, it is crucial for any e-
commerce company to have an extensive understanding of
online user behavior to strengthen the bond with their “e-
customers”. Though click stream analysis has been solving e-
business problems, still recommendation system and digital
marketing are far from perfect. In this article, different pattern
discovery methods are addressed to identify various navigation
patterns from weblogs to better understand users’ behavior in
e-commerce websites. The integrated approach of cognitive
science and data mining on click stream could provide deeper
insights about customer’s thinking patterns, perceptions and
their decision-making styles which could be utilized for
effective customer retention.
Keywords—weblogs, pattern discovery, click stream analysis,
markov model, cognitive model, web usage mining.
I. I
NTRODUCTION
Today’s e-commerce applications supposed to fulfil the
demands of thousands of customers failing which can cause
huge loss of revenues. Hence, the success of any online
company highly depends on potential to captivate visitors. It
is feasible for the company to track the data about customer
interaction through the so-called click stream data. It is the
principal source of information for the companies to adapt
their service according to their customers. Click stream
analysis may help these organizations determine customer
loyalty, improve marketing strategies, effectiveness of
promotional campaigns, provide more customized data to
visitors, effective website structure, etc. Hence,
understanding user’s behaviour in Web applications has
become necessary for ecommerce.
While the expectation for customer level data analysis is
high, there are still problems such as customers receiving
significant amount of uninterested mail advertisements and
online recommendations are still far from absolute. To build
more accurate consumer behaviour models for customers,
firms need to recognize their customers better. This includes
understanding customers’ preferences and customers’
behaviour through web history data.
Various pattern discovery algorithms are used by different
researchers for identifying web usage patterns. Temporal
logic model approach is used in [1] as an option to data
mining techniques for the evaluation of structured weblogs.
Complex user behavioural patterns were identified by
checking temporal logic formulas against the log model
developed using SPOT libraries to improve the structure of a
website. The K-Nearest-Neighbour (KNN) has been used for
successful classification of a real time recommendation
system [2]. KNN algorithm is adapted to classify Frequent
Access pattern [4]. Several data mining techniques namely
association rule mining and decision tree were applied on
click stream data to determine user interests and product
associations for effective recommendation [5]. The interested
users on web were identified using Naïve Bayes classification
[7]. The concepts of Hidden Markov Model (HMM) has been
used to predict if the user has the intention to buy something
or not by the appearance of shopping-cart page in that session
[9]. Markov models are also used to create usage profiles so
as to optimize the structure and reduce the operational costs
in maintenance [10]. With the help of the navigation pattern
web users can be grouped based on their cognitive style. It
can be used for modelling users to assist in adaptive websites
for better organization of information [11].
II. B
ACKGROUND
Web Usage Mining is the discovery of useful patterns
from the weblog data for better understanding of web users. It
helps to know about users’ behaviours and patterns which can
be useful for effective management and construction of the
site [13, 14]. The various sources of web usage data include
the proxy server logs, web server logs, browser logs, user
profiles, mouse clicks, user sessions, user queries, registration
data, cookies and any other data as a result of web
interactions. The web log files are primary source of data
which can be collected from web Servers, proxy servers and
Client browsers. A sample raw log file entry is shown below.
2016-02-13 00:12:27 128.230.247.37 GET clothing 80
74.111.18.59
Mozilla/5.0+(iPad;+CPU+OS+9_2_1+like+Mac+OS+X)
+ AppleWebKit/601.1
http://group0.ist722.ischool.syr.edu/beats-pill-20-
wireless-speaker 200 687
The web log contains Date, Time, server IP, HTTP
method, URI-query, Server Port, Client IP, User Agent,
Referrer Agent, Status and Time taken. User Agent contains
the client operating system and browser information whereas
Referrer Agent contains the source from where this user
arrives. These log attributes provide useful knowledge about
navigation behaviour of users [14].
The data collected from web server log is often defective
and unreliable [12]. Hence it needs pre -processing. It
involves tasks such as removing references to embedded
objects such as style files, graphics, or sound files, removal of
at least some of the data fields (e.g. number of bytes
transferred or version of HTTP protocol used, etc.) that may
not provide useful information in analysis [13]. Every new IP
address is considered as a new user. To accurately identify
unique users, combination of IP addresses and other
information such as user agents and referrers can be used. A
2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA)
series of pages viewed by a user at a particular visit is known
as a Session. [14] Session identification can be done by using
timestamp of consecutive log entries. Click stream analysis is
performed using statistics, data mining or machine learning
algorithms [14]. The meaningful patterns are then analysed
using online analytical processing(OLAP) or Visualization
techniques.
III. P
ATTERN
D
ISCOVERY
A
PPROACHES
Some of the widely used classification/prediction
techniques are KNN, Decision tree, Markov model, Naïve
Bayes and cognitive model.
Using KNN
Identifying the interest of customers becomes necessary
for an online company to serve them better. K-Nearest
Neighbour algorithm compares a particular test sample with a
set of training data that are similar to it [3,4]. Depending on
the class of their closest neighbours, the category of the page
visited by a user can be determined. The K-NN classifies the
tuple based on similarities or distance to the stored training
tuples [2, 3].
The Euclidean distance between a training tuple and a
test tuple can be derived as follows:
Let
X
i
be input tuple with p attributes (x
i1
, x
i2,
…., x
ip
)
Let n be the number of input tuples (i =1, 2, …., n)
Let p be the number of features (j =1,2, …, p)
The Euclidean distance between Tuple X
i
and X
t
is
=
=
+++=
n
i
ii
tpiptititi
xxxxd
xxxxxxxxd
1
2
2121
22
22
2
11
)(),(
)(.....)()(),(
Let us consider an e-commerce site of an A-mart store
with click stream as a vector of four attributes: users, source,
page accessed, category with users represented by U1, U2,
U3. . ., U7 as shown in Table 1. To determine the category of
product purchased by user U3, we have to compute the
Euclidean distance between the vector U3 and all other
vectors. The Euclidean distance between two tuples U1 and
U3 where U1 = (U
11
, U
12
, U
13
) and U3 = (U
31
, U
32
, U
33
).
From Table 1, U1= (direct, Amart/home/grocery, Home) and
U3 = (search engine, Amart/home/footwear/sport, footwear).
TABLE I. A-M
ART
S
T
RAINING
T
UPLES
For categorical attributes, the difference (U
11
U
31,
) can be
computed by simply comparing the corresponding value of
the attributes in tuple U1 with U3. If the values are the same
then the differences taken to be zero (0), otherwise, the
difference is taken to be one (1). So, for (U
1,1
and U
3,1
) i.e.
(direct, search engine), the difference is 1, for (U
12
andU
32
) i.e.
(Amart/home, Amart/footwear) the difference is 1, likewise
for (U
13
and U
33
) i.e., (home, footwear) the difference is 1.
The same process is repeated with all other tuples U2, U4,
. ., U9, and the result produced a stream of data sorted by
their Euclidean distance to the user U3 which is shown in
Table 2. Thus, the user U3 has visited footwear related page.
Similarly, whether a visitor is seasonal or regular, week end
/night visitors can be found out to better understand the users’
behaviour. This knowledge about user can be used for
customized marketing.
TABLE II. D
ISTANCE TO
U
SER
U3
User Class Distance to User U3
U4 Footwear 1.00
U2 Kids Apparel 1.414
U6 Ladies Garments 1.414
U8 Ladies Garments 1.414
U7 Men’s Apparel 1.732
U1 Home and personal 1.732
U9 Kids Apparel 1.732
Using Decision Tree
Any online company need to know their potential
customers in order to optimize traffic and spend effectively
on digital marketing.one of the popular classification
algorithm is a decision tree in which each non-leaf node
denotes a test on an attribute, each branch corresponds to an
outcome of the test, and each leaf node denotes a class
prediction [5]. The information gain measure can be used to
select the test attribute at each node. The attribute with the
highest gain is chosen as the test attribute for the current node
[6].
TABLE III. A-M
ART
S
T
RAINING
T
UPLES
To identify potential customers from large volume of big
data, consider the set of attributes (session id, session time,
no. of pages accessed, method used) from Table 3. The basic
idea is to segregate users on their purchase interest and those
UId source Page accessed Category class
U1 direct Amart/home/gr
ocery
Home Home,
personal care
U2 Search engine Amart/kids Kids Kids apparel
U4 direct Amart/footwear footwear footwear
U6 Search engine Amart/ladies
garments/kurti
Ladies Ladies
garments
U7 direct Amart/men’s
apparel/t-shirt
Men’s
Apparel
Men’s apparel
U8 Search engine Amart/ladies
garments/
Ladies Ladies
garments
U9 direct Amart/kids Kids Kids apparel
U3 search engine Amart/home/fo
otwear/sport
footwear ?
User
id
Session
Id
Session
time(mins)
Method
used
No of
pages
class
U1 1 10 (less) Get 8 (more) Casual
U2 2 25 (more) post 10 (more) Potential
U3 3 30 (more) Post 6 (more) Potential
U4 4 14 (less) Get 4 (more) casual
U5 5 12 (less) Get 9 (more) casual
U6 6 25 (more) Get 10 (more) potential
U7 7 27 (more) Post 12 (more) Potential
U8 888 35 (more) Post 15 (more) ?
2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA)
who simply explore the site. Generally interested users spend
long time on web pages and use the HTTP POST mode if
they are interested in registering with web sites. The
uninterested simply accesses many pages quickly to browse
contents [5, 6]. These users do not often use POST method
because they are not interested in registering at web sites.
The best splitting attribute Info (D) is calculated as
Gain (A)=Info (D) - Info
A
(D)
Info (D) = - p
i
log
2
(p
i
)
Number of tuples belong to potential (yes class) =4
Number of tuples belong to casual (no class) =3
Info (D) =- (3/7log (3/7) +4/7 log (4/7)) =0.984
Info
session
(D) = |D
j
|
------ X Info (D
j
)
|D|
=3/7Info (session <25) +4/7 Info (session >25)
=3/7*(-(0/3) log (0/3) -(3/3) log ((3/3)) + 4/7*(-4/4 log4/4-0)
= (3/7) *0 + (4/7) *0=0
Gain (session) =0.98-0= 0.98
Info
method
(D)=4/7*Info(method=’G’)+3/7*Info (method=’P’)
= (4/7(1/4 log 4 + ¾ log 4/3)) + 3/7 (-3/3 log 3/3 -0)
=0.46+0
Gain (method) =0.98-0.46=0.52
Info
Number
of pages (D) =7/7 * Info (number=’more’) + 0/7 *
Info (number=’less’)
=7/7* (-4/7 log 4/7-3/7 log 3/7) +0=0.98
Gain (number of pages) =0.98-0.98=0
Fig. 1. Decision tree generation
It is observed that session time attribute has the highest
information gain (0.98). The Users are classified as
“Potential” and” casual” based on the parameters Time
Stamp, method used (GET/ POST), number of pages referred.
The decision tree generated is shown in Fig 1.
From the decision tree generated, rules can be easily
interpreted classifying User U8 as potential user. Similarly,
we can classify as new or returning visitor of the site with the
help of an attribute ‘frequency of visit’ (difference between
two timestamps).
Using Markov Model
Predicting user’s next page request on the World Wide
Web is currently an urgent issue. Different methods exist that
can look at the user’s page views and predict what next page
the user is likely to view. On such method is Markov process
in which states represents the web pages and edges represents
transition probabilities. A trained Markov model can be used
to predict the next state, given a set of p previous states
[9,10]. Markov models can be denoted by three parameters <
A S T >, where A represents all actions performed by the
user; S represents all possible states; and T is a |A| X |S|
Transition Probability Matrix (TPM), where Tij represents
the probability of performing action j when the process is in
state i.
TABLE IV. S
AMPLE
P
AGE
V
IEWS
User Page View
U1 p2p3p2p1p5
U2 p2p1p3p2p1p5
U3 p1p2p5
U4 p1p2p5p2p4
U5 p1p2p1p4
U6 ?
The simplest Markov model predicts the next action by
only looking at the previous action performed by the user [9].
A markov process is represented as a directed acyclic graph
in which every node denotes a state corresponding to a page
view, and edges labelled with probabilities represents
transitions between the connected states. All transition
probabilities are stored in a transition probability matrix
Pn×n, where n is the number of states in the model [10].
Consider the set of transactions presented in Table 5.
Fig. 2. Markov chain for web transactions
To build a Markov chain start with an initial state (S) into
the chain and a final state (F) at the end. The probabilities
associated with the edges are obtained by counting the
number of times the transaction occurs in the trails. The
probability to move from the initial state S to a state p1
represents the page p1 is about 7/23 (0.31), where 7 is the
number of times that page p1 occurs, and 23 is the total
number of requests.
2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA)
TABLE V. T
RANSITION
P
ROBABILITY
M
ATRIX
P1 P2 P3 P4 P5
P1 0 0.43 0.14 0.14 0.28
P2 0.5 0 0.125 0.125 0.25
P3 0 1 0 0 0
P4 0 0 0 0 0
P5 0 0.25 0 0 0
Using the same process, the probability to move from
pagep1 to page p2 is 3/7(0.42), where 3 is the number of
times that p2 occurs after p1, and 7 is the number of times p1
occurs. Finally, the probability to move from page p4 to the
final state F is 2/2(1), where 2 is the number of trails where
p4 is the final state, and 2 is the number of times that p4
occurs.
The Markov chain generated from such transactions is
depicted in fig 2. Assume that a user U6 browsed through the
sequence of page views <p2p5p1p3>. Looking at P in
Table 4, there is 100% probability that the user will view
page p2 next. A problem that could arise here is contradicting
prediction, for example, there is an equal probability a user
will view page p3 or p4 after viewing page p1. Thus, the
prediction capability of the system will not be accurate and
will be ambiguous in such cases [10].
Using Naïve Bayes
Since Naïve Bayes algorithm works best on large volume
of data, it is addressed here to identify the same pattern
discussed earlier using decision tree. P (H | Q) represents the
probability that hypothesis H holds given the "evidence". Let
us consider our training data set (Table 3.) attributes as:
session id, time taken, number of pages visited then P (H | Q)
is the probability that the session id may be a potential user or
not given the time taken and number of pages viewed [7,8]. P
(H) is called as priori probability of H. To classify User U8 as
potential or casual user, compute the conditional probabilities
s follows.
P(class=’casual’) =3/7=0.428
P(class=’Potential’) =4/7=0.571
P (session time=’more’| class=’casual’) =1/3=0.333
P (session time=’more ‘|class=’potential’) =3/4=0.75
P (method used=’post’| class=’casual’) =1/3=0.333
P (method used=’post’| class=’potential’) =3/4=0.75
P (pages accessed=’more’ |class=’casual’) =3/3=1
P (pages accessed=’more’| class=’potential’) =4/4=1
P(U8|Class=’casual’) =0.333*0.333*1=0.111
P (U8|class=’Potential’) =0.75*0.75*1=0.563
P (Ci/X) =P(X/Ci) P (Ci)
P (U8|class=’casual’)
P (class=’casual’) =0.428*0.111= 0.0475
P (U8|class=’Potential’)
P(class=’Potential’) =0.571*0.563=0.321 (Maximum
probability)
The maximum probability obtained for the User U8 is
with the ‘Potential’ class. Hence U8 will be predicted as a
Potential customer of A-mart’s store.
Using Cognitive Model
Cognitive science is an interdisciplinary advance towards
the understanding of human behaviour. Cognitive sciences
can have direct application to web usage mining. More
recently, economists have applied such concepts to explain
consumers’ behaviour [11]. Cognitive styles describe the way
users process and organize information. Previous relevant
works had identified a number of dimensions in which the
users’ cognitive styles may differ. (Chen and Macredie, 2002;
Liuand Ginther, 1999). Serialists and wholists have different
characteristics among them (Pask, 1976). The wholists opt for
global way of processing while serialists prefer to understand
step by step. Researchers had proposed specific user
interaction metrics to inspect how users navigate (i.e.,
linearly/non-linearly) based on the sequence of hyperlinks
visited. Clustering techniques can be performed to determine
users’ navigation behaviour and their relation to cognitive
styles. In order to measure the linearity of user interactions
with the website the following interaction metrics are used.
Absolute Distance of Links (ADL): It represents the sum
of total absolute distance between the hyperlinks visited by a
user.
Average Sequential Links (ASL), denotes the number of
sequential links visited by the user .
Average non-sequential Groups of Links (AGL), indicates
the number of non-sequential links visited by the user. The
above metrics can be used to find whether the user had
followed a linear/nonlinear navigation path. For example,
consider the sequence of pages visited by users in Table 6.
Session id 1 contains the page sequence A, B, C, D, E
(where A is first link from homepage referred as 1, 2 ,3, 4, 5
for easy understanding) for which the interaction metrics
ADL, ASL, AGL can be calculated as ADL= (|1-1|+|2-1|+|3-
2|+|4-3|+|5-4|)/N=4/5=0.8 where N =Number of total links
clicked. ASL=M/N=5/5=1 where M=number of sequential
links visited. AGL=B/N=0/5=0 where B=number of non-
sequential links visited. Cognitive style ratio based on ADL
ranging between 0 and 1.667 indicates a linear approach of
navigation, range between 3 and 4 indicates a non-linear
approach. There is a link between cognitive style dimension
(i.e., Wholist–Intermediate–Analyst and Verbal–
Intermediate–Imager) and navigation style (i.e., linear and
non- linear) [11].
TABLE VI. S
AMPLE
P
AGE
V
IEWS
Session Id Transactions
1 A->B-> C-> D ->E
2 A ->B-> C
3 A-> B-> C-> E
4 C-> D-> E
5 C-> D-> E-> B
6 C ->D-> A-> E
7 D-> A-> B-> E
It is found that wholist type of users follow linear
navigation behaviour. The identification of users with
specific cognitive and navigation style will ultimately help e-
retail companies to derive new strategies and methods to
provide better service for the customers. [11].
Discussion
From the detailed survey done on various pattern
discovery approaches, some useful patterns are identified and
listed in Table 7.
2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA)
TABLE VII. P
ATTERNS
I
DENTIFIED
A
ND
M
ODELS
Parameters Useful Patterns Model
Recommended
User ID, Session ID,
Pageview, Timestamp,
Visit Duration
New User / Returning
User, Potential /
Casual Visitor,
Evening Visitor /
Week End Visitor
Seasonal Visitors
Classification
User ID, Session ID,
Page View, Item
Bought, Product ID,
Product Category,
Price, Quantity
Interested Product
Category for Each
User
Classification
ser ID, Session ID,
Age, Gender Product
Category, Item
Bought, Number Of
Items Bought
Most Interested
Product by Certain
Age Group / Gender
Classification /
Clustering
User ID, Products
Purchased, Category
Product Association to
Users
Classification /
Association Rules
User ID, Session ID,
Item Bought, Price,
Frequent Visitor,
Category
Predict Buy / Not Markov Model /
Classification
User ID, Session ID,
Page View, Mostly
Visited Page
Link Prediction,
Predict Next Click,
Frequent Sequence
Pages
Markov Model /
Sequence Pattern
Mining
User ID, Session ID,
Page View
Linear / Nonlinear
Path, User Cognitive
Styles
Cognitive Model
It is also clear that cognitive science in web usage mining
has a scope to get deeper insights about consumer shopping
psychology. The customer’s thinking patterns, perceptions
and their decision-making styles can be interpreted with the
help of cognitive science and data mining as an integrated
approach. Depending on the type of personality better
customized marketing can be done to captivate customers.
Similarly based on customer’s cognition, relevant
recommendations of product can be given more precisely to
customers.
IV. P
ROPOSED
I
DEA
The proposed framework shown in Fig 3. adopts the
conceptual framework of cognitive architecture namely ACT-
R (adaptive control of thought-Rational) along with data
mining which provide cognitive behaviours of online
customers.
Fig. 3. Proposed methodology
The motivation behind this idea is the lack of study of
cognitive processes on consumer model using click stream
data. The click stream data can be effectively used to identify
various styles of users’ decision making and their
perceptions. This customer analytics could help e -retailers
for effective digital marketing and online recommendations.
V. C
ONCLUSION
Customer retention is a critical current issue faced by
online retail companies. Online recommendation system and
digital marketing are few business problems which are still
far from perfect. From the comprehensive study, it is
observed that the application of cognitive science in web
usage mining is in its infancy, further investigation could
reveal more relevant relationships between cognitive styles
and navigation behaviour of user. Getting deeper insights
about consumer psychology through navigation behaviour
from weblogs can help e-companies improve their customer
retention rate by providing more personalized marketing and
relevant recommendations to user.
R
EFERENCES
[1] Sergio Hernandez, Pedro Alvarez, Javier Fabra, Jaoquin Ezpeleta
“Analysis of Users’ behaviour in structured e-Commerce Websites”,
IEEE Access, vol. 5, pp.
11941–11958, May 2017.
[2] D.A. Adeniyi, Z. Wei, Y. Yongquan, “Automated web usage data
mining and recommendation system using K-Nearest Neighbor (KNN)
classification method”, Applied Computing and Informatics, Vol. 12,
pp. 90–108, 2016.
[3] Manisha Kumari, “A Review of Classification in Web Usage Mining
using K-Nearest Neighbor”, Advances in Computational Sciences and
Technology, Vol. 10, No. 5, 2017.
[4] Suharjito, Diana and Herianto, “Implementation of Classification
Technique in web usage mining of banking company”, Proc. IEEE
seminar on Intelligent technology and its applications, pp. 211-218,
Jul. 2016.
[5] Yoon Ho Cho, Jae Kyeong Kim, Soung Hie Kim, “A personalized
recommender system based on web usage mining and decision tree
induction”, Expert Systems with Applications, Vol.23, No.3, pp. 329-
342, Oct 2002.
[6] Rianto, Lukito Edi Nugroho, P. Insap Santosa, “Pattern Discovery of
Indonesian Customers in an Online Shop: A Case of Fashion Online
Shop”, Proc. IEEE Conference on Information Tech, Computer, and
Electrical Engineering (ICITACEE), pp. 313-316, Oct. 2016.
[7] K. Santra, S. Jayasudha, “Classification of Web Log Data to Identify
Interested Users using Naïve Bayesian Classification”, IJCSI
International Journal of Computer Science Issues, Vol. 9, Issue 1, No
2, Jan. 2012.
[8] Mahdi Khosravi, Mohammad J. Tarokh, “Dynamic Mining of Users
Interest Navigation Patterns Using Naive Bayesian Method”, Proc.
IEEE International Conference on Intelligent Computer
Communication and Processing (ICCP), pp. 119-122, Aug. 2010.
[9] Chun-Jung Lin 1, Fan Wu, I-Han Chiu, “Using Hidden Markov Model
to Predict the Surfing User’s Intention of Cyber Purchase on the
Web”, Journal of Global Business Management, 2009.
[10] Alice Marques and Orlando Belo, “Discovering Student Web Usage
Profiles Using Markov Chains”, Portugal Electronic Journal of e-
Learning, Vol. 9 Issue 1, 2011.
[11] Marios Belk, Efi Papatheocharous, Panagiotis Germanakos, George
Samaras, “Modeling users on the World Wide Web based on cognitive
factors, navigation behaviour and clustering techniques”, The Journal
of Systems and Software, Vol. 86, Issue 12, pp. 2995-3012, 2013.
[12] P. Dhana Lakshmi, Dr. K. Ramani, Dr. B. Eswara Reddy, “The
Research of Preprocessing and Pattern Discovery Techniques on Web
Log files”, Proc. IEEE 6th International Conference on Advanced
Computing, pp. 138-145, Feb. 2016.
[13] J. Srivastava, Robert Cooley and Mukund Deshpande, “Web Usage
Mining: Discovery and Applications of Usage Patterns from web
data”, SIGKDD Explorations, Vol. 1, Issue 2, Jan. 2000.
[14] Liu, “Web Data Mining: Exploring Hyperlinks, Contents, and Usage
Data”, Data-Centric Systems and Applications, Springer, 2007.
... It can be achieved by defining the interests of the user. The document's substance also determines the user's interest that the user has read [1]. Clickstream data analysis can also select the people involved, help businesses recognize customer satisfaction, encourage productivity, and develop marketing campaigns by identifying consumer preferences. ...
... Clickstream data analysis can also select the people involved, help businesses recognize customer satisfaction, encourage productivity, and develop marketing campaigns by identifying consumer preferences. Clickstream data is the essential piece of information for firms looking to tailor their offerings to their clients [1]. It has a lot of value and may provide you with a lot of information about the people who visit your website. ...
... The page's escalation will be much larger if the client is interested in or needs to visit the page. The formulas below can be used to compute it [1]. ...
Article
The accelerated development of e-commerce has been a concern for business people. Business people should be able to gain customer interest in a variety of ways so that their companies can compete with others. Analyzing click-flow data will help organizations or firms assess customer loyalty, provide advertising privileges, and develop marketing strategies through user interests. By understanding consumer preferences, clickstream data analysis may be used to determine who is participating, assist companies in evaluating customer contentment, boost productivity, and design marketing strategies. This research was performed by defining experimental user interests using Dynamic Mining and Page Interest Estimation methods. The findings of this analysis, using three algorithms at the pattern discovery page, demonstrated that the Decision Tree method excelled in both methods. It indicated that the operational performance of the Decision Tree performed well in the assessment of user interests with two different approaches. The findings of this experiment can be used as a proposal for researching the field of web usage mining, collaborating with other approaches to achieve higher accuracy values.
... Understanding user behaviour and preferences requires a close examination of clickstream data. Businesses may enhance user experience, establish more effective marketing strategies, boost conversion rates, uncover patterns and trends in user behaviour, measure the success of initiatives, and assess the success of various online business initiatives by analysing clickstream data [1][2]. Although clickstream data analysis offers insightful information, it can be a difficult undertaking because data collection, cleansing, and storage are expensive, time-consuming, and complicated. ...
Article
Full-text available
The purpose of this research work is to provide an overview of setting up an Apache-based real-time clickstream data lifecycle for user behaviour analysis and marketing strategy improvement. It uses tools like Apache Kafka, Apache Spark, Amazon S3, AWS Glue Data Catalog, Hive Metastore, and Tableau to meet the challenges of data collecting, purification, and storage. The design offers rapid data processing and analysis using Spark, high-throughput and fault-tolerant data import with Kafka, and scalable storage in Amazon S3. Data retrieval, querying, and transformation are made easier by AWS Glue Data Catalogue and Hive Metastore, while Tableau offers interactive visualisations. Data management capabilities are improved by optional interaction with a data warehouse and data lake. The scalable architecture accommodates increased data quantities and user traffic, and a mathematical model derives useful insights from clickstream data.
... Understanding user behaviour and preferences requires a close examination of clickstream data. Businesses may enhance user experience, establish more effective marketing strategies, boost conversion rates, uncover patterns and trends in user behaviour, measure the success of initiatives, and assess the success of various online business initiatives by analysing clickstream data [1][2]. Although clickstream data analysis offers insightful information, it can be a difficult undertaking because data collection, cleansing, and storage are expensive, time-consuming, and complicated. ...
Article
Full-text available
The purpose of this research work is to provide an overview of setting up an Apache-based real-time clickstream data lifecycle for user behaviour analysis and marketing strategy improvement. It uses tools like Apache Kafka, Apache Spark, Amazon S3, AWS Glue Data Catalog, Hive Metastore, and Tableau to meet the challenges of data collecting, purification, and storage. The design offers rapid data processing and analysis using Spark, high-throughput and fault-tolerant data import with Kafka, and scalable storage in Amazon S3. Data retrieval, querying, and transformation are made easier by AWS Glue Data Catalogue and Hive Metastore, while Tableau offers interactive visualisations. Data management capabilities are improved by optional interaction with a data warehouse and data lake. The scalable architecture accommodates increased data quantities and user traffic, and a mathematical model derives useful insights from clickstream data.
... In this study [17], an e-commerce web application was studied and constructed, and the process of designing and developing a functioning e-commerce platform was discussed. In order to acquire insights into user behavior and optimize website performance, the authors of [18] presented a framework for clickstream analysis in e-commerce websites. ...
Article
This concise research paper provides an analysis of the latest trends in the e-commerce industry based on recent trend analysis studies. The rapid growth of e-commerce has triggered noFig. shifts in consumer behavior, technology adoption, and market dynamics. By examining recent research findings, this paper identifies and explores key trends that are currently shaping the e-commerce landscape. The analysis highlights the escalating significance of mobile commerce, with a substantial portion of online transactions now taking place via mobile devices. It delves into the implications for e-commerce platforms, emphasizing the necessity of mobile optimization and the delivery of seamless user experiences. Furthermore, the paper investigates the growing influence of social media on e-commerce, encompassing the surge in influencer marketing, user-generated content, and the integration of social commerce. It examines their impact on consumer purchasing decisions and brand engagement. Moreover, the analysis addresses the rise of omnichannel retailing, the emergence of new business models and marketplaces, and the integration of advanced technologies within e-commerce operations. It also touches upon challenges such as cybersecurity threats, logistics optimization, and sustainability concerns. By conducting an in-depth analysis of the latest trend analysis research, this paper offers valuable insights for e-commerce practitioners, policymakers, and researchers. It underscores the need to remain well-informed and adapt strategies to effectively capitalize on emerging opportunities within the dynamic e-commerce market.
Article
Full-text available
Background Evidence-based digital health tools allow clinicians to keep up with the expanding medical literature and provide safer and more accurate care. Understanding users’ online behavior in low-resource settings can inform programs that encourage the use of such tools. Our program collaborates with digital tool providers, including UpToDate, to facilitate free subscriptions for clinicians serving in low-resource settings globally. Objective We aimed to define segments of clinicians based on their usage patterns of UpToDate, describe the demographics of those segments, and relate the segments to self-reported professional climate measures. Methods We collected 12 months of clickstream data (a record of users’ clicks within the tool) as well as repeated surveys. We calculated the total number of sessions, time spent online, type of activity (navigating, reading, or account management), calendar period of use, percentage of days active online, and minutes of use per active day. We defined behavioral segments based on the distributions of these statistics and related them to survey data. Results We enrolled 1681 clinicians from 75 countries over a 9-week period. We based the following five behavioral segments on the length and intensity of use: short-term, light users (420/1681, 25%); short-term, heavy users (252/1681, 15%); long-term, heavy users (403/1681, 24%); long-term, light users (370/1681, 22%); and never-users (252/1681, 15%). Users spent a median of 5 hours using the tool over the year. On days when users logged on, they spent a median of 4.4 minutes online and an average of 71% of their time reading medical content as opposed to navigating or managing their account. Over half (773/1432, 54%) of the users actively used the tool for 48 weeks or more during the 52-week study period. The distribution of segments varied by age, with lighter and less use among those aged 35 years or older compared to that among younger users. The speciality of medicine had the heaviest use, and emergency medicine had the lightest use. Segments varied strongly by geographic region. As for professional climate, most respondents (1429/1681, 85%) reported that clinicians in their area would view the use of a online tool positively, and compared to those who reported other views, these respondents were less likely to be never-users (286/1681, 17% vs 387/1681, 23%) and more likely to be long-term users (655/1681, 39% vs 370/1681, 22%). Conclusions We believe that these behavioral segments can help inform the implementation of digital health tools, identify users who may need assistance, tailor training and messaging for users, and support research on digital health efforts. Methods for combining clickstream data with demographic and survey data have the potential to inform global health implementation. Our forthcoming analysis will use these methods to better elucidate what drives digital health tool use.
Article
Full-text available
The use of e-commerce in companies or other types of business has supported them to develop and correspondingly cope with business pressures of high levels of competition. More consumer information can be gathered based on the interactive nature of e-commerce technology. In an e-commerce competition, all information relating to consumer behavior, such as the knowledge of the visitor interests in a product marketed by e-commerce, is of value to e-commerce players. Users can use the Web Usage Mining techniques to explore these interests. This study aimed to compare three classification algorithms by using the dynamic mining approach of user interest navigation pattern. The results of the study showed that the Decision Tree Classifier performed optimally in both the unbalanced data and independent or dependent data models.
Research
Full-text available
Web Usage Mining came into account due to proliferation of information sources available on internet. With the help of Web Usage Mining we are able to extract useful information from the corpus. Classification is widely used method in pattern discovery phase of Web Usage Mining. Classification is a supervised learning; it can be used in designing of models describing data classes, where attribute class is required in the construction of classifier. Classification is useful in categorizing the object that has similar functionality. Classification technique uses many algorithms as classifier among them Nearest Neighbour (KNN) method is a very simple, most popular, highly efficient and effective algorithm. The output of K-NN classifier is a class membership. An object is classified on the basis of number of vote of its neighbours. The object is being assigned to the class most common among its k-nearest neighbour. This article provides a review of Web Usage Mining systems, its classification technique and widely used KNN classifier. We will also emphasize on advantages and disadvantages of K-NN algorithm.
Article
Full-text available
Online shopping is becoming more and more common in our daily lives. Understanding users’ interests and behaviour is essential in order to adapt e-commerce websites to customers’ requirements. The information about users’ behaviour is stored in the web server logs. The analysis of such information has focused on applying data mining techniques where a rather static characterization is used to model users’ behaviour and the sequence of the actions performed by them is not usually considered. Therefore, incorporating a view of the process followed by users during a session can be of great interest to identify more complex behavioural patterns. To address this issue, this paper proposes a linear-temporal logic model checking approach for the analysis of structured e-commerce web logs. By defining a common way of mapping log records according to the e-commerce structure, web logs can be easily converted into event logs where the behaviour of users is captured. Then, different predefined queries can be performed to identify different behavioural patterns that consider the different actions performed by a user during a session. Finally, the usefulness of the proposed approach has been studied by applying it to a real case study of a Spanish e-commerce website. The results have identified interesting findings that have made possible to propose some improvements in the website design with the aim of increasing its efficiency.
Conference Paper
Full-text available
The goals of this paper is to implement classification technique in web usage mining to a bank company that can help the company to identify web performance issue. Web usage mining consists of three phases: data preprocessing, pattern discovery, and pattern analysis. In pattern discovery phase, we propose to use classification technique with k-nearest neighbor algorithm implemented with standardized Euclidean distance to classifying frequent access pattern. The result shows that the k-nearest neighbor algorithm can be implemented in web usage mining and can help company to find interesting knowledge in web server log.
Article
Full-text available
The major problem of many on-line web sites is the presentation of many choices to the client at a time; this usually results to strenuous and time consuming task in finding the right product or information on the site. In this work, we present a study of automatic web usage data mining and recommendation system based on current user behavior through his/her click stream data on the newly developed Really Simple Syndication (RSS) reader website, in order to provide relevant information to the individual without explicitly asking for it. The K-Nearest-Neighbor (KNN) classification method has been trained to be used on-line and in real-time to identify clients/visitors click stream data, matching it to a particular user group and recommend a tailored browsing option that meet the need of the specific user at a particular time. To achieve this, web users RSS address file was extracted, cleansed, formatted and grouped into meaningful session and data mart was developed. Our result shows that the K-Nearest Neighbor classifier is transparent, consistent, straightforward, simple to understand, high tendency to possess desirable qualities and easy to implement than most other machine learning techniques specifically when there is little or no prior knowledge about data distribution.
Article
Full-text available
Web Usage Mining (WUM) is the process of extracting knowledge from Web user's access data by exploiting Data Mining technologies. It can be used for different purposes such as personalization, system improvement and site modification. Study of interested web users, provides valuable information for web designer to quickly respond to their individual needs. The main objective of this paper is to study the behavior of the interested users instead of spending time in overall behavior. The existing model used enhanced version of decision tree algorithm C4.5. In this paper, we propose to use the Naive Bayesian Classification algorithm for classifying the interested users and also we present a comparison study of using enhanced version of decision tree algorithm C4.5 and Naive Bayesian Classification algorithm for identifying interested users. The performance of this algorithm is measured for web log data with session based timing, page visits, repeated user profiling, and page depth to the site length. Experimental results conducted shows that the performance metric i.e., time taken and memory to classify the web log files are more efficient when compared to existing C4.5 algorithm.
Article
Full-text available
Nowadays, Web based platforms are quite common in any university, supporting a very diversified set of applications and services. Ranging from personal management to student evaluation processes, Web based platforms are doing a great job providing a very flexible way of working, promote student enrolment, and making access to academic information simple and in an universal way. Students can do their regular tasks anywhere, anytime. Sooner or latter, it was expected that organizations, and universities in particular, begin to think and act towards better educational platforms, more user-friendly and effective, where students find easily what they search about a specific topic or subject. Profiling is one of the several techniques that we can use to discover what students use to do, by establishing their user navigation patterns on Web based platforms, and knowing better how they explore and search the sites" pages that they visit. With these profiles Web based platforms administrators can personalize sites according with the preferences and behaviour of the students, promoting easy navigation functionalities and better abilities to response to their needs. In this article we will present the application of Markov chains in the establishment of such profiles for a target eLearning oriented Web site, presenting the system we implemented and its functionalities to do that, as well describing the entire process of discovering student profiles on an eLearning Web based platform.
Book
The rapid growth of the Web in the last decade makes it the largest p- licly accessible data source in the world. Web mining aims to discover u- ful information or knowledge from Web hyperlinks, page contents, and - age logs. Based on the primary kinds of data used in the mining process, Web mining tasks can be categorized into three main types: Web structure mining, Web content mining and Web usage mining. Web structure m- ing discovers knowledge from hyperlinks, which represent the structure of the Web. Web content mining extracts useful information/knowledge from Web page contents. Web usage mining mines user access patterns from usage logs, which record clicks made by every user. The goal of this book is to present these tasks, and their core mining - gorithms. The book is intended to be a text with a comprehensive cov- age, and yet, for each topic, sufficient details are given so that readers can gain a reasonably complete knowledge of its algorithms or techniques without referring to any external materials. Four of the chapters, structured data extraction, information integration, opinion mining, and Web usage mining, make this book unique. These topics are not covered by existing books, but yet they are essential to Web data mining. Traditional Web mining topics such as search, crawling and resource discovery, and link analysis are also covered in detail in this book.
Conference Paper
The amount of information displayed on an online shop could distract consumers' focus that might result in customers leaving the website or canceling transaction. Recommendation that provides customers with relevant products is required to make a decision easy for a wide variety of products. Relevancy of products for a customer can be achieved through user patterns discovery. This study aims to develop a user model to provide personalization on e-commerce based on customers group. This study involved 100 participants consisted of 25 girls, 25 adult females, 25 boys, and 25 adult males. Data collection was performed using an e-commerce website i.e. http://www.erstore.biz to record users' activity. Data analysis was done using the RapidMiner Studio 5 to provide a decision tree as user model grouped by age and gender. This study found that male and female users among different age groups have different patterns in recognizing process. Female users have a shorter process in the identification of age group than male users.
Conference Paper
The increased on-line applications are leading to exponential growth of the web content. Most of the business organizations are interested to know the web user behavior to enhance their business. In this context, users navigation in static and dynamic web applications plays an important role in understanding user's interests. The static mining techniques may not be suitable as it is for dynamic web log files and decision making. Traditional web log preprocessing approaches and weblog usage patterns have limitations to analyze the content relationship with the browsing history This paper, focuses on various static web log preprocessing and mining techniques and their applicable limitations for dynamic web mining.
Article
To predict the intention of the user on the internet is more important for the e-business. This paper is the first one applying the Hidden Markov Model, the stochastic tool used in information extraction, in predicting the behavior of the users on the web. We collect the log of web servers, clean the data and patch the paths that the users pass by. Based on the HMM, we construct a specific model for the web browsing that can predict whether the users have the intention to purchase in real time. The related measures, such as speeding up the operation, kindly guide and other comfortable operations, can take effects when a user is in a purchasing mode. The simulation shows that our model can predict the purchase intention of uses with a high accuracy.