Conference PaperPDF Available

A Study of Path Completion Techniques in Web Usage Mining

Authors:

Abstract and Figures

Path completion is a critical and difficult task in the preprocessing phase of web usage mining. We mold the data preprocessing phase to accomplish our goal to mine websites designed using a content management system (cms). The data preprocessing phase includes data cleaning, user identification, session identification, site structure and link details formation, path completion and event generation. The paper includes work on path completion by considering different types of path generated in accessing the website designed using cms and gives a novel algorithm to form the path.
Content may be subject to copyright.
A study of Path Completion Techniques in Web Usage
Mining
Nirali Honest and Dr. Atul Patel Dr. Bankim Patel
Smt. Chandaben Mohanbhai Patel Insitute of Computer
Applications
Shrimad Rajchandra Institute of Management
and Computer Application
Charotar University of Science and Technology, Uka Tarsadia University,
(CHARUSAT),Changa, India Bardoli, India
niralihonest.mca@charusat.ac.in bankim_patel@srimca.edu.in
Abstract - Path completion is a critical and difficult task in
the preprocessing phase of web usage mining. We mold the data
preprocessing phase to accomplish our goal to mine websites
designed using a content management system (cms). The data
preprocessing phase includes data cleaning, user identification,
session identification, site structure and link details formation,
path completion and event generation. The paper includes work
on path completion by considering different types of path
generated in accessing the website designed using cms and gives a
novel algorithm to form the path.
Index Terms – Web usage, Preprocessing phase, Path
completion.
I. INTRODUCTION
Web Usage Mining (WUM) carries out interesting and
decisive information for an assortment of people based on
different work domains. We focus to generate patterns for the
administrator of a website that is designed using cms. We
consider a university website and try to identify the useful and
meaningful information which can help the website
administrator to manage the website. We decide to prepare a
new reactive approach which uses the web log data, site
structure and academic calendar of the university in order to
produce more specific behavior patterns for the University
website access domain (UWAD). G. Castellano, A. M.
Fanelli, M. A. Torsello [6] have designed LODAP (Log Data
Preprocessor) that takes as input log files related to a Web site
and outputs a database containing some statistics about pages
visited by users and the identified user sessions but they don’t
form sessions for dynamic pages considered for our work.
The motivation behind the generation of this concept is
because of three reasons, 1) WUM can be molded according
to the specific goal w.r.t mining 2) There is no support for
generation of reports for particular events, you need to
remember the interval of the event for generating the report of
the event 3) The websites designed using the concept of cms
have master page and content page concept, so each content
page may be dynamically loaded in the master page. This page
may not have unique page names, instead they are stored by
page id, the report generation for per page frequency is not
supported by certain tools, and if supported the page name
cannot be known if it is generated by the ID number.
The paper presents the path completion techniques adopted by
different authors and suggest the new approach for path
formation. The paper is organized into six sections, in the first
section Introduction of the concept is given, in the second
section Literature survey for path completion is discussed, in
the third section an overview of Data preprocessing phase is
discussed, in the fourth section path completion technique is
discussed, in the fifth section pattern discovery and analysis is
discussed and in the last section experimental results are
shown followed by conclusion and future work.
II. LITERATURE SURVEY
Chungsheng Zhang and Liyan Zhuang [1] suggested that
reconstruction of accurate user sessions from logs is a
challenging task as the HTTP protocol is stateless and
connectionless
Path completion is an important activity in preprocessing
phase, as many patterns can be discovered and analyzed after
forming the complete and accurate path. Cyrus Shahabi, Amir
M. Zarkesh, Jafar Adibi, and Vishal Shah [3] uses the link
sequence information for prediction user links, D. W. Chueng,
B. kao, and J. W. Lee [4] analyses the web pages visited by
users and performs topic spotting.
In user session identification and path completion methods,
the most common methods are timeout, maximal forward
reference and reference length methods. Many authors have
implemented path completion phase with different parameters,
The method proposed by Cooley. R., Mobasher, B. &
Srivastava, J.[2], assumes that the amount of time a user
spends on a page depends whether the page is an auxiliary or
content page. Z. Chen, A. Fu, J. Tang and F. Tung [10][11],
they defined each session as the set of pages from the first
page in a request sequence to the final page before a backward
reference is made. Yan LI, Boqin FENG, Qinjiao MAO [9]
they have implemented path completion algorithm using three
steps. 1) The incomplete access path is identified, and path
combination is conducted. 2) The content and auxiliary pages
are identified by using the Maximal Forward References
(MFR) and Reference Length (RL) algorithms. 3) The
complete path for each user session is acquired by using
referrer information and the reference length of some pages of
this complete path is modified by using Average Reference
Length Auxiliary Pages. G. Arumugam, S. Suguna [5]
proposed User Session Identification Algorithm (USIDALG)
containing two modules for the activities related to User
Identification, and Session Identification.
Cooley, R., Mobasher, B. & Srivastava, J. [2] and Z. Chen, A.
Fu, J. Tang and F. Tung [10] suggest that to process l
2015 IEEE International Conference on Computational Intelligence & Communication Technology
978-1-4799-6023-1/15 $31.00 © 2015 IEEE
DOI 10.1109/CICT.2015.64
670
backward references among L logs the time complexity is O
(N/2 * l) where N is the number of pages on the server. The
complete path generation process fails to generate a correct
path when the pages are referred from some other servers. So,
level of accuracy is reduced. Cooley, R., Mobasher, B. &
Srivastava, J. [2], there is no algorithm for generating a
complete set of the USS. Yan LI, Boqin FENG, Qinjiao MAO
[9], using SbSfxminer and Absfxminer the complete set of the
USS is generated with the time complexity of •i=1 MFRS|MFRSi|.
G. Arumugam, S. Suguna [5] the performances are analyzed
on parameters like 1. Generating a complete path 2. Time
complexity to generate the complete set of User Session
Sequence (USS) 3. Accuracy in generating a complete set of
the USS. They find the Time complexity for generating the
complete USS by applying the formula O (l* log (n/2)) /(No.
of search / sec). In all the approaches the pages designed using
cms are not considered , so this area requires focus and we
contrive to work on this.
III. AN OVERVIEW OF DATA PREPROCESSING FOR UWAD
This phase is used to clean and process the data for making it
available for analysis. Our preprocessing phase comprises
various steps like, Data Cleaning, User Identification, Session
Identification, Path Completion, Generating Site structure
(Site Map, Mapping Page Number and Name) and Generating
Academic Events. A detail description of steps is given by
Nirali Honest, Dr. Bankim Patel, Dr. Atul Patel [7] [8]. Figure
1 shows the architecture of the data preprocessing phase for
UWAD.
Fig. 1 Architecture of Preprocessing phase UWAD
Website Structure considered in our work is based on a
website designed using Content Management System, which
shares two characteristics, 1) The pages are generated with
unique identifiers and 2) The pages may have logical names
apart from actual name. These characteristics are important to
understand before carrying out path completion. The website
structure we consider is shown in Figure 2.
Fig. 2 Website structure
IV. PATH COMPLETION
In the literature survey, we have found that authors have
worked for the static pages, in our work we consider the
websites designed with dynamic pages. Web sites designed
using the concept of CMS don’t have a unique page name for
every page, instead the pages have id by which the content of
the pages can be retrieved. So performing path completion
becomes difficult and complex. In the preprocessing phase,
after identifying sessions, while attempting for path
completion, we build the page name by reading the page id
and page name from the xml files like RSS feed and site map,
and during path completion, we Add missing pages in the
session, Remove duplicate pages in the consecutive access
within a given session and Map the name of pages with the
page number. Apart from this we add the concept of the event
as per the academic calendar in context with University
environment. Adding events allow to mine the web logs based
on temporal concept, as the accesses to web by the users are
not same all the time. Based on event lots of patterns can be
discovered and analyzed.
A. Definitions
While accessing the website, based on what user accesses,
different path are formed. Before we attempt for path
completion, it is necessary to understand the types of path.
Below is the list of definitions for various path.
Definition of Path: A path p= {p1, p2, --, pn} where n is the
number of pages traversed in a single session.
Definition Simple path: A path p containing the value as
domainname/page.aspx?id=pagenumber
Definition ID and Name path: A path p containing the value as
domainname/page.aspx?id=pagenumber&;name=title this path
is formed if the pages of a tab is selected for the first time.
Definition UI and ID path: A path p containing the value as
domainname/UIname/page.aspx?id=pagenumber the website
we consider has more than one user interface (UI), which is
included in the path.
671
Definition UI and key path: A path p containing the value as
domainname/UIname/page.aspx?key=number this is used for
right tab to and key indicates the tab number.
Definition UI and pOpen path: A path p containing the value
as domainname/page.aspx?id=pagenumber&;pOpen=0 when a
left tab has more than subtabs it is referred by pOpen, the
value of pOpen referes to inner tab open on the left tab.
Definition resource path: A path p containing the value as
domainname/UIname /download/advertisements/b.pdf a file
ending in a pdf or an image is a resource file.
Examples of above definitions include the following path,
B. Site Structure and Link Details
For mapping the page id and page name it is We need to
consider the site structure and store the link details of every
link of the website. This we can perform by reading the RSS
feed and a site map of the website. The file to have the
following structure as shown in Figure 3.,
Fig. 3 Site map and Rss feed files snapshot
After the files are read and parsed the details of link and site
structure are stored as shown in below figure 4.
Figure 4 : a)Link details
Figure 4 : b)Site structure
C. Construction of path
Construction of path consists of the following steps,
Read user session Ui, Ui= {U1, U2, ----, UN} where n is the
total number of sessions.
Divide first url from the session, into number of pages
accessed, pi= {p1, p2, ---pn}, where n is the number of pages
traversed in a single session.
Read pi and calculate the length of the page, checking it with
the link name formed in the link details. Record the link name,
uiname and level in the path.
Find the type of path in the existing links if the path then other
than simple and path replaces it with the actual name of the
link.
Read the second url in the session
IF url is same as first, don’t add it in the path, read next
url.
IF url is different then Record the link name, uiname, level
and distance.
Compare the uiname, if same, enter the page name in the path,
otherwise calculate the nodes to be traversed and list the new
pages in between first and the second url.
Repeat the above steps for all urls in a given session.
Append the url for a given session into a single path.
Repeat the above steps for all sessions in a given file.
We consider the parameter like time to build the path and
accuracy. Time to build a single path includes
P(Ti) = i=1ton (T 1*T2 *T3 * T4)/n
T1= Time to read each session
T2= Time to calculate level and distance
T3= Time to add or remove pages
T2= Time to search and map page id and name
n=Number of pages in a given session
Total time to build paths for the given user sessions includes,
P(TTi)=• i=1ton P(Ti)/n
P(Ti)=Time to build a single path
n=Number of path
After the algorithm is applied to the resultant path will carry
all the missing pages added, all the ID replaced by page
names, so they are more legible while pages are used for
discovering patterns.
o Example of path completion after mapping of
name
/CharUSATUI/MainWebsitePage2.aspx/
672
/CharUSATUI/Content.aspx?ID=3&nam
e=About_University
/CharUSATUI/Content.aspx/ID=6&nam
e=Academics
/CITC_UI/Content.aspx/ID=37
/CharUSATUI/Content.aspx/ID=6&nam
e=Academics
/CharUSATUI/NewsAnnouncementDeta
il.aspx
o Example of pages after path completion
CharUSATUI/MainWebsitePage2.aspx|
CharUSATUI/About_University|CharU
SATUI/Academics/Syllabi|/CITC_UI/S
yllabus_of_mechanical_Engineering|/C
harUSATUI/Academics/Syllabi|
/CharUSATUI/NewsAnnouncementDet
ail.aspx
D. Event data generation
Events can be anything based on the type of website. In case
of online shopping events can be festival sale, end of season
sale, brand wise sale, etc. The notion of event is to emphasize
that web users may visit and access the website in a different
way during different periods. In our work we try to capture the
events and mining the web logs particular to those events in
accordance with the regular access to the website.
In any University, there are lot many events that may occur
during the particular academic year. An event is a special
occurrence of an operation that occurs for a finite time. In the
university academic events can be considered as Recruitment,
Admission, Display of results, Announcement of workshop,
etc. During these academic events the access of the website is
different than the regular access. So in this paper, we try to
show the insertion of academic events to be specified, so that
the patterns of access during these events can be analyzed and
compared to the normal access of the website. A variety of
details are stored for generating academic events, like
Academic year, Name of Institute, Name of event, Intervals of
the event, i.e. start and end date, etc. This is the last phase of
data preprocessing.
V. PATTERN DISCOVERY AND PATTERN ANALYSIS
A. Pattern Discovery
After the completion of Data preprocessing phase, the next
phase is pattern discovery.The purpose of Pattern Discovery
phase is to produce meaningful patterns from the data stored
after cleaning and reforming the data. In the context of pattern
discovery following access patterns can be formed,Access
Patterns that can be discovered for UWAD are as below and
Figure 5 shows the process of pattern discovery,
Operating systems and browsers used by the user while
accessing the website.
Access to website
o Hourly, daily, monthly, yearly, (Regular, event
based: admission, recruitment, etc.)
Time spent by the user on the website.
User Navigation
o First page accessed by the user, last page
accessed by the user, all the pages accessed,
frequency of pages , order of pages accessed.
Fig 5. Pattern Discovery phase of UWAD
B. Pattern Analysis
Later in the patterns discovered the analysis is formed to find
out, pages that are accessed the most, Browsers and Operating
systems used the most, based on user navigation predicting
user accesses and deriving the user interest in accessing the
pages.
VI . EXPERIMENTAL RESULTS
Experimental results derived for the patterns discovery are
presented in below figures. Figure 6 a) shows the user
interface for discovering pattern of user activities like number
of users, sessions, pages accessed based on weekly daily and
hourly dimension. Figure 6 b) shows the number of users in on
a given date(s) for the selected hour. Figure 6 c) shows the
number of files accessed per hour.
Fig 6. a) Pattern Discovery Interface
Mining
algorith
ms for
temporal
aspects
Path
Com
p
letio
Session
Data
Event
Data
Cleaned
Data
673
Fig 6. b) No. of users on a given date & hour
Fig 6. c) No. of files accessed per hour.
Figure 7 a) shows the user interface for daily based reports
like browsers used, OS used and error types for a selected
date. Figure 7 b) shows the browsers used on the given date.
Figure 7 c) shows the OS used on the given date. Figure 7 d)
shows the types of errors occurred on the given date.
Fig 7. a) Daily based activity interface
Fig 7. b) Browsers used on the given date
Fig 7. c) OS used on the given date
Fig 7. d) Errors occurred on the given date
Figure 8 a) shows the user interface for per page frequency
based on individual files or based on events. Figure 8 b) shows
the pages accessed on the given date with page id replaced
with the meaningful page name.
Fig 8. a) Reports interface
Fig 8. b) Per page frequency on the given date
674
CONCLUSION
The analysis suggested that the existing technologies need to
focus on dynamic pages designed using a cms, as it adds
complexity and requires further study and calls for further
work to contribute to the existing algorithms.
ACKNOWLEDGMENT
The authors thank the Charotar University of Science and
Technology (CHARUSAT) for providing the necessary
resources to carry out this work.
R
EFERENCES
[1] Chungsheng Zhang and Liyan Zhuang , “New Path Filling Method on
Data Preprocessing in Web Mining ,“, Computer and Information Science
Journal , August 2008.
[2] Cooley, R., Mobasher, B. & Srivastava J., ,”Data preparation for mining
World Wide Web browsing patterns”, Journal of Knowledge and
Information Systems, I , Page(s): 5-32, 1999.
[3] Cyrus Shahabi, Amir M. Zarkesh, Jafar Adibi, and Vishal Shah,
"Kowledge Discovery from Users Web-Page navigation", , IEEE RIDE
1997.
[4] D. W. Chueng, B. kao, and J. W. Lee, "Discovering user Access patterns
on the World-Wide Web", Proc. First Pacific-Asia Conference on
Knowledge Discovery and Data Mining (PAKDD-97).
[5] G. Arumugam, S. Suguna,” Optimal Algorithms for Generation of User
Session Sequences Using Server Side Web User Logs”, Network and
Service Security, 2009.
[6] G. Castellano, A. M. Fanelli, M. A. Torsello,” Log Data Preparation For
Mining Web Usage Patterns”, IADIS International Conference Applied
Computing, ISBN: 978-972-8924-30-0, 2007.
[7] Nirali Honest, Dr. Bankim Patel and Dr. Atul Patel.
Article ”Preprocessing phase for University Website Access Domain”,
International Journal of Scientific & Engineering Research, (IJSER) –
ISSN : 2229-5518, 4, No.6, June 2013.
[8] Nirali Honest, Dr. Bankim Patel, Dr. Atul Patel,” Sessionization Process
for the Pages Designed with the Concept of CMS”, International Journal
of Advanced Research in Computer Science and Software Engineering,
ISSN: 2277 128X,Volume 3, Issue 9, September 2013 .
[9] Yan LI, Boqin FENG, Qinjiao MAO,”Research on Path Completion
Technique in Web Usage Mining”, International Symposium on
Computer Science and Computational Technology,2008.
[10] Z. Chen, A. Fu, J. Tang and F. Tung, “Optimal algorithms for finding user
web access sessions”, Journal of World Wide Web: Internet and
Information Systems, Vol. 6, Page(s): 259-279, 2003. Springer.
[11] Z. Chen, A.Fu, R.H. Fowler & C. Wang, “Efficient Web Mining of
Frequent Traversal Patterns”, in Anthony Acime, Web Mining:
Applications and Techniques, Page(s): 322-338, Idea Group Publishing,
August 2004.
675
Article
Preprocessing is the most important and time consuming step in web log mining. Results of web log mining mostly depend on extensive and correct preprocessing. Due to the drastic change in data consumption by users, there is a need of time to improve classical preprocessing algorithms. Although many algorithms exist for preprocessing, there is a need for an algorithm which can dynamically change the results of preprocessing according to need. Also there is a drastic need to add a few more parameters in earlier preprocessing algorithms to reduce the error rate. In this paper, we have introduced preprocessing algorithms for web usage data by inculcating some new parameters. These parameters make results more accurate for effective generation of web usage patterns.
Conference Paper
Full-text available
Web usage mining, is the method of mining for user browsing and access patterns. Usage data captures the identity or origin of Web users along with their surfing behavior at a Web site. This paper aims to classify user behavior in identifying the patterns of the browsing and navigation data of web users and also measure the performance of the Frequent Pattern (FP) Growth algorithm and Apriori algorithm by comparing their performances. The Apriori algorithm and FP Growth algorithm are compared by applying the rapid miner tool to discover frequent user patterns along with user behavior in the web log. Both the algorithms help to the analyze the patterns of web site usage and the features of user behavior knowledge obtained from web usage. This can be used to enhance web design, introduce personalization service and facilitate more effective browsing. The experimental results mainly focus on number of instance and execution time to be calculated on the two algorithms. FP growth algorithm gives the better performance in terms of time complexity.
Article
Full-text available
Web Usage Mining (WUM) is the process of generating interesting behavior patterns that helps in analyzing of website usage. The frequency access of to any website is not consistent for the given span of interval. The administrator has to mine the patterns for knowing the regular usage and usage during the increased access of the website. In this paper we discuss the problems created with the pages designed with the concept of Content Management System (CMS) and show the mining process for supporting this problem. Keywords— Web Usage Mining, Master Page, Data Preprocessing, CMS, Sessionization process by merging uri_stem and uri_query.
Article
Full-text available
In the current era websites are growing in size which adds complexity to website design. It becomes necessary to understand who are the users of site, what they are interested in accessing from the website, how much time the users are spending on the website and what is the accessing frequency of users during regular days and during special days. To understand these behaviors we consider a University Website Access Domain (UWAD), where regular days means the operational things that are part of routine activities, like checking the syllabus, checking the faculties details, checking the courses offered, etc. by special days we mean events like admission process, recruitment process, Workshop/Seminar details, etc. The access pattern of users during the regular days is different compared to the access patterns during the special events. Considering these we try to form the preprocessing phase for UWAD. The website is formed of the pages designed using Content Management Systems(CMS), and the preprocessing phase includes data cleaning, user and session identification, determining site structure, mapping of page number and name, path completion, and creation of academic events.
Article
Full-text available
In this paper we focus on log data preprocessing, the first step of a common Web Usage Mining process. In particular, we present LODAP (LOg DAta Preprocessor), a software tool which we designed and implemented in order to perform preprocessing of log data. The working scheme of LODAP embraces several steps. Firstly, log files are cleaned by removing irrelevant data. Then, the remaining requests are structured into user sessions, encoding the browsing behavior of users. Successively, the uninteresting sessions and the least visited pages are removed in order to reduce the size of data concerning the previously extracted user sessions. In addition, LODAP allows to create reports containing the results obtained in each step and information summaries mined from the analysis of the considered log files. During the preprocessing through LODAP, the analyst is guided by a sequence of panels representing the wizard-based interface which characterizes the tool. Each panel is a graphical window which offers a basic function of the preprocessor. Preliminary results on log files of a specific Web site show that the implemented tool can effectively reduce the log data size and identify user sessions encoding the user browsing behavior in a significant manner.
Article
Full-text available
Although efficient identification of user access sessions from very large web logs is an unavoidable data preparation task for the success of higher level web log mining, little attention has been paid to algorithmic study of this problem. In this paper we consider two types of user access sessions, interval sessions and gap sessions. We design two efficient algorithms for finding respectively those two types of sessions with the help of some proposed structures. We present theoretical analysis of the algorithms and prove that both algorithms have optimal time complexity and certain error-tolerant properties as well. We conduct empirical performance analysis of the algorithms with web logs ranging from 100 megabytes to 500 megabytes. The empirical analysis shows that the algorithms just take several seconds more than the baseline time, i.e., the time needed for reading the web log once sequentially from disk to RAM, testing whether each user access record is valid or not, and writing each valid user access record back to disk. The empirical analysis also shows that our algorithms are substantially faster than the sorting based session finding algorithms. Finally, optimal algorithms for finding user access sessions from distributed web logs are also presented.
Article
Full-text available
Normal 0 7.8 ? 0 2 false false false MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:????; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin:0cm; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman"; mso-fareast-font-family:"Times New Roman"; mso-ansi-language:#0400; mso-fareast-language:#0400; mso-bidi-language:#0400;} The article discusses the importance of data preprocessing in web mining and gives the topology structure for the website in the view of actual condition, analyzes the limitation of reference [3] and proposes a data structure based on adjacency list. The proposed method satisfies the actual condition of topology structure for the existed website. The special data structure and path filling algorithm based on adjacency list are given. The data structure satisfies the commonness of topology structure for the existed website and the time complexity is lower.
Article
A maximal forward reference of a Web user is a longest consecutive sequence of Web pages visited by the user in a session without revisiting some previously visited page in the sequence. Efficient Mining of frequent traversal path patterns, i.e., large reference sequences of maximal forward references, from very large Web logs is a fundamental problem in Web mining. This chapter aims at designing algorithms for this problem with the best possible efficiency. First, two optimal linear time algorithms are designed for finding maximal forward references from Web logs. Second, two algorithms for mining frequent traversal path patterns are devised with the help of a fast construction of "shallow" generalized suffix trees over a very large alphabet. These two algorithms have respectively provable linear and sublinear time complexity, and their performances are analyzed in comparison with the apriori-like algorithms and the Ukkonen algorithm. It is shown that these two new algorithms are substantially more efficient than the apriori-like algorithms and the Ukkonen algorithm.
Conference Paper
Identification of user session boundaries is one of the most important processes in the web usage mining for predictive prefetching of user next request based on their navigation behavior. This paper presents new techniques to identify user session boundaries by considering IPaddress, browsing agent, intersession and intrasession timeouts, immediate link analysis between referred pages and backward reference analysis without searching the whole tree representing the server pages. A complete set of user session sequences and the learning graph based on these user session sequences is also generated. Using this graph predictive prefetching is done. Comparison on the performance of the given approach with the existing reference length method and maximal reference method was done. Our analysis with different server's logs shows that our approach provides better results in terms of time complexity and precision to identify user session boundaries and also to generate all the relevant user session sequences.
Article
The World Wide Web provides its users with almost unlimited access to documents on the Internet. The use of intelligent agents is suggested to assist users to locate documents related to their interests instead of browsing the Web via primitive search engines. A number of key components in such intelligent systems are identified and a system architecture is proposed. In particular, a learning agent is designed along with the underlying algorithms for the discovery of areas of interest from user access logs. The discovered topics can be used to improve the efficiency of information retrieval by prefetching documents for the users and storing then in a document database in the system. A prototype system has also been implemented to illustrate the various concepts. Experiments are performed which show that the area of interest discovered can in fact be used to improve the efficiency of information retrieval on a distributed information system such as the Internet.
Conference Paper
An implementation of data preprocessing system for Web usage mining and the details of algorithm for path completion are presented. After user session identification, the missing pages in user access paths are appended by using the referer-based method which is an effective solution to the problems introduced by using proxy servers and local caching. The reference length of pages in complete path is modified by considering the average reference length of auxiliary pages which is estimated in advance through the maximal forward references and the reference length algorithms. As verified by practical Web access log, the proposed path completion algorithm efficiently appends the lost information and improves the reliability of access data for further Web usage mining calculations.
Conference Paper
The authors propose to detect users' navigation paths to the advantage of Web site owners. First, they explain the design and implementation of a profiler which captures a client's selected links and page order, accurate page viewing time and cache references, using a Java based remote agent. The information captured by the profiler is then utilized by a knowledge discovery technique to cluster users with similar interests. They introduce a novel path clustering method based on the similarity of the history of user navigation. This approach is capable of capturing the interests of the user which could persist through several subsequent hypertext link selections. Finally, they evaluate their path clustering technique via a simulation study on a sample WWW site. They show that, depending on the level of inserted noise, they can recover the correct clusters by 10%-27% of average error margin