ArticlePDF Available

A Study on Web Mining Tools and Techniques

Authors:

Abstract and Figures

Web today has become a repository of knowledge in any form such as text, audio, graphics, video and multimedia. With the passage of time world wide web has become clogged up with various information making extraction of vital information arduous and cumbersome. Web mining is a branch of data mining which deals with searching, extracting and filtering useful data stored in web server databases and logs. This study is a detailed study of various techniques involved in mining web data on the basis of its application. The study describes various tools involved in web mining and is concluded with challenges faced, future aspects and applications.
Content may be subject to copyright.
A Study on Web Mining Tools and Techniques
Saleh Mowla
Department of Information and Communication Technology
Manipal Institute of Technology, Manipal, Karnataka-576104, India
Email- saleh.mowla1796@gmail.com
Ishita Bedi
Department of Information and Communication Technology
Manipal Institute of Technology, Manipal, Karnataka-576104, India
Email- ibishitabedi@gmail.com
Nisha P.Shetty
Department of Information and Communication Technology
Manipal Institute of Technology, Manipal, Karnataka-576104, India
Email- ibishitabedi@gmail.com
Abstract- Web today has become a repository of knowledge in any form such as text, audio, graphics, video and
multimedia. With the passage of time World Wide Web has become clogged up with various information making
extraction of vital information arduous and cumbersome. Web mining is a branch of data mining which deals with
searching, extracting and filtering useful data stored in web server databases and logs. This paper is a detailed study of
various techniques involved in mining web data on the basis of its application. The paper also describes various tools
involved in web mining and is concluded with challenges faced, future aspects and applications.
Keywords Web content Mining, WEKA, R, Web Usage Mining, Web Structure Mining, Web Miner
I. INTRODUCTION
With the development of this huge repository of information called as WEB, information is not limited to one
computer and can be stored, accessed and updated from any computer located in any corner of the world. When
user queries a search engine numerous links pop up containing data selection of vital information from these links
is crucial. Information in web do not adhere to one single common format because of which mining and
preprocessing the humongous data is an absolute necessity. Within the data available, there are many hidden
patterns which cannot be detected at one go. In order to find these patterns, data mining is required which
automates the task of analyzing the information based on the certain perspective to solve a given problem
statement.
1.1 Web Mining Problems[1]
Data extraction: It is important to know if the data being mined is structured or unstructured and
accordingly, machine learning and automatic extraction techniques can be used. Also some data will be
incorrect or incomplete and must be examined with great accuracy. Personal data in the Web must be
suitably protected from unauthorized access.
Information integration & schema matching: Different websites and pages may represent the same
information in various manners; identification of similar data and classifying or categorizing them from the
vast data warehouse (i.e. the Internet) can be a difficult task.
Opinion extraction: It is not easy to interpret the tone of opinions collected from various chatrooms,
discussion forums and blogs; misinterpretation of data gathered will give a completely different result on
analysis.
Knowledge synthesis: Concept hierarchies or ontology can be used in a variety of applications. However,
manually generating them is not feasible as it takes a long time to do so. The aim here is to organize bits
and pieces of information scattered around the web and get something valuable out of it.
Detecting Noises: Very often the main content of any webpage goes unnoticed due to surplus amount of
hyperlinks, advertisements, copyright notices etc. in the web page. Extracting useful information is a
tedious but a necessary process.
1.2 Types of Web Mining
Based on the target data Web mining can be categorized into three categories[2] as shown in Figure 1.
Figure 1. Types of Web Mining
II. WEB CONTENT MINING
This technique involves procedures to extract and integrate data from varied Web page contents. It aims at evaluating
and mapping the information to provide adequate results to user’s queries[4].Types of Web Content Mining are
shown in Figure 2.
When extracting Web content information using web mining, there are four typical steps:
Collect Gather the contents from the Web
Parse Mine usable data from formatted data (HTML, PDF, etc.)
Analyze tokenize, rate, classify, cluster, filter, sort, etc.
Produce turn the results of analysis into something useful (report, search index, etc.)
Figure 2. Web Content Mining Types
Mining the Web: A Survey
2.1. Unstructured Data Mining
Almost all the data on the web is unstructured. Unstructured data basically refers to data that doesn’t fit in any
database or structured form. Examples include text, images, videos etc.
Because of the dominance of unstructured data over other types of data, it becomes essential to mine it efficiently [4,
5].
Data/Information Extraction: Data extraction helps to analyze results and provide services. Since the data is
very huge, to extract meaningful information, patterns are matched. Certain keywords and phrases are
traced and connections within the text are found. This technique helps extract information from large,
unstructured data and the missing information is found too using other rules. Some predictions can be
incorrect in which case it is discarded.
Topic tracking: This technique studies the type of documents already viewed by the user and the user's
profile. After analysis, it predicts and suggests documents related to the user's interest. Many search
engines use this technique. The main task of this technique is to study a stream of resources to find the
particular topic which is contained in the positive samples [6]. This information is generally scattered and
the technique might suggest irrelevant documents as well. It can be applied to fields such as education,
medical, finance, business, etc.
Clustering: A cluster is a group of similar objects. For cluster analysis, data is divided into groups based on
similarity and labelling is done within groups. Many types of partitioning can be done such as soft, hard
etc. and an object can be allowed or disallowed to be a part of multiple clusters or the objects may be
related in a hierarchical manner. Clustering is usually done based on fly because of which useful documents
are not omitted [3]. This helps user to select a topic of interest.
Information Visualization: It is “the use of computer-supported, interactive, visual representation of
abstract data to amplify cognition (Card, Mackinlay, and Schneiderman, 1999)”. Useful for finding similar
or related topics from a huge set of data or documents, it is a representation of data in an abstract way in a
graphical form using feature extraction and key term indexing. Huge texts are represented hierarchically in
graphical form which can be analyzed by zooming in, scaling, etc. [5].
Summarization: This technique helps to reduce the length of the document thereby keeping only the main
points. It is useful to get the gist of the topic. The time taken to summarize a text is comparatively very less.
The software should analyze the semantics, scan headings and subheadings, interpret meanings, etc. to
summarize a given text. Examples are Microsoft's auto-summarize, online text compactors etc.
2.2. Structured Data Mining
Highly organized data is classified as Structured Data. It can refer to data within a file or a record. The degree of
organization is such that insertion into databases is seamless and searchable by simple algorithms and search
operations as opposed to unstructured data.
Web Crawler: A web crawler or web spider is an automated script that browses the web in a planned way.
The crawlers regularly scan the content of the pages on the web for the words and the location of the words
in the pages. This is converted to an index which is essentially a list of words and web pages they reside in.
The external crawler traverses an unknown website and the internal crawler traverses the internal pages of
that website. Types include focused, incremental, distributed and parallel web crawlers.
Page content mining: Traditional search engines rank pages based on which pages are retrieved and
classified. The results are displayed according to the rank and classification is done as per their importance
based on their PCR (page content rank).
Wrapper Generation: Traditional search engines rank web pages. In wrapper generation information is
provided on the capability of sources. Sources are the query they will answer and the output types. Based
on the query, the web pages are retrieved on the basis of their page rank value. Wrappers provide meta-
information like statistics, domains, etc.[5].
2.3. Semi-Structured Data Mining
Semi structured data is not raw data but rather structured data which is not stored in a systematized manner like
tables. As the documents in the Web unite from diverse sources it is not feasible to store such data in a single
format.
Object Exchange Model (OEM): The OEM helps to understand the information structure on the web more
accurately and is best suited for an ever-changing and heterogeneous environment. The structures of objects
are self-describing in nature.
Web Data Extraction Language: Web data is converted to structured form and stored in a tabular form. End
users can hence access it.
Top down Extraction: Complex objects from rich web sources are extracted and changed into less complex
objects until the most atomic ones are extracted.
2.4. Multi-media Data Mining
Multimedia data mining is the process of examining stimulating patterns from media data like graphics, audio, video
and text
SKICAT: This system is an astronomical data analysis system. It is used to produce a digital catalog of the
sky objects. It is a mix of image processing and data classification and classifies objects into human usable
classes using machine learning.
Color Histogram Matching: Two basic constituents of Color Histogram Matching are Equalization and
Smoothing. The correlation between color components is Equalization. It suffers from one problem which
is the sparse data problem where unwanted artifacts are present in equalized images. Smoothening solves
this problem.
Multimedia Miner: Here, the mage excavator extracts images and videos and the preprocessor extracts
image features and stores them in a database. A search kernel matches queries with the images and videos
in the database and the discovery module does image information mining routines so as to trace the patterns
in images.
Shot Boundary Detection: This method helps to automatically detect the boundaries between shots in
videos.
III. WEB USAGE MINING
The process of examining the interaction of users with the Web by studying web logs in order to improve personalization
and provide better search engines [7].
3.1. Steps used in Web Usage mining
Data Gathering: Web logs are the records on which the server stores information about the user’s activities on the web. It
can be present on server, client or proxy side and contains vital information such as user’s domain, subdomain, hostname,
resources accessed and so on.
Data preprocessing: It involves picking out and refurbishing desired user’s entries and contents of their session.
Pattern discovery, analysis and visualizations: Log records are scrutinized to cognize the usage profile of an exact user.
Application: This knowledge can be applied to improve business in many e-commerce sectors.
3.2. Techniques
1) Association Rule Mining: The algorithms in this category are applied in order to improve the design of web
space and decide where to put hyperlinks, which pages to connect and also to predict the next attention-grabbing
page to the user and so on.
2) Clustering: These algorithms group the similar elements together and provides a clear distinction between
divergent elements. Two varieties of clusters can be fashioned:
Page cluster: groups pages of similar content together and
Usage cluster: clusters a group of users having analogous surfing patterns
3) Classification: It provides classes to web users based on their browsing behaviors. This classification is useful for
sectors such as e-commerce to further their businesses.
IV. WEB STRUCTURE MINING
Web structure mining is used to find connections between different web pages which are connected either by some
information or by some direct connection or link. Connections are beneficial so that search engines are able to extract the
web pages from web sites based on the search query directly. This is done by spiders which scan web sites, get to the
home page and thereon link information through these reference links to get to the particular page. Web structure mining
uses graph theory to do so.
Web structure mining involves two basic tasks: extraction of patterns from hyperlinks in the web and analysis of the tree
like document structure.
4.1. Algorithms
1) Google's PageRank Algorithm: The web pages are ranked based on the number of backlinks pointing to them. A total
page rank is assigned to all the pages based on the page ranks of the backlinks pointing to them [8].
Essentially, page rank is a vote given to the webpage by all other webpages based on its importance.[9] Every link
counts as a vote of support and an absence of a link means no support (which means there has been no vote, not that
the page has been voted against) [9].
Mining the Web: A Survey
The page rank of a page A is given by:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Google toolbar shows the page rank of your webpage (actually something like log base 10 of the actual page rank).
2) HITS algorithm: Used by the Ask.com search engine, this algorithm makes use of the link structure of the web in
order to find and rank pages relevant to certain topics.
First, the most relevant pages to the query are retrieved which can be done in many ways; this set of most relevant
pages is called the root set. Thereafter, the links of the webpages are explored and all the web pages that are linked
from it and some of the web pages that link to it are added to the root set forming a base set. The web pages in the
base set and the hyperlinks among those pages form a subgraph on which the HITS computation occurs [10].
Two values, namely authority and hub values are defined in terms of each other. Authority value is the sum of the
scaled hub values that point to that page. Hub value is the sum of the scaled authority values of the pages that it points
to. The algorithm is query based and iterative, with each step involving two basic steps:
Authority Update: Each node's authority score is updated to make it equal to the sum of the hub scores of each node
that it points to. Hence, a node which is linked from pages recognized as hubs for information is given a higher
authority.
Hub Update: Each node's hub score is updated to make it equal to the sum of the authority scores of each node that it
points to. Hence, a node is linked to nodes that are regarded as authorities on the subject to give it a high hub score.
V. TOOLS USED IN WEB MINING
Different data mining tools have different features which is why companies have to consider a number of factors
before deciding which tool needs to be installed.
Volume of Data: Based on the amount of data that needs to be analyzed, the company has to decide how powerful it
wants its data mining system to be; the more powerful it is, the more expensive it will be.
Amount of Pre-processing: If the data is retrieved from relational databases, it is easier to analyze it for most data
mining systems. However, in other cases, the data will first need to be processed in a manner that the system can
understand and hence, analyze it.
Storage: The manner in which the data is stored needs to be considered; if the data is stored in databases, then the
data mining system must be able to work with database or else more complex systems will be required to both
retrieve and analyze the data from large data streams.
Analysis Complexity: If the analysis is simple, then a simpler and affordable system can be put in place; if the
complexity of the analysis is required to be more, then a system with advanced features will be needed.
Tasks to be performed: Depends on what kind of operations need to be performed; this include clustering,
regression, association, classification, etc.
Scalability: The company’s system should be able to handle larger volumes of data if its database needs to be
expanded
Flexibility: There exists many data mining algorithms that can be implemented for the same data mining task. The
data mining system should be able to adapt to various types of analysis.
User-friendly: Not all users of the data mining system is well-versed with the technicalities which is why
visualization tools help make the presentation of the result more appropriate and comprehensible.
Table -1 Technical Overview of Results
Sr
No.
Tool Name
Technical
Overview
General
Features
Specialization/Applications
Advantages
Limitations
1.
WEKA[11]
>released in
1997
>licensed by
GNU
>platform-
independent
>java-
compatible
>open source
>variety of
data mining &
machine
learning
algorithms
>provides three
GUIs:
Explorer,
Experimenter,
Knowledge
Flow
>to mine association rules
>machine learning
techniques
>can develop
new machine
learning
schemes
>data file
formats: binary,
CSV, ARFF,
C4.5
>easy to
integrate into
other Java
packages
>poor
documentation
>poor
connectivity to
Excel
spreadsheet &
non-Java
databases
>weak in
classical
statistics
>automatic
parameter
optimization
for machine
learning is
unavailable
2.
KEEL[11]
>released in
2004
>licensed by
GNU
>platform-
independent
>java-
compatible
>provides
various
machine
learning tools
>vast
collection of
libraries for
pre- and post-
processing
techniques
>assessing evolutionary
algorithms for data mining
problems
>suited for machine learning
>includes
clustering,
regression,
classification,
pattern mining
>contains
hybrid models,
algorithms
based on
computational
intelligence and
rule learning
>supports
limited number
of algorithms
compared to
other tools
3.
R[11]
>released in
1997
>licensed by
GNU
>platform-
independent
>open source
and free
>well
supported
>for analyses
and graphical
& software
development
activities
>statistical computing
>bio-informatics
>social-science
>vast statistical
library
>easier to
optimize
machine
learning code
>better graphics
>more
transparent
> easier import
& export of
data from
spreadsheets
>hardly well-
oriented with
data mining
>difficult to
learn and
progress further
4.
KNIME[11]
>released in
2004
>licensed by
GNU
>compatible
with Windows,
Linux & OS X
>java-
compatible
>works in
IBM’s Eclipse
development
environment
>modular data
exploration
platform
>incorporates
more than 100
processing
nodes for
various
applications
(I/O, pre-
processing,
modeling, data
mining)
>chemical structures and
compounds
>data mining, enterprise
reporting
>business intelligence
>integrates all
analyses
modules of
WEKA
>allow R-
scripts to run
>easy to install
>ability to
interface with
programs that
visualize &
analyze
molecular data
>limited
number of error
measurement
methods
>wrapper
methods for
descriptor
methods are
unavailable
> automatic
parameter
optimization
for machine
learning is
unavailable
5.
Rapidminer[11]
>released in
2006
>licensed by
AGPL
Proprietary
>platform-
independent
>language
independent
>uses a
modular
operator
concept that
allows
complex
design of
nested operator
chains
>uses XML for
representation
if needed
>supports
about 22 file
formats
>includes
learning
algorithms
from WEKA
>reads &
writes Excel
files
>predictive analysis
>statistical computing
>effective
model
evaluation
using cross
validation &
independent
validation sets
>over 1500
methods for
data
transformation,
analysis,
modeling &
visualization
>offers
numerous
procedures in
the areas of
attribute
selection &
outlier
detection
>mostly suited
for users who
are able to
work with
database files
6.
ORANGE
[11]
>released in
2009
>licensed by
GNU
>compatible
>component-
based software
for data mining
and machine
learning
>open source data
visualization
>text mining
>bioinformatics
>data analytics
>can be used as
a script
>works well
with an ETL
work flow GUI
>large
installation
>limited
number of
machine
Mining the Web: A Survey
with Python, C
and C++
>data mining is
done through
visual
programming
or Python
scripting
>easiest tool to
learn
>better
debugger
>simpler
scripting of data
categorization
problems
>GUI is cross-
platform
learning
algorithms
>weak in
classical
statistics
>provides no
widgets for
statistical
testing
>reporting
capability is
restricted to
exporting
visual
representation
of models
7.
Tableau
>display of any
Unicode
character set
>Trend Lines:
regression
analysis of
linear,
polynomial,
logarithmic
and
exponential
data functions
>Forecasting:
prediction of
time series
based on
historical data
>data visualization
>user friendly:
drag-and-drop
interface
>can be easily
integrated with
R programming
>works well
with other data
mining tools
like KNIME
>not intended
for data mining
or predictive
analysis
8.
Scrapy
>supports
Windows,
Linux, MAC
OS
>compatible
with Python
>open source
and free
>non-
commercial
>written in
Python
>extraction of structured
data
>useful for
testing web
pages
>helps in
monitoring
>use of this
tool is difficult
compared to
other tools
9.
Web
Information
Extractor
(WIE)[12]
>supports
Windows
2000/ XP/
Vista OS
>commercial
tool
>extraction of
structured and
non-structured
data
>data export
formats are
Excel (CSV)
and text (TXT)
>web content extraction
>monitors the
web page
constantly to
detect any
changes
>ability to
multi-task
>supports
recursive task
definition
>loading of
website can be
time-
consuming
>not possible to
record the data
10.
Web Data
Extractor
(WDE)
>supports
Windows
95/98/2000/XP
>default export
file format is
Excel (.csv)
>commercial
tool
>extraction of
URLs, phone,
meta tags, e-
mail addresses,
etc.
>web content extraction
>easy to use
and relatively
comprehensive
>settings can be
changed
according to the
user preference
>highly
automated:
extensive
training
required
11.
Mozenda[13]
>platform-
independent
(Note:
Mozenda
Agent Builder
runs only on
Windows)
>commercial
tool
>Web Console
Section:
allows users to
run agents and
publish results
of extracted
data
>Agent
Producer
Section:
Windows
>mine and manage data
>easy to use
>smart filtering
of user’s text
>rotating IP
prevents user
identification
>not possible to
record the data
application to
construct
project
associated with
data extraction
12.
Web Content
Extractor
(WCE)[12]
>supports
Windows O.S.
>export
formats: MS
Excel (CSV),
Access, TXT,
HTML, XML,
SQL, MySQL
>commercial
tool
>known for
crawling and
web spiders
>collect data
from password
protected sites
>real estate data
>online auctions
>job seeking
>user friendly
wizard interface
>easy to create
crawling rules
>ability to
download data
as a multi-
subject
>not possible to
record the data
13.
Screen
Scraper[13]
>can be
integrated with
languages like
PHP, Java,
ASP, .NET
>commercial
tool
>can search for
content from
databases[4]
>extracted data
can be
downloaded
into a
spreadsheet
>meta-search engines
>easy
automation of
website tasks
(filling a form,
open links, etc.)
>not possible to
record the data
14.
Automation
Anywhere
[13]
>export
formats:
XML, Excel,
TXT, MySQL
>commercial
tool
>ability to
repeat an
action during
hours, minutes
or seconds
>ability to
specify the rate
of the required
action
>Scheduler:
schedules an
action at a
particular time
>web record and web data
extraction
>easy to use
and faster
>it can record
the data
>highly
automated:
extensive
training to
required
Other data mining tools are listed as follows[14,15]:
Table -2 Features of Tools
Sr. No.
Tool Name
Features
Area
1.
Web Miner
> helps in the mining of useful patterns
>provides user-specific information
Web Usage Mining
2.
Import.Io
>relatively better GUI than that of commercial tools
>various crawler options are available
>structured data can get extracted
>access to integrated data can be done easily online
Web Content Mining
3.
i-Miner
>discovers data clusters
>uses fuzzy clustering algorithm and fuzzy inference system
Web Usage Mining
4.
Speed Tracer
>used in the mining of web server logs
>reconstructs user navigational path for session identification
Web Usage Mining
5.
Web LogMiner
>helps in the extraction and presentation of various kinds of reports
>supports the extended W3C log format
>data is stored in the postgreSQL database
Web Usage Mining
6.
Koinotites
>a personalization tool
Web Usage Mining
Mining the Web: A Survey
>used for the construction of user communities in the web
7.
Context Miner
>free tool for mining online web content
>export options: XML and CSV
Web Content Mining
8.
Irobotsoft
>robot performs website related activities
>multiple automatic data extraction from different websites
Web Content Mining
9.
MiningMart
>information is processed from relational databases
>supports PostgreSQL, MySQL and Oracle
Web Content Mining
10.
TraMineR
>an R-package for mining
>describes and visualizes sequences of states or events
Web Usage Mining
Table 1 and Table 2 depicts overview of all the tools in the area of Web Mining.
VI.CONCLUSION
Web mining is a branch of data mining which deals with mining heterogeneous and vast data available in the gold
mine of information i.e. The World Wide Web. Web mining helps the user to scrutinize and filter out the data useful
to the users in an effective manner. This paper incorporates a detailed study of various tools and techniques involved
in mining the data in the Web. Future work with respect to web content mining would be personalization of web or
predicting user needs effectively by proper content interpretation and selection of appropriate data to satisfy user
needs.
REFERENCES
[1] D. Jayalatchumy, Dr. P.Thambidurai, "Web Mining Research Issues and Future Directions A Survey", IOSR Journal of Computer
Engineering (IOSR-JCE),e-ISSN: 2278-0661, p- ISSN: 2278-8727,Volume 14, Issue 3 (Sep. - Oct. 2013), PP 20-27.
[2] B. Singh and H. K. Singh, "Web Data Mining research: A survey," 2010 IEEE International Conference on Computational Intelligence and
Computing Research, Coimbatore, 2010, pp. 1-10, doi: 10.1109/ICCIC.2010.5705856.
[3] Faustina Johnson and Santosh Kumar Gupta, "Web Content Mining Techniques: A Survey", International Journal of Computer
Applications (0975 888), Volume 47, Issue 11, June 2012.
[4] R.Malarvizhi, K.Saraswathi, "Web Content Mining Techniques Tools & Algorithms A Comprehensive Study", International Journal of
Computer Trends and Technology (IJCTT) volume 4 Issue 8August 2013.
[5] Shipra Saini, Hari Mohan Pandey, "Review on Web Content Mining Techniques", International Journal of Computer Applications (0975
8887), Volume 118 No. 18, May 2015.
[6] Gupta, V. and Lehal, "A Survey of Text Mining Techniques and Applications", Journal of Emerging Technologies in Web Intelligence, G.
S. 2009, Vol. 1, pp. 60-76.
[7] R. Omar, A. O. Md Tap and Z. S. Abdullah, "Web usage mining: A review of recent works," The 5th International Conference on
Information and Communication Technology for The Muslim World (ICT4M), Kuching, 2014, pp. 1-5, doi:
10.1109/ICT4M.2014.7020638.
[8] M. Sangeetha and K. S. Joseph, "Page ranking algorithms used in Web Mining," International Conference on Information Communication
and Embedded Systems (ICICES2014), Chennai, 2014, pp. 1-7, doi: 10.1109/ICICES.2014.7033794.
[9] Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry, "The PageRank Citation Ranking: Bringing Order to the
Web" Technical Report, Stanford Info Lab,1999.
[10] Pooja Devi, Ashlesha Gupta, Ashutosh Dixit, "Comparative Study of HITS and PageRank Link based Ranking Algorithms", International
Journal of Advanced Research in Computer and Communication Engineering Vol. 3, Issue 2, February 2014.
[11] Kalpana Rangra and Dr. K. L. Bansal, "Comparative Study of Data Mining Tools", International Journal of Advanced Research in
Computer Science and Software Engineering, Volume 4, Issue 6, June 2014.
[12] Abdelhakim Herrouz, Chabane Khentout and Mahieddine Djoudi, "Overview of Web Content Mining Tools", International Journal of
Advanced Research in Computer Science and Software Engineering, Volume 3, Issue 11, November 2013.
[13] T. Shanmugapriya and P. Kiruthika, "Survey on Web Content Mining and Its Tools", International Journal of Scientific Engineering and
Research (IJSER), Volume 2, Issue 8, August 2014, ISSN: 2347-3878.
[14] Kamika Chaudhary, Santosh Kumar Gupta, "Web Usage Mining Tools & Techniques: A Survey", International Journal of Scientific
Engineering and Research (IJSER), Volume 4, Issue 6, June 2013.
[15] Prof. Prerak Thakkar, Prof. Gopi Bhatt, Prof. Anirudh Kurtkoti, Prof. Siddharth Shah and Prof. Chinmay Joshi, "A Survey- Web Mining
Tools And Technique", International Journal of Latest Trends in Engineering and Technology, Volume 7, Issue 4, pp.212-217, DOI:
http://dx.doi.org/10.21172/1.74.028, e-ISSN:2278-621X.
... To get useful information based on tourist responses, this study conducting a sentiment analysis, web mining and python programming. The advantages of web mining are to extract information on the website quickly and automatically [3]. Web mining also successful in the amount of voluminous data, so the new knowledge can be discovered to understand the patterns of customer behavior to support decision making [4]. ...
Conference Paper
Full-text available
Tourist attractions are one of the destinations for the community to eliminate fatigue and to function as a self-entertainment. Lots of favorite cities chosen by the community as a tourist destination, one of which is Batu City, East Java. This city has a variety of tourist sites, which can be categorized into two types, namely natural and artificial. Thousands of people come every year to spend their holidays in the city of Batu. For this reason, this study aims to apply web mining and sentiment analysis to determine people's sentiment or perspective on tourist attractions in Batu through the reviews found on the website. The web mining method is used to retrieve information from the travel website. Then, the retrieve information is analyzed by using python programming language. There are eight tourist sites whose reviews were taken, with the total number of reviews are 4887 reviews. The results show that the tourists sites have a polarity value with a range between 0 to 1, which indicates that all the tourists' site gets a positive sentiment. In addition, the resulting subjectivity value is in the range of 0.4 to 0.6, which indicates that all reviews given by visitors are opinions or personal opinions. It implies that each tourist site has sufficiently met the needs of visitors, regarding the facilities and infrastructure that are provided.
Article
Full-text available
Web data processing is the method of handling high volume of data. Previous research explains that handling/processing such data is not easy. Therefore, researchers utilize web mining, deals with identifying patterns, which user require. The second phase of web mining is known as web content mining, which dealt mining of pictures, text and graphs etc. The primary purpose of web content mining is to identify the relevance of content according to the queries. The focus of this paper is to present a detailed and comprehensive review of various methods applied for web mining/web content mining. The paper is divided into three parts to discuss, web content mining, web structure mining and web usage mining. Later, we presented application of these approaches for structured, unstructured, semi-structured and multimedia data mining techniques. The underlined motivation is to explore new possibilities in improving the existing techniques and identifying new ways/methods.
Article
Full-text available
Nowadays, the Web has become one of the most widespread platforms for information change and retrieval. As it becomes easier to publish documents, as the number of users, and thus publishers, increases and as the number of documents grows, searching for information is turning into a cumbersome and time-consuming operation. Due to heterogeneity and unstructured nature of the data available on the WWW, Web mining uses various data mining techniques to discover useful knowledge from Web hyperlinks, page content and usage log. The main uses of web content mining are to gather, categorize, organize and provide the best possible information available on the Web to the user requesting the information. The mining tools are imperative to scanning the many HTML documents, images, and text. Then, the result is used by the search engines. In this paper, we first introduce the concepts related to web mining; we then present an overview of different Web Content Mining tools. We conclude by presenting a comparative table of these tools based on some pertinent criteria.
Conference Paper
Full-text available
Web Data Mining is an important area of Data Mining which deals with the extraction of interesting knowledge from the World Wide Web, It can be classified into three different types i.e. web content mining, web structure mining and web usages mining. The aim of this paper is to provide past, current evaluation and update in each of the three different types of web mining i.e. web content mining, web structure mining and web usages mining and also outlines key future research directions. This paper also reports the comparisons and summary of various methods of web data mining with applications, which gives the overview of development in research and some important research issues.
Article
The world wide web is expanding, everyday huge amount of data is added to the web. Finding relevant information in the web is becoming a difficult task. Web Mining is the process of analysing and mining the web to find useful information. By web mining we extract information that are implicitly present in the web. Web Mining is classified into Web Content Mining (WCM), Web Structure Mining (WSM), Web Usage Mining (WUM) based on the type of data mined. Web Structure Mining analyses the structure of the web considering it as a graph. WSM can be used to rank pages present in the web, to improve the efficiency of search engines. This paper discusses about Web Mining, its types, and various ranking algorithms used in Web structure Mining.
Article
Web mining is the application of data mining on web data and web usage mining is an important component of web mining. The goal of web usage mining is to understand the behavior of web site users through the process of data mining of web access data. Knowledge obtained from web usage mining can be used to enhance web design, introduce personalization service and facilitate more effective browsing. This paper presents a review of literature containing latest works done in this field. Our objective is to provide an overview of web usage mining concepts relevant to pattern mining phase of web usage mining process. We provide review of pattern discovery algorithms which utilize association rules, classification and sequential patterns, and since sequential pattern mining is gaining much interest from WUM research community extra emphasis is given to related papers.
Article
The Quest for knowledge has led to new discoveries and inventions. With the emergence of World Wide Web, it became a hub for all these discoveries and inventions. Web browsers became a tool to make the information available at our finger tips. As years passed World Wide Web became overloaded with information and it became hard to retrieve data according to the need. Web mining came as a rescue for the above problem. Web content mining is a subdivision under web mining. This paper deals with a study of different techniques and pattern of content mining and the areas which has been influenced by content mining. The web contains structured, unstructured, semi structured and multimedia data. This survey focuses on how to apply content mining on the above data. It also points out how web content mining can be utilized in web usage mining.
Article
The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.
Web Content Mining Techniques Tools & Algorithms -A Comprehensive Study
  • R Malarvizhi
  • K Saraswathi
R.Malarvizhi, K.Saraswathi, "Web Content Mining Techniques Tools & Algorithms -A Comprehensive Study", International Journal of Computer Trends and Technology (IJCTT) -volume 4 Issue 8-August 2013.
Journal of Emerging Technologies in Web Intelligence
  • V Gupta
  • Lehal
Gupta, V. and Lehal, "A Survey of Text Mining Techniques and Applications", Journal of Emerging Technologies in Web Intelligence, G. S. 2009, Vol. 1, pp. 60-76.