Conference Paper

An Evolutionary Approach to Automatic Web Page Categorization and Updating

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Catalogues play an important role in most of the current Web search engines. The catalogues, which organize documents into hierarchical collections, are maintained manually increasing difficult y and costs due to the incessant growing of the WWW. This problem has stimulated many researches to work on automatic categorization of Web documents. In reality, most of these approaches work well either on special types of documents or on restricted set of documents. This paper presents an evolutionary approach useful to construct automatically the catalogue as well as to perform the classification of a Web document. This functionality relies on a genetic-based fuzzy clustering methodology that applies the clustering on the context of the document, as opposite to content based clustering that works on the complete document information.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Literature also highlights the use of several meta-heuristic approaches for solving such problems outlier detection problems specifically in the Web 3.0 domain including web content [5,6] and social media [3,7]. Research also highlights several page categorization and updation approaches using evolutionary approaches [42]. ...
Article
Full-text available
With the increased digital usage, web visibility has become critically essential for organizations when catering to a larger audience. This visibility on the web is directly related to web searches on search engines which is often governed by search engine optimization techniques liked link building and link farming amongst others. The current study identifies metrics for segregating websites for the purpose of link building for search engine optimization as it is important to invest resources in the right website sources. These metrics are further used for detecting websites outliers for effective optimization and subsequent search engine marketing. Two case studies of knowledge management portals from different domains are used having 1682 and 1070 websites respectively for validation of the proposed approach. The study evolutionary intelligence by proposing a k-means chaotic firefly algorithm coupled with k-nearest neighbor outlier detection for solving the problem. Factors like Page Rank, Page Authority, Domain Authority, Alexa Rank, Social Shares, Google Index and Domain Age emerge significant in the process. Further, the proposed chaotic firefly variants are compared to K-Means integrated firefly algorithm, bat algorithm and cuckoo search algorithm for accuracy and convergence showing comparable accuracy. Findings indicate that the convergence speeds are higher for proposed chaotic firefly approach for tuning absorption and attractiveness coefficients resulting in faster search for optimal cluster centroids. The proposed approach contributes both theoretically and methodologically in the domain of vendor selection for identifying genuine websites for avoiding investment on untrustworthy websites.
... Boughanem et al. [8] developed a query reformulation technique using GA, in which a GA generates several queries that explore different areas of the document space and determines the optimal one. Automatic Web page categorization and updating can also be performed using GAs [59]. Evolutionary computing techniques can be also applied to develop personalized Web services, by automatically selecting optimal combinations of Web service components from available component repositories, as in [12]. ...
Article
Full-text available
Computational Intelligence (CI) paradigms reveal to be potential tools to face under theWeb uncertainty. In particular, CI techniques may be properly exploited to handle Web usage data and develop Web-based applications tailored on users preferences. The main rationale behind this success is the synergy resulting from CI components, such as fuzzy logic, neural networks and genetic algorithms. In fact, rather than being competitive, each of these computing paradigms provides complementary reasoning and searching methods that allow the use of domain knowledge and empirical data to solve complex problems. This paper focuses on the major Computational Intelligent combinations applied in the context of Web personalization, by providing different examples of intelligent systems which have been designed to provide Web users with the information they search, without expecting them to ask for it explicitly. In particular, this paper emphasizes the suitability of hybrid schemes deriving from the profitable combination of different CI methodologies for the development of effective Web personalization systems.
... Then genetic operators and relevance judgements are applied to these descriptions in order to determine the best one in terms of classification performance in response to a specific query. Automatic web page categorization and updating can also be performed using GAs [83]. ...
Article
Full-text available
The paper summarizes the different characteristics of Web data, the basic components of Web mining and its different types, and the current state of the art. The reason for considering Web mining, a separate field from data mining, is explained. The limitations of some of the existing Web mining methods and tools are enunciated, and the significance of soft computing (comprising fuzzy logic (FL), artificial neural networks (ANNs), genetic algorithms (GAs), and rough sets (RSs) are highlighted. A survey of the existing literature on "soft Web mining" is provided along with the commercially available systems. The prospective areas of Web mining where the application of soft computing needs immediate attention are outlined with justification. Scope for future research in developing "soft Web mining" systems is explained. An extensive bibliography is also provided.
... Catalogues play an important role in most of the current web search engines. In [36], Loia and Luengo present an evolutionary approach useful to automatically construct a catalogue as well as to perform the classification of web documents. The proposal faces the two fundamental problems of web clustering: the high dimensionality of the feature space and the knowledge of the entire document. ...
Article
In this contribution, different proposals found in the specialized literature for the application of evolutionary computation to the field of information retrieval will be reviewed. To do so, different kinds of IR problems that have been solved by evolutionary algorithms are analyzed. Some of the specific existing approaches will be specifically described for some of these problems and the obtained results will be critically evaluated in order to give a clear view of the topic to the reader.
... In [10] developed a query reformulation using GA, in which a GA generates several queries that explore different areas of the document space and determines the optimal one approximate matching in documents. Automatic web page categorization and updating can also be performed using GA [11]. In [12] focuses on the problem of index page synthesis where an index page is a page consisting of a set of links that cover a particular topic. ...
Article
With the exponential growth of information on the World Wide Web, there is great demand for developing efficient methods for effectively organizing the large amount of retrieved information. The most influential techniques have been HITS (Hyperlink Induced Topic Search), which computes hub and authority scores for web pages related to a search query. Being HITS algorithms are not good enough to be applied in mining the informative structures, we brings forward a new CGHITS Web search algorithm based on hyperlinks and content relevance strategy, and the cause of topic drifting away in HITS algorithm is analyzed. CGHITS algorithm evolves a population of web pages for maximizing the relevance by clone, crossover and selection operator. Experiments are carried out to show that CGHITS algorithm was able to achieve the optimal solution in most cases.
... Many other methods have been tried for web page categorization that include link and context analysis [3], analysis of the user's mouse click behavior [13], genetic-based fuzzy clustering method [9], approaches based on example based learning using SVM [14] and summarization-based classification [11]. ...
Conference Paper
Random surfers spend very little time on a web page. If the most important web page content fails to attract his attention within the short time span, he will move away to some other page, thus defeating the purpose of the web page designer. In order to predict if the contents of a web page will catch a random surfer's attention or not, we propose a machine learning based approach to classify web pages into “bad” and “not bad” classes, where the “bad” class implies poor attention drawing ability. We propose to divide web page contents into “objects”, which are coherent regions of web page conveying the same information, to develop the classifier approach. We surveyed 100 web pages sampled from the Internet to identify the type and frequency of objects used in web page design. From our survey, we identified six types of objects that are most important in determining the class of a web page, in terms of its attention drawing capability. We used the WEKA tool to implement the machine learning approach. Two different strategies of percentage split and three different strategies of cross validation are used to check for accuracy of the classifier. We have experimented with 65 algorithms supported by WEKA and found that the algorithms RBF network and Random subspace, among the 65, gives the best performance, with about 83% accuracy.
... The architecture of the Yahoo search engine is a good example of such technology where the database of the system consists of directories of web-documents classified according to subjects. In [25], [26] a genetic based fuzzy clustering algorithm is used to categorize (i.e. to form a catalogue of) the web documents. The clustering method was based on the analysis of the context of the documents rather than the content. ...
Article
The Internet and World Wide Web are becoming more and more dynamic in terms of their content and use. Information retrieval (IR) efforts aim to keep up with this dynamic environment by designing intelligent systems which can deliver Web content in real time to various wired or wireless devices. Evolutionary and adaptive systems (EASs) are emerging as typical examples of such systems. This paper contains one of the first attempts to gather and evaluate the nature of current research on Web-based IR using EAS and proposes future research directions in parallel to developments on the Web environments.
... The genetic operators and the relevance judgments are applied to these descriptions in order to determine the one which has best classification performance in response to a specific query. In [24] GA is used for automatic web page categorization and updation. ...
Article
In order to share distributed resources in the campus network and save relevant cost, This paper presents an extended campus services, data integration middleware grid middleware, grid data integration gives the campus a key middleware technology and architecture, and discusses the integration of grid middleware architecture of the campus's role and how grid services and other interactive components to complete the application request
Article
Catalogues play an important role in most of the current web search engines. The catalogues, which organize documents into hierarchical collections, are maintained manually increasing difficulty and costs due to the incessant growing of the WWW. This problem has stimulated many researches to work on automatic categorization of web documents. In reality, most of these approaches work well either on special types of documents or on restricted sets of documents. This paper presents an evolutionary approach useful to construct automatically the catalogue as well as to perform the classification of a Web document. This functionality relies on a genetic-based fuzzy clustering methodology that applies the clustering on the context of the document, opposite to content-based clustering that works on the complete document information.
Article
World-wide-web applications have grown very rapidly and have made a significant impact on computer systems. Among them, web browsing for useful information may be most commonly seen. Due to its tremendous amounts of use, efficient and effective web retrieval has thus become a very important research topic in this field. In the past, we proposed a web-mining algorithm for extracting interesting browsing patterns from log data in web servers. It integrated fuzzy-set concepts and a data mining approach to achieve this purpose. In that algorithm, each web page used only the linguistic term with the maximum cardinality in the mining process. The number of items was thus the same as that of the original web page, making the processing time reduced. The fuzzy browsing patterns derived in this way are, however, not complete, meaning some possible patterns may be missed. This paper thus modifies it and proposes a new fuzzy web-mining algorithm for extracting all possible fuzzy interesting knowledge from log data in web servers. The proposed algorithm can derive a more complete set of browsing patterns but with more computation time than the previous method. Trade-off thus exists between the computation time and the completeness of browsing patterns. Choosing an appropriate mining method thus depends on the requirements of the application domains.
Article
Due to the popularity of knowledge discovery and data mining, in practice as well as among academic and corporate professionals, association rule mining is receiving increasing attention. The technology of data mining is applied in analyzing data in databases. This paper puts forward a new method which is suit to design the distributed databases.
Article
With the rapid increase of information in Web, people have to waste much time to search the information they need. So the search engine has more and more become an absolutely necessary tool of Internet surfers. This paper brings forward a new Web search algorithm based on hyperlinks and content relevance strategy. The experimental results have showed that the proposed algorithm focuses on mining the potentially semantic relationship between hyperlinks and performs quite well in the topic-specific crawling.
Article
With the explosive growth of information sources available on the World Wide Web, how to combine the results of multiple search engines has become a valuable problem. In this paper, a search strategy based on genetic simulated annealing for search engines in Web mining is proposed. According to the proposed strategy, there exists some important relationship among Web statistical studies, search engines and optimization techniques. We have proven experimentally the relevance of our approach to the presented queries by comparing the qualities of output pages with those of the original downloaded pages, as the number of iterations increases better results are obtained with reasonable execution time.
Article
Data mining is the process of analyzing data from different perspectives and summarizing it into useful information. The financial management of enterprise management is an important component of the work is the core of enterprise management, improve business management and enhance the economic efficiency of enterprises is very important role. This paper proposes an improved data mining method to enhance the capability of exploring valuable information from financial statements. Experimental results indicate that this proposed method significantly improves the performance.
Conference Paper
We propose a generic multiple classifier system based solely on pairwise classifiers to classify web pages. Web page classification is getting huge attention now because of its use in enhancing the accuracy of search engines and in summarizing web content for small-screen handheld devices. We have used a Support Vector Machine (SVM) as our core pair-wise classifier. The proposed system has produced very encouraging results on the problem web page classification. The proposed solution is totally generic and should be applicable in solving a wide range of multiple class pattern recognition problems.
Article
Full-text available
Tesis Univ. Granada. Departamento de Ciencias de la Computación e Inteligencia Artificial. Leída el 29 de abril de 2005
Article
Full-text available
The present article provides a survey of the available literature on data mining using soft computing. A categorization has been provided based on the different soft computing tools and their hybridizations used, the data mining function implemented, and the preference criterion selected by the model. The utility of the different soft computing methodologies is highlighted. Generally fuzzy sets are suitable for handling the issues related to understandability of patterns, incomplete/noisy data, mixed media information and human interaction, and can provide approximate solutions faster. Neural networks are nonparametric, robust, and exhibit good learning and generalization capabilities in data-rich environments. Genetic algorithms provide efficient search algorithms to select a model, from mixed media data, based on some preference criterion/objective function. Rough sets are suitable for handling different types of uncertainty in data. Some challenges to data mining and the application of soft computing methodologies are indicated. An extensive bibliography is also included.
Conference Paper
Full-text available
Information retrieval systems are being challenged to manage larger and larger document collections. In an effort to provide better retrieval performance on large collections, more sophisticated retrieval techniques have been developed that support rich, ...
Article
Full-text available
Assistance in retrieving of documents on the World Wide Web is provided either by search engines, through keyword based queries, or by catalogues, which organise documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult due to the sheer amount of material on the Web, and therefore it will be soon necessary to resort to techniques for automatic classification of documents. Classification is traditionally performed by extracting information for indexing a document from the document itself. The paper describes the technique of categorisation by context, which exploits the context perceivable from the structure of HTML documents to extract useful information for classifying the documents they refer to. We present the results of experiments with a preliminary implementation of the technique.
Article
Full-text available
The recent explosion of on-line information in Digital Libraries and on the World Wide Web has given rise to a number of query-based search engines and manually constructed topical hierarchies. However, these tools are quickly becoming inadequate as query results grow incomprehensibly large and manual classification in topic hierarchies creates an immense bottleneck. We address these problems with a system for topical information space navigation that combines the query-based and taxonomic systems. We employ machine learning techniques to create dynamic document categorizations based on the full-text of articles that are retrieved in response to users' queries. Our system, named SONIA (Service for Organizing Networked Information Autonomously), has been implemented as part of the Stanford Digital Libraries Testbed. It employs a combination of technologies that takes the results of queries to networked information sources and, in real-time, automatically retrieve, parse and organize the...
Article
Full-text available
Domain-specific search engines are becoming increasingly popular because they offer increased accuracy and extra features not possible with general, Web-wide search engines. Unfortunately, they are also difficult and timeconsuming to maintain. This paper proposes the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific search engines. We describe new research in reinforcement learning, text classification and information extraction that enables efficient spidering, populates topic hierarchies, and identifies informative text segments. Using these techniques, we have built a demonstration system: a search engine for computer science research papers available at www.cora.justresearch.com. 1 Introduction As the amount of information on the World Wide Web grows, it becomes increasingly difficult to find just what wewant. While general-purpose search engines suchas AltaVista and HotBot offer high coverage, they often provi...
Article
Clustering techniques have been used by many intelligent software agents in order to retrieve, filter, and categorize documents available on the World Wide Web. Clustering is also useful in extracting salient features of related Web documents to automatically formulate queries and search for other similar documents on the Web. Traditional clustering algorithms either use a priori knowledge of document structures to define a distance or similarity among these documents, or use probabilistic techniques such as Bayesian classification. Many of these traditional algorithms, however, falter when the dimensionality of the feature space becomes high relative to the size of the document space. In this paper, we introduce two new clustering algorithms that can effectively cluster documents, even in the presence of a very high dimensional feature space. These clustering techniques, which are based on generalizations of graph partitioning, do not require pre-specified ad hoc distance functions, and are capable of automatically discovering document similarities or associations. We conduct several experiments on real Web data using various feature selection heuristics, and compare our clustering schemes to standard distance-based techniques, such as hierarchical agglomeration clustering, and Bayesian classification methods, such as AutoClass.
Conference Paper
The World Wide Web is a vast source of information accessible to computers, but understandable only to humans. The goal of the research described here is to automatically create a computer understandable world wide knowledge base whose content mirrors that of the World Wide Web. Such a knowledge base would enable much more effective retrieval of Web information, and promote new uses of the Web to support knowledge-based inference and problem solving. Our approach is to develop a trainable information extraction system that takes two inputs: an ontology defining the classes and relations of interest, and a set of training data consisting of labeled regions of hypertext representing instances of these classes and relations. Given these inputs, the system learns to extract information from other pages and hyperlinks on the Web. This paper describes our general approach, several machine learning algorithms for this task, and promising initial results with a prototype system.
Conference Paper
The Construe news story categorization system assigns indexing terms to news stories according to their content using knowledge-based techniques. An initial deployment of Construe in Reuters Ltd. topic identification system (TIS) has replaced human indexing for Reuters Country Reports, an online information service based on news stories indexed by country and type of news. TIS indexing is comparable to human indexing in overall accuracy but costs much less, is more consistent, and is available much more rapidly. TIS can be justified in terms of cost savings alone, but Reuters also expects the speed and consistency of TIS to provide significant competitive advantage and, hence, an increased market share for Country Reports and other products from Reuters Historical Information Products Division.
Article
The dozens of existing search tools and the keyword-based search model have become the main issues of accessing the ever growing WWW. Various ranking algorithms, which are used to evaluate the relevance of documents to the query, have turn out to be impractical. This is because the information given by the user is too few to give good estimation. In this paper, we propose a new idea of searching under the multi-engine search architecture to overcome the problems. These include clustering of the search results and extraction of co-occurrence keywords which with the user's feedback better refines the query in the searching process. Besides, our system also provides the construction of the concept space to gradually customize the search tool to fit the usage for the user at the same time.
Article
We describe the design, prototyping and evaluation of ARC, a system for automatically compiling a list of authoritative web resources on any (sufficiently broad) topic. The goal of ARC is to compile resource lists similar to those provided by Yahoo! or Infoseek. The fundamental difference is that these services construct lists either manually or through a combination of human and automated effort, while ARC operates fully automatically. We describe the evaluation of ARC, Yahoo!, and Infoseek resource lists by a panel of human users. This evaluation suggests that the resources found by ARC frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic. We also provide examples of ARC resource lists for the reader to examine. Keywords: Search, taxonomies, link analysis, anchor text, information retrieval. 1. Overview The subject of this paper is the design and evaluation of an automatic resource compiler. An autom...
Article
The World Wide Web has rapidly become a key medium for information dissemination to all members of society. However, its disorganized nature and sheer size can make it difficult for people to find information. Web search services have made a significant contribution towards enabling people to quickly find information on the Web. Unfortunately, as of this writing, no Web search service can conduct a comprehensive search of the Web for any topic. In addition, many major Web search services are unable to return a stable set of results. An intuitive assumption about the behavior of any search service is that the results of a given query will be unchanged unless either the documents referred to in the results change and become irrelevant or better documents become available. Unfortunately, due to a variety of real-world constr...
Article
Users of Web search engines are often forced to sift through the long ordered list of document "snippets" returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on the major search engines. The paper articulates the unique requirements of Web document clustering and reports on the first evaluation of clustering methods in this domain. A key requirement is that the methods create their clusters based on the short snippets returned by Web search engines. Surprisingly, we find that clusters based on snippets are almost as good as clusters created using the full text of Web documents. To satisfy the stringent requirements of the Web domain, we introduce an incremental, linear time (in the document collection size) algorithm called Suffix Tree Clustering (STC), which creates clusters based on phrases shared between documents. We show that STC is faster than standard clusteri...
Article
The degree to which information sources are pre-processed by Web-based information systems varies greatly. In search engines like Altavista, little pre-processing is done, while in "knowledge integration" systems, complex site-specific "wrappers" are used integrate different information sources into a common database representation. In this paper we describe an intermediate between these two models. In our system, information sources are converted into a highly structured collection of small fragments of text. Databaselike queries to this structured collection of text fragments are approximated using a novel logic called WHIRL, which combines inference in the style of deductive databases with ranked retrieval methods from information retrieval. WHIRL allows queries that integrate information from multiple Web sites, without requiring the extraction and normalization of object identifiers that can be used as keys; instead, operations that in conventional databases require equality tests...
Article
The output of major WWW search engines was analyzed and the results led to some surprising observations about their stability. Twenty-five queries were issued repeatedly to the engines and the results were compared. After one month, the top ten results returned by eight out of nine engines had changed by more than fifty percent. Furthermore, five out of the nine engines returned over a third of their URLs intermittently during the month.
Experimental simulation for automatic patent categorization
  • H Mase
  • H Tsuji
  • H Kinukawa
  • Y Hosoya
  • K Koutani
  • K Kiyota
Mase, H., Tsuji, H., Kinukawa, H., Hosoya, Y., Koutani, K., and Kiyota, K. (1996). Experimental simulation for automatic patent categorization. Advances in Production Management Systems, 377-382.
  • S Lawrence
  • C L Giles
Lawrence, S. and Giles, C. L. (1999). Nature, 400:107-109. Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99).
Automatic resource list compilation by analyzingh yperlink structure and associated text
  • S Chakrabarti
  • B Dom
  • D Gibson
  • J Kleinberg
  • P Rahavan
  • S Rajagopalan
Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Rahavan, P., and Rajagopalan, S.(1998). Automatic resource list compilation by analyzing hyperlink structure and associated text. Seventh International World Wide Web Conference, 1998.
Partioning-based clustering for Web document categorization Decision Support System
  • D Boley
  • M Gini
  • R Gross
  • E-H Hang
  • K Hasting
  • G Karypis
  • V Kumar
  • B Mobasher
  • D. Boley