Conference Paper

A dynamic genetic algorithm for clustering Web pages

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Though the hybrid clustering algorithm (HCA) is very effective to cluster Web pages, it needs the auto k value calculation (AKVC) method to calculate the number of clusters in advance and its clustering result is affected by the number. A dynamic genetic algorithm(DGA) is designed in this paper by improving the AKVC method and the HCA's population, genetic operators and fitness function. The experiments show that DGA can obtain a more accurate number of clusters than AKVC and more accurate clusters of Web pages than HCA.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... It recorded that for visualizing 400 documents it took 2 min and it took 30 sec for the 100 documents. Zhengyu et al. (2010) enhanced their own work on web page document clustering presented in (Zhu et al., 2007). A Dynamic Genetic Algorithm (DGA) was designed then developed with Delphi language to overcome the shortages of their previous Hybrid Clustering Algorithm (HCA). ...
... b(i) = the distance of object i to the next neighboring cluster. = the no. of clusters (chromosome).Max.Inter and intra clustering(Zhengyu et al., 2010; Ai = all the clusters of population.Zhu et al., 2007) d(x,y) = cosine similarity. Vi (i=1,2, … c) are the centers. ...
Article
Full-text available
Document clustering is the process of organizing a particular electronic corpus of documents into subgroups of similar text features. Formerly, a number of conventional algorithms had been applied to perform document clustering. There are current endeavors to enhance clustering performance by employing evolutionary algorithms. Thus, such endeavors became an emerging topic gaining more attention in recent years. The aim of this paper is to present an up-to-date and self-contained review fully devoted to document clustering via evolutionary algorithms. It firstly provides a comprehensive inspection to the document clustering model revealing its various components with its related concepts. Then it shows and analyzes the principle research work in this topic. Finally, it compiles and classifies various objective functions, the core of the evolutionary algorithms, from the related collection of research papers. The paper ends up by addressing some important issues and challenges that can be subject of future work. © 2015 Sarmad Makki, Razali Yaakob, Norwati Mustapha and Hamidah Ibrahim.
... The recallprecision framework is a useful method for evaluating Information retrieval system performance. It is also used to measure the quality of clustering [15]. The proportion of retrieved relevant documents to all retrieved documents is called precision. ...
Article
Full-text available
As the size of the World Wide Web has grown largely, it has become difficult to retrieve useful information quickly. User looking for information may have to browse lots of pages to get the desired information from the pool of World Wide Web (WWW). A technique is required which organizes documents content efficiently so that information can be easily obtained from largest data repository of WWW. Clustering is an unsupervised classification technique which puts related data in one set (Cluster). Clustering can help user to get interested information quickly from these abundance of information. However clustering methods are suffered from the huge size of documents with the high dimensionality of text features. We have proposed web page clustering scheme that works efficiently in higher dimension. We have presented the method to reduce the dimensionality of the feature vector by selecting the most informative words and still maintaining the quality of the clusters.
Article
Full-text available
This paper presents a computationally efficient and robust evolutionary algorithm to find the better permutation of weighting phase factors in minimizing envelope fluctuations of orthogonal frequency division multiplexing signals. The proposed optimization method is called the seasons algorithm, in which its main inspiration is the growth and survival of trees in nature. This algorithm formulates fluctuation reduction as an optimization problem. It is combined with the partial transmit sequence method to decreases both the large fluctuations of signals and the search cost for larger sub-blocks at the same time. The search complexity of the proposed hybrid algorithm is polynomial, while the complexity of the exhaustive search partial transmit sequence scheme increases exponentially with the number of sub-blocks. The proposed algorithm is evaluated using different benchmarks and compared with several counterpart methods according to the fluctuation reduction performance and search cost. The simulation results show that the proposed algorithm outperformed the existing optimization meta-heuristics in minimizing the envelop fluctuations.
Article
Indus-2 is a synchrotron radiation source which is operational at Raja Ramanna Centre for Advanced Technology (RRCAT), Indore, India. The betatron tune measurement system is an essential beam diagnostic system for smooth beam operation of Indus-2 which is based on swept frequency excitation method. In this method, a lookup table is used for adjustment of excitation signal level in the drive system and selection of reciver parameters such as sweep rate and bandwidth. It is observed that, beam excitation levels are required to be calibrated regularly due to change in the operating parameters of Indus-2. To cater this problem of re-calibration, a new tune measurement system for Indus-2 is developed based on optimal control method using genetic algorithm. Multiple parameters and multi-objective control were required to develop this system. A variant of genetic algorithm (GA) is developed for online control of this system. In this algorithm, real valued variables are used and their boundaries are adaptively modified during the online optimization, hence this variant of GA is called as Adaptive Variable Boundary based Real Coded Genetic Algorithm (AVB-RCGA). This system provides comparatively better results as compared to the old tune measurement system. The new tune measurement system is faster with better signal to noise ratio (SNR) for entire range of machine operation.
Article
To automatically determine the number of clusters and generate more quality clusters while clustering data samples, we propose a harmonious genetic clustering algorithm, named HGCA, which is based on harmonious mating in eugenic theory. Different from extant genetic clustering methods that only use fitness, HGCA aims to select the most suitable mate for each chromosome and takes into account chromosomes gender, age, and fitness when computing mating attractiveness. To avoid illegal mating, we design three mating prohibition schemes, i.e., no mating prohibition, mating prohibition based on lineal relativeness, and mating prohibition based on collateral relativeness, and three mating strategies, i.e., greedy eugenics-based mating strategy, eugenics-based mating strategy based on weighted bipartite matching, and eugenics-based mating strategy based on unweighted bipartite matching, for harmonious mating. In particular, a novel single-point crossover operator called variable-length-and-gender-balance crossover is devised to probabilistically guarantee the balance between population gender ratio and dynamics of chromosome lengths. We evaluate the proposed approach on real-life and artificial datasets, and the results show that our algorithm outperforms existing genetic clustering methods in terms of robustness, efficiency, and effectiveness.
Chapter
World Wide Web (WWW) has become largest source of information. This abundance of information with dynamic and heterogeneous nature of the web makes information retrieval a difficult process for the average user. A technique is required that can help the users to organize, summarize and browse the available information from web with the goal of satisfying their information need effectively. Clustering process organizes the collection of objects into related groups. Web page clustering is the key concept for getting desired information quickly from the massive storage of web pages on WWW. Many researchers have proposed various web document clustering techniques. In this paper, we present detail survey on existing web document clustering techniques along with document representation techniques. We have also described some evaluation measures to evaluate the cluster qualities.
Conference Paper
Many conventional search engines satisfy the need of information retrieval from WWW, but the results obtained still hold a chance for refinement and accuracy. This problem of getting irrelevant results is specifically observed for complex queries i.e. queries with many key words. We propose an intelligent method for web mining based on Genetic Algorithm (GA). The results produced by conventional search engine i.e. snippets, are further processed and refined further to extract only the most relevant snippets. A significant improvement is observed in the search results by using a modified GA with additional local searching technique of Memetic Algorithm (MA).
Article
Full-text available
We examined whether using a natural language processing (NLP) system results in improved accuracy and completeness of automated electronic laboratory reporting (ELR) of notifiable conditions. We used data from a community-wide health information exchange that has automated ELR functionality. We focused on methicillin-resistant Staphylococcus Aureus (MRSA), a reportable infection found in unstructured, free-text culture result reports. We used the Regenstrief EXtraction tool (REX) for this work. REX processed 64,554 reports that mentioned MRSA and we compared its output to a gold standard (human review). REX correctly identified 39,491(99.96%) of the 39,508 reports positive for MRSA, and committed only 74 false positive errors. It achieved high sensitivity, specificity, positive predicted value and F-measure. REX identified over two times as many MRSA positive reports as the ELR system without NLP. Using NLP can improve the completeness and accuracy of automated ELR.
Article
When clustering a set of Web documents, most of the existing clustering algorithms need to be specified artificially a special k value, the number of clusters. Based on a technique of auto-selected similarity threshold, this paper proposes an 'auto k value calculation' method to calculate automatically the k value. Then a hybrid clustering algorithm that combines GA with k-medoids is presented. Initially, it is applied to get the initial k partitions of all the Web documents; then these initial clusters are agglomerated hierarchically according to the 'similarity between classes' until all of the documents are clustered into a single cluster. The final result is the tree-structure classification of all the Web documents. The GA can avoid reaching local optimum when clustering the documents and the k-medoids can hasten the convergence of the GA.
Article
Search Engine has proven its effectiveness for retrieval of information from World Wide Web. Traditionally, the search results are arranged in an ordered list by popularity and relevancy. However, the enormous size of matched Web pages causes inefficiency for users to locate the most relevant Web pages. A proper organization of the search result is important to improve its browsability of Web searching. In this paper, we proposed by performing Support Vector Clustering (SVC) on the search result to reorganize results in groups of similar context to facilitate effective browsing of search result by the users. SVC is a nonparametric clustering algorithm that can group clusters with arbitrary shapes and without the need to specify the number of clusters. It is a kernel clustering method that maps via a nonlinear function to a high dimension feature space. To obtain the optimal clustering result, choosing of the accurate parameters (kernel width and penalty coefficient) for SVC is crucial. In this paper, it proposed an automatic tuning method for SVC parameters to obtain the optimal result. The results from the experiment have proven the effectiveness and usefulness of above mentioned method. The performance is comparable to other popular clustering techniques.
Article
For the past few decades the mainstream data clustering technologies have been fundamentally based on centralized operation; data sets were of small manageable sizes, and usually resided on one site that belonged to one organization. Today, data is of enormous sizes and is usually located on distributed sites; the primary example being the Web. This created a need for performing clustering in distributed environments. Distributed clustering solves two problems: infeasibility of collecting data at a central site, due to either technical and/or privacy limitations, and intractability of traditional clustering algorithms on huge data sets. In this paper we propose a distributed collaborative clustering approach for clustering Web documents in distributed environments. We adopt a peer-to-peer model, where the main objective is to allow nodes in a network to first form independent opinions of local document clusterings, then collaborate with peers to enhance the local clusterings. Information exchanged between peers is minimized through the use of cluster summaries in the form of keyphrases extracted from the clusters. This summarized view of peer data enables nodes to request merging of remote data selectively to enhance local clusters. Initial clustering, as well as merging peer data with local clusters, utilizes a clustering method, called similarity histogram-based clustering, based on keeping a tight similarity distribution within clusters. This approach achieves significant improvement in local clustering solutions without the cost of centralized clustering, while maintaining the initial local clustering structure. Results show that larger networks exhibit larger improvements, up to 15% improvement in clustering quality, albeit lower absolute clustering quality than smaller networks.
Article
In this paper, a new algorithm fuzzy co-clustering with Ruspini’s condition (FCR) is proposed for co-clustering documents and words. Compared to most existing fuzzy co-clustering algorithms, FCR is able to generate fuzzy word clusters that capture the natural distribution of words, which may be beneficial for information retrieval. We discuss the principle behind the algorithm through some theoretical discussions and illustrations. These, together with experiments on two standard datasets show that FCR can discover the naturally existing document-word co-clusters.
Article
In this paper we propose a new co-clustering algorithm called possibilistic fuzzy co-clustering (PFCC) for automatic categorization of large document collections. PFCC integrates a possibilistic document clustering technique and a combined formulation of fuzzy word ranking and partitioning into a fast iterative co-clustering procedure. This novel framework brings about simultaneously some benefits including robustness in the presence of document and word outliers, rich representations of co-clusters, highly descriptive document clusters, a good performance in a high-dimensional space, and a reduced sensitivity to the initialization in the possibilistic clustering. We present the detailed formulation of PFCC together with the explanations of the motivations behind. The advantages over other existing works and the algorithm's proof of convergence are provided. Experiments on several large document data sets demonstrate the effectiveness of PFCC.
Article
Providing highly relevant page hits to the user is a major concern in Web search. To accomplish this goal, the user must be allowed to express his intent precisely. Secondly, page hit rating mechanisms should be used that take the user’s intent into account. Finally, a learning mechanism is needed that captures a user’s preferences in his Web search, even when those preferences are changing dynamically. To address the first two issues, we propose a semantic taxonomy-based meta-search agent approach that incorporates the user’s taxonomic search intent. It also addresses relevancy improvement issues of the resulting page hits by using user’s search intent and preference-based rating. To provide a learning mechanism, we first propose a connectionist model-based user profile representation approach, which can leverage all of the features of the semantic taxonomy-based information retrieval approach. A user profile learning algorithm is also devised for our proposed user profile representation framework by significantly modifying and extending a typical neural network learning algorithm. Finally, the entire methodology including this learning mechanism is implemented in an agent-based system, WebSifter II. Empirical results of learning performance are also discussed.
Conference Paper
The existing partitioning-based clustering algorithms, such as k-means, k-medoids and their variations, are simple in theory and fast in convergence speed, but they always just reach local optimum when the iterations terminate and they are not suitable for discovering clusters in the cases when their sizes are very different. This paper proposes an improved Web documents clustering method, using genetic algorithm (GA) which introduces some ideas of ISODATA [6] into the design of its mutation operation. Experiments show that the GA's global search characteristic can avoid local optimum and the ISODATA-based mutation operation makes the improved clustering algorithm have the self-adjusting ability to discover clusters of different sizes.