ArticlePDF Available

Efficient algorithm for handling dangling pages using hypothetical node

Authors:

Abstract

Dangling pages are one of the major drawbacks of page rank algorithm which is used by different search engines to calculate the page rank. The number of these pages increasing with web as the source of knowledge. However search engines are the main tools used in knowledge extraction or Information retrieval from the Web. We proposed an algorithm to handle these pages using a hypothetical node and comparing it with page rank algorithm.
A preview of the PDF is not available
... The static nature of the web not allow many changes in the link structure of the web but now the web is evolving into a dynamic one driven by databases. These Dangling pages also create Link rot problem because of the dynamic behavior of web [2].The link rot is kind of problem in which the links which was working at one time is not working now a days and reason behind is not working now is the content are removed from that link or URL of that link is changed or the links are broken and sometimes it will return HTTP code 403 or 404 [7]. Penalty pages are the pages which return this HTTP code. ...
... There are many previous works done to handle dangling pages. According to [7] they handle the dangling pages by connect a hypothetical node to the web graph and connect the entire dangling node in the web graph to hypothetical node and construct the Self loop from the hypothetical node which point back to itself. By creating this hypothetical node in the graph make the matrix become stochastic which is necessary for computing the page rank vector because Markov chain is only defined for stochastic matrix. ...
... By implementing this method page rank value of all the nodes non-dangling, dangling or hypothetical are come out as a result. The value of hypothetical node is always having a high Page Rank and It can be ignored according to [7]. By creating hypothetical node approach philosophical issue is solved by including the entire dangling node but computational issue increased from this approach. ...
... There is another proposal from Bianchini et al. (2005) and Singh et al. (2010) to connect a hypothetical node h i with self loop and connect all the dangling nodes to the hypothetical node as shown in Fig.4. This method also makes the transition matrix as stochastic matrix. ...
... The dataset is implemented with the algorithm in (Singh et al., 2010) and shows the results of the dangling hosts. We actually show the rank results of the first dangling host to the last one. ...
... Modified Web Graph W usingSingh et at. (2010) ...
Article
Link analysis algorithms for Web search engines determine the importance and relevance of Web pages. Among the link analysis algorithms, PageRank is the state of the art ranking mechanism that is used in Google search engine today. The PageRank algorithm is modeled as the behavior of a randomized Web surfer; this model can be seen as Markov chain to predict the behavior of a system that travels from one state to another state considering only the current condition. However, this model has the dangling node or hanging node problem because these nodes cannot be presented in a Markov chain model. This paper focuses on the application of Markov chain on PageRank algorithm and discussed a few methods to handle the dangling node problem. The Experiment is done running on WEBSPAM-UK2007 to show the rank results of the dangling nodes.
... 3.Figure 3 Modified Web Graph W using [3] The new forward links from the Alumni page is shown using the dotted arrows. This makes the transition matrix T as stochastic as shown below: [5] to connect a hypothetical node h i with self loop and connect all the dangling nodes to the hypothetical node as shown infig. 4. ...
... 4. This method also makes the transition matrix as stochastic matrix.Figure 4 Modified Web Graph W using [5] In fig. 4, h i is the hypothetical node with self loop (shown in blue dotted line) and the Alumni page is connected to it (shown in red dotted line) and the transition matrix for the modified graph is shown as follows: Alumni page in the modified graph infig. ...
... The collections contain 114549 hosts and among them, 49379 are dangling hosts. Below are the distributions of hosts: The dataset is implemented with the algorithm in [5] and shows the results of the dangling hosts. We actually show the rank results of the first dangling host to the last one. ...
Conference Paper
Full-text available
Link analysis algorithms for Web search engines determine the importance and relevance of Web pages. Among the link analysis algorithms, PageRank is the state of the art ranking mechanism that is used in Google search engine today. The PageRank algorithm is modeled as the behavior of a randomized Web surfer; this model can be seen as Markov chain to predict the behavior of a system that travels from one state to another state considering only the current condition. However, this model has the dangling node or hanging node problem because these nodes cannot be presented in a Markov chain model. This paper focuses on the application of Markov chain on PageRank algorithm and discussed a few methods to handle the dangling node problem. The Experiment is done running on WEBSPAM-UK2007 to show the rank results of the dangling nodes.
... Another method, by Ipsen and Selee (2007), also separates the hanging pages from the nonhanging ones and computes the PageRank. There were also methods proposed by Bianchini, Gori, and Scarselli (2005) and Singh, Kumar, and Goh (2010, 2012 to include hanging pages in the ranking process. All these methods either leave the hanging pages or include the hanging pages in the rank computation. ...
... This also makes the pages E and F become nonhanging. According to Langville and Meyer (2004) and Singh, Kumar, and Goh (2010), matrix PP is stochastic 1 and primitive, 2 as is the original matrix proposed by Page, Brin, Motwani, and Winograd (1999). The modified web graph WG′ for this proposed probability matrix is as shown in Figure 5. Here, the hanging node (nonrelevant) is removed and the links are adjusted. ...
Article
This article presents an algorithm to determine the relevancy of hanging pages in the link-structure-based ranking algorithms. Recently, hanging pages have become major obstacles in web information retrieval (IR). As an increasing number of meaningful hanging pages appear on the Web, their relevancy has to be determined according to the query term in order to make the search engine result pages fair and relevant. Exclusion of these pages in ranking calculation can give biased/inconsistent results, but inclusion of these pages will reduce the speed significantly. Most of the IR ranking algorithms exclude the hanging pages. But there are relevant and important hanging pages on the Web, and they cannot be ignored. In our proposed method, we use anchor text to determine the hanging relevancy, and we use stability analysis to show that rank results are consistent before and after altering the link structure.
... There is another proposal from Bianchini et al. (2005) and Singh et al. (2010) to connect a hypothetical node h i with self loop and connect all the dangling nodes to the hypothetical node as shown in Fig.4. This method also makes the transition matrix as stochastic matrix. ...
... The distributions of hosts is as Fig.5. The dataset is implemented with the algorithm in (Singh et al., 2010) and shows the results of the dangling hosts. We actually show the rank results of the first dangling host to the last one. ...
... There is another proposal from Bianchini et al. (2005) and Singh et al. (2010) to connect a hypothetical node h i with self loop and connect all the dangling nodes to the hypothetical node as shown in Fig.4. This method also makes the transition matrix as stochastic matrix. ...
... The distributions of hosts is as Fig.5. The dataset is implemented with the algorithm in (Singh et al., 2010) and shows the results of the dangling hosts. We actually show the rank results of the first dangling host to the last one. ...
Article
Full-text available
Link analysis algorithms for Web search engines determine the importance and relevance of Web pages. Among the link analysis algorithms, PageRank is the state of the art ranking mechanism that is used in Google search engine today. The PageRank algorithm is modeled as the behavior of a randomized Web surfer; this model can be seen as Markov chain to predict the behavior of a system that travels from one state to another state considering only the current condition. However, this model has the dangling node or hanging node problem because these nodes cannot be presented in a Markov chain model. This paper focuses on the application of Markov chain on PageRank algorithm and discussed a few methods to handle the dangling node problem. The Experiment is done running on WEBSPAM-UK2007 to show the rank results of the dangling nodes.
... Another method by Ipsen et al. [10] also separates the hanging pages from the non-hanging ones and computes the PageRank. There were also methods proposed by M. Bianchini et al. [3] and Singh et al. [4] to include hanging pages in the ranking process. All these methods either leave the hanging pages or include the hanging pages in the PageRank computation. ...
Conference Paper
Full-text available
Continuous growth of hanging pages with Web makes a significant problem for ranking in the information retrieval. Exclusion of these pages in ranking calculation can give biased/inconsistent result. On the other hand inclusion of these pages will reduce the speed significantly. However most of the IR ranking algorithms exclude the hanging pages. But there are relevant and important hanging pages on the Web and they cannot be ignored because of the complexity in computation and time. In our proposed method, we include the relevant hanging pages in the ranking. Relevancy or non-relevancy of hanging pages is achieved by application of Genetic Algorithm (GA).
... The basic PageRank model treats the whole Web as a directed graph G(V, E), with a vertex set of V of N pages and a directed edge set E. The following matrix M is an m x m matrix where m = (n + 1) i.e. the last column and the last row is the virtual one which we are using in dealing with the hanging pages (Singh et al. 2010). ...
Article
Full-text available
In this article we first explain the knowledge extraction (KE) process from the World Wide Web (WWW) using search engines. Then we explore the PageRank algorithm of Google search engine (a well-known link-based search engine) with its hidden Markov analysis. We also explore one of the problems of link-based ranking algorithms called hanging pages or dangling pages (pages without any forward links). The presence of these pages affects the ranking of Web pages. Some of the hanging pages may contain important information that cannot be neglected by the search engine during ranking. We propose methodologies to handle the hanging pages and compare the methodologies. We also introduce the TrustRank algorithm (an algorithm to handle the spamming problems in link-based search engines) and include it in our proposed methods so that our methods can combat Web spam. We implemented the PageRank algorithm and TrustRank algorithm and modified those algorithms to implement our proposed methodologies.
Conference Paper
Full-text available
This paper explores different Web spam detection algorithms like TrustRank and derivatives of TrustRank. TrustRank is implemented and compared with state of the art PageRank on large dataset and experimentally proved that the TrustRank algorithm had filter out spam effectively.
Article
Full-text available
A self-consistent methodology is developed for determining citation based influence measures for scientific journals, subfields and fields. Starting with the cross citing matrix between journals or between aggregates of journals, an eigenvalue problem is formulated leading to a size independent influence weight for each journal or aggregate. Two other measures, the influence per publication and the total influence are then defined. Hierarchical influence diagrams and numerical data are presented to display journal interrelationships for journals within the subfields of physics. A wide range in influence is found between the most influential and least influential or peripheral journals.
Article
The network structure of a hypcrlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.
Conference Paper
The study of the Web as a graph is not only fascinating in its own right, but also yields valuable insight into Web algorithms for crawling, searching and community discovery, and the sociological phenomena which characterize its evolution. We report on experiments on local and global properties of the Web graph using two AltaVista crawls each with over 200 million pages and 1.5 billion links. Our study indicates that the macroscopic structure of the Web is considerably more intricate than suggested by earlier experiments on a smaller scale.
Article
The Web harbors a large number of communities — groups of content-creators sharing a common interest — each of which manifests itself as a set of interlinked Web pages. Newgroups and commercial Web directories together contain of the order of 20,000 such communities; our particular interest here is on emerging communities — those that have little or no representation in such fora. The subject of this paper is the systematic enumeration of over 100,000 such emerging communities from a Web crawl: we call our process trawling. We motivate a graph-theoretic approach to locating such communities, and describe the algorithms, and the algorithmic engineering necessary to find structures that subscribe to this notion, the challenges in handling such a huge data set, and the results of our experiment.
Article
When using traditional search engines, users have to formulate queries to describe their information need. This paper discusses a different approach to Web searching where the input to the search process is not a set of query terms, but instead is the URL of a page, and the output is a set of related Web pages. A related Web page is one that addresses the same topic as the original page. For example, www.washingtonpost.com is a page related to www.nytimes.com, since both are online newspapers.We describe two algorithms to identify related Web pages. These algorithms use only the connectivity information in the Web (i.e., the links between pages) and not the content of pages or usage information. We have implemented both algorithms and measured their runtime performance. To evaluate the effectiveness of our algorithms, we performed a user study comparing our algorithms with Netscape's `What's Related' service (http://home.netscape.com/escapes/related/). Our study showed that the precision at 10 for our two algorithms are 73% better and 51% better than that of Netscape, despite the fact that Netscape uses both content and usage pattern information in addition to connectivity information.
Article
The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.
Conference Paper
With the rapid growth of the Web, users easily get lost in the rich hyper structure. Providing the relevant information to users to cater to their needs is the primary goal of Website owners. Therefore, finding the content of the Web and retrieving the users' interests and needs from their behavior have become increasingly important. Web mining is used to categorize users and pages by analyzing user behavior, the content of the pages, and the order of the URLs that tend to be accessed. Web structure mining plays an important role in this approach. Two page ranking algorithms, HITS and PageRank, are commonly used in Web structure mining. Both algorithms treat all links equally when distributing rank scores. Several algorithms have been developed to improve the performance of these methods. The weighted PageRank algorithm (WPR), an extension to the standard PageRank algorithm, is introduced. WPR takes into account the importance of both the inlinks and the outlinks of the pages and distributes rank scores based on the popularity of the pages. The results of our simulation studies show that WPR performs better than the conventional PageRank algorithm in terms of returning a larger number of relevant pages to a given query.
Article
graph, consisting of a set of abstract nodes (the pages) joined by directional edges (the hyperlinks). Hyperlinks encode a considerable amount,of latent information about the the underlying collection of pages; thus, the structure of this directed graph can provide us with significant insight into its content. Within this framework, we can search for signs of meaningful graph-theoretic structure; we can ask: What are the recurring patterns of linkage that occur across the Web as a whole? The profound complexity of the WWW is a crucial challenge in this search for structure. Content on the