Conference Paper

A general markov framework for page importance computation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We propose a General Markov Framework for computing page importance. Under the framework, a Markov Skeleton Process is used to model the random walk conducted by the web surfer on a given graph. Page importance is then deflned as the product of page reachability and page utility, which can be computed from the transition probability and the mean staying time of the pages in the Markov Skeleton Pro- cess respectively. We show that this general framework can cover many existing algorithms as its special cases, and that the framework can help us deflne new algorithms to handle more complex problems. In particular, we demonstrate the use of the framework with the exploitation of a new process named Mirror Semi-Markov Process. The experimental re- sults validate that the Mirror Semi-Markov Process model is more efiective than previous models in several tasks.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... A direct motivation of investigating such dynamics comes from the information retrieval on the Web. The notion of WMSP was recently invented in [9,10], in which the authors found that WMSP is a very suitable framework for modeling the user browsing behavior on the Web. When modeling the user browsing behavior, the state space E is a collection of web pages, X = {X n , n ≥ 0} describes the transition behavior between pages, which forms a Markov chain with state space E, and Y = {Y n , n ≥ 0} represents staying times on the pages. ...
... Moreover, the framework is essential in designing some new algorithms to handle more complex problems. For example, mirror semi-Markov processes, a new class of processes in WMSP family, plays an essential role in designing MobileRank for computing page importance of the mobile Web [9,10]. It is known that the structure of the mobile Web differs a lot from the usual Internet Web [17]. ...
... The notion of WMSP appeared recently in [9,10]. The authors found that WMSP is a very suitable framework for modeling the user browsing behavior on the Web, and is a very useful mathematical tool for computing the web page importance. ...
Article
We propose and discuss a new class of processes, web Markov skeleton processes (WMSP), arising from the information retrieval on the Web. The framework of WMSP covers various known classes of processes, and it contains also important new classes of processes. We explore the definition, the scope and the time homogeneity of WMSPs, and discuss in detail a new class of processes, mirror semi-Markov processes. In the last section we briefly review some applications of WMSPs in computing page importance on the Web.
... Markov chain (Norris, 1996;Gao et al., 2009) is invented by A. A. Markov; a Russian Mathematician in the early 1900's to predict the behavior of a system that moves from one state to another state by considering only the current state. Markov chain uses only a matrix and a vector to model and predict it. ...
... Markov Chain is a random process used by a system that at any given time t = 1, 2, 3 … n occupies one of a finite number of states (Gao et al., 2009). At each time t the system moves from state v to u with probability p uv that does not depends on t. p uv is called as transition probability which is an important feature of Markov chain and it decides the next state of the object by considering only the current state and not any previous states. ...
Article
Link analysis algorithms for Web search engines determine the importance and relevance of Web pages. Among the link analysis algorithms, PageRank is the state of the art ranking mechanism that is used in Google search engine today. The PageRank algorithm is modeled as the behavior of a randomized Web surfer; this model can be seen as Markov chain to predict the behavior of a system that travels from one state to another state considering only the current condition. However, this model has the dangling node or hanging node problem because these nodes cannot be presented in a Markov chain model. This paper focuses on the application of Markov chain on PageRank algorithm and discussed a few methods to handle the dangling node problem. The Experiment is done running on WEBSPAM-UK2007 to show the rank results of the dangling nodes.
... Recently a new class of stochastic processes, web Markov skeleton processes (WMSP), has been found to be very useful in the study of information retrieval on the Web (see 10,11 ). In our previous paper, 25 we initiated an exploration on the theoretical aspects of WMSPs. ...
... Moreover, the framework is essential in designing some new algorithms to handle more complex problems. For example, mirror semi-Markov processes, a new class of processes in WMSP family, play an essential role in designing MobileRank 10,11 for computing page importance of mobile Web. Below we describe briefly several concrete examples of WMSPs and their roles in corresponding algorithms of information retrieval on the Web. ...
Article
Full-text available
Recently a new class of stochastic processes, web Markov skeleton processes (WMSP), has been found to be very useful in the study of information retrieval on the Web (see 10,11). In our previous paper, 25 we initiated an exploration on the theoretical aspects of WMSPs. We found that the framework of WMSPs is not only of importance in applications, it enjoys also many interesting theoretical properties by its own. In this paper we shall study further some theoretical aspects of WMSPs, including the property of time homogeneous WMSPs, the reconstruction of WMSPs, and the relation of WMSPs with multivariate point processes. In details the paper is organized as follows. In section 16.1 we briefly review the definition of WMSPs and the role of WMSPs in several algorithms of information retrieval on the Web. In Section 16.2 we review the concept of time homogeneity for WMSPs and investigate its further properties. In Section 16.3 we show that we can reconstruct a time homogeneous WMSP for a given kernel and a family of initial distributions. Based on the reconstruction we prove an existence and uniqueness result for time homogeneous WMSPs on a canonical path space. In Section 16.4 we explore the relation between WMSPs and multivariate point processes, which reveals that the theory of multivariate point processes is very useful in the study of WMSPs. In the last section we clarify that the notion of semi-Markov processes in our context coincides with the corresponding classical notion appeared previously in the literature.
... Markov chain (Norris, 1996;Gao et al., 2009) is invented by A. A. Markov; a Russian Mathematician in the early 1900's to predict the behavior of a system that moves from one state to another state by considering only the current state. Markov chain uses only a matrix and a vector to model and predict it. ...
... Markov Chain is a random process used by a system that at any given time t = 1, 2, 3 … n occupies one of a finite number of states (Gao et al., 2009). At each time t the system moves from state v to u with probability p uv that does not depends on t. p uv is called as transition probability which is an important feature of Markov chain and it decides the next state of the object by considering only the current state and not any previous states. ...
... Markov chain (Norris, 1996;Gao et al., 2009) is invented by A. A. Markov; a Russian Mathematician in the early 1900's to predict the behavior of a system that moves from one state to another state by considering only the current state. Markov chain uses only a matrix and a vector to model and predict it. ...
... Markov Chain is a random process used by a system that at any given time t = 1, 2, 3 … n occupies one of a finite number of states (Gao et al., 2009). At each time t the system moves from state v to u with probability p uv that does not depends on t. p uv is called as transition probability which is an important feature of Markov chain and it decides the next state of the object by considering only the current state and not any previous states. ...
Article
Full-text available
Link analysis algorithms for Web search engines determine the importance and relevance of Web pages. Among the link analysis algorithms, PageRank is the state of the art ranking mechanism that is used in Google search engine today. The PageRank algorithm is modeled as the behavior of a randomized Web surfer; this model can be seen as Markov chain to predict the behavior of a system that travels from one state to another state considering only the current condition. However, this model has the dangling node or hanging node problem because these nodes cannot be presented in a Markov chain model. This paper focuses on the application of Markov chain on PageRank algorithm and discussed a few methods to handle the dangling node problem. The Experiment is done running on WEBSPAM-UK2007 to show the rank results of the dangling nodes.
... BrowseRank [11] was proposed recently to consider rich metadata (e.g., visiting frequency and staying time) in user behavior data for page importance ranking, which is based on a new mathematical tool -continuous time Markov process. MobileRank [6] and BrowseRank Plus [6] further improved BrowseRank in considering more dependency between different nodes in the graph and more metadata. It has been shown that most of these algorithms can be summarized into a general framework based on Markov skeleton process. ...
... BrowseRank [11] was proposed recently to consider rich metadata (e.g., visiting frequency and staying time) in user behavior data for page importance ranking, which is based on a new mathematical tool -continuous time Markov process. MobileRank [6] and BrowseRank Plus [6] further improved BrowseRank in considering more dependency between different nodes in the graph and more metadata. It has been shown that most of these algorithms can be summarized into a general framework based on Markov skeleton process. ...
Conference Paper
Full-text available
For many Web applications, one needs to deal with the ranking problem on large-scale graphs with rich metadata. However, it is non-trivial to perform efficient and effective ranking on them. On one aspect, we need to design scalable algorithms. On another aspect, we also need to develop powerful computational infrastructure to support these algorithms. This tutorial aims at giving a timely introduction to the promising advances in the aforementioned aspects in recent years, and providing the audiences with a comprehensive view on the related literature.
... The Markov chain can be used in any system where there is a transition from one state to another [22]. Imagine a random surfer surfing the web, going from one page to another by randomly choosing an outgoing link from one page to go to the next. ...
Article
Full-text available
Link spammers are constantly seeking new methods and strategies to deceive the search engine ranking algorithms. The search engines need to come up with new methods and approaches to challenge the link spammers and to maintain the integrity of the ranking algorithms. In this paper, we proposed a methodology to detect link spam contributed by zero-out link or dangling pages. We randomly selected a target page from live web pages, induced link spam according to our proposed methodology, and applied our algorithm to detect the link spam. The detail results from amazon.com pages showed that there was a considerable improvement in their PageRank after the link spam was induced; our proposed method detected the link spam by using eigenvectors and eigenvalues.
... But the recent application of Markov chain on the Google search engine is interesting and more challenging. Markov Chain is a random process [13] used by a system that at any given time t = 1, 2, 3 … n occupies one of a finite number of states. At each time t the system moves from state v to u with probability p uv that does not depends on t. p uv is called as transition probability which is an important feature of Markov chain and it decides the next state of the object by considering only the current state and not any previous states. ...
Conference Paper
Full-text available
Link analysis algorithms for Web search engines determine the importance and relevance of Web pages. Among the link analysis algorithms, PageRank is the state of the art ranking mechanism that is used in Google search engine today. The PageRank algorithm is modeled as the behavior of a randomized Web surfer; this model can be seen as Markov chain to predict the behavior of a system that travels from one state to another state considering only the current condition. However, this model has the dangling node or hanging node problem because these nodes cannot be presented in a Markov chain model. This paper focuses on the application of Markov chain on PageRank algorithm and discussed a few methods to handle the dangling node problem. The Experiment is done running on WEBSPAM-UK2007 to show the rank results of the dangling nodes.
... Even if the data is naturally represented in a feature space, it is usually helpful to transform the data into a network, or graph structure (for example, by constructing a nearest neighbor graph) to better exploit the intrinsic characteristics of the data. Therefore, learning on networked data is receiving growing attention in recent years [9, 22, 1, 24, 6]. Most of the existing studies [19, 16, 2, 12] about information networks mainly work with homogeneous networks, i.e., networks composed of a single type of object, as mentioned above. ...
Conference Paper
It has been recently recognized that heterogeneous information networks composed of multiple types of nodes and links are prevalent in the real world. Both classification and ranking of the nodes (or data objects) in such networks are essential for network analysis. However, so far these approaches have generally been performed separately. In this paper, we combine ranking and classification in order to perform more accurate analysis of a heterogeneous information network. Our intuition is that highly ranked objects within a class should play more important roles in classification. On the other hand, class membership information is important for determining a quality ranking over a dataset. We believe it is therefore beneficial to integrate classification and ranking in a simultaneous, mutually enhancing process, and to this end, propose a novel ranking-based iterative classification framework, called RankClass. Specifically, we build a graph-based ranking model to iteratively compute the ranking distribution of the objects within each class. At each iteration, according to the current ranking results, the graph structure used in the ranking algorithm is adjusted so that the sub-network corresponding to the specific class is emphasized, while the rest of the network is weakened. As our experiments show, integrating ranking with classification not only generates more accurate classes than the state-of-art classification methods on networked data, but also provides meaningful ranking of objects within each class, serving as a more informative view of the data than traditional classification.
... However, Liu et al. [25] utilized users' browsing behaviors to calculate page authority from a continuous-time Markov process which combines both how likely a web surfer reaches a page and how long the web surfer stays on a page. Their follow-up work [17] generalizes the page importance framework to be a semi-Markov process in which how long a web surfer stays on a page can partially depend on where the surfer comes from in one step transition. Since our work models web freshness from both how fresh a page is and how well other pages care about a particular page over time, we incorporate these two aspects into a semi-Markov process, which can model a temporal web surfer behavior in a natural and adaptive way. ...
Conference Paper
Full-text available
The collective contributions of billions of users across the globe each day result in an ever-changing web. In verticals like news and real-time search, recency is an obvious significant factor for ranking. However, traditional link-based web ranking algorithms typically run on a single web snapshot without concern for user activities associated with the dynamics of web pages and links. Therefore, a stale page popular many years ago may still achieve a high authority score due to its accumulated in-links. To remedy this situation, we propose a temporal web link-based ranking scheme, which incorporates features from historical author activities. We quantify web page freshness over time from page and in-link activity, and design a web surfer model that incorporates web freshness, based on a temporal web graph composed of multiple web snapshots at different time points. It includes authority propagation among snapshots, enabling link structures at distinct time points to influence each other when estimating web page authority. Experiments on a real-world archival web corpus show our approach improves upon PageRank in both relevance and freshness of the search results.
Chapter
This chapter summarizes a few frontier research topics in complex network studies, including human opinion dynamics, human mobility, web pagerank algorithms, web recommender systems, network edge prediction schemes, cascading reactions and bionetworks.
Conference Paper
In the last years, a lot of attention was attracted by the problem of page authority computation based on user browsing behavior. However, the proposed methods have a number of limitations. In particular, they run on a single snapshot of a user browsing graph ignoring substantially dynamic nature of user browsing activity, which makes such methods recency unaware. This paper proposes a new method for computing page importance, referred to as Fresh BrowseRank. The score of a page by our algorithm equals to the weight in a stationary distribution of a flexible random walk, which is controlled by recency-sensitive weights of vertices and edges. Our method generalizes some previous approaches, provides better capability for capturing the dynamics of the Web and users behavior, and overcomes essential limitations of BrowseRank. The experimental results demonstrate that our method enables to achieve more relevant and fresh ranking results than the classic BrowseRank.
Conference Paper
BrowseRank algorithm and its modifications are based on analyzing users' browsing trails. Our paper proposes a new method for computing page importance using a more realistic and effective search-aware model of user browsing behavior than the one used in BrowseRank.
Article
For many information retrieval applications, we need to deal with the ranking problem on very large scale graphs. However, it is non-trivial to perform efficient and effective ranking on them. On one aspect, we need to design scalable algorithms. On another aspect, we also need to develop powerful computational infrastructure to support these algorithms. This tutorial aims at giving a timely introduction to the promising advances in the aforementioned aspects in recent years, and providing the audiences with a comprehensive view on the related literature.
Article
Full-text available
This paper is concerned with Markov processes for computing page importance. Page importance is a key factor in Web search. Many algorithms such as PageRank and its variations have been proposed for computing the quantity in different scenarios, using different data sources, and with different assumptions. Then a question arises, as to whether these algorithms can be explained in a unified way, and whether there is a general guideline to design new algorithms for new scenarios. In order to answer these questions, we introduce a General Markov Framework in this paper. Under the framework, a Web Markov Skeleton Process is used to model the random walk conducted by the web surfer on a given graph. Page importance is then defined as the product of two factors: page reachability, the average possibility that the surfer arrives at the page, and page utility, the
Conference Paper
Full-text available
In contrast with the current Web search methods that essentially do document-level ranking and retrieval, we are exploring a new paradigm to enable Web search at the object level. We collect Web information for objects relevant for a specific application domain and rank these objects in terms of their relevance and popularity to answer user queries. Traditional PageRank model is no longer valid for object popularity calculation because of the existence of heterogeneous relationships between objects. This paper introduces can achieve significantly better ranking results than naively applying PageRank on the object graph.
Conference Paper
Full-text available
One of the premier applications on the global Internet is browsing the World Wide Web. The advent of advanced browser-enabled cell phones, high-speed wireless networks, and "unlimited-data" pricing plans is fueling the demand for Web access on mobile devices. Further, there is an increasing amount of content in the mobile Web, the set of web pages written in markup languages (CHTML, XHTML, and WML) designed specifically for consumption on mobile wireless devices. Understanding the structural properties of the WWW can be very helpful in a variety of applications, such as crawling the Web more efficiently, or performing better search results ranking. So far, however, this line of investigation has been limited to the Web consisting of HTML pages. In this study we examine the structural properties of the mobile Web graph inferred from a crawl of mobile markup pages. We find that the mobile Web graph differs in general from the fixed web in several important ways. Its connectivity is sparser than the fixed Web and its node degree distributions fall off much more rapidly. We further analyze the Web graph in terms of its bow-tie structure, which has been studied previously for the fixed web. The properties of the bow-tie structure for mobile Web are quite different from those of the fixed Web, such as having a smaller central core strongly connected component (SCC) and more disconnectedness. We also find the CHTML and XHTML/WML subgraphs of the mobile Web subgraph differ significantly, indicating the influence of different usage and maturity of the mobile Web in Japan compared to other countries. We also consider the domain-level graphs, where all nodes of a domain are collapsed into a single node and all inter- domain edges are hidden, and find notable differences between the fixed and mobile graphs. To our knowledge this is the first study of the structural properties of the web graph. We briefly comment on the potential implications of the findings, focusing on crawl a- s an example application.
Conference Paper
Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine's results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semi-automatically separate reputable, good pages from spam. We first select a small set of seed pages to be evaluated by an expert. Once we manually identify the reputable seed pages, we use the link structure of the web to discover other pages that are likely to be good. In this paper we discuss possible ways to implement the seed selection and the discovery of good pages. We present results of experiments run on the World Wide Web indexed by AltaVista and evaluate the performance of our techniques. Our results show that we can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites.
Article
A new class of stochastic processes—Markov skeleton processes is introduced, which have the Markov property on a series of random times. Markov skeleton processes include minimalQ processes, Doob processes,Q processes of order one, semi-Markov processes[1], piecewise determinate Markov processes[2], and the input processes, the queuing lengths and the waiting times of the system G1/G/1, as particular cases. First, the forward and backward equations are given, which are the criteria for the regularity and the formulas to compute the multidimensional distributions of the Markov skeleton processes. Then, three important cases of the Markov skeleton processes are studied: the (H, G, II)-processes, piecewise determinate Markov skeleton processes and Markov skeleton processes of Markov type. Finally, a vast vistas for the application of the Markov skeleton processes is presented.
Article
The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.
Conference Paper
This paper proposes a new method for computing page importance, referred to as BrowseRank. The conventional approach to com- pute page importance is to exploit the link graph of the web and to build a model based on that graph. For instance, PageRank is such an algorithm, which employs a discrete-time Markov process as the model. Unfortunately, the link graph might be incomplete and inaccurate with respect to data for determining page impor- tance, because links can be easily added and deleted by web con- tent creators. In this paper, we propose computing page impor- tance by using a 'user browsing graph' created from user behav- ior data. In this graph, vertices represent pages and directed edges represent transitions between pages in the users' web browsing his- tory. Furthermore, the lengths of staying time spent on the pages by users are also included. The user browsing graph is more re- liable than the link graph for inferring page importance. This pa- per further proposes using the continuous-time Markov process on the user browsing graph as a model and computing the stationary probability distribution of the process as page importance. An e - cient algorithm for this computation has also been devised. In this way, we can leverage hundreds of millions of users' implicit voting on page importance. Experimental results show that BrowseRank indeed outperforms the baseline methods such as PageRank and TrustRank in several tasks.