Conference PaperPDF Available

Application of Markov Chain in the PageRank Algorithm

Authors:

Abstract and Figures

Link analysis algorithms for Web search engines determine the importance and relevance of Web pages. Among the link analysis algorithms, PageRank is the state of the art ranking mechanism that is used in Google search engine today. The PageRank algorithm is modeled as the behavior of a randomized Web surfer; this model can be seen as Markov chain to predict the behavior of a system that travels from one state to another state considering only the current condition. However, this model has the dangling node or hanging node problem because these nodes cannot be presented in a Markov chain model. This paper focuses on the application of Markov chain on PageRank algorithm and discussed a few methods to handle the dangling node problem. The Experiment is done running on WEBSPAM-UK2007 to show the rank results of the dangling nodes.
Content may be subject to copyright.
1
Application of Markov Chain in the PageRank
Algorithm
Ravi Kumar P
Department of Electrical and Computer
Engineering, Curtin University, Sarawak
Campus, Miri, Malaysia
ravi2266@gmail.com
Alex Goh Kwang Leng
Department of Electrical and Computer
Engineering, Curtin University, Sarawak
Campus, Miri, Malaysia
alexgoh.kwangleng@gmail.com
Ashutosh Kumar Singh
Department of Electrical and Computer
Engineering, Curtin University, Sarawak
Campus, Miri, Malaysia
ashutosh.s@curtin.edu.my
Abstract Link analysis algorithms for Web search engines
determine the importance and relevance of Web pages. Among the
link analysis algorithms, PageRank is the state of the art ranking
mechanism that is used in Google search engine today. The
PageRank algorithm is modeled as the behavior of a randomized
Web surfer; this model can be seen as Markov chain to predict the
behavior of a system that travels from one state to another state
considering only the current condition. However, this model has
the dangling node or hanging node problem because these nodes
cannot be presented in a Markov chain model. This paper focuses
on the application of Markov chain on PageRank algorithm and
discussed a few methods to handle the dangling node problem. The
Experiment is done running on WEBSPAM-UK2007 to show the
rank results of the dangling nodes.
Keywords Markov Chain; Web Graph; Information Retrieval;
PageRank; Transition Probability; Dangling Page.
I. INTRODUCTION
Ranking algorithms or link analysis algorithms determine
the success of the Web search engines as they calculate the
importance and relevance of individual page on the World Wide
Web. Examples of link analysis algorithms are HITS (Hyperlink
Induced Topic Search), PageRank and SALSA (Stochastic
Approach for Link Structure Analysis). These algorithms rely
on the link structure of the Web pages. HITS [9] developed by
Jon Kleinberg, is a query depend algorithm, which calculate the
authorities and hubs value of a page while SALSA [10]
algorithm combines the random walk feature in PageRank and
the hub authority idea from HITS algorithm.
This paper, we focus on PageRank algorithm. PageRank [1]
is a query and content independent algorithm [12]. Query
independent means that the PageRank algorithm ranks all the
pages offline after the crawler download and index the pages
and the rank remains constant for all the pages. Content
independent means the PageRank algorithm does not include the
contents of a Web page for ranking rather it uses the link
structure of the Web to calculate the rank. This PageRank
algorithm is explained in section III. When a user types a query
term on the search engine, the PageRank algorithm just finds the
pages on the Web that matches the query term and presents
those pages to the user in the order of their PageRank. It looks
simple but the mathematical model behind is amazing. This
paper explores the mathematical model behind the PageRank
algorithm with experiments and results.
This paper is organized as follows. Next Section describes
Markov chain and it’s Mathematics. Section III explains
PageRank algorithm using a sample Web graph. Section IV
describes how Markov chain is applied in the PageRank
algorithm. Experiments and results are shown in Section V.
Section VI concludes this paper.
II. MARKOV CHAIN
A. Introduction
Markov chain [7 and 13] is invented by A.A. Markov; a
Russian Mathematician in the early 1900’s to predict the
behavior of a system that moves from one state to another state
by considering only the current state. Markov chain uses only a
matrix and a vector to model and predict it. Markov chains are
used in places where there is a transition of states. It used in
biology, economy, engineering, physics etc. But the recent
application of Markov chain on the Google search engine is
interesting and more challenging.
Markov Chain is a random process [13] used by a system
that at any given time t = 1, 2, 3 n occupies one of a finite
number of states. At each time t the system moves from state v
to u with probability p
uv
that does not depends on t. p
uv
is called
as transition probability which is an important feature of
Markov chain and it decides the next state of the object by
considering only the current state and not any previous states.
B. Transition Matrix
Transition Matrix T is an n x n matrix formed from the
transition probability of the Markov process, where n represents
the number of states. Each entry in the transition matrix t
uv
is
equal to the probability of moving from state v to state u in one
time slot. So, 0 t
uv
1 must be true for all u,v = 1, 2, …, n.
The following example shows a sample transition matrix of a 3
state Markov chain:
2
4/14/12/1
2/102/1
4/12/14/1
t
uv
The Transition matrix must have the following properties:
The matrix must be square and nonnegative matrix
i.e. the number rows and columns must be equal
and the entries must be non negative. Each row and
column represents a state.
All the entries in the matrix represent probabilities,
so, each entry must be between 0 and 1 inclusive.
The sum of the entries in a row is the sum of the
transition probabilities from a state to another state.
So, the sum of the entries in any row must equal to
one. This is called as stochastic matrix.
In the above Transition matrix, t
uv
, we can easily see the
probability of moving from one state to another state. For
example t
3
,
2
= ¼ i.e. the probability of moving from state 2 to
state 3 is only 25%. This Markov chains are used to predict
the probability of an event.
III. PAGERANK ALGORITHM
A. Web Graph
PageRank algorithm treats the Web as a directed labeled
graph [14] whose nodes are the pages and the edges are the
hyperlinks between them. This directed graph structure in the
Web is called as Web Graph. A graph G consists of two sets V
and E. The set V is a finite, nonempty set of vertices. The set E
is a set of pairs of vertices; these pairs are called edges. The
notation V(G) and E(G) represent the sets of vertices and edges,
respectively of graph G. It can also be expressed G = (V, E) to
represent a graph. The graph in Fig. 1 is a directed graph with 3
vertices and 3 edges.
Figure 1 A Directed Web Graph G
The vertices V of G, V(G) = {1, 2, 3}. The Edges E of G,
E(G) = {(1, 2), (2, 1), (2, 3), (1,3), (3,1)}. In a directed graph
with n vertices, the maximum number of edges is n(n-1). With 3
vertices, the maximum number of edges can be 3(3-1) = 6.
There are a number of link based ranking algorithms [1, 9,
and 10]. Among them PageRank is the most popular link based
ranking algorithm. PageRank algorithm and Google are
developed by Brin and Page [1] during their Ph D at Stanford
University as a research project. The PageRank algorithm is the
heart of the Google search engine. Google was introduced in the
search engine business in 1998. Soon after its introduction, it
became one of the most efficient search engine because it is a
query independent and content independent search engine. It
produces the results faster because it is query independent i.e.
the Web pages are downloaded, indexed and ranked offline.
When a user types a query on the search engine, the PageRank
algorithm just finds the pages on the Web that matches the
query term and presents those pages to the user in the order of
their PageRank. PageRank algorithm uses only the link structure
of the Web to determine the importance of a page rather than
going into the contents of a page. PageRank provides a more
efficient way to compute the importance of a Web page by
counting the number of pages that are linking to it (in-coming
links or backlinks). If an in-coming link comes from a reputed
page, then that in-coming link is given a higher weighting than
those in-coming links from a non-reputed pages. The PageRank
PR of a page p can be computed by taking into account the set
of pages pa(p) pointing to p as per the formula given by Brin
and Page [8] is shown in (1) as follows:
)1()( ddpPR
pa
O
PR
p
q
q
q
(1)
Here, d is a damping factor such that 0 <d <1 and O
q
is the
number of out-going links of page q.
Let us take an example of a simple Web graph with 3 nodes
1, 2 and 3 as shown in Fig. 1. The PageRank for pages 1, 2 and
3 can be calculated by using (1). To start with, we assume the
initial PageRank as 1.0 and do the calculation. The damping
factor d is set to 0.85. PageRank calculation of page 1 is shown
in (2) and (3).
O
PR
O
PR
ddPR
32
)3()2(
1)1(
(2)
425.1
1
1
2
1
85.015.0)1(
PR
(3)
O
PR
ddPR
1
)1(
1)2(
(4)
756.0
2
425.1
85.015.0)2(
PR
(5)
O
PR
O
PR
ddPR
31
)3()1(
1)3(
(6)
= 1.077 (7)
1
2
3
3
This PageRank computation continues until PageRank gets
converged. This computation will be shown in the experiments
and results section. Previous experiment [8 and 11] shows that
the PageRank gets converged to a reasonable tolerance.
IV. USE OF MARKOV CHAIN IN PAGERANK ALGORITHM
In the original PageRank algorithm by Brin et al., the
Markov chain is not being mentioned. But the other researchers
Langville et al. [3] and Bianchini et al. [4] explored the
relationship between PageRank algorithm and the Markov
chain. This section explains the relationship between PageRank
algorithm and Markov chain. Imagine a random surfer surfing
the Web, going from one page to another page by randomly
choosing an outgoing link from one page to go to the next one.
This can some time lead to dead ends i.e. pages with no
outgoing links, cycles around a group of interconnected pages.
So, a certain fraction of the time, the surfer chooses a random
page from the Web. This theoretical random walk is known as
Markov chain or Markov process. The limiting probability that
an infinitely dedicated random surfer visits any particular page
is its PageRank.
The number of links to and from a page provides
information about the importance of a page. The more back
links or in-coming a page has, the more important the page is.
Back links from more good pages carries more weight than back
links from less important pages. Also if a good page points to
several other pages then its weight is distributed equally to all
those pages. According to Langville et al. [2], the basic
PageRank starts with the following (8) to define the rank of a
page p as PR(p)
pa
q
p
q
q
O
PR
pPR )(
(8)
Where p is a Web page and PR(p) is the PageRank of page
p. pa is the set of pages pointing to p. O
q
is the number of
forward links of page q. The above (8) is a recursive equation.
PageRank assigns an initial value of
n
PR
p
1
)0(
, where n is the
total number of pages on the Web. The PageRank algorithm
iterates as follows in (9).
O
PR
PR
q
k
q
pa
p
q
k
p
)1(
for k = 0,1,2,…, (9)
Where
PR
k
q
is the PageRank of page p at iteration k. The
above (9) can be written in matrix notation. Let q
k
be the
PageRank vector at k
th
iteration, and let T be the transition
matrix for the Web; then according to Langville et al. [2],
q
k+1
= Tq
k
(10)
If there are n pages on the Web, T is an n x n matrix such
that t
pq
is the probability of moving from page q to page p in a
time interval. Unfortunately, (10) has convergence problems i.e.
it can cycle or the limit may be dependent on the starting vector.
To fix this problem, Brin and Page build an irreducible
*
aperiodic
+
Markov chain characterized by a primitive transition
probability matrix.
The irreducibility guarantees the existence of a unique
stationary distribution vector q, which becomes the PageRank
vector. The power method with a primitive stochastic repetition
matrix will always converge to q independent of the starting
vector.
PageRank algorithm makes the hyperlink structure of the
Web into a primitive stochastic matrix as follows. If there are n
pages on the Web, let T be a n x n matrix whose element t
pq
is
the probability of moving from page p to page q in one step. The
basic model takes t
pq
= 1/|O
q
|. If page q has a set of forward
links, O
q
, and normally all forward links are chosen equally as
per the following (11).
0
1
O
t
q
pq
otherwise,
ptoqfromlinkaisthereif,
(11)
The following fig. 2 shows a sample Web graph extracted
from a University site. It shows a sample Web graph extracted
from a University site contains 7 pages namely, Home, Admin,
Staff, Student, Library, Dept and Alumni. We use this sample
Web graph in our Markov analysis and PageRank calculation.
Figure 2 A sample Web Graph W of a University
---------------------------------------------------------------------------
*
A Markov chain is irreducible if there is a non-zero probability
of transitioning from any state to any other state.
+
An irreducible Markov chain with a primitive transition matrix
is called as aperiodic chain.
Staff
Student
Alumni
Library
Home
Admin
Dept
4
A. Transition Matrix
The transition matrix T can be produced by applying (11) to
our sample Web graph on fig, 2.
03/13/13/1000
3/103/103/100
6/16/106/16/16/16/1
0010000
0000000
003/13/13/100
003/13/103/10
T
In the transition matrix T, that row q has non-zero elements
in positions that correspond to forward links to page q and
column p has non-zero elements in positions that correspond to
back links to page p. If page q has forward links, the sum of row
is equal to 1.
In the transition matrix, if sum of any rows is zero that
indicates that there is a page with no forward links. This type of
page is called as dangling node or hanging node. Dangling
nodes cannot present in the Web graph if it to be presented
using a Markov model. There are a couple of methods to
eliminate this dangling page problem. They are discussed using
the transition matrix below:
Langville et al. [3] proposed a method to handle dangling
pages is to replace all the rows with e/n, where e is a row vector
of all ones and n is the order of matrix. In our example, the
value of n is 7.
Using the above proposal, the sample Web graph in fig. 2 is
modified as shown in fig. 3.
Figure 3 Modified Web Graph W using [3]
The new forward links from the Alumni page is shown using
the dotted arrows. This makes the transition matrix T as
stochastic as shown below:
03/13/13/1000
3/103/103/100
6/16/106/16/16/16/1
0010000
7/17/17/17/17/17/17/1
003/13/13/100
003/13/103/10
T
The row 3 in the transition matrix
T
, (Alumni page) is
connected to all the nodes and also connected back to it (shown
in the dotted lines).
There is another proposal from Bianchini et al. [4] and Singh
et al. [5] to connect a hypothetical node h
i
with self loop and
connect all the dangling nodes to the hypothetical node as
shown in fig. 4. This method also makes the transition matrix as
stochastic matrix.
Figure 4 Modified Web Graph W using [5]
In fig. 4, h
i
is the hypothetical node with self loop (shown in
blue dotted line) and the Alumni page is connected to it (shown
in red dotted line) and the transition matrix for the modified
graph is shown as follows:
10000000
003/13/13/1000
03/103/103/100
06/16/106/16/16/16/1
00010000
10000000
0003/13/13/100
0003/13/103/10
T
The last row and last column in the above transition matrix
T
is the hypothetical node h
i.
The transition probability for the
Staff
Student
Alumni
Library
Home
Admin
Dept
h
i
Staff
Student
Alumni
Library
Home
Admin
Dept
5
Alumni page in the modified graph in fig. 4 is 1. Now the
Alumni page is no more a dangling page. Similarly for the
hypothetical node h
i
the transition probability is 1 because of the
self loop. Now this Web graph on fig. 4 is also stochastic.
This stochastic property is not enough to guarantee that
Markov model will converge and a steady state vector exists.
There is another problem with is transition matrix
T
is that this
matrix may not be regular. The general Web’s nature makes the
transition matrix
T
is not regular. In the graph, every node
needs to be connected to every other node (irreducible). But in
the real Web, every page is not connected to every other page
i.e. it is not strongly connected. Brin et al. [1] forced all the
entries in the transition matrix to satisfy 0 < t
pq
< 1 to make it
regular. This ensures convergence of q
n
to a unique, positive
steady state vector.
B. Google Matrix
According to Langville et al. [6], Brin and Page added a
perturbation matrix E = ee
t
/n to make this stochastic irreducible
matrix as Google matrix as shown in (12).
ETT )1(
(12)
Where α is between 0 and 1. Google believed that this new
matrix
T
tends to better model the real-life surfer. In the real-life
a surfer has 1-α probability of jumping to a random page on the
Web i.e. by typing a URL on the command line of a browser
and an α probability of clicking on a forward link on a current
page. Many researchers [1, 3 and 4] say the value of α used by
the PageRank algorithm of Google is 0.85.
We calculate the Google Matrix in equation (12) using the
sample Web graph W by having a value of 0.85 for α and shown
in the matrix
T
.
This matrix computation can be normalized to a stationary
vector by calculating the powers of the transition matrix. At one
stage of the calculation, the values of the matrix get stationary.
7/17/17/17/17/17/17/1
7/17/17/17/17/17/17/1
7/17/17/17/17/17/17/1
7/17/17/17/17/17/17/1
7/17/17/17/17/17/17/1
7/17/17/17/17/17/17/1
7/17/17/17/17/17/17/1
15.0
03/13/13/1000
3/103/103/100
6/16/106/16/16/16/1
0010000
7/17/17/17/17/17/17/1
003/13/13/100
003/13/103/10
85.0T
021.0304.0304.0304.0021.0021.0021.0
304.0021.0304.0021.0304.0021.0021.0
163.0163.0021.0163.0163.0163.0163.0
021.0021.0871.0021.0021.0021.0021.0
142.0142.0142.0142.0142.0142.0142.0
021.0021.0304.0304.0304.0021.0021.0
021.0021.0304.0304.0021.0304.0021.0
Those values are the PageRank scores for the 7 pages from
the sample Web graph W. Assume after the 30
th
iteration the
following are the stationary vector for our sample 7 page Web
graph.
041.0041.0441.0341.0047.0334.0043.0s
C. PageRank Interpretation of Stationary Vector
PageRank interprets the stationary vector in the following
way. For example, a user enters a query in the Google search
window requesting for word 1 and word 2. Then the search
engine looks for the inverted index database with the word 1 and
word 2. This database contains the list of all the words or terms
and the list of documents that contains the words or terms [6].
Assume the following documents lists are stored in the
inverted index database for word 1 and word 2 as shown in
Table I.
TABLE I. INVERTED INDEX DOCUMENT LIST
Query Word/Term
Document List
Word 1
Document 2, Document 4 & Document 7
Word 2
Document 1 & Document 7
So, the relevancy set for the user’s query term word 1 and
word 2 is {1, 2, 4, and 7}. The PageRank of these 4 documents
are compared to find out the order of importance. According to
our sample 7 page Web, 1 is the Staff page, 2 is the Student
page, 4 is the Library page and 7 is the Dept. page. The
respective PageRank scores are s
1
= 0.043, s
2
= 0.334, s
4
= 0.341
6
and s
7
= 0.041. This PageRank algorithm treats that document 4
(Library) page is most relevant to the given query term, followed
by document 2 (Student), document 1 (Staff) and document 7
(Dept). When a user types a new query term, the inverted index
database is accessed again and a new relevancy set is created.
This is how the PageRank algorithm works in the Google search
engine.
V. EXPERIMENTAL RESULTS
The dataset that is used in this experiment is WEBSPAM-
UK2007, provided by Laboratory of Web Algorithmics,
Università degli Studi di Milano with the support of the DELIS
EU - FET research project [15]. The collections contain 114549
hosts and among them, 49379 are dangling hosts. Below are the
distributions of hosts:
Label
No. of Hosts
Dangling Hosts
49379
Non-dangling Hosts
65150
Total
114529
Figure 5 Distributions of Web Hosts
The dataset is implemented with the algorithm in [5] and
shows the results of the dangling hosts. We actually show the
rank results of the first dangling host to the last one. Below are
the rank results of the dangling hosts:
Figure 6 The rank results of the dangling hosts
With the hypothetical node, now the Web graph is stochastic.
Fig. 6 shows the rank results of the dangling hosts in ascending
order. The results are calculated with the damping factor α of
0.85 and 50 iterations.
VI. CONCLUSION
This paper starts with the introduction of Markov chain and
PageRank algorithm. Then the mathematics behind the
PageRank algorithm is explained theoretically. This paper also
brings anonymity about how the PageRank algorithm uses the
Markov chain and transition matrix to calculate the relevancy
set. This paper highlights the different adjustments done to make
the Web graph into a Markov model. In that, the dangling node
problem and the methods to handle the dangling nodes were also
discussed and the mathematical solutions are given. A Markov
model is created for a sample Web graph and the PageRank
calculation is shown for the Markov model. We implemented
the PageRank algorithm just to support our mathematical model
and shown the results.
REFERENCES
[1] S. Brin, L. Page, “The Anatomy of a Large Scale Hypertextual Web search
engine,”, Computer Network and ISDN Systems, Vol. 30, Issue 1-7, pp.
107-117, 1998.
[2] A.N. Langville and C.D. Meyer, “The Use of the Linear Algebra by Web
Search Engines”, Bulletin of the International Linear Algebra, No 23,
December 2004.
[3] A.N. Langville and C.D Meyer, ‘Deeper Inside PageRank”, Internet
Mathematics, Vol. 1, Issue 3, pp 335-380, 2004.
[4] M. Bianchini, M. Gori and F. Scarselli, “Inside PageRank”, ACM
transactions on Internet Technology, 2005.
[5] A.K. Singh, P.R. Kumar and G.K.L. Alex, “Efficient Algorithm to handle
Dangling Pages using Hypothetical node”, Proc. Of 6
th
IDC2010, Seoul,
Korea, 2010
[6] A.N. Langville and C.D. Meyer, “A Survey of Eigenvector Methods of
Web Information Retrieval”, Proc of the SIAM, Vol. 47, No. 1, pp. 135
161, 2005.
[7] R. Norris, Markov Chains”, Cambridge University Press. 14, 1996.
[8] L. Page, S. Brin, R. Motwani, and T. Winograd, “The Pagerank Citation
Ranking: Bringing order to the Web”. Technical Report, Stanford Digital
Libraries SIDL-WP-1999-0120, 1999.
[9] J. Kleinberg, “Authoritative Sources in a Hyper-Linked Environment”,
Journal of the ACM 46(5), pp. 604-632, 1999.
[10] R. Lempel and S. Moran, The stochastic approach for linkstructure
analysis (SALSA) and the TKC effect”, In Proceedings of the 9th World
Wide Web Conference (WWW9). Elsevier Science. 387401, 2000.
[11] C. Ridings and M. Shishigin, “PageRank Uncovered”. Technical Report,
2002.
[12] A. Borodin, G.O. Roberts, J.S. Rosenthal and P. TsaParas, “Link Analysis
Ranking: Algorithms, Theory and Experiments”, In Proc. Of the ACM
Transactions on Internet Technology, Vol. 5, No 1, pp 231-297, 2005.
[13] B. Gao, T.Y. Liu, Z. Ma, T. Wang and H. Li, A General Markov
Framework for Page Importance Computation”, Proceeding of the 18th
ACM conference on Information and knowledge management,, 2009.
[14] A. Broder, R. Kumar, F Maghoul, P. Raghavan, S. Rajagopalan, R. Stata,
A. Tomkins, J. Wiener, “Graph Structure in the Web”, Computer Networks
: The International Journal of Computer and telecommunications
Networking, Vol. 33, Issue 1-6, pp 309-320, 2000.
[15] Yahoo! Research: “Web Spam Collections”.
http://barcelona.research.yahoo.net/webspam/datasets/ Crawled by the
laboratory of Web Algorithmics, University of Milan,
http://law.dsi.unimi.it/. URLs retrieved July 2011.
... The most unpopular pages are 6 (Dept) and 7 (Alumni) with PageRank 0,041. The Example is taken from [12]. In their research, Langville et al. [7] and Bianchini et al. [8] explored the relationship between the PageRank algorithm and the Markov chain. ...
... 12) P(X n = a|Y 0:n = b 0:n ) = P(X n = a, Y 0:n = b 0:n ) P(Y 0:n = b 0:n ) ...
Thesis
Full-text available
The concept of data assimilation plays a vital role in improving mathematical models by connecting them with the actual experimental data. This mathematical framework helps to reduce the uncertainty of a model. Additionally, this concept increases the accuracy of the forecast. This work aims to consider filtering and control procedures in the data assimilation for discrete-time stochastic processes. In particular, we focus on various types of Markov chains applications. This thesis describes the structure of model building based on Markov chains to define a probability of the following possible system states. We analyze neat solutions for a control problem. However, to set the connection to continuous-time settings, we outline the Kalman filter in detail. Furthermore, numerical and graphic results of applying filtering and control to different problems are provided.
Chapter
Compromised or malicious websites are a serious threat to cyber security. Malicious users prefer to do malicious activities like phishing, spamming, etc., using compromised websites because they can mask their original identities behind these compromised sites. Compromised websites are more difficult to detect than malicious websites because compromised websites work in masquerade mode. This is one of the main reasons for us to take this topic as our research. This paper introduces the related work first and then introduces a framework to detect compromised websites using link structure analysis. One of the most popular link structures based ranking algorithm used by the Google search engine algorithm called PageRank, is implemented in our experiment using the Java language, computation is done before a website is compromised and after a website is compromised and the results are compared. The results show that when a website is compromised, its PageRank can go do down to indicate that this website is compromised.
Article
Full-text available
The explosive growth and the widespread accessibility of the Web has led to a surge of research activity in the area of information retrieval on the World Wide Web. The seminal papers of Kleinberg [1998, 1999] and Brin and Page [1998] introduced Link Analysis Ranking, where hyperlink structures are used to determine the relative authority of a Web page and produce improved algorithms for the ranking of Web search results. In this article we work within the hubs and authorities framework defined by Kleinberg and we propose new families of algorithms. Two of the algorithms we propose use a Bayesian approach, as opposed to the usual algebraic and graph theoretic approaches. We also introduce a theoretical framework for the study of Link Analysis Ranking algorithms. The framework allows for the definition of specific properties of Link Analysis Ranking algorithms, as well as for comparing different algorithms. We study the properties of the algorithms that we define, and we provide an axiomatic characterization of the INDEGREE heuristic which ranks each node according to the number of incoming links. We conclude the article with an extensive experimental evaluation. We study the quality of the algorithms, and we examine how different structures in the graphs affect their performance.
Article
Full-text available
Dangling pages are one of the major drawbacks of page rank algorithm which is used by different search engines to calculate the page rank. The number of these pages increasing with web as the source of knowledge. However search engines are the main tools used in knowledge extraction or Information retrieval from the Web. We proposed an algorithm to handle these pages using a hypothetical node and comparing it with page rank algorithm.
Article
Full-text available
HITS (13) and PageRank (2, 3) are two of the most popular link analysis algorithms. Both were developed around 1998 and both have dramatically improved the search business. In order to appreciate the impact of link analysis, recall for a minute the state of search prior to 1998. Because of the immense number of pages on the Web, a query to an engine often produced a very long list of relevant pages, sometimes thousands of pages long. A user had to sort carefully through the list to find the most relevant pages. The order of presentation of the pages was little help because spamming was so easy then. In order to trick a search engine into producing rankings higher than normal, spammers used meta-tags liberally, claiming their page used popular search terms that never appeared in the page. Meta-tags became useless for search engines. Spammers also repeated popular search terms in invisible text (white text on a white background) to fool engines.
Article
The network structure of a hypcrlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.
Article
Today, when searching for information on the World Wide Web, one usually performs a query through a term-based search engine. These engines return, as the query's result, a list of Web sites whose contents match the query. For broad topic queries, such searches often result in a huge set of retrieved documents, many of which are irrelevant to the user. However, much information is contained in the link-structure of the World Wide Web. Information such as which pages are linked to others can be used to augment search algorithms. In this context, Jon Kleinberg introduced the notion of two distinct types of Web sites: hubs and authorities. Kleinberg argued that hubs and authorities exhibit a mutually reinforcing relationship: a good hub will point to many authorities, and a good authority will be pointed at by many hubs. In light of this, he devised an algorithm aimed at finding authoritative sites. We present SALSA, a new stochastic approach for link structure analysis, which examines random walks on graphs derived from the link structure. We show that both SALSA and Kleinberg's mutual reinforcement approach employ the same meta-algorithm. We then prove that SALSA is equivalent to a weighted in-degree analysis of the link-structure of World Wide Web subgraphs, making it computationally more efficient than the mutual reinforcement approach. We compare the results of applying SALSA to the results derived through Kleinberg's approach. These comparisons reveal a topological phenomenon called the TKC effect (Tightly Knit Community) which, in certain cases, prevents the mutual reinforcement approach from identifying meaningful authorities.
Article
The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.
Conference Paper
We propose a General Markov Framework for computing page importance. Under the framework, a Markov Skeleton Process is used to model the random walk conducted by the web surfer on a given graph. Page importance is then deflned as the product of page reachability and page utility, which can be computed from the transition probability and the mean staying time of the pages in the Markov Skeleton Pro- cess respectively. We show that this general framework can cover many existing algorithms as its special cases, and that the framework can help us deflne new algorithms to handle more complex problems. In particular, we demonstrate the use of the framework with the exploitation of a new process named Mirror Semi-Markov Process. The experimental re- sults validate that the Mirror Semi-Markov Process model is more efiective than previous models in several tasks.
Conference Paper
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from 3 years ago. This paper provides an in-depth description of our large-scale web search engine - the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections, where anyone can publish anything they want.
Article
This paper serves as a companion or extension to the "Inside PageRank'' paper by Bianchini et al. It is a comprehensive survey of all issues associated with PageRank, covering the basic PageRank model, available and recommended solution methods, storage issues, existence, uniqueness, and convergence properties, possible alterations to the basic model, suggested alternatives to the traditional solution methods, sensitivity and conditioning, and finally the updating problem. We introduce a few new results, provide an extensive reference list, and speculate about exciting areas of future research.
Article
Web information retrieval is significantly more challenging than traditional well-controlled, small document collection information retrieval. One main di#erence between traditional information retrieval and Web information retrieval is the Web's hyperlink structure. This structure has been exploited by several of today's leading Web search engines, particularly Google. In this survey paper, we focus on Web information retrieval methods that use eigenvector computations, presenting the three popular methods of HITS, PageRank, and SALSA.