Diversified ranking on large graphs: an optimization viewpoint.
-
Citations (0)
-
Cited In (0)
Page 1
Diversified Ranking on Large Graphs:
An Optimization Viewpoint
Hanghang TongJingrui HeZhen WenRavi Konuru Ching-Yung Lin
IBM T.J. Watson Research Center
Hawthorne, NY, USA
{htong, jingruhe, zhenwen, rkonuru, chingyung}@us.ibm.com
ABSTRACT
Diversified ranking on graphs is a fundamental mining task and has
avarietyof high-impact applications. Therearetwoimportant open
questions here. The first challenge is the measure - how to quan-
tify the goodness of a given top-k ranking list that captures both
the relevance and the diversity? The second challenge lies in the
algorithmic aspect - how to find an optimal, or near-optimal, top-k
ranking list that maximizes the measure we defined in a scalable
way?
In this paper, we address these challenges from an optimization
point of view. Firstly, we propose a goodness measure for a given
top-k ranking list. The proposed goodness measure intuitively cap-
tures both (a) the relevance between each individual node in the
ranking list and the query; and (b) the diversity among different
nodes in the ranking list. Moreover, we propose a scalable algo-
rithm (linear wrt the size of the graph) that generates a provably
near-optimal solution. The experimental evaluations on real graphs
demonstrate its effectiveness and efficiency.
Categories and Subject Descriptors
H.2.8 [Database Management]: Database Applications – Data
Mining
General Terms
Algorithm, experimentation
Keywords
Diversity, ranking, scalability, graph mining
1.INTRODUCTION
Given an author-paper network, how to find the top-k most re-
lated conferences for a given author? How to diversify the ranking
list so that it captures the whole spectrum of the given author’s re-
search interest? It is now widely realized that diversity is a key fac-
tor toaddress theuncertainty andambiguity inaninformationneed;
and to cover the different aspects of the information need [32].
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
KDD’11, August 21–24, 2011, San Diego, CA, USA.
Copyright 2011 ACM 978-1-60558-193-4/08/088 ...$5.00.
Diversity is also positively associated with personnel performance
and job retention rate in a large organization [38].
Despite their own success of the previous works (See Section 6
for a review), two important questions remain open in diversified
ranking on large graphs. The first challenge is the measure - for a
given top-k ranking list, how can we quantify its goodness? Intu-
itively, a good top-k ranking list should capture both the relevance
and the diversity. For example, given a task which typically re-
quires a set of different skills, if we want to form a team of experts,
not only should the people in the team have relevant skills, but
also they should somehow be ‘different’ from each other so that
the whole team can benefit from the diversified, complementary
knowledge and social capital. However, there does not exist such a
goodness measure for the graph data in the literature. Most of the
existing works for diversified ranking on graphs are based on some
heuristics. The only exception is [27], where the authors made an
important step towards this goal by providing some optimization
explanations, which is achieved by defining a time-varying objec-
tive function at each iteration. But still, it is not clear what overall
objective function the algorithm tries to optimize.
The second challenge lies in the algorithmic aspect - how can we
find an optimal, or near-optimal, top-k ranking list that maximizes
the goodness measure? Bringing diversity into the design objective
impliesthatweneed tooptimizeonthesetlevel. Inother words, the
objective function for a subset of nodes is usually not equal to the
sum of objective functions of each individual nodes. It is usually
very hard to perform such set-level optimization. For instance, a
straight-forward method would need exponential enumerations to
findtheexact optimal solution, whichisinfeasibleeven for medium
size graphs. This, together withthe fact that real graphs are often of
largesize, reaching billions of nodes andedges, poses thechallenge
for the optimization algorithm - how can we find a near-optimal
solution in a scalable way?
In this paper, we address these challenges from an optimization
point of view. We propose a goodness measure which intuitively
captures both (a) the relevance between each individual nodes in
the ranking list and the query node; and (b) the diversity among
different nodes in the ranking list. We further propose a scalable
algorithm (linear wrt the size of the graph) that generates a prov-
ably near-optimal top-k ranking list. To the best of our knowledge,
this is the first work for diversified ranking on large graphs that
(1) has a clear optimization formulation; (2) finds a provably near-
optimal solution; and (3) enjoys the linearly scalability. The main
contributions of the paper are summarized as follows:
• A measure to quantify goodness for a top-k ranking list that
captures both relevance and diversity;
• An algorithm to find a diversified top-k ranking list from
large graphs;
1028
Page 2
Table 1: Symbols
Symbol
A,B,...
A(i,j)
A(i,:)
A(:,j)
A?
a,b,...
I,J,...
⊗
r
p
I
1
0
n, m
k
c
Definition and Description
matrices (bold upper case)
the element at the ithrow and jthcolumn of A
the ithrow of matrix A
the jthcolumn of matrix A
transpose of matrix A
vectors
sets (calligraphic)
element-wise Hadamard product
an n × 1 ranking vector
an n × 1 query vector (p(i) ≥ 0,?n
a vector/matrix with all elements set to 1s
a vector/matrix with all elements set to 0s
the number of the nodes and edges in the graph
the budget (i.e., the length of the ranking list)
the damping factor 0 < c < 1
i=1p(i) = 1)
an identity matrix
• Proofs and complexity analysis, showing that our method is
provably near-optimal in terms of optimization quality with
linear scalability;
• Extensive experimental evaluations, demonstrating the effec-
tiveness and efficiency of our method.
The rest of the paper is organized as follows. We introduce no-
tation and formally define the problems in Section 2. We present
and analyze the proposed measure and algorithm in Section 3 and
Section 4, respectively. Weprovide experimental evaluation inSec-
tion 5. We review the related work in Section 6 and conclude in
Section 7.
2.PROBLEM DEFINITIONS
Table 1 lists the main symbols we use throughout the paper. In
this paper, we consider the most general case of directed, weighted,
irreducible unipartite graphs. We represent a general graph by its
adjacency matrix1. Following the standard notation, we use bold
upper-case for matrices (e.g., A), bold lower-case for vectors (e.g.,
a), and calligraphic fonts for sets (e.g., I). We denote the transpose
with a prime (i.e., A?is the transpose of A). For a bipartite graph
with adjacency matrix W, we can convert it to the equivalent uni-
?0
size of matrices/vectors (e.g., An×nmeans a matrix of size n×n).
When the size of matrices/vectors are clear from the context, we
omit such subscripts for brevity. Also, we represent the elements in
a matrix using a convention similar to Matlab, e.g., A(i,j) is the
element at the ithrow and jthcolumn of the matrix A, and A(:,j)
is the jthcolumn of A, etc. With this notation, we can represent
a sub-matrix of A as A(I,I), which is a block of matrix A that
corresponds to the rows/columns of A indexed by the set I.
In this paper, we focus on personalized PageRank [30, 11] since
it is one of the most fundamental ranking methods on graphs, and
has shown its success in many different application domains in the
past decade. Formally, it can be defined as follows:
partite graph: A =
W
0W
?
. We use subscripts to denote the
r = cA?r + (1 − c)p
(1)
where p is an n × 1 personalized vector (p(i) ≥ 0,?n
1In practice, we store these matrices using an adjacency list repre-
sentation, since real graphs are often very sparse.
i=1p(i) =
1). Sometimes, we also refer to p as the query vector. c (0 < c <
1) is a damping factor; A is the row-normalized adjacency matrix
of thegraph (i.e.,?n
is reduced to the standard PageRank [30]; if p(i) = 1 and p(j) =
0(j ?= i), the resulting ranking vector r gives the proximity scores
from node i to all the other nodes in the graph [37].
In order to simplify the description of our upcoming method, we
also introduce the so-called ‘Google matrix’ B:
j=1A(i,j) = 1(i = 1,...,n); and risthen×
1 resulting ranking vector. Note that if p(i) = 1/n(i = 1,...,n), it
B = cA?+ (1 − c)p11×n
(2)
where 11×n is a 1 × n row vector with all elements set to 1s. In-
tuitively, the ‘Google matrix’ B can be viewed as the personalized
adjacency matrix that is biased towards the query vector p. It turns
out that the ranking vector r defined in eq. (1) satisfies r = Br. In
other words, the ranking vector r is the right eigenvector of the B
matrix with the eigenvalue 1. It can be verified that B is a column-
wise stochastic matrix (i.e., each column of B sums up to 1). By
Perron-Frobenius theorem [10], it can be shown that 1 is the largest
(in module) simple eigenvalue of the matrix B; and the ranking
vector r is unique with all non-negative elements since the graph is
irreducible.
Our goal is two-fold: (1) we want a goodness measure to quan-
tify the quality of a given top-k ranking list that captures both the
relevance and thediversity; and (2) given thegoodness measure, we
want an optimal or near-optimal algorithm to find a top-k ranking
list that maximizes such goodness measure in a scalable way. With
the above notations and assumptions, our problems can be formally
defined as follows:
PROBLEM 1. (Goodness Measure.)
Given: A large graph An×n, the query vector p, the damping fac-
tor c, and a subset of k nodes S;
Output: A goodness score f(S) of the subset of nodes S, which
measures (a) the relevance of each node in S wrt the query
vector p, and (b) the diversity among all the nodes in the
subset S.
PROBLEM 2. (Diversified Top-k Ranking Algorithm.)
Given: A large graph An×n, the query vector p, the damping fac-
tor c, and the budget k;
Find: A subset of k nodes S that maximizes the goodness measure
f(S).
In the next two sections, we present our solutions for these two
problems respectively.
3. THEPROPOSEDGOODNESSMEASURE
In this section, we address Problem 1. Our goal is to define a
goodness measure to quantify the quality of a given top-k ranking
list that captures both the relevance and the diversity. We first dis-
cuss some design objectives of such a goodness measure; and then
present our solution followed by some theoretical analysis.
3.1 Design Objectives
As said before, a good diversified top-k ranking list should bal-
ance between the relevance and the diversity. The notion of rel-
evance is clear for personalized PageRank, - larger value in the
ranking vector r means more relevant wrt the query vector p. On
the other hand, the notion of diversity is more challenging. In-
tuitively, a diversified subset of nodes should be dis-similar with
each other. Take the query ‘Find the top-k conferences for Philip
Yu from the author-conference network’ as an example. Dr. Philip
1029
Page 3
Yu is aprofessor at University of Illinois at Chicago. His recent ma-
jor research interest lies in databases and data mining. He also has
broad interests in several related domains, including systems, par-
allel and distributed processing, web applications, and performance
modeling, etc. A top-k ranking list for this query would have high
relevance if it consists of all the conferences from databases and
data mining community (e.g., SIGMOD, VLDB, KDD, etc) since
all these conferences are closely related to his major research inter-
est. However, such a list has low diversity since these conferences
are too similar with each other (e.g., having a large overlap of con-
tributing authors, etc). Therefore, if we replace a few databases
and data mining conferences by some representative conferences
in his other research domains (e.g., ICDCS for distributed comput-
ing systems, WWW for web applications, etc), it would make the
whole ranking list more diverse (e.g., the conferences in the list are
more dis-similar with each other).
Furthermore, if we go through the ranking list from top down,
we would like to see the most relevant conferences to appear first
in the ranking list. For example, a ranking list in the order of ‘SIG-
MOD’,‘ICDCS’,‘WWW’isbetterthan‘ICDCS’,‘WWW’,‘SIGMOD’
since databases (SIGMOD) is a more relevant research interest for
Dr. PhilipYu, compared withdistributedcomputing systems(ICDCS),
or web applications (WWW). In this way, the user can capture Dr.
Philip Yu’s main research interest by just inspecting a few top-
ranked conferences/nodes. This suggests the so-called diminishing
returns property of the goodness measure - it would help the user
to know better about Dr. Philip Yu’s whole research interest if we
return more conferences/nodes in the ranking list; but the marginal
benefit becomes smaller and smaller as wego down theranking list.
Another implicit design objective lies in the algorithmic aspect.
The proposed goodness measure should also allow us to develop
an effective and scalable algorithm to find an optimal (or at least
near-optimal) top-k ranking list from large graphs. We will discuss
and address this issue in the next section.
To summarize, for a given top-k ranking list, we aim to provide a
single goodness score that (1) measures the relevance between each
individual node in the list and the query vector p; (2) measures the
similarity (or dis-similarity)among all the nodes in the ranking list;
(3) exhibits some diminishing returns property wrt the size of the
ranking list; and (4) enables some effective and scalable algorithm
to find an optimal (or near-optimal) top-k ranking list.
3.2 The Proposed Measure
Let A be the row-normalized adjacency matrix of the graph, B
be the ‘Google matrix’ defined in eq (2), p be the personalized
vector and r be the ranking vector. For a given ranking list S (i.e.,
S gives the indices of the nodes in the ranking list; and |S| = k.),
the proposed goodness measure is formally defined as follows:
Goodness Measure:
?
We can also represent f(S) by using the matrix A instead:
f(S) = 2
i∈S
wherecisthedamping factor inpersonalized PageRank, and 11×|S|
is a row vector of length |S| with all the elements set to 1s. It can
be shown that it is equivalent to eq. (3).
Notice that the goodness measure in eq (3) is independent on the
ordering of the different nodes in the subset S. If we simply change
the ordering of the nodes for the same subset S, it does not affect
the goodness score. However, as we will show in Section 4, we can
f(S) = 2
i∈S
r(i) −
?
i,j∈S
B(i,j)r(j)
(3)
?
r(i) − c
?
i,j∈S
A(j,i)r(j) − (1 − c)
?
j∈S
r(j)
?
i∈S
p(i)
still output an ordered subset based on the diminishing returns need
when the user is seeking for a diverse top-k ranking list.
3.3Proofs and Analysis
Let us analyze how the proposed goodness measure meets our
design objectives in subsection 3.1.
There are two terms in eq (3), the first term is twice the sum
of the ranking scores in the ranking list. For the second term, re-
call that B can be viewed as the personalized adjacency matrix wrt
the query vector p, where B(i,j) indicates the similarity (i.e., the
strength of the connection) between nodes i and j. In other words,
the second term in eq (3) is the sum of all the similarity scores be-
tween any two nodes i,j(i,j ∈ S) in the ranking list (weighted
by r(j)). Therefore, the proposed goodness measure captures both
the relevance and the diversity. The more relevant (higher r(i))
each individual node is, the higher the goodness measure f(S). At
the same time, it encourages the diversity within the ranking list by
penalizing the (weighted) similarity between any two nodes in S.
The proposed measure f(S) also exhibits the diminishing returns
property, which is summarized in Theorem 1. The intuitions of
Theorem 1 are as follows: (1) by P1, it means that the utility of an
empty ranking list is always zero; (2) by P2, if we add more nodes
into the ranking list, the overall utility of the ranking list does not
decrease; and (3) by P3, the marginal utility of adding new nodes
is relatively small if we already have a large ranking list.
THEOREM 1. Diminishing Returns Property of f(S). Let Φ
bean emptyset; I,J,Rbethreesetss.t., I ⊆ J, andR∩J = Φ.
The following facts hold for f(S):
P1: f(Φ) = 0;
P2: f(S) is monotonically non-decreasing, i.e., f(I) ≤ f(J);
P3: f(S) is submodular, i.e., f(I∪R)−f(I) ≥ f(J ∪R)−f(J).
PROOF of P1. It is obviously held by the definition of f(S). 2
PROOF of P2. Let T = J \ I. Substituting eq (3) into f(J) −
f(I) and canceling the common terms, we have
f(J) − f(I)
=2
r(i) −
i∈I
=(
r(j) −
j∈T
+(
r(i) −
i∈T
Recall that the matrix B is a column-wise stochastic matrix (i.e.,
each column of B sums up to 1). The first half of eq (4) satisfies
?
=
r(j)(1 −
?
For the second half of eq (4), we have that
?
=
(r(i) −
j∈J
=
B(i,j)r(j) ≥ 0
?
?
?
i∈T
?
?
?
?
?
?
j∈T
B(i,j)r(j) −
?
i∈T
?
j∈J
B(i,j)r(j)
j∈T
i∈I
B(i,j)r(j))
i∈T
j∈J
B(i,j)r(j))
(4)
(
j∈T
?
r(j) −
?
?
B(i,j) ≥ 0
j∈T
?
B(i,j))
i∈I
B(i,j)r(j))
j∈T
i∈I
=
j∈T
r(j)
?
i/ ∈I
(5)
(
i∈T
?
?
r(i) −
?
?
i∈T
?
B(i,j)r(j))
j∈J
B(i,j)r(j))
i∈T
i∈T
?
j/ ∈J
(6)
1030
Page 4
The last equality in eq (6) is due to the fact that r = Br, and each
element in r is non-negative.
Putting eq (4)-(6) together, we have that f(J) ≥ f(I), which
completes the proof of P2.
PROOF of P3. Again, let T = J \ I. Substituting eq (4) into
(f(I∪R)−f(I))−(f(J ∪R)−f(J)) and canceling the common
terms, we have
2
(f(I ∪ R) − f(I)) − (f(J ∪ R) − f(J))
(
B(i,j)r(j) −
?
?
Therefore, we have that f(I ∪ R) − f(I) ≥ f(J ∪ R) − f(J),
which completes the proof of P3.
=
?
i∈J
?
?
?
j∈R
?
i∈I
?
?
?
j∈R
B(i,j)r(j))
+(
i∈R
j∈J∪R
B(i,j)r(j) +
B(i,j)r(j) −
i∈R
?
B(i,j)r(j) ≥ 0
j∈I∪R
B(i,j)r(j))
=
j∈R
i∈T
?
i∈R
j∈T
2
4. THE PROPOSED ALGORITHM
In this section, we address Problem 2. Here, given the initial
query vector p and the budget k, we want to find a subset of k
nodes that maximizes the goodness measure defined in eq (3). We
first analyze the main challenges in optimizing eq (3); and then
present the proposed algorithm DRAGON, followed by some theo-
retical analysis and discussion.
4.1 Challenges
Problem 2 is essentially a subset selection problem to find the
optimal k nodes that maximize eq (3). Theorem 1 indicates that it
is not easy to find the exact optimal solution of Problem 2- it isNP-
hard to maximize a monotonic submodular function if the function
value is 0 for an empty set [18]. For instance, a straight-forward
method would take exponential enumerations?n
medium size graph (e.g., with a few hundred nodes).
We can also formulate Problem 2 as a binary quadratic program-
ming problem. Let xn×1 be a binary indicator vector (x(i) = 1
means node i is selected in the subset S, and 0 means it is not
selected). Problem 2 can be expressed as the following binary
quadratic programming problem:
x?Dx
Subject to:
x(i) ∈ {0,1}(i = 1,...,n)
n
?
where D = (B−2In×n)diag(r), In×nis an identitymatrixof size
n × n, and diag(r) is a diagonal matrix with r(i,i)(i = 1,...,n)
being the diagonal elements.
Eq. (7) isstillnot easy tosolvedue to (1) thebinary constrains on
the variable x and (2) the quadratic term in the objective function.
Ifwerelaxthebinaryconstrainonxas0 ≤ x(i) ≤ 1(i = 1,...,n),
we can solve the relaxed problem by standard quadratic program-
ming packages. We refer to this strategy as ‘Lin-QP’. However,
there are two major limitations of this method. First of all, we do
not know what the gap is between eq. (7) and its relaxed version.
Therefore, it is not clear how good the final solution is in terms of
maximizing the original goodness measure (eq (3)) even if we can
solve the relaxed problem optimally2. Second, most, if not all, of
the existing quadratic programming packages require polynomial
k
?to find the exact
optimal k nodes, which is not feasible in computation even for a
min
i=1
x(i) = k
(7)
2It is worth pointing out that it is not even easy to find an opti-
mal solution for the relaxed problem by quadratic programming
complexity in computation. This makes this strategy very slow, or
even infeasible, for a graph with more than a few thousand nodes.
Another possible solution for eq. (7) is to remove the quadratic
term in the objective function as follows. Starting from some ini-
tial indicator vector ˆ x, we iterate between the following two steps:
(1) approximate the objective function in eq. (7) by its first order
Taylor expansion around ˆ x; and (2) update ˆ x by solving a binary
integer programming problem for the approximated objective func-
tion, which is linear wrt x. We refer to this strategy as ‘Ite-BIP’.
However, the two main issues still exist: (1) it is not clear how
such approximation will downgrade the overall optimization per-
formance; (2)thebinaryinteger programming itself, again, requires
polynomial time, which does not scale to large graphs.
4.2The Proposed DRAGON Algorithm
Our proposed DRAGON algorithm is presented in Alg. 1. In step
1, we compute the ranking vector r (e.g., by the power method,
etc). Then after some initializations (steps 2-5), we select k nodes
one-by-one as follows. At each time, we compute the score vector
s in step 7. Then, we select one node with the highest score in the
vector s and add it to the subset S (steps 8-9). After that, we use the
selectednode toupdate thetworeferencevectors u and v (steps 10-
11). Note that ‘⊗’ denote the element-wise product between two
matrices/vectors. Intuitively, the score vector s keeps the marginal
contribution of each node for the goodness measure given the cur-
rent selected subset S. From step 7, it can be seen that at each
iteration, the values of such marginal contribution either keeps un-
changed or decreases. This is consistent with P3 of Theorem 1 - as
there are more and more nodes in the subset S, the marginal con-
tribution of each node is monotonically non-increasing. It is worth
pointing out that we use the original normalized adjacency matrix
A, instead of the ‘Google matrix’ B in Alg. 1. This is because for
many real graphs, the matrix A is often very sparse, whereas the
matrix B might not be3. In the case B is dense, it is not efficient in
either time or space to use B in Alg. 1.
In Alg. 1, although we try to optimize a goodness measure that is
not affected by the ordering of different nodes in the subset, we can
still output an ordered list to the user based on in which iteration
these nodes are selected - earlier selected nodes in Alg. 1 are placed
at the top of the resulting top-k ranking list. This ordering naturally
meets the diminishing returns need when the user is seeking for a
diverse top-k ranking list as we analyzed in subsection 3.1.
4.3Proofs and Analysis
Here, we analyze the optimality as well as the complexity of the
proposed algorithm. We show that our DRAGON leads to a near-
optimal solution, and at the same time it enjoys linear scalability
in both time and space.
Optimality. The optimality of the proposed DRAGON is given
in Lemma 1. According to Lemma 1, our DRAGON is near-optimal
- its solution is within a fixed fraction (1 − 1/e ≈ 0.63) from the
global optimal one. Given the hardness of Problem 2, such near-
optimality is acceptable in terms of optimization quality.
LEMMA 1. Near-Optimality of DRAGON. Let S be the subset
found by DRAGON; |S| = k; and S∗= argmax|S|=kf(S). We
have that f(S) ≥ (1−1/e)f(S∗), where e is the base of the natural
logarithm.
PROOF. Omitted for Brevity
2
because the matrix D (1) might be asymmetric and (2) is not al-
ways semi-positive definite.
3To see this, notice that B is a full matrix if p is uniform.
1031
Page 5
Algorithm 1 DRAGON for Problem 2
Input: The row-normalized adjacency matrix A of the graph, the
damping factor c, the query vector p, and the budget k;
Output: A subset of k nodes S.
1: Compute the ranking vector r: r = cA?r + (1 − c)p;
2: Initialize S as the empty set; set u = v = 0n×1;
3: for i = 1 : n do
4: Initialize ˆ s(i) = (2 − cA(i,i) − (1 − c)p(i))r(i);
5: end for
6: for iter = 1 : k do
7:Compute the score vector s = ˆ s − u ⊗ r − v;
8:Find i = argmaxjs(j)(j = 1,...,n;j / ∈ S);
9:Add node i into S;
10: Update u ← u + cA(:,i) + (1 − c)p(i)1n×1;
11: Update v ← v + cA?(:,i)r(i) + (1 − c)r(i)p;
12: end for
13: Return the subset S
TimeComplexity. Thetimecomplexityof theproposed DRAGON
is given in Lemma 2. According to Lemma 2, our DRAGON has
linear time complexity wrt the size of the graph. Therefore it is
scalable to large graphs in terms of computational time.
LEMMA 2. Time Complexity of DRAGON. The timecomplex-
ity of Alg. 1 is O(m + nk).
PROOF. Omitted for brevity.
We would like to point out that the proposed DRAGON can be
further sped up. Firstly, notice that the O(m) term in Lemma 2
comes from computing the ranking vector r (step 1) by the most
commonly used power method. There are a lot of fast methods for
computing r, either by effective approximation (e.g., [37]), or by
parallelism (e.g. [13]). These methods can be naturally plugged in
our DRAGON, which might lead to further computational savings.
Secondly, the O(nk) term in Lemma 2 comes from the greedy se-
lection stepin steps6-12. Thanks to themonotonicity of f(S)as we
show in Theorem 1, we can use the similar lazy evaluation strategy
as [20] to speed up this process, without sacrificing the optimiza-
tion quality.
SpaceComplexity. Thespacecomplexity oftheproposed DRAGON
isgiven inLemma3. According to Lemma3, our DRAGONhas lin-
ear space complexity wrt the size of the graph. Therefore it is also
scalable to large graphs in terms of space cost.
2
LEMMA 3. Space Complexity of DRAGON. The space com-
plexity of Alg. 1 is O(m + n + k).
PROOF. Omitted for brevity.
4.4 Discussion - Comparisons
Inliterature, there exist twoother methods toencourage diversity
in the top-k ranking list for personalized PageRank. Here, we make
a comparison in terms of optimality, convergence, and scalability
of different methods. ARW [42] is based on an intuitive heuristic
by greedily selecting the highest ranked node and setting it as the
absorbing state. From theoretical point of view, it is not clear what
ARW [42] tries to optimize. And also, it requires a matrix inverse
of the same size of the graph, which is not scalable to large graphs.
RRW [27] is based on vertex reinforced random walk [31]. Com-
pared with ARW [42], it makes an important step forward by pro-
viding some optimization explanations via defining a time-varying
objective function that changes at each iterationstep. However, it is
still not clear what overall metric it tries to measure; and how good
2
Table 2: Comparison of different methods.
DRAGON is the only method that leads to a near-optimal so-
lution with linear scalability.
Method MeasureOptimality
ARW [42]
NANA
RRW [27]
PartialNA
DRAGON
YesNear-optimal
Our proposed
Scalability
No
Yes
Yes
Convergence
Yes
NA
Yes
its optimization solution is. Moreover, RRW [27] introduced some
modifications and approximation techniques to the original vertex
reinforced random walk, and it is not clear how the modified vertex
reinforcement random walk converges4.
5. EXPERIMENTAL EVALUATIONS
Inthissection, weprovideempirical evaluations for theproposed
DRAGON. Our evaluations mainly focus on (1) the effectiveness
and (2) efficiency of the proposed DRAGON.
5.1 Experimental Setup
Data sets. We use the DBLP publication data5to construct a
co-authorship network, where each node is an author and the edge
weight is the number of the co-authored papers between the two
corresponding persons. Overall, we have n = 418,236 nodes
and m = 2,753,798 edges. We also construct much smaller
co-authorship networks, using the authors from only one confer-
ence (e.g., KDD, SIGIR, SIGMOD, etc.). For example, KDD is the
co-authorship network for the authors in the ‘KDD’ conference.
These smaller co-authorship networks typically have a few thou-
sand nodes and up to a few tens of thousands edges. We also con-
struct the co-authorship networks, using the authors from multiple
conferences (e.g., KDD+SIGIR). For these graphs, we denote them
as Sub(n,m), where n and m are the numbers of nodes and edges in
the graph, respectively.
Machine configurations. For the computational cost and scala-
bility, we report the wall-clock time. All the experiments ran on the
same machine with four 2.4GHz AMD CPUs and 48GB memory,
running Linux (2.6 kernel). For all the quantitative results, we ran-
domly generate a query vector p and feed it into different methods
for a top-k ranking list with the same length. We repeat it 100 times
and report the average.
Evaluation criteria. To the best of our knowledge, there is no
universally accepted measure for diversity. In [27], the authors sug-
gested an intuitive notion based on the density of the induced sub-
graph from the original graph A by the subset S. The intuition is as
follows: the lower the density (i.e., the less 1-step neighbors) of the
induced subgraph, themore diverse the subset S. Here, we general-
ize this notion to the t-step graph in order to also take into account
the effect of those in-direct neighbors. Let Sign(.) be a binary func-
tion operated element-wise on a matrix, i.e., Y = Sign(X), where
Y is a matrix of the same size as X, Y(i,j) = 1 if X(i,j) > 0,
Y(i,j) = 0 otherwise. We define the t-step connectivity matrix
Ctas Ct= Sign(?t
steps/hops. With this Ctmatrix, we define the diversity of a given
subset S as eq (8). Here, the value of Div(t) is always between
0.5 and 1 - higher means more diverse. If all the nodes in S are
reachable from each other within t-steps, we say that the subset S
4Even if it converges, its stationary state might not be unique ac-
cording to [31].
5http://www.informatik.uni-trier.de/˜ley/db/
i=1Ai). That is, Ct(i,j) = 1 (0) means
that node i can (cannot) reach node j on the graph A within t-
1032