Content uploaded by Qiongkai Xu

Author content

All content in this area was uploaded by Qiongkai Xu on Dec 07, 2016

Content may be subject to copyright.

GraRep: Learning Graph Representations with

Global Structural Information

Shaosheng Cao

Xidian University

shelsoncao@gmail.com

Wei Lu

Singapore University of

Technology and Design

luwei@sutd.edu.sg

Qiongkai Xu

IBM Research - China

Australian National University

xuqkai@cn.ibm.com

ABSTRACT

In this paper, we present GraRep, a novel model for learn-

ing vertex representations of weighted graphs. This model

learns low dimensional vectors to represent vertices appear-

ing in a graph and, unlike existing work, integrates global

structural information of the graph into the learning process.

We also formally analyze the connections between our work

and several previous research eﬀorts, including the Deep-

Walk model of Perozzi et al. [20] as well as the skip-gram

model with negative sampling of Mikolov et al. [18]

We conduct experiments on a language network, a social

network as well as a citation network and show that our

learned global representations can be eﬀectively used as fea-

tures in tasks such as clustering, classiﬁcation and visualiza-

tion. Empirical results demonstrate that our representation

signiﬁcantly outperforms other state-of-the-art methods in

such tasks.

Categories and Subject Descriptors

I.2.6 [Artiﬁcial Intelligence]: Learning; H.2.8 [Database

Management]: Database Applications - Data Mining

General Terms

Algorithms, Experimentation

Keywords

Graph Representation, Matrix Factorization, Feature Learn-

ing, Dimension Reduction

1. INTRODUCTION

In many real-world problems, information is often orga-

nized using graphs. For example, in social network research,

classiﬁcation of users into meaningful social groups based on

social graphs can lead to many useful practical applications

such as user search, targeted advertising and recommenda-

tions. Therefore, it is essential to accurately learn useful

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from Permissions@acm.org.

CIKM’15, October 19–23, 2015, Melbourne, Australia.

c

2015 ACM. ISBN 978-1-4503-3794-6/15/10 ...$15.00.

DOI: http://dx.doi.org/10.1145/2806416.2806512.

information from the graphs. One strategy is to learn the

graph representations of a graph: each vertex of the graph

is represented with a low-dimensional vector in which mean-

ingful semantic, relational and structural information con-

veyed by the graph can be accurately captured.

Recently, there has been a surge of interest in learning

graph representations from data. For example, DeepWalk

[20], one recent model, transforms a graph structure into a

sample collection of linear sequences consisting of vertices

using uniform sampling (which is also called truncated ran-

dom walk). The skip-gram model [18], originally designed

for learning word representations from linear sequences, can

also be used to learn the representations of vertices from such

samples. Although this method is empirically eﬀective, it is

not well understood what is the exact loss function deﬁned

over the graph involved in their learning process.

In this work, we ﬁrst present an explicit loss function of

the skip-gram model deﬁned over the graph. We show that

essentially we can use the skip-gram model to capture the

k-step (k= 1,2,3, . . . ) relationship between each vertex and

its k-step neighbors in the graph with diﬀerent values of k.

One limitation of the skip-gram model is it projects all such

k-step relational information into a common subspace. We

argue that such a simple treatment can lead to potential

issues. The above limitation is overcome in our proposed

model through the preservation of diﬀerent k-step relational

information in distinct subspaces.

Another recently proposed work is LINE [25], which has

a loss function to capture both 1-step and 2-step local re-

lational information. To capture certain complex relations

in such local information, they also learn non-linear trans-

formations from such data. While their model can not be

easily extended to capture k-step (with k > 2) relational

information for learning their graph representation, one im-

portant strategy used to enhance the eﬀectiveness of their

model is to consider higher-order neighbors for vertices with

small degrees. This strategy implicitly captures certain k-

step information into their model to some extent. We believe

k-step relational information between diﬀerent vertices, with

diﬀerent values of k, reveals the useful global structural in-

formation associated with the graph, and it is essential to

explicitly take full advantage of this when learning a good

graph representation.

In this paper, we propose GraRep, a novel model for learn-

ing graph representations for knowledge management. The

model captures the diﬀerent k-step relational information

with diﬀerent values of kamongst vertices from the graph

directly by manipulating diﬀerent global transition matrices

deﬁned over the graph, without involving slow and complex

sampling processes. Unlike existing work, our model deﬁnes

diﬀerent loss functions for capturing the diﬀerent k-step lo-

cal relational information (i.e., a diﬀerent k). We optimize

each model with matrix factorization techniques, and con-

struct the global representations for each vertex by combin-

ing diﬀerent representations learned from diﬀerent models.

Such learned global representations can be used as features

for further processing.

We give a formal treatment of this model, showing the

connections between our model and several previous mod-

els. We also demonstrate the empirical eﬀectiveness of the

learned representations in solving several real-world prob-

lems. Speciﬁcally, we conducted experiments on a language

network clustering task, a social network multi-label clas-

siﬁcation task, as well as a citation network visualization

task. In all such tasks, GraRep outperforms other graph

representation methods, and is trivially parallelizable.

Our contributions are as follows:

•We introduce a novel model to learn latent represen-

tations of vertices on graphs, which can capture global

structural information associated with the graph.

•We provide from a probabilistic prospective an un-

derstanding of the uniform sampling method used in

DeepWalk for learning graph representations, which

translates a graph structure into linear sequences. Fur-

thermore, we explicitly deﬁne their loss function over

graphs and extend it to support weighted graphs.

•We formally analyze the deﬁciency associated with the

skip-gram model with negative sampling. Our model

deﬁnes a more accurate loss function that allows non-

linear combinations of diﬀerent local relational infor-

mation to be integrated.

The organization of this paper is described below. Sec-

tion 2 discusses related work. Section 3 proposes our loss

function and states the optimization method using matrix

factorization. Section 4 presents the overall algorithm. Sec-

tion 5 gives a mathematical explanation to elucidate the

rationality of the proposed work and shows its connection

to previous work. Section 6 discusses the evaluation data

and introduces baseline algorithms. Section 7 presents ex-

periments as well as analysis on the parameter sensitivity.

Finally, we conclude in Section 8.

2. RELATED WORK

2.1 Linear Sequence Representation Methods

Natural language corpora, consisting of streams of words,

can be regarded as special graph structures, that is, linear

chains. Currently, there are two mainstream methods for

learning word representations: neural embedding methods

and matrix factorization based approaches.

Neural embedding methods employ a ﬁxed slide window

capturing context words of current word. Models like skip-

gram [18] are proposed, which provide an eﬃcient approach

to learning word representations. While these methods may

yield good performances on some tasks, they can poorly cap-

ture useful information since they use separate local context

windows, instead of global co-occurrence counts [19]. On

the other hand, the family of matrix factorization methods

can utilize global statistics [5]. Previous work include La-

tent Semantic Analysis (LSA) [15], which decomposes term-

document matrix and yields latent semantic representations.

Lund et al. [17] put forward Hyperspace Analogue to Lan-

guage (HAL), factorizing a word-word co-occurrence counts

matrix to generate word representations. Levy et al. [4]

presented matrix factorization over shifted positive Point-

wise Mutual Information (PMI) matrix for learning word

representations and showed that the Skip-Gram model with

Negative Sampling (SGNS) can be regarded as a model that

implicitly such a matrix [16].

2.2 Graph Representation Approaches

There exist several classical approaches to learning low di-

mensional graph representations, such as multidimensional

scaling (MDS) [8], IsoMap [28], LLE [21], and Laplacian

Eigenmaps [3]. Recently, Tang et al. [27] presented meth-

ods for learning latent representational vectors of the graphs

which can then be applied to social network classiﬁcation.

Ahmed et al. [1] proposed a graph factorization method,

which used stochastic gradient descent to optimize matrices

from large graphs. Perozzi et al. [20] presented an approach,

which transformed graph structure into several linear vertex

sequences by using a truncated random walk algorithm and

generated vertex representations by using skip-gram model.

This is considered as an equally weighted linear combina-

tion of k-step information. Tang et al. [25] later proposed a

large-scale information network embedding, which optimizes

a loss function where both 1-step and 2-step relational in-

formation can be captured in the learning process.

3. GRAREP MODEL

In this section, we deﬁne our task and present our loss

function for our task, which is then optimized with the ma-

trix factorization method.

3.1 Graphs and Their Representations

Deﬁnition 1. (Graph) A graph is deﬁned as G= (V, E ).

V={v1, v2,...,vn}is the set of vertices with each Vindi-

cating one object while E={ei,j }is the set of edges with

each Eindicating the relationship between two vertices. A

path is a sequence of edges which connect a sequence of ver-

tices.

We ﬁrst deﬁne adjacency matrix Sfor a graph. For an

unweighted graph, we have Si,j = 1 if and only if there exists

an edge from vito vj, and Si,j = 0 otherwise. For a weighted

graph, Si,j is a real number called the weight of the edge ei,j ,

which indicates the signiﬁcance of edge. Although weights

can be negative, we only consider non-negative weights in

this paper. For notational convenience, we also use wand c

to denote vertices throughout this paper.

The following diagonal matrix Dis known as the degree

matrix for a graph with adjacency matrix S:

Dij =PpSip,if i=j

0,if i6=j

Assume we would like to capture the transitions from one

vertex to another, and assume in the adjacency matrix Si,j

is proportional to the transition probability from vito vj, we

can deﬁne the following (1-step) probability transition matrix

A1 A2

A1 A2

(a)

A1 A2

A1 A2

B1 B2 B3 B4

(b)

A1 A2

B

C1

C4

C2

C3

(c)

4

A1 A2

B1

C1 C2 C3 C4

B2

A1 A2

(d)

A1 A2

(e)

A1 A2

B

(f)

A1 A2

B C

(g)

C3 C4

A1 A2

B1 B2

(h)

Figure 1: The importance of capturing diﬀerent k-step information in the graph representations. Here we

give examples for k= 1,2,3,4.

A:

A=D−1S

where Ai,j is the probability of a transition from vito vertex

vjwithin one step. It can be observed that the Amatrix

can be regarded as a re-scaled Smatrix whose rows are

normalized.

Deﬁnition 2. (Graph Representations with Global Struc-

tural Information) Given a graph G, the task of Learning

Graph Representations with Global Structural Information

aims to learn a global representation matrix W∈R|V|×dfor

the complete graph, whose i-th row Wiis a d-dimensional

vector representing the vertex viin the graph Gwhere the

global structural information of the graph can be captured

in such vectors.

In this paper, global structural information serves two

functions: 1) the capture of long distance relationship be-

tween two diﬀerent vertices and 2) the consideration of dis-

tinct connections in terms of diﬀerent transitional steps.

This would be further illustrated later.

As we have discussed earlier, we believe the k-step (with

varying k) relational information from the graph needs to

be captured when constructing such global graph representa-

tions. To validate this point, Figure 1 gives some illustrative

examples showing the importance of k-step (for k=1,2,3,4)

relational information that needs to be captured between

two vertices A1and A2. In the ﬁgure, a thick line indicates

a strong relation between two vertices, while a thin line in-

dicates a weaker relation. Here, (a) and (e) show the impor-

tance of capturing the simple 1-step information between

the two vertices which are directly connected to each other,

where one has a stronger relation and the other has a weaker

relation. In (b) and (f), 2-step information is shown, where

in (b) both vertices share many common neighbors, and in

(f) only one neighbor is shared between them. Clearly, 2-

step information is important in capturing how strong the

connection between the two vertices is – the more common

neighbors they share, the stronger the relation between them

is. In (c) and (g), the importance of 3-step information is

illustrated. Speciﬁcally, in (g), despite the strong relation

between A1and B, the relation between A1and A2can be

weakened due to the two weaker edges connecting Band C,

as well as Cand A2. In contrast, in (c), the relation between

A1and A2remains strong because of the large number of

common neighbors between Band A2which strengthened

their relation. Clearly such 3-step information is essential to

be captured when learning a good graph representation with

global structural information. Similarly, the 4-step informa-

tion can also be crucial in revealing the global structural

properties of the graph, as illustrated in (d) and (h). Here,

in (d), the relation between A1and A2is clearly strong,

while in (h) the two vertices are unrelated since there does

not exist a path from one vertex to the other. In the absence

of 4-step relational information, such important distinctions

can not be properly captured.

A

B

C1 C2 C3 C4

(a)

A

B

C1

C2 C3

C4

(b)

Figure 2: The importance of maintaining diﬀerent

k-step information separately in the graph represen-

tations.

We also additionally argue that it is essential to treat

diﬀerent k-step information diﬀerently when learning graph

representations. We present one simple graph in (a) of Fig-

ure 2. Let us focus on learning the representation for the

vertex Ain the graph. We can see that Areceives two types

of information when learning its representation: 1-step in-

formation from B, as well as 2-step information from all C

vertices. We note that if we do not distinguish these two dif-

ferent types of information, we can construct an alternative

graph as shown in (b) of Figure 2, where Areceives exactly

the same information as (a), but has a completely diﬀerent

structure.

In this paper, we propose a novel framework for learning

accurate graph representations, integrating various k-step

information which together captures the global structural

information associated with the graph.

3.2 Loss Function On Graph

We discuss our loss function used for learning the graph

representations with global structural information in this

section. Assume we are now given a graph with a collec-

tion of vertices and edges. Consider a vertex wtogether

with another vertex c. In order for us to learn global repre-

sentations to capture their relations, we need to understand

how strongly these two vertices are connected to each other.

Let us begin with a few questions. Does there exist a path

from wto c? If so, if we randomly sample a path starting

with w, how likely is it for us to reach c(possibly within

a ﬁxed number of steps)? To answer such questions, we

ﬁrst use the term pk(c|w) to denote the probability for a

transition from wto cin exactly ksteps. We already know

what are the 1-step transition probabilities, to compute the

k-step transition probabilities we introduce the following k-

step probability transition matrix

Ak=A· · · A

| {z }

k

We can observe that Ak

i,j exactly refers to the transition

probability from vertex ito vertex jwhere the transition

consists of exactly kstep(s). This directly leads to:

pk(c|w) = Ak

w,c

where Ak

w,c is the element from w-th row and c-th column

of the matrix Ak.

Let us consider a particular kﬁrst. Given a graph G,

consider the collection of all paths consisting of ksteps that

can be sampled from the graph which start with wand end

with c(here we call w“current vertex”, and c“context ver-

tex”). Our objective aims to maximize: 1) the probability

that these pairs come from the graph, and 2) the probability

that all other pairs do not come from the graph.

Motivated by the skip-gram model by Mikolov et al. [18],

we employ noise contrastive estimation (NCE), which is pro-

posed by Gutmann et al. [11], to deﬁne our objective func-

tion. Following a similar discussion presented in [16], we ﬁrst

introduce our k-step loss function deﬁned over the complete

graph as follows:

Lk=X

w∈V

Lk(w)

where

Lk(w) = X

c∈V

pk(c|w) log σ(~w ·~c)!+λEc0∼pk(V)[log σ(−~w·~

c0)]

Here pk(c|w) describes the k-step relationship between w

and c(the k-step transition probability from wto c), σ(·)

is sigmoid function deﬁned as σ(x) = (1 + e−x)−1,λis a

hyper-parameter indicating the number of negative samples,

and pk(V) is the distribution over the vertices in the graph.

The term Ec0∼pk(V)[·] is the expectation when c0follows the

distribution pk(V), where c0is an instance obtained from

negative sampling. This term can be explicitly expressed as:

Ec0∼pk(V)[log σ(−~w ·~

c0)]

=pk(c)·log σ(−~w ·~c) + X

c0∈V\{c}

pk(c0)·log σ(−~w ·~

c0)

This leads to a local loss deﬁned over a speciﬁc (w,c):

Lk(w, c) = pk(c|w)·log σ(~w ·~c) + λ·pk(c)·log σ(−~w ·~c)

In this work, we set a maximal length for each path we

consider. In other words, we assume 1 ≤k≤K. In fact,

when kis large enough, the transition probabilities converge

to certain ﬁxed values. The distribution pk(c) can be com-

puted as follows:

pk(c) = X

w0

q(w0)pk(c|w0) = 1

NX

w0

Ak

w0,c

Note that here Nis the number of vertices in graph G,

and q(w0) is the probability of selecting w0as the ﬁrst vertex

in the path, which we assume follow a uniform distribution,

i.e.,q(w0) = 1/N. This leads to:

Lk(w, c) = Ak

w,c ·log σ(~w ·~c) + λ

NX

w0

Ak

w0,c ·log σ(−~w ·~c)

Following [16], we deﬁne e=~w ·~c, and setting ∂ Lk

∂e = 0.

This yields the following:

~w ·~c = log Ak

w,c

Pw0Ak

w0,c !−log(β)

where β=λ/N.

This concludes that we essentially need to factorize the

matrix Yinto two matrices Wand C, where each row of W

and each row of Cconsists of a vector representation for the

vertex wand crespectively, and the entries of Yare:

Yk

i,j =Wk

i·Ck

j= log Ak

i,j

PtAk

t,j !−log(β)

Now we have deﬁned our loss function and showed that

optimizing the proposed loss essentially involves a matrix

factorization problem.

3.3 Optimization with Matrix Factorization

Following the work of Levy et al. [16], to reduce noise,

we replace all negative entries in Ykwith 0. This gives us a

positive k-step log probabilistic matrix Xk, where

Xk

i,j = max(Yk

i,j ,0)

While various techniques for matrix factorization exist, in

this work we focus on the popular singular value decompo-

sition (SVD) method due to its simplicity. SVD has been

shown successful in several matrix factorization tasks [9, 14],

and is regarded as one of the important methods that can be

used for dimensionality reduction. It was also used in [16].

For the matrix Xk, SVD factorizes it as:

Xk=UkΣk(Vk)T

where Uand Vare orthonormal matrices and Σ is a diagonal

matrix consisting of an ordered list of singular values.

We can approximate the original matrix Xkwith Xk

d:

Xk≈Xk

d=Uk

dΣk

d(Vk

d)T

where Σk

dis the matrix composed by the top dsingular val-

ues, and Uk

dand Vk

dare ﬁrst dcolumns of Ukand Vk, re-

spectively (which are the ﬁrst deigenvector of XXTand

XTXrespectively).

This way, we can factorize our matrix Xkas:

Xk≈Xk

d=WkCk

where

Wk=Uk

d(Σk

d)1

2, Ck= (Σk

d)1

2Vk

d

T

The resulting Wkgives representations of current vertices

as its column vectors, and Ckgives the representations of

context vertices as its column vectors [6, 5, 30]. The ﬁ-

nal matrix Wkis returned from the algorithm as the low-d

representations of the vertices which capture k-step global

structural information in the graph.

In our algorithm we consider all k-step transitions with

all k= 1,2,...,K, where Kis a pre-selected constant. In

our algorithm, we integrate all such k-step information when

learning our graph representation, as to be discussed next.

Note that here we are essentially ﬁnding a projection from

the row space of Xkto the row space of Wkwith a lower

rank. Thus alternative approaches other than the popular

SVD can also be exploited. Examples include incremental

SVD [22], independent component analysis (ICA) [7, 13],

and deep neural networks [12]. Our focus in this work is on

the novel model for learning graph representations, so we

do not pursue any alternative methods. In fact, if alterna-

tive method such as sparse auto-encoder is used at this step,

it becomes relatively harder to justify whether the empiri-

cal eﬀectiveness of our representations is due to our novel

model, or comes from any non-linearity introduced in this

dimensionality reduction step. To maintain the consistency

with Levy et al. [16], we only employed SVD in this work.

4. ALGORITHM

We detail our learning algorithm in this section. In gen-

eral, graph representations are extracted for other applica-

tions as features, such as classiﬁcation and clustering. An

eﬀective way to encode k-step representation in practice is to

concatenate the k-step representation as a global feature for

each vertex, since each diﬀerent step representation reﬂects

diﬀerent local information. Table 1 shows the overall algo-

rithm, and we explain the essential steps of our algorithm

here.

Step 1. Get k-step transition probability matrix

Akfor each k= 1,2,...,K.

As shown in Section 3, given a graph G, we can calcu-

late the k-step transition probability matrix Akthrough the

product of inverse of degree matrix Dand adjacent matrix

S. For a weighted graph, Sis a real matrix, while for an

unweighted graph, Sis a binary matrix. Our algorithm is

applicable to both cases.

Step 2. Get each k-step representation

We get k-step log probability matrix Xk, then subtract

each entry by log (β), and replace the negative entries by

zeros. After that, we construct the representational vectors

as rows of Wk, where we introduce a solution to factorize the

positive log probability matrix Xkusing SVD. Finally, we

get all k-step representations for each vertex on the graph.

Step 3. Concatenate all k-step representations

We concatenate all k-step representations to form a global

representation, which can be used in other tasks as features.

5. SKIP-GRAM MODEL AS A SPECIAL CASE

OF GRAREP

GraRep aims to learn representations for graphs where

we optimize the loss function based on matrix factorization.

On the other hand, SGNS has been shown to be successful

in handling linear structures such as natural language sen-

tences. Is there any intrinsic relationship between them? In

this section, we provide a view of SGNS as a special case of

the GraRep model.

5.1 Explicit Loss of Skip-gram Model on Graph

SGNS aims at representing words in linear sequences, so

we need to translate a graph structure into linear structures.

Deepwalk reveals an eﬀective way with uniform sampling

Table 1: Overall Algorithm

GraRep Algorithm

Input

Adjacency matrix Son graph

Maximum transition step K

Log shifted factor β

Dimension of representation vector d

1. Get k-step transition probability matrix Ak

Compute A=D−1S

Calculate A1, A2,...,AK, respectively

2. Get each k-step representations

For k= 1 to K

2.1 Get positive log probability matrix

calculate Γk

1,Γk

2,...,Γk

N(Γk

j=PpAk

p,j ) respectively

calculate {Xk

i,j }

Xk

i,j = log Ak

i,j

Γk

j−log(β)

assign negative entries of Xkto 0

2.2 Construct the representation vector Wk

[Uk,Σk,(Vk)T] = SV D(Xk)

Wk=Uk

d(Σk

d)1

2

End for

3. Concatenate all the k-step representations

W= [W1, W 2,...,WK]

Output

Matrix of the graph representation W

(truncated random walk). This method ﬁrst samples uni-

formly a random vertex from the graph, then walks ran-

domly to one of its neighbors and repeats this process. If the

length of the vertex sequence reaches a certain preset value,

then stop and start generating a new sequence. This pro-

cedure can be used to produce a large number of sequences

from the graph.

Essentially, for an unweighted graph, this strategy of uni-

form sampling works, while for a weighted graph, a proba-

bilistic sampling method based on the weights of the edges is

needed, which is not employed in DeepWalk. In this paper,

we propose an Enhanced SGNS (E-SGNS) method suitable

for weighted graphs. We also note that DeepWalk optimizes

an alternative loss function (hierarchical softmax) that is

diﬀerent from negative sampling.

First, we consider total K-step loss Lon whole graph

L=f(L1, L2,...,LK)

where f(·) is a linear combination of its arguments deﬁned

as follows:

f(ϕ1, ϕ2,· · · , ϕK) = ϕ1+ϕ2+· · · +ϕK

We focus on the loss of a speciﬁc pair (w,c) which are i-

th and j-th vertex in the graph. Similar to Section 3.2, we

assign partial derivative to 0, and get

YE−SGN S

i,j = log Mi,j

PtMt,j −log(β)

where Mis transition probability matrix within Kstep(s),

and Mi,j refers to transition probability from vertex ito

vertex j.YE−SGNS

i,j is a factorized matrix for E-SGNS, and

M=A1+A2+· · · +AK

The diﬀerence between E-SGNS and GraRep model is on the

deﬁnition of f(·). E-SGNS can be considered as a linear com-

bination of K-step loss, and each loss has an equal weight.

Our GraRep model does not make such a strong assump-

tion, but allows their (potentially non-linear) relationship

to be learned from data in practice. Intuitively, diﬀerent

k-step transition probabilities should have diﬀerent weights,

and linear combination of these may not achieve desirable

results for heterogeneous network data.

5.2 Intrinsic Relation Between Sampling and

Transition Probabilities

In our approach, we used transition probabilities to mea-

sure relationship between vertices. Is this reasonable? In

this subsection, we articulate the intrinsic relation between

sampling and transition probabilities.

Among the sequences generated by random walk, we as-

sume vertex woccurs a total of T=γ·Ktimes:

#(w) = γ·K

where γis a variable related to #(w). We regard vertex w

as the current vertex, then the expected number of times

that we see c1as its direct neighbor (1-step away from w)

is:

#(w, c1) = γ·K·p1(c|w)

This holds for both uniform sampling for unweighted graphs

or our proposed probabilistic sampling for weighted graphs.

Further, we analyze the expected number of times of co-

occurrence for wand cof context window size 2,

#(w, c2) = γ·KX

c0

p(c2|c0)·p(c0|w) = γ·K·p2(c|w)

where c0can be any vertex bridging wand c2. That is, c0is

shared neighbor between wand c2. Similarly, we can derive

the equations for k= 3,4,· · · , K:

#(w, c3) = γ·K·p3(c|w)

.

.

.

#(w, cK) = γ·K·pK(c|w)

then we add them up and divide both sides by K, leading

to:

#(w, c) = γ·

K

X

k=1

pk(c|w)

where #(w, c) is the expected co-occurrence count between

wand cwithin Kstep(s).

According to deﬁnition of Mw,c, we can get

#(w, c) = γ·Mw,c

Now, we can also compute the expected number of times

we see cas the context vertex, #(c):

#(c) = γ·X

t

Mt,c

where we consider the transitions from all possible vertices

to c.

Finally, we plug these expected counts into the equation

of YE−SGN S

i,j , and we arrive at:

YE−SGN S

w,c = log #(w , c)· |D|

#(w)·#(c)−log(λ)

where Dis the collection of all observed pairs in sequences.

Here |D|=γK N where Nis the total number of vertices

in the graph. This matrix YE−SGNS becomes exactly the

same as that of SGNS as described in [16].

This shows SGNS is essentially a special version of our

GraRep model that deals with linear sequences which can

be sampled from graphs. Our approach has several advan-

tages over the slow and expensive sampling process, which

typically involves several parameters to tune, such as maxi-

mum length of linear sequence, sampling frequency for each

vertex, and so on.

6. EXPERIMENTAL DESIGN

In this section, we assess the eﬀectiveness of our GraRep

model through experiments. We conduct experiments on

several real-word datasets for several diﬀerent tasks, and

make comparisons with baseline algorithms.

6.1 Datasets and Tasks

In order to demonstrate the performance of GraRep, we

conducted the experiments across three diﬀerent types of

graphs – a social network, a language network and a cita-

tion network, which include both weighted and unweighted

graphs. We conducted experiments across three diﬀerent

types of tasks, including clustering, classiﬁcation, and vi-

sualization. As we mentioned in earlier sections, this work

focuses on proposing a novel framework for learning good

representation of graph with global structural information,

and aims to validate the eﬀectiveness of our proposed model.

Thus we do not employ alternative more eﬃcient matrix

factorization methods apart from SVD, and focused on the

following three real-world datasets in this section.

1. 20-Newsgroup1is a language network, which has ap-

proximately 20,000 newsgroup documents and is partitioned

by 20 diﬀerent groups. In this network, each document is

represented by a vector with tf-idf scores of each word, and

the cosine similarity is used to calculate the similarity be-

tween two documents. Based on these similarity scores of

each pair of documents, a language network is built. Fol-

lowing [29], in order to show the robustness of our model,

we also construct the following 3 graphs built from 3, 6 and

9 diﬀerent newsgroups respectively (note that NG refers to

“Newsgroups”):

3-NG:comp.graphics, comp.graphics and talk.politics.guns;

6-NG: alt.atheism, comp.sys.mac.hardware, rec.motorcycles,

rec.sport.hockey, soc.religion.christian and talk.religion.misc;

9-NG: talk.politics.mideast, talk.politics.misc, comp.os.ms-

windows.misc, sci.crypt, sci.med, sci.space, sci.electronics,

misc.forsale, and comp.sys.ibm.pc.hardware

Besides randomly sampling 200 documents from a topic

as described in [29], we also conduct experiment on all doc-

uments as comparison. The topic label on each document is

considered to be true.

This language network is a fully connected and weighted

graph, and we will demonstrate the results of clustering,

using graph representation as features.

1qwone.com/˜jason/20Newsgroups/

2. Blogcatalog2is a social network, where each vertex

indicates one blogger author, and each edge corresponds to

the relationship between authors. 39 diﬀerent types of topic

categories are presented by authors as labels.

Blogcatalog is an unweighted graphs, and we test the per-

formance of the learned representations on the multi-label

classiﬁcation task, where we classify each author vertex into

a set of labels. The graph representations generated from

our model and each baseline algorithm are considered as

features.

3. DBLP Network3is a citation network. We extract

author citation network from DBLP, where each vertex in-

dicates one author and the number of references from one

author to the other is recorded by the weight of edge be-

tween these two authors. Following [26], we totally select 6

diﬀerent popular conferences and assign them into 3 groups,

where WWW and KDD are grouped as data mining, NIPS

and ICML as machine learning, and CVPR and ICCV as

computer vision. We visualize the learned representations

from all systems using a visualisation tool t-SNE [31], which

provides both qualitative and quantitative results for the

learned representations. We give more details in Section

6.4.3.

To summarize, in this paper we conduct experiments on

both weighted and unweighted graphs, and both sparse and

dense graphs, where three diﬀerent types of learning tasks

are carried out. More details of the graph we used are shown

in Table 2.

6.2 Baseline Algorithms

We use the following methods of graph representation as

baseline algorithms.

1. LINE [25]. LINE is a recently proposed method for

learning graph representations on large-scale informa-

tion networks. LINE deﬁnes a loss function based on

1-step and 2-step relational information between ver-

tices. One strategy to improve the performance of

vertices with small degrees is to make graph denser

by expanding their neighbors. LINE will get the best

performance, if concatenating the representation of 1-

step and 2-step relational information and tuning the

threshold of maximum number of vertices.

2. DeepWalk [20]. DeepWalk is a method that learns

the representation of social networks. The original

model only works for unweighted graph. For each ver-

tex, truncated random walk is used to translate graph

structure into linear sequences. The skip-gram model

with hierarchical softmax is used as the loss function.

3. E-SGNS. Skip-gram is an eﬃcient model that learns

the representation of each word in large corpus [18].

For this enhanced version, we ﬁrst utilize uniform sam-

pling for unweighted graph and probabilistic sampling

proportional to weight of edges for weighted graph, to

generate linear vertex sequences, and then introduce

SGNS to optimize. This method can be regarded as a

special case of our model, where diﬀerent representa-

tional vector of each k-step information is averaged.

2leitang.net/code/social-dimension/data/blogcatalog.mat

3aminer.org/billboard/citation

4. Spectral Clustering [23]. Spectral clustering is a

reasonable baseline algorithm, which aims at minimiz-

ing Normalized Cut (NCut). Like our method, Spec-

tral clustering also factorize a matrix, but it focuses on

a diﬀerent matrix of the graphs – the Laplacian Matrix.

Essentially, the diﬀerence between spectral clustering

and E-SGNS lies on their diﬀerent loss function.

6.3 Parameter Settings

As suggested in [25], for LINE, we set the mini-batch size

of stochastic gradient descent (SGD) as 1, learning rate of

starting value as 0.025, the number of negative samples as

5, and the total number of samples as 10 billion. We also

concatenate both 1-step and 2-step relational information

to form the representations and employ the reconstruction

strategy for vertices with small degrees to achieve the opti-

mal performance. As mentioned in [20], for DeepWalk and

E-SGNS, we set window size as 10, walk length as 40, walks

per vertex as 80. According to [25], LINE yielded better

results when the learned graph representations are L2 nor-

malized, while DeepWalk and E-SGNS can achieve optimal

performance without normalization. For GraRep, we found

the L2 normalization yielded better results. We reported

all the results based on these ﬁndings for each system ac-

cordingly. For a fair comparison, the dimension dof repre-

sentations is set as 128 for Blogcatalog network and DBLP

network as used in [25] and is set as 64 for 20-NewsGroup

network as used in [29]. For GraRep, we set β=1

Nand max-

imum matrix transition step K=6 for Blogcatalog network

and DBLP network and K=3 for 20-NewsGroup network.

To demonstrate the advantage of our model which captures

global information of the graph, we also conducted exper-

iments under various parameter settings for each baseline

systems, which are discussed in detail in the next section.

6.4 Experimental Results

In this section, we present empirical justiﬁcations that our

GraRep model can integrate diﬀerent k-step local relational

information into a global graph representation for diﬀerent

types of graphs, which can then be eﬀectively used for dif-

ferent tasks. We make the source code of GraRep available

at http://shelson.top/.

6.4.1 20-Newsgroup Network

We ﬁrst conduct an experiment on a language network

through a clustering task by employing the learned repre-

sentations in a k-means algorithm [2].

To assess the quality of the results, we report the aver-

aged Normalized Mutual Information (NMI) score [24] over

10 diﬀerent runs for each system. To understand the eﬀect

of diﬀerent dimensionality din the end results, we also show

the results when dimension dis set to 192 for DeepWalk,

E-SGNS and Spectral Clustering. For LINE, we employ the

reconstruction strategy proposed by their work by adding

neighbors of neighbors as additional neighbors to improve

performance. We set k-max=0, 200, 500, 1000 for experi-

ments, where k-max is a parameter that is used to control

how the higher-order neighbors are added to each vertex in

the graph.

As shown in Table 3, the highest results are highlighted

in bold for each column. We can see that GraRep con-

sistently outperforms other baseline methods for this task.

For DeepWalk, E-SGNS and Spectral Clustering, increasing

Table 2: Statistics of the real-world graphs

Language Network Social Network Citation Network

Name 20-NewsGroup 20-NewsGroup Blogcatalog DBLP

(200 samples) (all data) (author citation)

Type weighted weighted unweighted weighted

#(V) 600, 1200 and 1800 1,720, 3,224 and 5,141 10,312 7,314

#(E) Fully connected Fully connected 333,983 72,927

Avg. degree — — 64.78 19.94

#Labels 3, 6 and 9 3, 6 and 9 39 3

Task Clustering Clustering Classiﬁcation Visualization

Table 3: Results on 20-NewsGroup

200 samples all data

Algorithm 3NG(200) 6NG(200) 9NG(200) 3NG(all) 6NG(all) 9NG(all)

GraRep 81.12 67.53 59.43 81.44 71.54 60.38

LINE (k-max=0) 80.36 64.88 51.58 80.58 68.35 52.30

LINE (k-max=200) 78.69 66.06 54.14 80.68 68.83 53.53

DeepWalk 65.58 63.66 48.86 65.67 68.38 49.19

DeepWalk (192dim) 60.89 59.89 47.16 59.93 65.68 48.61

E-SGNS 69.98 65.06 48.47 69.04 67.65 50.59

E-SGNS (192dim) 63.55 64.85 48.65 66.64 66.57 49.78

Spectral Clustering 49.04 51.02 46.92 62.41 59.32 51.91

Spectral Clustering (192dim) 28.44 27.80 36.05 44.47 36.98 47.36

Spectral Clustering (16dim) 69.91 60.54 47.39 78.12 68.78 57.87

Table 4: Results on Blogcatalog

Metric Algorithm 10% 20% 30% 40% 50% 60% 70% 80% 90%

Micro-F1

GraRep 38.24 40.31 41.34 41.87 42.60 43.02 43.43 43.55 44.24

LINE 37.19 39.82 40.88 41.47 42.19 42.72 43.15 43.36 43.88

DeepWalk 35.93 38.38 39.50 40.39 40.79 41.28 41.60 41.93 42.17

E-SGNS 35.71 38.34 39.64 40.39 41.23 41.66 42.01 42.16 42.25

Spectral Clustering 37.16 39.45 40.22 40.87 41.27 41.50 41.48 41.62 42.12

Macro-F1

GraRep 23.20 25.55 26.69 27.53 28.35 28.78 29.67 29.96 30.93

LINE 19.63 23.04 24.52 25.70 26.65 27.26 27.94 28.68 29.38

DeepWalk 21.02 23.81 25.39 26.27 26.85 27.36 27.67 27.96 28.41

E-SGNS 21.01 24.09 25.61 26.59 27.64 28.08 28.33 28.34 29.26

Spectral Clustering 19.26 22.24 23.51 24.33 24.83 25.19 25.36 25.52 26.21

(a) Spectral Clustering (b) DeepWalk (c) E-SGNS (d) LINE (e) GraRep

Figure 3: Visualization of author citation network. Each point indicates one author. Green: Data Mining,

magenta: computer vision and blue: Machine Learning.

the dimension dof representations does not appear to be

eﬀective in improving the performance. We believe this is

because a higher dimension does not provide diﬀerent com-

plementary information to the representations. For LINE,

the reconstruction strategy does help, since it can capture

additional structural information of the graph beyond 1-step

and 2-step local information.

It is worth mentioning that GraRep and LINE can achieve

good performance with a small graph. We believe this is be-

cause both approaches can capture rich local relational in-

formation even when the graph is small. Besides, for tasks

with more labels, such as 9NG, GraRep and LINE can pro-

vide better performances than other methods, with GraRep

providing the best. An interesting ﬁnding is that Spectral

Clustering arrives at its best performance when setting di-

mension dto 16. However, for all other algorithms, their

best results are obtained when dis set to 64. We show these

results in Figure 5 when we assess the sensitivity of param-

eters.

Due to limited space, we do not report the detailed results

with d=128 and d=256 for all baseline methods, as well as

k-max=500, 1000 and without reconstruction strategy for

LINE, since these results are no better than the results pre-

sented in Table 3. In Section 6.5, we will further discuss the

issue of parameter sensitivity.

6.4.2 Blogcatalog Network

In this experiment, we focus on a supervised task on a so-

cial network. We evaluate the eﬀectiveness of diﬀerent graph

representations through a multi-label classiﬁcation task by

regarding the learned representations as features.

Following [20, 27], we use the LibLinear package [10] to

train one-vs-rest logistic regression classiﬁers, and we run

this process for 10 rounds and report the averaged Micro-

F1 and Macro-F1 measures. For each round, we randomly

sample 10% to 90% of the vertices and use these samples

for training, and use the remaining vertices for evaluation.

As suggested in [25], we set k-max as 0, 200, 500 and 1000,

respectively, and we report the best performance with k-

max=500.

Table 4 reports the results of GraRep and each baseline al-

gorithm. The highest performance is highlighted in bold for

each column which corresponds to one particular train/test

split. Overall, GraRep signiﬁcantly outperforms other meth-

ods, especially under the setting where only 10% of the ver-

tices are used for training. This indicates diﬀerent types

of rich local structural information learned by the GraRep

model can be used to complement each other to capture

the global structural properties of the graph, which serves

Table 5: Final KL divergence for the DBLP dataset

Algorithm GraRep LINE DeepWalk E-SGNS

KL divergence 1.0070 1.0816 1.1115 1.1009

as a distinctive advantage over existing baseline approaches,

especially when the data is relatively scarce.

6.4.3 DBLP Network

In this experiment, we focus on visualizing the learned rep-

resentations by examining a real citation network – DBLP.

We feed the learned graph representations into the standard

t-SNE tool [31] to lay out the graph, where the authors from

the same research area (one of data mining, machine learn-

ing or computer vision, see Section 6.1) share the same color.

The graphs are visualized on a 2-dimensional space and the

Kullback-Leibler divergence is reported [31], which captures

the errors between the input pairwise similarities and their

projections in the 2-dimensional map as displayed (a lower

KL divergence score indicates a better performance).

Under the help of t-SNE toolkit, we set the same param-

eter conﬁguration and import representations generated by

each algorithm. From Figure 3, we can see that the lay-

out using Spectral Clustering is not very informative, since

vertices of diﬀerent colors are mixed with each other. For

DeepWalk and E-SGNS, results look much better, as most

vertices of the same color appear to form groups. However,

vertices still do not appear in clearly separable regions with

clear boundaries. For LINE and GraRep, the boundaries

of each group become much clearer, with vertices of diﬀer-

ent colors appearing in clearly distinguishable regions. The

results for GraRep appear to be better, with clearer bound-

aries for each regions as compared to LINE.

Table 5 reports the KL divergence at the end of iterations.

Under the same parameter setting, a lower KL divergence

indicates a better graph representation. From this result, we

also can see that GraRep yields better representations than

all other baseline methods.

6.5 Parameter Sensitivity

We discuss the parameter sensitivity in this section. Specif-

ically, we assess the how the diﬀerent choices of the maximal

k-step size K, as well as the dimension dfor our represen-

tations can aﬀect our results.

Labeled Nodes

1% 2% 3% 4% 5% 6% 7% 8% 9%

Micro-F1

0.34

0.36

0.38

0.4

0.42

0.44

0.46

K=7

K=6

K=5

K=2

K=1

(a) Blogcatalog, Micro-F1

Labeled Nodes

1% 2% 3% 4% 5% 6% 7% 8% 9%

Macro-F1

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0.32

K=7

K=6

K=5

K=2

K=1

(b) Blogcatalog, Macro-F1

Figure 4: Performance over steps, K

Figure 4 shows the Micro-F1 and Macro-F1 scores over

diﬀerent choices of Kon the Blogcatalog data. We can ob-

serve that the setting K=2 has a signiﬁcant improvement

over the setting K=1, and K=3 further outperforms K=2.

This conﬁrms that diﬀerent k-step can learn complementary

local information. In addition, as mentioned in Section 3,

when kis large enough, learned k-step relational informa-

tion becomes weak and shifts towards a steady distribution.

Empirically we can observe that the performance of K=7

is no better than K=6. In this ﬁgure, for readability and

clarity we only present the results of K=1,2,3,6 and 7. We

found the performance of K= 4 is slightly better than K=3,

while the results for K=5 is comparable to K=4.

Dimensions

32 64 128 256

NMI

0.4

0.5

0.6

0.7

0.8

0.9

GraRep

DeepWalk

E-SGNS

LINE

(a) 3NG, NMI

Dimensions

32 64 128 256

NMI

0.4

0.45

0.5

0.55

0.6

GraRep

DeepWalk

E-SGNS

LINE

(b) 9NG, NMI

Figure 5: Performance over dimensions, d

Figure 5 shows the NMI scores of each algorithm over dif-

ferent settings of the dimension don 3NG and 9NG data.

We can observe that GraRep consistently outperforms other

baseline algorithms which learn representations with the same

dimension. This set of experiments serves as an additional

supplement to Table 3. Interestingly, all algorithms can ob-

tain the optimal performance with d= 64. As we increase d

from 64 to larger values, it appears that the performances of

all the algorithms start to drop. Nevertheless, our GraRep

algorithm is still superior to the baseline systems across dif-

ferent dvalues.

Dimensions

128 128*2 128*3 128*4 128*5 128*6 128*7

Cost of time/seconds

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

(a) Blogcatalog, running time

Nodes of graph

600 1200 1800 3224 5141

Cost of time/seconds

0

50

100

150

200

250

300

(b) 20NG, running time

Figure 6: Running time as a function of a) dimension

and b) size of graph.

Figure 6 shows the running time over diﬀerent dimensions

as well as over diﬀerent graph sizes. In Figure 6(a), we set

Kfrom 1 to 7 where each complementary feature vector has

a dimension of 128. This set of experiments is conducted

on the Blogcatalog dataset which contains around 10,000

vertices. It can be seen that the running time increases ap-

proximately linearly as the dimension increases. In Figure

6(b), we analyze running time with respect to graph sizes.

This set of experiment is conducted on diﬀerent size of the

20-NewsGroup dataset. The result shows a signiﬁcant in-

crease of running time as graph size increases. The reason

resulting in the large increase in running time is mainly due

to high time complexity involved in the computation of the

power of a matrix and the SVD procedure.

7. CONCLUSIONS

In this paper, GraRep, a novel model for learning better

graph representations is proposed. Our model, with our k-

step loss functions deﬁned on graphs which integrate rich

local structural information associated with the graph, cap-

tures the global structural properties of the graph. We also

provide mathematical derivations justifying the model and

establish the connections to previous research eﬀorts. Em-

pirically, the learned representations can be eﬀectively used

as features in other learning problems such as clustering

and classiﬁcation. This model comes with one limitation:

the expensive computation of the power of a matrix and

SVD involved in the learning process. Future work would

include the investigation of eﬃcient and online methods to

approximate matrix algebraic manipulations, as well as in-

vestigation of alternative methods by employing deep ar-

chitectures for learning low-dimensional representations in

place of SVD.

8. ACKNOWLEDGMENTS

The authors would like to thank the anonymous reviewers

for their helpful comments. The authors would also like to

thank Fei Tian, Qingbiao Miao and Jie Liu for their sug-

gestions or help with this work. This paper was done when

the ﬁrst author was an intern at SUTD. This work was sup-

ported by SUTD grant ISTD 2013 064 and Temasek Lab

project IGDST1403013.

9. REFERENCES

[1] A. Ahmed, N. Shervashidze, S. Narayanamurthy,

V. Josifovski, and A. J. Smola. Distributed large-scale

natural graph factorization. In WWW, pages 37–48.

International World Wide Web Conferences Steering

Committee, 2013.

[2] D. Arthur and S. Vassilvitskii. k-means++: The

advantages of careful seeding. In SODA, pages

1027–1035. Society for Industrial and Applied

Mathematics, 2007.

[3] M. Belkin and P. Niyogi. Laplacian eigenmaps and

spectral techniques for embedding and clustering. In

NIPS, volume 14, pages 585–591, 2001.

[4] J. A. Bullinaria and J. P. Levy. Extracting semantic

representations from word co-occurrence statistics: A

computational study. BRM, 39(3):510–526, 2007.

[5] J. A. Bullinaria and J. P. Levy. Extracting semantic

representations from word co-occurrence statistics:

stop-lists, stemming, and svd. BRM, 44(3):890–907,

2012.

[6] J. Caron. Experiments with lsa scoring: Optimal rank

and basis. In CIR, pages 157–169, 2001.

[7] P. Comon. Independent component analysis, a new

concept? Signal processing, 36(3):287–314, 1994.

[8] T. F. Cox and M. A. Cox. Multidimensional scaling.

CRC Press, 2000.

[9] C. Eckart and G. Young. The approximation of one

matrix by another of lower rank. Psychometrika,

1(3):211–218, 1936.

[10] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang,

and C.-J. Lin. Liblinear: A library for large linear

classiﬁcation. JMLR, 9:1871–1874, 2008.

[11] M. U. Gutmann and A. Hyv¨

arinen. Noise-contrastive

estimation of unnormalized statistical models, with

applications to natural image statistics. JMLR,

13(1):307–361, 2012.

[12] G. E. Hinton and R. R. Salakhutdinov. Reducing the

dimensionality of data with neural networks. Science,

313(5786):504–507, 2006.

[13] C. Jutten and J. Herault. Blind separation of sources,

part i: An adaptive algorithm based on neuromimetic

architecture. Signal processing, 24(1):1–10, 1991.

[14] V. Klema and A. J. Laub. The singular value

decomposition: Its computation and some

applications. Automatic Control, 25(2):164–176, 1980.

[15] T. K. Landauer, P. W. Foltz, and D. Laham. An

introduction to latent semantic analysis. Discourse

processes, 25(2-3):259–284, 1998.

[16] O. Levy and Y. Goldberg. Neural word embedding as

implicit matrix factorization. In NIPS, pages

2177–2185, 2014.

[17] K. Lund and C. Burgess. Producing high-dimensional

semantic spaces from lexical co-occurrence. BRMIC,

28(2):203–208, 1996.

[18] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and

J. Dean. Distributed representations of words and

phrases and their compositionality. In NIPS, pages

3111–3119, 2013.

[19] J. Pennington, R. Socher, and C. D. Manning. Glove:

Global vectors for word representation. EMNLP, 12,

2014.

[20] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk:

Online learning of social representations. In SIGKDD,

pages 701–710. ACM, 2014.

[21] S. T. Roweis and L. K. Saul. Nonlinear dimensionality

reduction by locally linear embedding. Science,

290(5500):2323–2326, 2000.

[22] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.

Incremental singular value decomposition algorithms

for highly scalable recommender systems. In ICIS,

pages 27–28. Citeseer, 2002.

[23] J. Shi and J. Malik. Normalized cuts and image

segmentation. PAMI, 22(8):888–905, 2000.

[24] A. Strehl, J. Ghosh, and R. Mooney. Impact of

similarity measures on web-page clustering. In

Workshop on Artiﬁcial Intelligence for Web Search

(AAAI 2000), pages 58–64, 2000.

[25] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and

Q. Mei. Line: Large-scale information network

embedding. In WWW. ACM, 2015.

[26] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su.

Arnetminer: extraction and mining of academic social

networks. In SIGKDD, pages 990–998. ACM, 2008.

[27] L. Tang and H. Liu. Relational learning via latent

social dimensions. In SIGKDD, pages 817–826. ACM,

2009.

[28] J. B. Tenenbaum, V. De Silva, and J. C. Langford. A

global geometric framework for nonlinear

dimensionality reduction. Science,

290(5500):2319–2323, 2000.

[29] F. Tian, B. Gao, Q. Cui, E. Chen, and T.-Y. Liu.

Learning deep representations for graph clustering. In

AAAI, 2014.

[30] P. D. Turney. Domain and function: A dual-space

model of semantic relations and compositions. JAIR,

pages 533–585, 2012.

[31] L. Van der Maaten and G. Hinton. Visualizing data

using t-sne. JMLR, 9(2579-2605):85, 2008.