Available via license: CC BY 4.0

Content may be subject to copyright.

arXiv:2009.04404v1 [cs.LG] 9 Sep 2020

WALK EXTRACTION STRATEGIES FOR NODE EMBEDDINGS

WITH RDF2VEC IN KNOWLED GE GRAPHS

A PR EP RINT

Gilles Vandewiele

IDLab

Ghent University – imec

Belgium

gilles.vandewiele@ugent.be

Bram Steenwinckel

IDLab

Ghent University – imec

Belgium

bram.steenwinckel@ugent.be

Pieter Bonte

IDLab

Ghent University – imec

Belgium

pieter.bonte@ugent.be

Michael Weyns

IDLab

Ghent University – imec

Belgium

michael.weyns@ugent.be

Heiko Paulheim

Data and Web Science Group

University of Mannheim

Germany

heiko@informatik.uni-mannheim.de

Petar Ristoski

IBM Research Almaden

IBM

United States of America

petar.ristoski@ibm.com

Filip De Turck

IDLab

Ghent University – imec

Ghent, Belgium

filip.deturck@ugent.be

Femke Ongenae

IDLab

Ghent University – imec

Ghent, Belgium

femke.ongenae@ugent.be

September 10, 2020

ABSTRACT

As KGs are symbolic constructs, specialized techniques have to be applied in order to make them

compatible with data mining techniques. RDF2Vec is an unsupervised technique that can create task-

agnostic numerical representations of the nodes in a KG by extending successful language modelling

techniques. The original work proposed the Weisfeiler-Lehman (WL) kernel to improve the quality

of the representations. However, in this work, we show both formally and empirically that the WL

kernel does little to improve walk embeddings in the context of a single KG. As an alternative to

the WL kernel, we propose ﬁve different strategies to extract information complementary to basic

random walks. We compare these walks on several benchmark datasets to show that the n-gram

strategy performs best on average on node classiﬁcation tasks and that tuning the walk strategy can

result in improved predictive performances.

Keywords Knowledge graphs ·Embeddings ·Representation Learning

1 Introduction

As a result of the recent data deluge, we are increasingly confronted with more information than we can meaningfully

make sense of. Above all, this information is characterised by contextual heterogeneity: its origins are semantically

and syntactically diverse. Insights derived through traditional data mining procedures will be constrained precisely

because such procedures fail to account for this staggering diversity. To deal with such a myriad of contextual

environments and backgrounds, the Semantic Web’s (SW) Linked Open Data (LOD) initiative can be used to interlink

various data sources and unite them under a common queryable interface. The product of such a consolidation

effort is often called a Knowledge Graph (KG). In addition to unifying information from various sources, KGs are

VANDE WIELE E T AL .

able to enrich classical data formats by explicitly encoding relations between different data points in the form of edges.

Using KGs to enhance traditional data mining techniques with background knowledge is a relatively recent endeav-

our [1]. Because KGs are symbolic constructs, their compatibility with such techniques is rather limited. In fact, data

mining techniques usually require inputs to be presented as numerical feature vectors, and are therefore unable to

process background knowledge directly. With this in mind, some of the earliest knowledge-enhanced data mining

approaches proceeded by extracting custom features from speciﬁc and generic relations inside the graph [2]. While

these approaches produce human-interpretable variables, they have to be tailored to the task at hand and therefore

require extensive effort. As an alternative to feature-based approaches, techniques can be applied to learn vector

representations, called embeddings, for each of the entities inside a graph based on a limited set of global latent

features [3, 4]. These techniques are task-agnostic, which allows them to be used for different downstream tasks, such

as predicting missing links inside a graph or categorizing different nodes [5].

Natural language and graphs often share similarities. As an example, frequencies of language symbols or graph

structures both tend to approximate Zipf’s law. Techniques such as DeepWalk [6] and Node2Vec [7] were among the

ﬁrst to leverage these similarities, by extending successful language modelling techniques, such as Word2Vec [8],

to deal with graph-based data. Their proposed techniques rely on the extraction of sequences of graph vertices,

which are then fed as sentences to language models. Similarly, work on (deep) graph kernels also relies on language

modelling to learn the latent representations of graph substructures [9, 10, 11]. RDF2Vec is a technique that builds on

the progress made by these previous two types of techniques by adapting random walks and the Weisfeiler-Lehman

(WL) subtree kernel to directed graphs with labelled edges, i.e. KGs [12].

In this work, we show that the WL kernel, while effective for measuring similarities between nodes or when working

with regular graphs, offers little improvements in the context of a single KG with respect to walk embeddings. In

response to this observation, we propose various alternative walk strategies for RDF data to improve upon basic

random walks and compare them on different benchmark datasets.

The remainder of this paper is structured as follows. In Section 2, background information is provided on KGs,

RDF2Vec, walk embeddings and the WL kernel. Next, in Section 3, we provide a formal discussion of our claim

with respect to the feasibility of Weisfeiler-Lehman subtree RDF graph kernels. Following this, Section 4 discusses

a number of possible alternative walk strategies, including pseudo-code listings for each algorithm. Section 5 then

describes the datasets used to evaluate these alternative strategies and lists the corresponding results. These results are

subsequently discussed in Section 6. Finally, in Section 7 we conclude this work with a general reﬂection.

2 Background

2.1 Knowledge graphs

A KG is a multi-relational directed graph, G= (V,E, ℓ), where Vare the vertices in our graph, Ethe edges or

predicates and ℓa labelling function that maps each vertex or edge onto its corresponding label. It should further be

noted that we can identify three different types of vertices: (i) entities, (ii) blank nodes, and (iii) literals. We can

simplify further analysis by applying a transformation to the knowledge graph which removes the multi-relational

aspect, as done by de Vries et al. [13]. This is done by representing each (subject, predicate, object) triple

from the original knowledge graph by three labelled nodes and two unlabelled edges (subject →predicate and

predicate →object).

2.2 RDF2Vec

Machine learning algorithms cannot work directly with graph-based data, as they require numerical vectors as

input. RDF2Vec [12] is an unsupervised, task-agnostic approach that solves this problem by ﬁrst transforming the

information of the nodes in the graph into numerical data, which are called latent representations or embeddings.

The goal is to capture as much of the semantics as possible in the numerical representation, i.e. entities that are

semantically related should be close to each other in the embedded space. RDF2Vec builds on word embedding

techniques, which have shown great success in the domain of natural language processing. These word embedding

techniques take a corpus of sentences as input, and learn a latent representation for each of the unique words within

the corpus. Learning this latent representation can be done, for example, by learning to predict a word based on its

2

VANDE WIELE E T AL .

context (continuous bag-of-words) or predicting the context based on a target word (skip-gram) [14, 8].

2.3 (Random) walk embeddings

In the context of (knowledge) graphs, we can construct an input corpus by extracting walks. A walk is a sequence of

vertices that can be found in the graph by traversing the directed links. We can notate a walk of length n as follows:

v0→v1→...→vn−1(1)

We can then use the graph labelling function to create a sentence:

ℓ(v0)→ℓ(v1)→...→ℓ(vn−1)(2)

It should be noted that, due to the previously discussed transformation of our KG, nodes with an even index in the

walk correspond to entities of the original knowledge graph, while nodes with an odd index correspond to predicates.

The most straightforward strategy to extract walks is by doing a breadth-ﬁrst traversal of the graph starting from the

nodes of interest. Since the total number of walks that can be extracted grows exponentially in function of the depth,

sampling can be applied after each iteration of breadth-ﬁrst traversal. This sampling can either be guided by some

metric, resulting in a collection of biased walks [15], or can be performed at random, which results in random walks.

2.4 Weisfeiler-Lehman kernel

The WL kernel was proposed as an extension to the labelling function. The WL kernel is an algorithm to test whether

two graphs were isomorphic in polynomial time [16]. The intuition behind the algorithm was to assign new labels

to each of the nodes, where each of the newly assigned labels captured the information of an entire subgraph up to a

certain depth. This algorithm was later adapted to serve as a kernel, or similarity measure, between graphs [17, 18],

by counting the number of WL labels two graphs had in common. The WL relabelling for a node vis performed by

recursively hashing the concatenation of the label of an entity and the (sorted) labels the nodes of its neighbours:

WL0(v) = ℓ(v)

WLk(v) = H ASH (WLk−1(v)⊕S ORT({W Lk−1(vn)|vn∈N(v)})) (3)

with k≥1,WLk(v)the corresponding WL label of a node after kiterations, H AS H a hashing function, S ORT a sorting

function, ⊕the string concatenation operator and N(v)the neighbourhood of v. We assume that the hashing function

does not produce collisions. This can easily be achieved by, for example, mapping each unique label to an integer. The

neighbourhood of a vertex vis deﬁned as the set of vertices v′with an edge going to v:

N(v) = {v′|(v′, v)∈E}(4)

3 Weisfeiler-Lehman kernel for Knowledge Graphs

Ristoski et al. proposed to use the W L kernel in order to relabel nodes as an alternative to extracting random walks [12].

We will refer to this as the WL strategy in the remainder of this paper. However, we argue that this WL strategy provides

no additional information with respect to entity representations when extracting a ﬁxed number of random walks from

a knowledge graph. We now formally demonstrate this claim on al three types of vertices:

3.1 Entities

The wl strategy brings little to no added value in comparison to random walks, when applied to entities in a knowledge

graph, due to the properties of R DF data. Entities in RD F are represented by Uniform Resource Identiﬁers (UR I), which

need to be unique1. As such:

ℓ(x) = ℓ(y)⇐⇒ x=y(5)

Due to this property, WL relabelling, when applied on RDF data, is nothing more than a bijection from the hops in

random walks to the hops in the walks obtained through WL relabelling. This means that WL relabelling does not add

any useful additional information. When two WL labels are equal, their underlying entities are always equal as well.

We will now prove this formally. First, from Eq. 3 we can deduce that when two W L labels of two nodes are equal,

1https://www.w3.org/DesignIssues/Axioms.html

3

VANDE WIELE E T AL .

then at least the labels of these nodes should be equal and they should have the same neighbours2. Formally this means

that:

WLk(vj) = W Lk(vi)⇐⇒ ℓ(vj) = ℓ(vi)∧N(vj) = N(vi)(6)

Proof:

WLk(vj) = W Lk(vi)⇐⇒ vj=vi

Step 1: W Lk(vj) = WLk(vi) =⇒vj=vi

WLk(vj) = W Lk(vi) =⇒ℓ(vj) = ℓ(vi)∧N(vj) = N(vi)(Eq. 6)

=⇒ℓ(vj) = ℓ(vi)(Conjunction Elim.)

=⇒vj=vi(Eq. 5)

Step 2: vj=vi=⇒W Lk(vj) = WL k(vi)

vj=vi=⇒ℓ(vj) = ℓ(vi)(Eq. 5)

=⇒ℓ(vj) = ℓ(vi)∧N(vj) = N(vi)(Eq. 4 and vj=vi)

=⇒WLk(vj) = WLk(vi)(Eq. 6)

3.2 Blank nodes and literals

In contrast to entities, Eq. 5 does not hold for blank nodes and literals. This implies that multiple nodes could have the

same original label but have different WL labels (one-to-many mapping). However, the added value of WL is limited

even in these cases due to the fact that blank nodes rarely have the exact same neighbourhoods and because literals

only have one incoming edge and no outgoing edges. Moreover, due to the fact that RDF2Vec treats each hop in the

walk as categorical data, RDF2Vec does not handle literals well.

4 Custom walk extraction strategies

Based on the observation probably discussed, we now identify two types of strategies to construct a corpus of walks:

Type 1 - Extraction: strategies that deﬁne how walks for each of the entities are extracted. The random walk strategy

is an example of such a strategy, where breadth-ﬁrst traversal is applied to extract walks.

Type 2 - Transformation: strategies that transform walks extracted by a Type 1 strategy. The WL strategy is an

example of this type. In order for such a strategy to provide information complementary to the originally

provided walks, it must deﬁne a one-to-many or many-to-one mapping from the original labels to the new

labels.

We now propose ﬁve different strategies complementary to the random strategy. One of these strategies can be classi-

ﬁed as being of Type 1 while the other four are of Type 2.

4.1 Community hops

As opposed to iteratively extending the walk with neighbours of a vertex, we could allow with a certain probability

for teleportation to a node that has properties similar to a certain neighbor [19]. In order to group nodes with similar

properties together, unsupervised community detection can be applied [20]. In this work, we opted to use the Louvain

method [21] due to its excellent trade-off between speed and clustering quality. The idea of introducing community

hops is to capture implicit relations between nodes that are not explicitly modelled in the KG, and to allow for including

related pieces of knowledge in the walks which are otherwise out of reach. We provide pseudo-code for this strategy

in Algorithm 1. This strategy is of Type 1. We will refer to this strategy as community.

4.2 Anonymous walks

The random walks discussed in the previous section can be anonymized, which transforms label information into

positional information. More formally, a walk w=v0→v1→... →vn, is transformed into f(v0)→f(v1)→

2Proof omitted due to space restrictions, but as WL is recursive, it can be proven through induction.

4

VANDE WIELE E T AL .

Alg. 1: community_walk(G, v, depth, p, hop_prob)

# List of communities and dictionary {vertex: community}

com, com_map = com_detection(G)

walks = { (v,) }

for din range(depth):

new = set()

for walk in walks:

for nin get_neighbours(G, v):

# Sample neighbourhood

if random() <p:

new.add(walk + (n,))

# Hop to community

if random() <hop_prob:

c_n = com[com_map[n]]

hop = choice(c_n)

new.add(walk + (hop,))

walks = new

return walks

... →f(vn)with f(vi) = min({i|w[i] = vi}), which corresponds to the ﬁrst index where vican be found in

the walk w[22]. The notion behind anonmyous walks is that local graph structures often bear enough information

for encoding and reconstructing a graph, even when ignoring the node labels, i.e., the mere topology surrounding a

node is often sufﬁcient for identifying that node. Ignoring the labels, on the other hand, allows for a computationally

efﬁcient generation of the walks. We present pseudo-code for this transformation in Algorithm 2. This strategy is of

Type 2. We will refer to this strategy as anonymous.

Alg. 2: anonymize(walks)

anon_walks = [ ]

for walk in walks:

new = [ walk[0] ]

for hop in walk[1:]:

new.append(walk.index(hop))

anon_walks.append(new)

return anon_walks

4.3 Walkets

Walks can be transformed into walklets, which are walks of length two consisting of the root of the original walk and

one of the hops. Provided a walk w=v0→v1→... →vn, we can construct sets of walklets {(v0, vi)|1≤i≤

n}[23]. While standard RDF2Vec does not consider the distance between two nodes in a walk, walklets are explicitly

created for different scales. Hence, they allow for such a distinction between a direct neighbor and a node which is

further away. Pseudocode for this approach is provided in Algorithm 3. This strategy is of Type 2. We will refer to

this strategy as walklet.

Alg. 3: walklets(walks)

walklets = set()

for walk in walks:

for iin range(1, |walk|):

walklets.add((walk[0], walk[i]))

return walklets

4.4 Hierarchical random walk (HALK)

The frequency of entities in a knowledge graph often follows a long-tailed distribution, similar to natural language.

Entities rarely occurring often carry little information, and increase the number of hops between the root and potentially

more interesting entities. As such, the removal of rare entities from the random walks can increase the quality of

5

VANDE WIELE E T AL .

the generated embeddings while decreasing the memory usage [24]. Pseudo-code for this strategy is provided in

Algorithm 4. This strategy is of Type 2. We will refer to this strategy as HALK.

Alg. 4: halk(walks, thresholds)

# Count nr. of walks a hop occurs

counts = { }

for iin range(|walk|):

for hop in walks[i]:

if hop not in frequencies:

counts[hop] = {i}

else:

counts[hop].add(i)

# Skip rare hops

halk_walks = [ ]

for thresh in thresholds:

for walk in walks:

new = [ walk[0] ]

for hop in walk[1:]:

if |counts[hop]|

|walks|≥thresh:

new.append(hop)

halk_walks.append(new)

return halk_walks

4.5 N-Gram walks

Another approach that deﬁnes a one-to-many mapping is relabelling n-grams in the random walks. The intuition

behind this is that the predecessors of a node that two different walks have in common can be different. Additionally,

we can inject wildcards into the walk before relabelling n-grams [25]. The injection of wildcards allows subsequences

with small differences to be mapped onto the same label. Pseudo-code for this strategy is provided in Algorithm 5.

This strategy is of Type 2. We will refer to this strategy as n-gram.

Alg. 5: ngram(walks, n, n_wild)

# Introduce wildcards in the walks

extended_walks = walks

for walk in walks:

idx = range(1, |walk|)

combs = combinations(idx, n_wild)

for comb in combs:

new = walk

for iin comb:

new[i] = ‘*’

extended_walks.append(new)

# Relabel ngrams in the walk

ngram_walks = [ ]

map = { }

for walk in extended_walks:

new = walk[:n]

for iin range(n, |walk|+ 1):

ngram = walk[i-n:i]

if ngram not in map:

map[ngram] = |map|

new.append(map[ngram])

ngram_walks.append(new)

return ngram_walks

4.6 Example

In order to further clarify each of the proposed strategies, we provide an example in Figure 1.

6

VANDE WIELE E T AL .

A

B

C D

E

F

G

H

I

Random

ACD

A D E

F G I

F H G

F D E

Walklet

A C

A D

F G

F H F I

F E

A E

F D

HALK (thresh = 0.2)

A D

A D E

F G

F D E

Anonymous

A 0 1

F 0 1

Community

ACD

A H G

A D E

F G I

F H G

F C D

F D E

N-Gram (n = 2,1 wildcard)

A C Na

A C Nb

A D Nc

A D Nd

F G Ne

F G Nf

F H Ng

F D Nc

F H Nh

F D Nd

Figure 1: An example of each of our proposed strategies. We extract walks with the random and community strategy

of exactly depth 2 from nodes “A" and “F". For other strategies, we transform the walks extracted by the random

strategy. Nodes “C" and “H" belong to the same community.

5 Results

To evaluate the impact of custom walking strategies, we measure the predictive performance on different datasets and

various tasks.

5.1 Datasets

Three different types of datasets are used, in order to ensure enough variation in our evaluation. Moreover, these

datasets are commonly used in (knowledge) graph-based machine learning studies.

5.1.1 Node classiﬁcation benchmark datasets

We will be using four benchmark data sets, each describing knowledge graphs, that serve as benchmarks for node

classiﬁcation and are available from a public repository set up by Ristoski et al. [26]. The names of these benchmark

datasets are AIFB, MUTAG, BGS and AM. For each of these data sets, we remove triples with speciﬁc predicates that

would leak the target from our knowledge graph, as provided by the original authors. Moreover, a predeﬁned split into

train and test set, with the corresponding ground truth, is provided by the authors, which we used in our experiments.

5.1.2 Citation networks

We converted three citation networks [27], which describe scientiﬁc papers, to knowledge graphs. The three citation

networks used are CORA, CITESEER and PUBMED. Each paper is represented by a bag-of-word or tf-idf represen-

tation of their content and a list of citations to other papers in the network. A ﬁxed train/test split is provided for each

of the datasets and the associated task is to categorize each of the papers into the correct research domain, which can

be seen as a node classiﬁcation task. For each paper p, we obtained the words wfrom the bag-of-words or tf-idf vector

with a value greater than 0 and add the following triples to our KG: {(p, hasW ord, w )|f(p, w)>0}, with f(p, w)

a function that retrieves the bag-of-word or tf-idf value of word wfor paper p. Moreover, for each paper p′cited by p

we add the following triple: (p, cites, p′).

7

VANDE WIELE E T AL .

5.1.3 DBpedia

We use the English version of the 2016-10 DBpedia dataset [28], which contains 4,356,314 entities and 52,689,448

triples in total. In our evaluation, we only consider object properties, and ignore datatype properties and literals. We

use the obtained embeddings in multiple different downstream tasks: 5 different classiﬁcation tasks (AAUP, Cities,

Forbes, Albums and Movies), document similarity and entity relatedness. For more details on each of these tasks, we

refer the reader through to the original RDF2Vec paper by Ristoski et al. [12].

5.2 Setup

For each of the entities in all of the datasets, walks of depth 4are exhaustively extracted. A depth of 4is chosen as it

results in the best predictive performances on average for all strategies and datasets. Only for the entities of DBpedia,

the maximum number of walks per entity is limited to 500. These walks are then provided to a Word2Vec model to

create 500-dimensional embedd ings. The hyp er-parameters of the Word2Vec model are the same for all experiments in

this study. Skip-Gram is used, the window size is equal to 5and the maximum number of iterations is equal to 10 with

negative sampling set to 25. These conﬁgurations are identical to the original RDF2Vec study. The embeddings are

learned, in an unsupervised manner, for both the train and test set. For node classiﬁcation tasks, embeddings are fed to

a Support Vector Machine (SVM) classiﬁer with Radial Basis Function (RBF) kernel. The regularization strength of

the SVM is tuned to be one of {0.001,0.01,0.1,1.0,10.0,100.0,1000.0}. For tasks other than node classiﬁcation, an

evaluation framework is used [29]. For document similarity, we measure the Pearson’s linear correlation coefﬁcient,

Spearman’s rank correlation and their harmonic mean. For entity relatedness, we measure the Kendall’s rank correla-

tion coefﬁcient. For the benchmark datasets and citation networks, a pre-deﬁned train/test split is used and experiments

are repeated 5 times in order to report a corresponding standard deviation. For the tasks involving DBpedia data,

10-fold cross-validation is used and experiments are only repeated once for timing reasons. Moreover, the community

strategy was excluded from the DBpedia experiments, as it cannot be efﬁciently performed on large knowledge graphs.

For each of the walking strategies, we tune the following hyper-parameters using either a provided validation set or by

using cross-validation on the train set:

•The random,anonymous and walklet strategies are hyper-parameter-free.

•For the n-gram walker, we tune n∈[1,2,3] and the number of introduced wildcards to be either 0 or 1.

•For the community strategy, we set the resolution of the Louvain algorithm to 1.0 [30] and the probability to

teleport to a node from the community to 10%.

•For the WL strategy, we use the original algorithm used by Ristoski et al. [12]. We set the number of iterations

of the Weisfeiler-Lehman kernel to 4 and extract walks of ﬁxed depth for each of the iterations, including zero.

This causes the WL walker to extract 5 times as many walks as the random walker, which causes the results

to differ from those of the random walk strategy.

•For the HALK strategy, we extract sets of walks using different thresholds:

[0.0,0.1,0.05,0.01,0.005,0.001,0.0005,0.0001].

5.3 Evaluation results

The results for the various classiﬁcation tasks are provided in Table 1. The results for the document similarity and

entity relatedness task are provided in Table 2.

6 Discussion

Based on the provided results, several observations can be made. The random and WL are used in the original

RDF2Vec study [12]. As such, the results reported in this study can be seen as a reproduction of those results. It is

important to note here that the only reason why the results obtained by the WL and random strategy differ in this and

the original work, is because walks are extracted after each iteration of the WL relabelling algorithm. This results in

ktimes as many walks, with kthe number of iterations in the relabelling algorithm. If walks from only one of the

iterations would be used, the results would be identical to those of the random strategy. Nevertheless, this simple

trick does often result in increased predictive performances, as was empirically shown by Ristoski et al. [12]. We

hypothesize that this is due to more weight being given, internally in Word2Vec, to the entities where many walks can

be extracted from. While the original WL and random strategies result in very strong performances, especially on the

8

VANDE WIELE E T AL .

Random WL Walkets Anonymous HALK N-Gram Community

AIFB 86.11 ±2.48 91.67 ±0.00 63.89 ±0.00 41.67 ±0.00 86.11 ±0.00 88.33 ±1.11 88.89 ±1.76

MUTAG 76.76 ±0.59 75.00 ±2.46 72.06 ±0.00 66.18 ±0.00 75.00 ±0.00 77.65 ±2.85 74.71 ±3.99

BGS 79.31 ±0.00 80.69 ±6.40 65.52 ±0.00 65.52 ±0.00 80.00 ±4.57 83.45 ±4.02 84.14 ±3.52

AM 75.56 ±2.70 82.53 ±1.68 47.47 ±0.00 34.85 ±0.00 80.10 ±0.88 84.44 ±2.22 73.94 ±2.70

CORA 77.20 ±0.00 74.32 ±1.56 58.20 ±0.00 14.30 ±0.00 76.62 ±0.36 76.46 ±0.78 67.92 ±1.22

CITESEER 64.68 ±1.58 64.02 ±1.46 38.40 ±0.00 16.00 ±0.00 66.90 ±0.00 65.38 ±1.22 58.66 ±0.50

PUBMED 75.66 ±1.36 73.70 ±2.87 68.30 ±0.00 24.20 ±0.00 75.56 ±0.08 78.48 ±0.35 54.64 ±2.40

DBP: AAUP 67.94 69.88 69.27 54.73 60.08 66.96 /

DBP: Cities 79.07 79.12 79.08 55.34 73.34 79.79 /

DBP: Forbes 63.73 64.60 62.28 55.16 60.98 63.65 /

DBP: Albums 75.24 79.31 79.99 54.45 66.89 79.38 /

DBP: Movies 80.06 80.48 78.89 59.40 68.11 78.84 /

Table 1: The accuracy scores obtained by various techniques on different datasets.

Strategy Pears. rSpear. ρ µ

Random 0.578 0.390 0.466

Anonymous 0.321 0.324 0.322

Walklets 0.528 0.372 0.437

HALK 0.455 0.376 0.412

N-grams 0.551 0.353 0.431

WL 0.576 0.412 0.480

Strategy Kendall τ

Random 0.523

Anonymous 0.243

Walklets 0.520

HALK 0.424

N-grams 0.483

WL 0.516

Table 2: Document similarity and entity relatedness results

downstream tasks of DBpedia they are often outperformed by custom strategies proposed in this work.

While the results indicate that there is no one-size-ﬁts-all walking strategy for all tasks and datasets, it seems that the

n-gram strategy results in the best predictive performances on average for node classiﬁcation tasks. The average rank

of the n-gram strategy on the four node classiﬁcation and three citation network datasets, using all seven techniques,

is equal to 1.86, followed by 3of the HALK strategy and 3.07 of both the random and WL strategy. An average rank

of 1would mean that the technique outperforms all others on each dataset. The average rank of the n-gram strategy

on all the node classiﬁcation tasks, excluding the community strategy, is equal to 2.08, followed by 2.375,2.875 and

3.67 by random,WL and HALK respectively.

The performance of the community strategy varies a lot. On some datasets, such as AIFB and BGS, its performance

is among the best while it performs a lot worse than random walks on others. This is due to the fact that the quality

of the walks is highly dependent on the quality of the community detection. If the groups of nodes, clustered by the

community detection, do not align well with the downstream task, the performance worsens.

Further, it is important to note that the various strategies are complementary to each other. Even when equal accuracies

are achieved, the confusion matrices can differ. Therefore, the combination of several strategies can further increase

the predictive performance. There are different points within the pipeline where the combination of strategies can

take place: (i) at corpus level before feeding the walks to Word2Vec, (ii) at embedding level, by combining the

different produced embeddings, and (iii) at prediction level, by aggregating the predictions of the different models.

We consider this combination of strategies to be an interesting future step. While the predictive performances of some

of the strategies proposed in this work, such as the anonymous and walklet strategy, do often not come near that of the

random strategy, a combination of these strategies could improve performance.

Some limitations of this study can be identiﬁed. Firstly, no comparisons with other techniques are performed. Here,

it is important to note that RDF2Vec is an unsupervised and task-agnostic technique. As such, comparisons with

supervised techniques, speciﬁcally trained for certain tasks, such as Relational Graph Convolutional Networks [31]

are rather unfair. In the original work of Ristoski et al. [12] it was already shown that RDF2Vec outperforms other

unsupervised variants such as TransE, TransH and TransR. This was independently conﬁrmed by Zouaq and Martel,

who additionally showed that RDF2Vec outperformed ComplEx and DistMult as well [32]. Second, a ﬁxed depth and

ﬁxed hyper-parameters for the Word2Vec model were used within this study. While tuning these hyper-parameters

9

VANDE WIELE E T AL .

could possibly result in increased predictive performances, it should be noted that the number of hyper-parameters and

the range of a Word2Vec model are very large and that the time required to generate the embedding is signiﬁcant. We

therefore opted to ﬁx the hyper-parameters on sensibly chosen values, as was done by Ristoski et al.

7 Conclusion

In this work, ﬁve walk strategies that can serve as an alternative to the basic random walk approach are proposed

as a response to the observation that the WL kernel offers little improvement in the context of a single KG. Results

indicate that there is no one-size-ﬁts-all strategy for all datasets and tasks, and that tuning the strategy to a speciﬁc

objective, as opposed to simply using the random walk approach, can result in increased predictive performances.

There are several future directions that we deem interesting. First, it would be interesting to study the impact on

the performance when the strategies are combined with different biased walk strategies and embedding algorithms

that differ from the Word2Vec model used within this work. Second, all of the strategies proposed in this work are

unsupervised, but supervised approaches could be evaluated that sacriﬁce generality to gain predictive performance.

Third, as already mentioned, the walking strategies are complementary to each other and combining them could

potentially result in increased predictive performances. Therefore an evaluation of different combination strategies

would be an interesting addition.

Reproducibility and code availability

We provide a Python implementation of RDF2Vec with can be combined with any of the walking strategies discussed

in this work3. Moreover, we provide all code required to reproduce the reported results4.

Acknowledgements

GV (1S31417N) and BS (1SA0219N) are funded by a strategic base research grant of the Fund for Scientiﬁc Research

Flanders (F WO).

References

[1] Xander Wilcke, Peter Bloem, and Victor De Boer. The Knowledge Graph as the Default Data Model for Machine

Learning. Data Science, 1:1–0, 2017.

[2] Petar Ristoski and Heiko Paulheim. A comparison of propositionalization strategies for creating features from

linked open data. Linked Data for Knowledge Discovery, 6, 2014.

[3] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of relational machine

learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2015.

[4] William L Hamilton, Rex Ying, and Jure Leskovec. Representation Learning on Graphs: Methods and Applica-

tions. Preprint of article to appear in the IEEE Data Engineering Bulletin, 2017.

[5] Petar Ristoski and Heiko Paulheim. Semantic web in data mining and knowledge discovery: A comprehensive

survey. Journal of Web Semantics, 36:1–22, 2016.

[6] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages

701–710, 2014.

[7] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd

ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016.

[8] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words

and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119,

2013.

[9] S Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and Karsten M Borgwardt. Graph kernels. Journal

of Machine Learning Research, 11(Apr):1201–1242, 2010.

3github.com/IBCNServices/pyRDF2Vec

4github.com/GillesVandewiele/WalkExperiments

10

VANDE WIELE E T AL .

[10] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD Inter-

national Conference on Knowledge Discovery and Data Mining, pages 1365–1374, 2015.

[11] Nils M Kriege, Fredrik D Johansson, and Christopher Morris. A survey on graph kernels. Applied Network

Science, 5(1):1–42, 2020.

[12] Petar Ristoski, Jessica Rosati, Tommaso Di Noia, Renato De Leone, and Heiko Paulheim. Rdf2vec: Rdf graph

embeddings and their applications. Semantic Web, 10(4):721–752, 2019.

[13] Gerben Klaas Dirk de Vries and Steven de Rooij. Substructure counting graph kernels for machine learning from

rdf data. Web Semantics: Science, Services and Agents on the World Wide Web, 35:71–84, 2015.

[14] Yoav Goldberg and Omer Levy. word2vec explained: deriving mikolov et al.’s negative-sampling word-

embedding method. arXiv preprint arXiv:1402.3722, 2014.

[15] Michael Cochez, Petar Ristoski, Simone Paolo Ponzetto, and Heiko Paulheim. Biased graph walks for rdf graph

embeddings. In Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics,

pages 1–12, 2017.

[16] Boris Weisfeiler and Andrei A Lehman. A reduction of a graph to a canonical form and an algebra arising during

this reduction. Nauchno-Technicheskaya Informatsia, 2(9):12–16, 1968.

[17] Gerben KD de Vries. A fast approximation of the weisfeiler-lehman graph kernel for rdf data. In Joint European

Conference on Machine Learning and Knowledge Discovery in Databases, pages 606–621. Springer, 2013.

[18] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt.

Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.

[19] Mohammad Mehdi Keikha, Maseud Rahgozar, and Masoud Asadpour. Community aware random walk for

network embedding. Knowledge-Based Systems, 148:47–54, 2018.

[20] Santo Fortunato. Community detection in graphs. Physics reports, 486(3-5):75–174, 2010.

[21] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of commu-

nities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008.

[22] Sergey Ivanov and Evgeny Burnaev. Anonymous walk embeddings. arXiv preprint arXiv:1805.11921, 2018.

[23] Bryan Perozzi, Vivek Kulkarni, Haochen Chen, and Steven Skiena. Don’t walk, skip! online learning of multi-

scale network embeddings. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in

Social Networks Analysis and Mining 2017, pages 258–265, 2017.

[24] Jörg Schlötterer, Martin Wehking, Fatemeh Salehi Rizi, and Michael Granitzer. Investigating extensions to

random walk based graph embedding. In 2019 IEEE International Conference on Cognitive Computing (ICCC),

pages 81–89. IEEE, 2019.

[25] Gilles Vandewiele, Bram Steenwinckel, Femke Ongenae, and Filip De Turck. Inducing a decision tree with

discriminative paths to classify entities in a knowledge graph. In SEPDA2019, the 4th International Workshop

on Semantics-Powered Data Mining and Analytics, pages 1–6, 2019.

[26] Petar Ristoski, Gerben Klaas Dirk De Vries, and Heiko Paulheim. A collection of benchmark datasets for

systematic evaluations of machine learning on the semantic web. In International Semantic Web Conference,

pages 186–194. Springer, 2016.

[27] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective

classiﬁcation in network data. AI magazine, 29(3):93–93, 2008.

[28] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hell-

mann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al. Dbpedia–a large-scale, multilingual knowledge

base extracted from wikipedia. Semantic Web, 6(2):167–195, 2015.

[29] Maria Angela Pellegrino, Abdulrahman Altabb, Martina Garofalo, Petar Ristoski, and Michael Cochez. Geval:

a modular and extensible evaluation framework for graph embedding techniques. In European Semantic Web

Conference. Springer, 2020.

[30] Renaud Lambiotte, J-C Delvenne, and Mauricio Barahona. Laplacian dynamics and multiscale modular structure

in networks. arXiv preprint arXiv:0812.1770, 2008.

[31] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. Mod-

eling relational data with graph convolutional networks. In European Semantic Web Conference, pages 593–607.

Springer, 2018.

[32] Amal Zouaq and Felix Martel. What is the schema of your knowledge graph? leveraging knowledge graph

embeddings and clustering for expressive taxonomy learning. In Proceedings of The International Workshop on

Semantic Big Data, pages 1–6, 2020.

11