PreprintPDF Available

Walk Extraction Strategies for Node Embeddings with RDF2Vec in Knowledge Graphs

Preprints and early-stage research may not have been peer reviewed yet.


As KGs are symbolic constructs, specialized techniques have to be applied in order to make them compatible with data mining techniques. RDF2Vec is an unsupervised technique that can create task-agnostic numerical representations of the nodes in a KG by extending successful language modelling techniques. The original work proposed the Weisfeiler-Lehman (WL) kernel to improve the quality of the representations. However, in this work, we show both formally and empirically that the WL kernel does little to improve walk embeddings in the context of a single KG. As an alternative to the WL kernel, we propose five different strategies to extract information complementary to basic random walks. We compare these walks on several benchmark datasets to show that the \emph{n-gram} strategy performs best on average on node classification tasks and that tuning the walk strategy can result in improved predictive performances.
arXiv:2009.04404v1 [cs.LG] 9 Sep 2020
Gilles Vandewiele
Ghent University – imec
Bram Steenwinckel
Ghent University – imec
Pieter Bonte
Ghent University – imec
Michael Weyns
Ghent University – imec
Heiko Paulheim
Data and Web Science Group
University of Mannheim
Petar Ristoski
IBM Research Almaden
United States of America
Filip De Turck
Ghent University – imec
Ghent, Belgium
Femke Ongenae
Ghent University – imec
Ghent, Belgium
September 10, 2020
As KGs are symbolic constructs, specialized techniques have to be applied in order to make them
compatible with data mining techniques. RDF2Vec is an unsupervised technique that can create task-
agnostic numerical representations of the nodes in a KG by extending successful language modelling
techniques. The original work proposed the Weisfeiler-Lehman (WL) kernel to improve the quality
of the representations. However, in this work, we show both formally and empirically that the WL
kernel does little to improve walk embeddings in the context of a single KG. As an alternative to
the WL kernel, we propose five different strategies to extract information complementary to basic
random walks. We compare these walks on several benchmark datasets to show that the n-gram
strategy performs best on average on node classification tasks and that tuning the walk strategy can
result in improved predictive performances.
Keywords Knowledge graphs ·Embeddings ·Representation Learning
1 Introduction
As a result of the recent data deluge, we are increasingly confronted with more information than we can meaningfully
make sense of. Above all, this information is characterised by contextual heterogeneity: its origins are semantically
and syntactically diverse. Insights derived through traditional data mining procedures will be constrained precisely
because such procedures fail to account for this staggering diversity. To deal with such a myriad of contextual
environments and backgrounds, the Semantic Web’s (SW) Linked Open Data (LOD) initiative can be used to interlink
various data sources and unite them under a common queryable interface. The product of such a consolidation
effort is often called a Knowledge Graph (KG). In addition to unifying information from various sources, KGs are
able to enrich classical data formats by explicitly encoding relations between different data points in the form of edges.
Using KGs to enhance traditional data mining techniques with background knowledge is a relatively recent endeav-
our [1]. Because KGs are symbolic constructs, their compatibility with such techniques is rather limited. In fact, data
mining techniques usually require inputs to be presented as numerical feature vectors, and are therefore unable to
process background knowledge directly. With this in mind, some of the earliest knowledge-enhanced data mining
approaches proceeded by extracting custom features from specific and generic relations inside the graph [2]. While
these approaches produce human-interpretable variables, they have to be tailored to the task at hand and therefore
require extensive effort. As an alternative to feature-based approaches, techniques can be applied to learn vector
representations, called embeddings, for each of the entities inside a graph based on a limited set of global latent
features [3, 4]. These techniques are task-agnostic, which allows them to be used for different downstream tasks, such
as predicting missing links inside a graph or categorizing different nodes [5].
Natural language and graphs often share similarities. As an example, frequencies of language symbols or graph
structures both tend to approximate Zipf’s law. Techniques such as DeepWalk [6] and Node2Vec [7] were among the
first to leverage these similarities, by extending successful language modelling techniques, such as Word2Vec [8],
to deal with graph-based data. Their proposed techniques rely on the extraction of sequences of graph vertices,
which are then fed as sentences to language models. Similarly, work on (deep) graph kernels also relies on language
modelling to learn the latent representations of graph substructures [9, 10, 11]. RDF2Vec is a technique that builds on
the progress made by these previous two types of techniques by adapting random walks and the Weisfeiler-Lehman
(WL) subtree kernel to directed graphs with labelled edges, i.e. KGs [12].
In this work, we show that the WL kernel, while effective for measuring similarities between nodes or when working
with regular graphs, offers little improvements in the context of a single KG with respect to walk embeddings. In
response to this observation, we propose various alternative walk strategies for RDF data to improve upon basic
random walks and compare them on different benchmark datasets.
The remainder of this paper is structured as follows. In Section 2, background information is provided on KGs,
RDF2Vec, walk embeddings and the WL kernel. Next, in Section 3, we provide a formal discussion of our claim
with respect to the feasibility of Weisfeiler-Lehman subtree RDF graph kernels. Following this, Section 4 discusses
a number of possible alternative walk strategies, including pseudo-code listings for each algorithm. Section 5 then
describes the datasets used to evaluate these alternative strategies and lists the corresponding results. These results are
subsequently discussed in Section 6. Finally, in Section 7 we conclude this work with a general reflection.
2 Background
2.1 Knowledge graphs
A KG is a multi-relational directed graph, G= (V,E, ℓ), where Vare the vertices in our graph, Ethe edges or
predicates and a labelling function that maps each vertex or edge onto its corresponding label. It should further be
noted that we can identify three different types of vertices: (i) entities, (ii) blank nodes, and (iii) literals. We can
simplify further analysis by applying a transformation to the knowledge graph which removes the multi-relational
aspect, as done by de Vries et al. [13]. This is done by representing each (subject, predicate, object) triple
from the original knowledge graph by three labelled nodes and two unlabelled edges (subject predicate and
predicate object).
2.2 RDF2Vec
Machine learning algorithms cannot work directly with graph-based data, as they require numerical vectors as
input. RDF2Vec [12] is an unsupervised, task-agnostic approach that solves this problem by first transforming the
information of the nodes in the graph into numerical data, which are called latent representations or embeddings.
The goal is to capture as much of the semantics as possible in the numerical representation, i.e. entities that are
semantically related should be close to each other in the embedded space. RDF2Vec builds on word embedding
techniques, which have shown great success in the domain of natural language processing. These word embedding
techniques take a corpus of sentences as input, and learn a latent representation for each of the unique words within
the corpus. Learning this latent representation can be done, for example, by learning to predict a word based on its
context (continuous bag-of-words) or predicting the context based on a target word (skip-gram) [14, 8].
2.3 (Random) walk embeddings
In the context of (knowledge) graphs, we can construct an input corpus by extracting walks. A walk is a sequence of
vertices that can be found in the graph by traversing the directed links. We can notate a walk of length n as follows:
We can then use the graph labelling function to create a sentence:
It should be noted that, due to the previously discussed transformation of our KG, nodes with an even index in the
walk correspond to entities of the original knowledge graph, while nodes with an odd index correspond to predicates.
The most straightforward strategy to extract walks is by doing a breadth-first traversal of the graph starting from the
nodes of interest. Since the total number of walks that can be extracted grows exponentially in function of the depth,
sampling can be applied after each iteration of breadth-first traversal. This sampling can either be guided by some
metric, resulting in a collection of biased walks [15], or can be performed at random, which results in random walks.
2.4 Weisfeiler-Lehman kernel
The WL kernel was proposed as an extension to the labelling function. The WL kernel is an algorithm to test whether
two graphs were isomorphic in polynomial time [16]. The intuition behind the algorithm was to assign new labels
to each of the nodes, where each of the newly assigned labels captured the information of an entire subgraph up to a
certain depth. This algorithm was later adapted to serve as a kernel, or similarity measure, between graphs [17, 18],
by counting the number of WL labels two graphs had in common. The WL relabelling for a node vis performed by
recursively hashing the concatenation of the label of an entity and the (sorted) labels the nodes of its neighbours:
WL0(v) = (v)
WLk(v) = H ASH (WLk1(v)S ORT({W Lk1(vn)|vnN(v)})) (3)
with k1,WLk(v)the corresponding WL label of a node after kiterations, H AS H a hashing function, S ORT a sorting
function, the string concatenation operator and N(v)the neighbourhood of v. We assume that the hashing function
does not produce collisions. This can easily be achieved by, for example, mapping each unique label to an integer. The
neighbourhood of a vertex vis defined as the set of vertices vwith an edge going to v:
N(v) = {v|(v, v)E}(4)
3 Weisfeiler-Lehman kernel for Knowledge Graphs
Ristoski et al. proposed to use the W L kernel in order to relabel nodes as an alternative to extracting random walks [12].
We will refer to this as the WL strategy in the remainder of this paper. However, we argue that this WL strategy provides
no additional information with respect to entity representations when extracting a fixed number of random walks from
a knowledge graph. We now formally demonstrate this claim on al three types of vertices:
3.1 Entities
The wl strategy brings little to no added value in comparison to random walks, when applied to entities in a knowledge
graph, due to the properties of R DF data. Entities in RD F are represented by Uniform Resource Identifiers (UR I), which
need to be unique1. As such:
(x) = (y)x=y(5)
Due to this property, WL relabelling, when applied on RDF data, is nothing more than a bijection from the hops in
random walks to the hops in the walks obtained through WL relabelling. This means that WL relabelling does not add
any useful additional information. When two WL labels are equal, their underlying entities are always equal as well.
We will now prove this formally. First, from Eq. 3 we can deduce that when two W L labels of two nodes are equal,
then at least the labels of these nodes should be equal and they should have the same neighbours2. Formally this means
WLk(vj) = W Lk(vi)(vj) = (vi)N(vj) = N(vi)(6)
WLk(vj) = W Lk(vi)vj=vi
Step 1: W Lk(vj) = WLk(vi) =vj=vi
WLk(vj) = W Lk(vi) =(vj) = (vi)N(vj) = N(vi)(Eq. 6)
=(vj) = (vi)(Conjunction Elim.)
=vj=vi(Eq. 5)
Step 2: vj=vi=W Lk(vj) = WL k(vi)
vj=vi=(vj) = (vi)(Eq. 5)
=(vj) = (vi)N(vj) = N(vi)(Eq. 4 and vj=vi)
=WLk(vj) = WLk(vi)(Eq. 6)
3.2 Blank nodes and literals
In contrast to entities, Eq. 5 does not hold for blank nodes and literals. This implies that multiple nodes could have the
same original label but have different WL labels (one-to-many mapping). However, the added value of WL is limited
even in these cases due to the fact that blank nodes rarely have the exact same neighbourhoods and because literals
only have one incoming edge and no outgoing edges. Moreover, due to the fact that RDF2Vec treats each hop in the
walk as categorical data, RDF2Vec does not handle literals well.
4 Custom walk extraction strategies
Based on the observation probably discussed, we now identify two types of strategies to construct a corpus of walks:
Type 1 - Extraction: strategies that define how walks for each of the entities are extracted. The random walk strategy
is an example of such a strategy, where breadth-first traversal is applied to extract walks.
Type 2 - Transformation: strategies that transform walks extracted by a Type 1 strategy. The WL strategy is an
example of this type. In order for such a strategy to provide information complementary to the originally
provided walks, it must define a one-to-many or many-to-one mapping from the original labels to the new
We now propose five different strategies complementary to the random strategy. One of these strategies can be classi-
fied as being of Type 1 while the other four are of Type 2.
4.1 Community hops
As opposed to iteratively extending the walk with neighbours of a vertex, we could allow with a certain probability
for teleportation to a node that has properties similar to a certain neighbor [19]. In order to group nodes with similar
properties together, unsupervised community detection can be applied [20]. In this work, we opted to use the Louvain
method [21] due to its excellent trade-off between speed and clustering quality. The idea of introducing community
hops is to capture implicit relations between nodes that are not explicitly modelled in the KG, and to allow for including
related pieces of knowledge in the walks which are otherwise out of reach. We provide pseudo-code for this strategy
in Algorithm 1. This strategy is of Type 1. We will refer to this strategy as community.
4.2 Anonymous walks
The random walks discussed in the previous section can be anonymized, which transforms label information into
positional information. More formally, a walk w=v0v1... vn, is transformed into f(v0)f(v1)
2Proof omitted due to space restrictions, but as WL is recursive, it can be proven through induction.
Alg. 1: community_walk(G, v, depth, p, hop_prob)
# List of communities and dictionary {vertex: community}
com, com_map = com_detection(G)
walks = { (v,) }
for din range(depth):
new = set()
for walk in walks:
for nin get_neighbours(G, v):
# Sample neighbourhood
if random() <p:
new.add(walk + (n,))
# Hop to community
if random() <hop_prob:
c_n = com[com_map[n]]
hop = choice(c_n)
new.add(walk + (hop,))
walks = new
return walks
... f(vn)with f(vi) = min({i|w[i] = vi}), which corresponds to the first index where vican be found in
the walk w[22]. The notion behind anonmyous walks is that local graph structures often bear enough information
for encoding and reconstructing a graph, even when ignoring the node labels, i.e., the mere topology surrounding a
node is often sufficient for identifying that node. Ignoring the labels, on the other hand, allows for a computationally
efficient generation of the walks. We present pseudo-code for this transformation in Algorithm 2. This strategy is of
Type 2. We will refer to this strategy as anonymous.
Alg. 2: anonymize(walks)
anon_walks = [ ]
for walk in walks:
new = [ walk[0] ]
for hop in walk[1:]:
return anon_walks
4.3 Walkets
Walks can be transformed into walklets, which are walks of length two consisting of the root of the original walk and
one of the hops. Provided a walk w=v0v1... vn, we can construct sets of walklets {(v0, vi)|1i
n}[23]. While standard RDF2Vec does not consider the distance between two nodes in a walk, walklets are explicitly
created for different scales. Hence, they allow for such a distinction between a direct neighbor and a node which is
further away. Pseudocode for this approach is provided in Algorithm 3. This strategy is of Type 2. We will refer to
this strategy as walklet.
Alg. 3: walklets(walks)
walklets = set()
for walk in walks:
for iin range(1, |walk|):
walklets.add((walk[0], walk[i]))
return walklets
4.4 Hierarchical random walk (HALK)
The frequency of entities in a knowledge graph often follows a long-tailed distribution, similar to natural language.
Entities rarely occurring often carry little information, and increase the number of hops between the root and potentially
more interesting entities. As such, the removal of rare entities from the random walks can increase the quality of
the generated embeddings while decreasing the memory usage [24]. Pseudo-code for this strategy is provided in
Algorithm 4. This strategy is of Type 2. We will refer to this strategy as HALK.
Alg. 4: halk(walks, thresholds)
# Count nr. of walks a hop occurs
counts = { }
for iin range(|walk|):
for hop in walks[i]:
if hop not in frequencies:
counts[hop] = {i}
# Skip rare hops
halk_walks = [ ]
for thresh in thresholds:
for walk in walks:
new = [ walk[0] ]
for hop in walk[1:]:
if |counts[hop]|
return halk_walks
4.5 N-Gram walks
Another approach that defines a one-to-many mapping is relabelling n-grams in the random walks. The intuition
behind this is that the predecessors of a node that two different walks have in common can be different. Additionally,
we can inject wildcards into the walk before relabelling n-grams [25]. The injection of wildcards allows subsequences
with small differences to be mapped onto the same label. Pseudo-code for this strategy is provided in Algorithm 5.
This strategy is of Type 2. We will refer to this strategy as n-gram.
Alg. 5: ngram(walks, n, n_wild)
# Introduce wildcards in the walks
extended_walks = walks
for walk in walks:
idx = range(1, |walk|)
combs = combinations(idx, n_wild)
for comb in combs:
new = walk
for iin comb:
new[i] = ‘*’
# Relabel ngrams in the walk
ngram_walks = [ ]
map = { }
for walk in extended_walks:
new = walk[:n]
for iin range(n, |walk|+ 1):
ngram = walk[i-n:i]
if ngram not in map:
map[ngram] = |map|
return ngram_walks
4.6 Example
In order to further clarify each of the proposed strategies, we provide an example in Figure 1.
HALK (thresh = 0.2)
A 0 1
F 0 1
N-Gram (n = 2,1 wildcard)
A C Na
A C Nb
A D Nc
A D Nd
F G Ne
F G Nf
F H Ng
F D Nc
F H Nh
F D Nd
Figure 1: An example of each of our proposed strategies. We extract walks with the random and community strategy
of exactly depth 2 from nodes “A" and “F". For other strategies, we transform the walks extracted by the random
strategy. Nodes “C" and “H" belong to the same community.
5 Results
To evaluate the impact of custom walking strategies, we measure the predictive performance on different datasets and
various tasks.
5.1 Datasets
Three different types of datasets are used, in order to ensure enough variation in our evaluation. Moreover, these
datasets are commonly used in (knowledge) graph-based machine learning studies.
5.1.1 Node classification benchmark datasets
We will be using four benchmark data sets, each describing knowledge graphs, that serve as benchmarks for node
classification and are available from a public repository set up by Ristoski et al. [26]. The names of these benchmark
datasets are AIFB, MUTAG, BGS and AM. For each of these data sets, we remove triples with specific predicates that
would leak the target from our knowledge graph, as provided by the original authors. Moreover, a predefined split into
train and test set, with the corresponding ground truth, is provided by the authors, which we used in our experiments.
5.1.2 Citation networks
We converted three citation networks [27], which describe scientific papers, to knowledge graphs. The three citation
networks used are CORA, CITESEER and PUBMED. Each paper is represented by a bag-of-word or tf-idf represen-
tation of their content and a list of citations to other papers in the network. A fixed train/test split is provided for each
of the datasets and the associated task is to categorize each of the papers into the correct research domain, which can
be seen as a node classification task. For each paper p, we obtained the words wfrom the bag-of-words or tf-idf vector
with a value greater than 0 and add the following triples to our KG: {(p, hasW ord, w )|f(p, w)>0}, with f(p, w)
a function that retrieves the bag-of-word or tf-idf value of word wfor paper p. Moreover, for each paper pcited by p
we add the following triple: (p, cites, p).
5.1.3 DBpedia
We use the English version of the 2016-10 DBpedia dataset [28], which contains 4,356,314 entities and 52,689,448
triples in total. In our evaluation, we only consider object properties, and ignore datatype properties and literals. We
use the obtained embeddings in multiple different downstream tasks: 5 different classification tasks (AAUP, Cities,
Forbes, Albums and Movies), document similarity and entity relatedness. For more details on each of these tasks, we
refer the reader through to the original RDF2Vec paper by Ristoski et al. [12].
5.2 Setup
For each of the entities in all of the datasets, walks of depth 4are exhaustively extracted. A depth of 4is chosen as it
results in the best predictive performances on average for all strategies and datasets. Only for the entities of DBpedia,
the maximum number of walks per entity is limited to 500. These walks are then provided to a Word2Vec model to
create 500-dimensional embedd ings. The hyp er-parameters of the Word2Vec model are the same for all experiments in
this study. Skip-Gram is used, the window size is equal to 5and the maximum number of iterations is equal to 10 with
negative sampling set to 25. These configurations are identical to the original RDF2Vec study. The embeddings are
learned, in an unsupervised manner, for both the train and test set. For node classification tasks, embeddings are fed to
a Support Vector Machine (SVM) classifier with Radial Basis Function (RBF) kernel. The regularization strength of
the SVM is tuned to be one of {0.001,0.01,0.1,1.0,10.0,100.0,1000.0}. For tasks other than node classification, an
evaluation framework is used [29]. For document similarity, we measure the Pearson’s linear correlation coefficient,
Spearman’s rank correlation and their harmonic mean. For entity relatedness, we measure the Kendall’s rank correla-
tion coefficient. For the benchmark datasets and citation networks, a pre-defined train/test split is used and experiments
are repeated 5 times in order to report a corresponding standard deviation. For the tasks involving DBpedia data,
10-fold cross-validation is used and experiments are only repeated once for timing reasons. Moreover, the community
strategy was excluded from the DBpedia experiments, as it cannot be efficiently performed on large knowledge graphs.
For each of the walking strategies, we tune the following hyper-parameters using either a provided validation set or by
using cross-validation on the train set:
The random,anonymous and walklet strategies are hyper-parameter-free.
For the n-gram walker, we tune n[1,2,3] and the number of introduced wildcards to be either 0 or 1.
For the community strategy, we set the resolution of the Louvain algorithm to 1.0 [30] and the probability to
teleport to a node from the community to 10%.
For the WL strategy, we use the original algorithm used by Ristoski et al. [12]. We set the number of iterations
of the Weisfeiler-Lehman kernel to 4 and extract walks of fixed depth for each of the iterations, including zero.
This causes the WL walker to extract 5 times as many walks as the random walker, which causes the results
to differ from those of the random walk strategy.
For the HALK strategy, we extract sets of walks using different thresholds:
5.3 Evaluation results
The results for the various classification tasks are provided in Table 1. The results for the document similarity and
entity relatedness task are provided in Table 2.
6 Discussion
Based on the provided results, several observations can be made. The random and WL are used in the original
RDF2Vec study [12]. As such, the results reported in this study can be seen as a reproduction of those results. It is
important to note here that the only reason why the results obtained by the WL and random strategy differ in this and
the original work, is because walks are extracted after each iteration of the WL relabelling algorithm. This results in
ktimes as many walks, with kthe number of iterations in the relabelling algorithm. If walks from only one of the
iterations would be used, the results would be identical to those of the random strategy. Nevertheless, this simple
trick does often result in increased predictive performances, as was empirically shown by Ristoski et al. [12]. We
hypothesize that this is due to more weight being given, internally in Word2Vec, to the entities where many walks can
be extracted from. While the original WL and random strategies result in very strong performances, especially on the
Random WL Walkets Anonymous HALK N-Gram Community
AIFB 86.11 ±2.48 91.67 ±0.00 63.89 ±0.00 41.67 ±0.00 86.11 ±0.00 88.33 ±1.11 88.89 ±1.76
MUTAG 76.76 ±0.59 75.00 ±2.46 72.06 ±0.00 66.18 ±0.00 75.00 ±0.00 77.65 ±2.85 74.71 ±3.99
BGS 79.31 ±0.00 80.69 ±6.40 65.52 ±0.00 65.52 ±0.00 80.00 ±4.57 83.45 ±4.02 84.14 ±3.52
AM 75.56 ±2.70 82.53 ±1.68 47.47 ±0.00 34.85 ±0.00 80.10 ±0.88 84.44 ±2.22 73.94 ±2.70
CORA 77.20 ±0.00 74.32 ±1.56 58.20 ±0.00 14.30 ±0.00 76.62 ±0.36 76.46 ±0.78 67.92 ±1.22
CITESEER 64.68 ±1.58 64.02 ±1.46 38.40 ±0.00 16.00 ±0.00 66.90 ±0.00 65.38 ±1.22 58.66 ±0.50
PUBMED 75.66 ±1.36 73.70 ±2.87 68.30 ±0.00 24.20 ±0.00 75.56 ±0.08 78.48 ±0.35 54.64 ±2.40
DBP: AAUP 67.94 69.88 69.27 54.73 60.08 66.96 /
DBP: Cities 79.07 79.12 79.08 55.34 73.34 79.79 /
DBP: Forbes 63.73 64.60 62.28 55.16 60.98 63.65 /
DBP: Albums 75.24 79.31 79.99 54.45 66.89 79.38 /
DBP: Movies 80.06 80.48 78.89 59.40 68.11 78.84 /
Table 1: The accuracy scores obtained by various techniques on different datasets.
Strategy Pears. rSpear. ρ µ
Random 0.578 0.390 0.466
Anonymous 0.321 0.324 0.322
Walklets 0.528 0.372 0.437
HALK 0.455 0.376 0.412
N-grams 0.551 0.353 0.431
WL 0.576 0.412 0.480
Strategy Kendall τ
Random 0.523
Anonymous 0.243
Walklets 0.520
HALK 0.424
N-grams 0.483
WL 0.516
Table 2: Document similarity and entity relatedness results
downstream tasks of DBpedia they are often outperformed by custom strategies proposed in this work.
While the results indicate that there is no one-size-fits-all walking strategy for all tasks and datasets, it seems that the
n-gram strategy results in the best predictive performances on average for node classification tasks. The average rank
of the n-gram strategy on the four node classification and three citation network datasets, using all seven techniques,
is equal to 1.86, followed by 3of the HALK strategy and 3.07 of both the random and WL strategy. An average rank
of 1would mean that the technique outperforms all others on each dataset. The average rank of the n-gram strategy
on all the node classification tasks, excluding the community strategy, is equal to 2.08, followed by 2.375,2.875 and
3.67 by random,WL and HALK respectively.
The performance of the community strategy varies a lot. On some datasets, such as AIFB and BGS, its performance
is among the best while it performs a lot worse than random walks on others. This is due to the fact that the quality
of the walks is highly dependent on the quality of the community detection. If the groups of nodes, clustered by the
community detection, do not align well with the downstream task, the performance worsens.
Further, it is important to note that the various strategies are complementary to each other. Even when equal accuracies
are achieved, the confusion matrices can differ. Therefore, the combination of several strategies can further increase
the predictive performance. There are different points within the pipeline where the combination of strategies can
take place: (i) at corpus level before feeding the walks to Word2Vec, (ii) at embedding level, by combining the
different produced embeddings, and (iii) at prediction level, by aggregating the predictions of the different models.
We consider this combination of strategies to be an interesting future step. While the predictive performances of some
of the strategies proposed in this work, such as the anonymous and walklet strategy, do often not come near that of the
random strategy, a combination of these strategies could improve performance.
Some limitations of this study can be identified. Firstly, no comparisons with other techniques are performed. Here,
it is important to note that RDF2Vec is an unsupervised and task-agnostic technique. As such, comparisons with
supervised techniques, specifically trained for certain tasks, such as Relational Graph Convolutional Networks [31]
are rather unfair. In the original work of Ristoski et al. [12] it was already shown that RDF2Vec outperforms other
unsupervised variants such as TransE, TransH and TransR. This was independently confirmed by Zouaq and Martel,
who additionally showed that RDF2Vec outperformed ComplEx and DistMult as well [32]. Second, a fixed depth and
fixed hyper-parameters for the Word2Vec model were used within this study. While tuning these hyper-parameters
could possibly result in increased predictive performances, it should be noted that the number of hyper-parameters and
the range of a Word2Vec model are very large and that the time required to generate the embedding is significant. We
therefore opted to fix the hyper-parameters on sensibly chosen values, as was done by Ristoski et al.
7 Conclusion
In this work, five walk strategies that can serve as an alternative to the basic random walk approach are proposed
as a response to the observation that the WL kernel offers little improvement in the context of a single KG. Results
indicate that there is no one-size-fits-all strategy for all datasets and tasks, and that tuning the strategy to a specific
objective, as opposed to simply using the random walk approach, can result in increased predictive performances.
There are several future directions that we deem interesting. First, it would be interesting to study the impact on
the performance when the strategies are combined with different biased walk strategies and embedding algorithms
that differ from the Word2Vec model used within this work. Second, all of the strategies proposed in this work are
unsupervised, but supervised approaches could be evaluated that sacrifice generality to gain predictive performance.
Third, as already mentioned, the walking strategies are complementary to each other and combining them could
potentially result in increased predictive performances. Therefore an evaluation of different combination strategies
would be an interesting addition.
Reproducibility and code availability
We provide a Python implementation of RDF2Vec with can be combined with any of the walking strategies discussed
in this work3. Moreover, we provide all code required to reproduce the reported results4.
GV (1S31417N) and BS (1SA0219N) are funded by a strategic base research grant of the Fund for Scientific Research
Flanders (F WO).
[1] Xander Wilcke, Peter Bloem, and Victor De Boer. The Knowledge Graph as the Default Data Model for Machine
Learning. Data Science, 1:1–0, 2017.
[2] Petar Ristoski and Heiko Paulheim. A comparison of propositionalization strategies for creating features from
linked open data. Linked Data for Knowledge Discovery, 6, 2014.
[3] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of relational machine
learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2015.
[4] William L Hamilton, Rex Ying, and Jure Leskovec. Representation Learning on Graphs: Methods and Applica-
tions. Preprint of article to appear in the IEEE Data Engineering Bulletin, 2017.
[5] Petar Ristoski and Heiko Paulheim. Semantic web in data mining and knowledge discovery: A comprehensive
survey. Journal of Web Semantics, 36:1–22, 2016.
[6] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages
701–710, 2014.
[7] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd
ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016.
[8] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words
and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119,
[9] S Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and Karsten M Borgwardt. Graph kernels. Journal
of Machine Learning Research, 11(Apr):1201–1242, 2010.
[10] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD Inter-
national Conference on Knowledge Discovery and Data Mining, pages 1365–1374, 2015.
[11] Nils M Kriege, Fredrik D Johansson, and Christopher Morris. A survey on graph kernels. Applied Network
Science, 5(1):1–42, 2020.
[12] Petar Ristoski, Jessica Rosati, Tommaso Di Noia, Renato De Leone, and Heiko Paulheim. Rdf2vec: Rdf graph
embeddings and their applications. Semantic Web, 10(4):721–752, 2019.
[13] Gerben Klaas Dirk de Vries and Steven de Rooij. Substructure counting graph kernels for machine learning from
rdf data. Web Semantics: Science, Services and Agents on the World Wide Web, 35:71–84, 2015.
[14] Yoav Goldberg and Omer Levy. word2vec explained: deriving mikolov et al.’s negative-sampling word-
embedding method. arXiv preprint arXiv:1402.3722, 2014.
[15] Michael Cochez, Petar Ristoski, Simone Paolo Ponzetto, and Heiko Paulheim. Biased graph walks for rdf graph
embeddings. In Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics,
pages 1–12, 2017.
[16] Boris Weisfeiler and Andrei A Lehman. A reduction of a graph to a canonical form and an algebra arising during
this reduction. Nauchno-Technicheskaya Informatsia, 2(9):12–16, 1968.
[17] Gerben KD de Vries. A fast approximation of the weisfeiler-lehman graph kernel for rdf data. In Joint European
Conference on Machine Learning and Knowledge Discovery in Databases, pages 606–621. Springer, 2013.
[18] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt.
Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.
[19] Mohammad Mehdi Keikha, Maseud Rahgozar, and Masoud Asadpour. Community aware random walk for
network embedding. Knowledge-Based Systems, 148:47–54, 2018.
[20] Santo Fortunato. Community detection in graphs. Physics reports, 486(3-5):75–174, 2010.
[21] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of commu-
nities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008.
[22] Sergey Ivanov and Evgeny Burnaev. Anonymous walk embeddings. arXiv preprint arXiv:1805.11921, 2018.
[23] Bryan Perozzi, Vivek Kulkarni, Haochen Chen, and Steven Skiena. Don’t walk, skip! online learning of multi-
scale network embeddings. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in
Social Networks Analysis and Mining 2017, pages 258–265, 2017.
[24] Jörg Schlötterer, Martin Wehking, Fatemeh Salehi Rizi, and Michael Granitzer. Investigating extensions to
random walk based graph embedding. In 2019 IEEE International Conference on Cognitive Computing (ICCC),
pages 81–89. IEEE, 2019.
[25] Gilles Vandewiele, Bram Steenwinckel, Femke Ongenae, and Filip De Turck. Inducing a decision tree with
discriminative paths to classify entities in a knowledge graph. In SEPDA2019, the 4th International Workshop
on Semantics-Powered Data Mining and Analytics, pages 1–6, 2019.
[26] Petar Ristoski, Gerben Klaas Dirk De Vries, and Heiko Paulheim. A collection of benchmark datasets for
systematic evaluations of machine learning on the semantic web. In International Semantic Web Conference,
pages 186–194. Springer, 2016.
[27] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective
classification in network data. AI magazine, 29(3):93–93, 2008.
[28] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hell-
mann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al. Dbpedia–a large-scale, multilingual knowledge
base extracted from wikipedia. Semantic Web, 6(2):167–195, 2015.
[29] Maria Angela Pellegrino, Abdulrahman Altabb, Martina Garofalo, Petar Ristoski, and Michael Cochez. Geval:
a modular and extensible evaluation framework for graph embedding techniques. In European Semantic Web
Conference. Springer, 2020.
[30] Renaud Lambiotte, J-C Delvenne, and Mauricio Barahona. Laplacian dynamics and multiscale modular structure
in networks. arXiv preprint arXiv:0812.1770, 2008.
[31] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. Mod-
eling relational data with graph convolutional networks. In European Semantic Web Conference, pages 593–607.
Springer, 2018.
[32] Amal Zouaq and Felix Martel. What is the schema of your knowledge graph? leveraging knowledge graph
embeddings and clustering for expressive taxonomy learning. In Proceedings of The International Workshop on
Semantic Big Data, pages 1–6, 2020.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Abstract Graph kernels have become an established and widely-used technique for solving classification tasks on graphs. This survey gives a comprehensive overview of techniques for kernel-based graph classification developed in the past 15 years. We describe and categorize graph kernels based on properties inherent to their design, such as the nature of their extracted graph features, their method of computation and their applicability to problems in practice. In an extensive experimental evaluation, we study the classification accuracy of a large suite of graph kernels on established benchmarks as well as new datasets. We compare the performance of popular kernels with several baseline methods and study the effect of applying a Gaussian RBF kernel to the metric induced by a graph kernel. In doing so, we find that simple baselines become competitive after this transformation on some datasets. Moreover, we study the extent to which existing graph kernels agree in their predictions (and prediction errors) and obtain a data-driven categorization of kernels as result. Finally, based on our experimental results, we derive a practitioner’s guide to kernel-based graph classification.
Conference Paper
Full-text available
Deep-learning based techniques are increasingly being used for different machine learning tasks on knowledge graphs. While it has been shown empirically that these techniques often achieve better pre-dictive performances than their classical counterparts, where features are extracted from the graph, they lack interpretability. Interpretability is a vital aspect in critical domains such as the health and financial sector. In this paper, we present a technique that builds a decision tree of class-specific substructures in order to classify different entities within the knowledge graph. We show how our proposed technique is competitive to current state-of-the-art deep-learning techniques on four benchmark datasets, while being fully interpretable.
Full-text available
Linked Open Data has been recognized as a valuable source for background information in many data mining and information retrieval tasks. However, most of the existing tools require features in propositional form, i.e., a vector of nominal or numerical features associated with an instance, while Linked Open Data sources are graphs by nature. In this paper, we present RDF2Vec, an approach that uses language modeling approaches for unsupervised feature extraction from sequences of words, and adapts them to RDF graphs. We generate sequences by leveraging local information from graph sub-structures, harvested by Weisfeiler-Lehman Subtree RDF Graph Kernels and graph walks, and learn latent numerical representations of entities in RDF graphs. We evaluate our approach on three different tasks: (i) standard machine learning tasks, (ii) entity and document modeling, and (iii) content-based recommender systems. The evaluation shows that the proposed entity embeddings outperform existing techniques, and that pre-computed feature vector representations of general knowledge graphs such as DBpedia and Wikidata can be easily reused for different tasks.
Full-text available
Machine learning on graphs is an important and ubiquitous task with applications ranging from drug design to friendship recommendation in social networks. The primary challenge in this domain is finding a way to represent, or encode, graph structure so that it can be easily exploited by machine learning models. Traditionally, machine learning approaches relied on user-defined heuristics to extract features encoding structural information about a graph (e.g., degree statistics or kernel functions). However, recent years have seen a surge in approaches that automatically learn to encode graph structure into low-dimensional embeddings, using techniques based on deep learning and nonlinear dimensionality reduction. Here we provide a conceptual review of key advancements in this area of representation learning on graphs, including matrix factorization-based methods, random-walk based algorithms, and graph convolutional networks. We review methods to embed individual nodes as well as approaches to embed entire (sub)graphs. In doing so, we develop a unified framework to describe these recent approaches, and we highlight a number of important applications and directions for future work.
Conference Paper
We present WALKLETS, a novel approach for learning multiscale representations of vertices in a network. In contrast to previous works, these representations explicitly encode multi-scale vertex relationships in a way that is analytically derivable. WALKLETS generates these multiscale relationships by sub-sampling short random walks on the vertices of a graph. By 'skipping' over steps in each random walk, our method generates a corpus of vertex pairs which are reachable via paths of a fixed length. This corpus can then be used to learn a series of latent representations, each of which captures successively higher order relationships from the adjacency matrix. We demonstrate the efficacy of WALKLETS's latent representations on several multi-label network classification tasks for social networks such as BlogCatalog, DBLP, Flickr, and YouTube. Our results show that WALKLETS outperforms new methods based on neural matrix factorization. Specifically, we outperform DeepWalk by up to 10% and LINE by 58% Micro-F1 on challenging multi-label classification tasks. Finally, WALKLETS is an online algorithm, and can easily scale to graphs with millions of vertices and edges.
Social network analysis provides meaningful information about behavior of network members that can be used in diverse applications such as classification, link prediction, etc. however, network analysis is computationally expensive because of feature learning for different applications. In recent years, many researches have focused on feature learning methods in social networks. Network embedding represents the network in a lower dimensional representation space with the same properties which presents a compressed representation of the input network. In this paper, we introduce a novel algorithm named "CARE" for network embedding that can be used for different types of networks including weighted, directed and complex. While current methods try to preserve local neighborhood information of nodes, we utilize local neighborhood and community information of network nodes to cover both local and global structure of social networks. CARE builds customized paths, which are consisted of local and global structure of network nodes, as a basis for network embedding and uses skip-gram model to learn representation vector of nodes. Then, stochastic gradient descent is used to optimize our objective function and learn the final representation of nodes. Our method can be scalable when new nodes are appended to network without information loss. Parallelize generation of customized random walks is also used for speeding up CARE. We evaluate the performance of CARE on multi label classification and link prediction tasks. Experimental results on different networks indicate that the proposed method outperforms others in both Micro-f1 and Macro-f1 measures for different size of training data.
Conference Paper
Knowledge Graphs have been recognized as a valuable source for background information in many data mining, information retrieval, natural language processing, and knowledge extraction tasks. However, obtaining a suitable feature vector representation from RDF graphs is a challenging task. In this paper, we extend the RDF2Vec approach, which leverages language modeling techniques for unsupervised feature extraction from sequences of entities. We generate sequences by exploiting local information from graph substructures, harvested by graph walks, and learn latent numerical representations of entities in RDF graphs. We extend the way we compute feature vector representations by comparing twelve different edge weighting functions for performing biased walks on the RDF graph, in order to generate higher quality graph embeddings. We evaluate our approach using different machine learning, as well as entity and document modeling benchmark data sets, and show that the naive RDF2Vec approach can be improved by exploiting Biased Graph Walks.