Content uploaded by Diego Collarana
Author content
All content in this area was uploaded by Diego Collarana on Dec 15, 2017
Content may be subject to copyright.
Content uploaded by Mikhail Galkin
Author content
All content in this area was uploaded by Mikhail Galkin on Jul 12, 2017
Content may be subject to copyright.
SJoin: A Semantic Join Operator to Integrate
Heterogeneous RDF Graphs
Mikhail Galkin1,2,5, Diego Collarana1,2, Ignacio Traverso-Ribón3,
Maria-Esther Vidal2,4, Sören Auer1,2
1Enterprise Information Systems (EIS), University of Bonn
{galkin|collaran|vidal|auer}@cs.uni-bonn.de
2Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS)
3FZI Research Center for Information Technology, Germany
traverso@fzi.de
4Universidad Simón Bolívar, Venezuela
5ITMO University, Saint Petersburg, Russia
Abstract. Semi-structured data models like the Resource Description
Framework (RDF), naturally allow for modeling the same real-world
entity in various ways. For example, different RDF vocabularies en-
able the definition of various RDF graphs representing the same drug
in Bio2RDF or Drugbank. Albeit semantically equivalent, these RDF
graphs may be syntactically different, i.e., they have distinctive graph
structure or entity identifiers and properties. Existing data-driven inte-
gration approaches only consider syntactic matching criteria or similar-
ity measures to solve the problem of integrating RDF graphs. However,
syntactic-based approaches are unable to semantically integrate hetero-
geneous RDF graphs. We devise SJoin, a semantic similarity join op-
erator to solve the problem of matching semantically equivalent RDF
graphs, i.e., syntactically different graphs corresponding to the same
real-world entity. Two physical implementations are proposed for SJoin
which follow blocking or non-blocking data processing strategies, i.e.,
RDF graphs can be merged in a batch or incrementally. We empirically
evaluate the effectiveness and efficiency of the SJoin physical operators
with respect to baseline similarity join algorithms. Experimental results
suggest that SJoin outperforms baseline approaches, i.e., non-blocking
SJoin incrementally produces results faster, while the blocking SJoin ac-
curately matches all semantically equivalent RDF graphs.
1 Introduction
The support that Open Data and Semantic Web initiatives have received from
the society has resulted in the publication of a large number of publicly available
datasets, e.g., United Nations Data6or Linked Open Data cloud7allows for
accessing billion of records. In the context of the Semantic Web, the Resource
6http://data.un.org/
7http://stats.lod2.eu/
2 Mikhail Galkin et al.
DBpedia
drugbank:DB00316 drugbank:DB 01050
dbr:Paracetamol dbr:Acetaminophen dbr:Ibuprofen
label
Ibuprofen
15687-27-1
CAS
15687-27-1
rdfs:label
dbo:casNumber
103-90-2
dbr:Paracetamol
N02BE01
103-90-2
atcCode
rdfs:label
dbo:casNumber
rdfs:label dbo:pageRedirect
N-(4-hydroxy
phenyl)ethanamide
2-[4-(2-methylpropyl)
phenyl]propanoic acid
chemicalIupacName
dbo:iupacName
(RS)-2-(4-(2-Methyl
propyl)phenyl)pro
panoic acid
Ibuprofen@en
Acetaminophen
@en
Paracetamol@en
dbo:iupacName
dbo:casNumber
N-(4-hydroxy
phenyl) acetamide
chemicalIupacName
Acetaminophen
label
103-90-2
CAS
Drugbank
Fig. 1: Motivating Example. The Ibuprofen and Paracetamol real-world en-
tities are modeled in different ways by Drugbank and DBpedia. Syntactically
the properties and objects are different, but semantically the represent the
same drugs. Drug drugbank:DB01050 matches 1-1 with dbr:Ibuprofen, while
drugbank:DB00316 matches 1-2 with dbr:Paracetamol and dbr:Acetaminophen.
Description Framework (RDF) is utilized for semantically enriching data with
vocabularies or ontologies. Albeit expressive, the RDF data model allows (e.g.,
due to the non-unique names assumption) multiple representations of a real-
world entity using different vocabularies.
To illustrate this, consider chemicals and drugs represented in the Drug-
bank and DBpedia knowledge graphs. Using different vocabularies, drugs are
represented from different perspectives. DBpedia contains more general informa-
tion, whereas Drugbank provides more domain-specific facts, e.g., the chemical
composition and properties, pharmacology, and interactions with other drugs.
Fig. 1 illustrates representations of two drugs in Drugbank and DBpedia. Ibupro-
fen, a drug for treating pain, inflammation and fever, and Paracetamol, a drug
with analgesic, and antipyretic effects. Firstly, Drugbank Uniform Resource
Identifiers (URIs) are textual IDs (e.g., drugbank:DB003168corresponds to Ac-
etaminophen and drugbank:DB01050 to Ibuprofen. In contrast, DBpedia utilizes
human-readable URIs (e.g., dbr:Acetaminophen and dbr:Ibuprofen) to identify
drugs. Secondly, the same attributes are encoded differently with various prop-
erty URIs, e.g., chemicalIupacName,casRegistryNumber in Drugbank, and iupacName,
casNumber in DBpedia, respectively. Thirdly, some drugs might be linked to more
than one analogue, e.g., Acetaminophen in Drugbank (drugbank:DB00316) corre-
sponds to two DBpedia resources: dbr:Paracetamol, and dbr:Acetaminophen.
Traditional join operators, e.g., Hash Join [2] or XJoin [11], are not capable
of joining those resources as neither URIs nor properties match syntactically.
Similarity join operators [3, 5, 6, 8, 12] tackle this heterogeneity issue, but due
to the same extent of inequality string and set similarity techniques are limited
in deciding whether two RDF resources should be joined or not. Therefore, we
identify the need of a semantic similarity join operator able to satisfy the fol-
lowing requirements: R1) Applicable to heterogeneous RDF knowledge graphs.
8Prefixes are as specified on http://prefix.cc/
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs 3
R2) Able to identify joinable tuples leveraging semantic relatedness between
RDF graphs. R3) Capable of performing perfect matching for one-to-one in-
tegration, and fuzzy conditional matching for integrating groups of Nentities
from one graph with Mentities from another knowledge graph. R4) Support of
a blocking operation mode for batch processing, and a non-blocking mode for
on-demand real time cases whenever results are expected incrementally.
We present SJoin – a semantic join operator which meets these requirements.
The contributions of this article include: 1) Definition and description of SJoin, a
semantic join operator for integrating heterogeneous RDF graphs. 2) Algorithms
and complexity study of a blocking SJoin for 1−1integration and non-blocking
SJoin for the N−Msimilarity case. 3) An extensive evaluation that demon-
strates benefits of SJoin in terms of efficiency, effectiveness and completeness
over time in various heterogeneity conditions and confidence levels.
The article is organized as follows: The problem addressed in this work is
clearly defined in Section 2. Section 3 presents the SJoin operator, as well as the
blocking and non-blocking physical implementations, as solutions for detecting
semantically equivalent entities in RDF knowledge graphs. Results from our
experimental study are reported on Section 4. An overview of traditional binary
joins and similarity joins as a related work is analyzed in Section 5. Finally, we
sum up the lessons learned and outline future research directions in Section 6.
2 Problem Statement
In this work, we tackle the problem of identifying semantically equivalent RDF
molecules from RDF graphs. Given an RDF graph G, we call a subgraph M
of Gan RDF molecule [4] iff the RDF triples of M={t1, . . . , tn}share the
same subject, i.e., ∀i, j ∈ {1, .., n}(subject(ti) = subject(tj)). An RDF molecule
can be represented as a pair M= (R, T ), where Rcorresponds to the URI (or
blank node) of the molecule subject, and Tis a set of pairs p=(prop,val) such
that the triple (R,prop,val) belongs to M. We name Rand Tthe head and
the tail of the RDF molecule M, respectively. For example, an RDF molecule
of a drug Paracetamol is (dbr:Paracetamol, {(rdfs:label,"Paracetamol@en"),
(dbo:casNumber,"103-90-2"), (dbo:iupacName,"N-(4-hydroxyphenyl)ethanamide")}).
An RDF graph Gcan be described in terms of its RDF molecules as follows:
φ(G) = {M = (R, T )|t= (R, prop, v al)∈Gand (prop, val)∈T}(1)
Definition 1 (Problem of Semantically Equivalent RDF Graphs). Given
sets of RDF molecules φ(G),φ(D), and φ(F), and an RDF molecule Mein
φ(F)which corresponds to an entity erepresented by different RDF molecules
MGand MDin φ(G)and φ(D), respectively. The problem of identifying seman-
tically equivalent entities between sets of RDF molecules φ(G)and φ(D)consists
of providing an homomorphism θ:φ(G)∪φ(D)→2φ(F), such that if two RDF
molecules MGand MDrepresent the RDF molecule Me, then Me∈θ(MG)
and Me∈θ(MD); otherwise, θ(MG)6=θ(MD).
4 Mikhail Galkin et al.
Definition 1 considers perfect 1-1 matching, e.g., determining 1-1 seman-
tic equivalences between drugbank:01050 and dbr:Ibuprofen, as well as N−M
matching, e.g., drugbank:DB00316 with both dbr:Paracetamol and dbr:Acetaminophen.
3 Proposed Solution: The SJoin Operator
We propose a similarity join operator named SJoin, able to identify joinable
entities between RDF graphs, i.e., SJoin implements the homomorphism θ(.).
SJoin is based on the Resource Similarity Molecule (RSM) structure, that in
combination with a similarity function Simf, and a threshold γ, produce a list
of matching entity pairs. RSM is defined as follows:
Definition 2 (Resource Similarity Molecule (RSM)). Given a set Mof
RDF molecules, a similarity function Simf, and a threshold γ. A Resource Sim-
ilarity Molecule is a pair RSM=(M,T), where:
• M = (R, T )is the head of RSM and the RDF molecule described in RSM.
•T is the tail of RSM and represents an ordered list of RDF molecules Mi=
(Ri, Ti). T meets the following conditions:
• M is highly similar to Mi, i.e., Simf(R, Ri)≥γ.
•For all Mi= (Ri, Ti)∈T, Simf(R, Ri)≥Simf(R, Ri+1).
An RSM is composed of a head and tail that correspond to an RDF molecule
and a list of molecules which similarity score is higher than a specified threshold
γ, respectively. For example, an RSM of Ibuprofen (with omitted tails of prop-
erty:value pairs) is ((dbr:Ibuprofen, T)[(drugbank:DB01050, T1), (chebi:5855,
T2), (wikidata:Q186969, T3)]) given a similarity function Simf, a threshold γ,
and Simf(dbr:Ibuprofen,drugbank:DB01050)≥S imf(dbr:Ibuprofen,chebi:5855),
and Simf(dbr:Ibuprofen,chebi:5855)≥S imf(dbr:Ibuprofen,wikidata:Q186969).
The SJoin operator is a two-fold algorithm that performs: first, Similarity
Partitioning, and second, Similarity Probing to identify semantically equiva-
lent RDF molecules. To address batch and real-time processing scenarios, we
present two implementations of SJoin. Blocking SJoin Operator solves the
1-1 weighted perfect matching problem allowing for a batch processing of the
graphs. Non-Blocking SJoin Operator employs fuzzy conditional matching
for identifying communities of N-Mentities in graphs covering the on-demand
case whenever results are expected to be produced incrementally.
3.1 Blocking SJoin Operator
Fig. 2 illustrates the intuition behind the blocking SJoin operator. Similarity
Partitioning and Probing steps are executed sequentially. Thus, blocking SJoin
operator completely evaluates both datasets of RDF molecules in the Partition-
ing step, and then fires the Probing step to produce the whole output.
The Similarity Partitioning step is described in Algorithm 1. The operator
initializes two lists of RSMs for two RDF graphs and incoming RDF molecules
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs 5
(R1A,T1A)[ ]
(R2A,T2A)[ ]
(R3A,T3A)[ ]
(R1B,T1B)[ ]
(R2B,T2B)[ ]
(R3B,T3B)[ ]
insert (R2A,T2A)
(R3A,T3A)
(R1B,T1B)
(R2B,T2B)
(R3B,T3B)
(R1A,T1A)(R2B,T2B)
(R2A,T2A)(R3B,T3B)
(R3A,T3A)(R1B,T1B)
Dataset A
Dataset B
simf γ
(R1A,T1A )
insert
insert 1-1 Perfect Matching
Similarity Partitioning Similarity Probing
Fig. 2: SJoin Blocking Operator. Similarity Partitioning step initializes lists
of RSMs and populates their tails through a similarity function Simfand a
threshold γ. Similarity Probing step performs 1-1 weighted perfect matching
and outputs the perfect pairs of semantically equivalent molecules (MiA,Mj B ).
Algorithm 1: Similarity Partitioning step for Blocking SJoin operator
according to similarity function Simfand threshold γ
Data: Dataset φ(DA),Simf, γ
Result: List of RSMA, List of RS MB
1while getMolecule(φ(DA))do
2MiA ←getMolecule(φ(DA)) ;
3RiA ←head(MiA);// Get URI
4for RSMj B ∈List of RSMBdo
5RSMj B = ((RjB , Tj B )[(RlA, TlA )),...,(RkA , TkA)] ;
6RjB ←head(head(RSMjB )) ;// Get URI
7if Simf(Rj B , RiA)≥γthen // Probe
8tail(RSMj B )←tail(RSMjB ) + (MiA );
9return sort(List of RSMA),sort(List of RSMB)
are inserted into a respective list with a filled head Mand empty tail T. To
populate the tail of a RSM in the list A, SJoin resorts to a semantic similarity
function for computing a similarity score between the RSM and all RSMs in
the opposite list B. If the similarity score exceeds a certain threshold γthen
the molecule from the list B is appended to the tail of the RSM. Finally, the
tail is sorted in the descending similarity score order such that the most similar
RDF molecule obtains the top position in the tail. For instance, the semantic
similarity function GADES [10] is able to decide relatedness between the RDF
molecules of dbr:Ibuprofen and drugbank:DB01050 in Fig. 1, and assigns a sim-
ilarity score of 0.8. The algorithm supports datasets with arbitrary amounts of
molecules. However, in order to guarantee 1-1 perfect matching, we place a re-
striction card(φ(DA)) = card(φ(DB)), i.e., the number of molecules in φ(DA)
and φ(DB)must be the same. Thus, card(List of RSMA) = card(List of RSMB).
A 1-1 weighted perfect matching is applied at the Similarity Probing stage in
the Blocking SJoin operator. It accepts the lists of RSMA, RSMBcreated and
populated during the previous Similarity Partitioning step. This step aims at pro-
ducing perfect pairs of semantically equivalent RDF molecules (MiA,Mj B ), i.e.,
6 Mikhail Galkin et al.
(RiA,TiA )[(RjB,TjB),… ,(RkB,TkB)] (RjB,TjB )[(RiA,TiA),… ,(RmA,TmA)]
List of RSMAList of RSMB
(RiA,TiA ) (RjB,TjB )
(a) 1-1 matching from the bipartite graph of RMS
(RaA,TaA ) (RbB,TbB )
(RmA,TmA ) (RnB,TnB )
n pairs
(RiA,TiA ) (RjB,TjB )
(b) Matched pairs
Fig. 3: 1-1 Weighted Perfect Matching. (a) The matching is identified from
the lists of RSMAand RS MB; RDF molecules MiA =(RiA,TiA ) and Mj B =
(RjB ,Tj B ) are semantically equivalent whenever RiA and RjB are reciprocally
the most similar RDF molecules according to Simf.
Algorithm 2: 1-1 Weighted Perfect Matching of RSMs bipartite graph
Data: List of RSMA, List of RS MB
Result: List of pairs LP = ((RiA, TiA ),(RjB , Tj B ))
1for RSMiA ∈List of RS MAdo
2RSMiA = ((RiA , TiA )[(RjB , TjB ),...,(Rk B , TkB )]) ;// Ordered Set
3for (RjB , TjB )∈tail(RSMiA )do
4RSMj B ←Find in the List of RSMB;
5RSMj B = ((RjB , Tj B )[(RlA, TlA ),...,(RzA , TzA )]) ;// Ordered Set
6if (RlA, TlA ) = (RiA, TiA )and (RiA , TiA)6∈ LP then
7LP ←LP + ((RiA, TiA ),(RjB , Tj B )) ;// Add to result
8else
9for (RlA, TlA )∈tail(RSMjB )do
10 find the position of (RiA , TiA);
11 return LP
max(Simf(MiA , RS MB)) = max(Simf(MjB , RSMA)) = Simf(MiA ,Mj B ).
That is, for a given molecule MiA, there is no molecule in the list of RSMAwhich
has a similarity score higher than Simf(MiA ,MjB )and vice versa. Algorithm 2
describes how perfect pairs are created; Fig. 3 illustrates the algorithm.
Traversing the List of RSMA, the algorithm iterates over each RSMiA . Then,
the tail of RSMiA , i.e., an ordered list of highly similar molecules, is extracted.
The first molecule of the tail RSMj B corresponds to the most similar molecule
from the List of RSMB. The algorithm searches for RSMjB in the List of RSMB
and examines whether the molecule (RiA, TiA )is the first one in the tail of
RSMj B . If this condition holds and (RiA, TiA )is not already matched with
another RSM, then the pair ((RiA, TiA ),(RjB , Tj B )) is identified as a perfect
pair and is appended to the result list of pairs LP (cf. Fig. 3a). If false, then
the algorithm finds the first occurrence of (RiA, TiA)in the tail of RS MjB and
appends the result pair to LP . When all RSM s are matched, the algorithm
yields the list of perfectly matched pairs (cf. Fig. 3b).
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs 7
(R1A,T1A)[ ] (R1B,T1B)[ ]
(R2B,T2B)[ ]
insert
(R1A,T1A)(R 2B,T2B)
probe
simf γ
Dataset A Dataset B
Similarity Partitioning / Similarity Probing
(R1A,T1A)(R1A,T1A)
(a) Molecule (RiA, TiA )yields a pair
((R1A, T1A),(R2B, T2B))
(R1A,T1A)[ ]
(R2A,T2A)[ ]
(R1B,T1B)[ ]
(R2B,T2B)[ ]
(R3B,T3B)[ ]
insert
(R1A,T1A)(R 2B,T2B)
(R3B,T3B)(R 2A,T2A)
probe
simf γ
Dataset A Dataset B
Similarity Partitioning / Similarity Probing
(R3B,T3B)
(b) Molecule (R3B, T3B)yields a pair
((R3B, T3B),(R2A, T2A))
Fig. 4: SJoin Non-Blocking Operator. Identifies N-M matchings and pro-
duces results as soon as new molecule arrives. When a molecule (RiA, TiA )ar-
rives, it is inserted into a relevant list and probed against another list. If the
similarity score exceeds the threshold γ, a new matching is produced.
3.2 Non-Blocking SJoin Operator
The Non-Blocking SJoin operator aims at identifying N−Mmatchings, i.e.,
an RSMiA might be associated with multiple RSMs, e.g., RSMj B or RSMkB .
Therefore, 1-1 weighted perfect matching is not executed which enables the op-
erator to produce results as soon as new molecules arrive, i.e., in a non-blocking,
on-demand manner. The operator receives two sets of RDF molecules φ(DA)
and φ(DB). Lists of RSMA, RS MBare initialized as empty lists. Algorithm 3
describes the join procedure and Fig. 4 illustrates the algorithm.
For every incoming molecule MiA from φ(DA), Algorithm 3 performs the
same two steps: Similarity Partitioning and Similarity Probing. The URI RiA
of an RDF molecule extracted from the tuple (RiA, TiA )is probed against URIs
of all existing RS M s in the List of RS MB(cf. Fig. 4). If the similarity score of
Simf(RiA , RjB )exceeds the threshold γ, then the pair ((RiA , TiA ),(RjB , Tj B ))
is considered as a matching and appended to the results list LP . During the
Similarity Insert step, an RSMiA is initialized, the molecule (RiA , TiA)becomes
its head, and eventually added to the respective List of RSMA. Algorithm 3 is
applied to both φ(DA)and φ(DB)and able to produce results with constantly
updating Lists of RSMs supporting the non-blocking operation workflow.
3.3 Time Complexity Analysis
The SJoin binary operator receives two RDF graphs of nRDF molecules each.
To estimate the complexity of the blocking SJoin operator, three most expen-
sive operations have to be analyzed. Table 1 gives an overview of the analysis.
The complexity of the Data Partitioner module depends on the Algorithm 1,
i.e., construction of Lists of RSMA, RS MBand a similarity function Simf. The
asymptotic approximation equals to O(n2·O(Simf)). To produce ordered tails
of RSM s the similar molecules in the tail have to be sorted in the descending
8 Mikhail Galkin et al.
Algorithm 3: The Non-Blocking SJoin operator executes both Similarity
Partitioning and Probing steps as soon as an RDF molecule arrives from
an RDF graph.
Data: Dataset φ(DA),Simf, γ
Result: List of pairs LP = ((RiA, TiA ),(RjB , Tj B ))
1while getMolecule(φ(DA))do
2MiA ←getMolecule(φ(DA)) ;
3RiA ←head(MiA), TiA ←tail(MiA );// Get URI, tail
4for RSMj B ∈List of RSMBdo
5RSMj B = ((RjB , Tj B )[]) ;
6RjB ←head(head(RSMjB )) ;// Get URI
7TjB ←tail(head(RS MjB );// Get tail
8if Simf(RiA , RjB )≥γthen // Probe
9LP ←LP + ((RiA, TiA ),(RjB , Tj B )) ;
10 head(RSMiA )← MiA ,tail(RSMiA )←[] ;
11 List of RSMA←List of RS MA+RSMiA ;// Insert
12 return LP
Table 1: The SJoin Time Complexity. Results for the steps of Partitioning,
Sorting, and Matching, where nis the number of RDF molecules.
Stage Blocking SJoin Complexity Non-Blocking SJoin Complexity
Partitioning O(n2·O(Simf)) O(n2·O(Simf))
Sorting O(nlog n)
Matching O(n3)
Overall O(n2·O(Simf)) + O(n3)O(n2·O(Simf))
similarity score order. The applicable merge sort and heapsort algorithms have
O(nlog n)asymptotic complexity. The 1-1 Weighted Perfect Matching compo-
nent has O(n3)complexity in the worst case according to the Algorithm 2. How-
ever, the Hungarian algorithm [7], a standard approach for 1-1 weighted perfect
matching, converges to the same O(n3)complexity. Partitioning, sorting, and
perfect matching are executed sequentially. Therefore, the overall complexity
conforms to the sum of complexities, i.e., O(n2·O(Simf)) + O(nlog n) + O(n3)
which equals to O(n2·O(Simf)) + O(n3). We thus deduce that the SJoin com-
plexity depends on the complexity of a chosen similarity measure whereas the
lowest achievable order of complexity is limited to O(n3).
The complexity of the non-blocking SJoin operator stems from the analysis
of the Algorithm 3. The most expensive step of the algorithm is to compute a
similarity score between an RSMiA and RSMs in the List of RSMB. Applied to
both φ(DA)and φ(DB)the complexity converges to O(n2·O(Simf)).
4 Empirical Study
An empirical evaluation is conducted to study the efficiency and effectiveness of
SJoin in blocking and non-blocking conditions on RDF graphs from DBpedia and
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs 9
Table 2: Benchmark Description. RDF datasets used in the evaluation.
Experiment 1: People Experiment 2: People
DBpedia D1 DBpedia D2 DBpedia Wikidata DBpedia Wikidata
Molecules 500 500 500 500 1000 1000
Triples 17,951 17,894 29,263 16,307 54,590 29,138
Wikidata. We assess the following research questions: RQ1) Does blocking SJoin
integrate RDF graphs more efficiently and effectively compared to the state of
the art? RQ2) What is the impact of threshold values on the completeness of
a non-blocking SJoin? RQ3) What is the effect of a similarity function in the
SJoin results? The experimental configuration is as follows:
Benchmark: Experiment 1 is executed against a dataset of 500 molecules9
of type Person extracted from the live version of DBpedia (February 2017).
Based on the original molecules, we created two sets of molecules by randomly
deleting or editing triples in the two sets. Sharing the same DBpedia vocabulary,
Experiment 1 datasets have a higher resemblance degree compared to Experi-
ment 2. Experiment 2 employs subsets of DBpedia and Wikidata of the Person
class. Assessing SJoin in the higher heterogeneity settings, we sampled datasets
of 500 and 1000 molecules varying triples count from 16K up to 55K10 . Table 2
provides basic statistics on the experimental datasets. DBpedia D1 and D2 refer
to the dumps of 500 molecules. Further, the dumps of 500 and 1000 molecules
for Experiment 2 are extracted from DBpedia and Wikidata.
Baseline: Gold standards for blocking operators comparison include the
original DBpedia Person descriptions (Experiment 1) and owl:sameAs links be-
tween DBpedia and Wikidata (Experiment 2). We compare SJoin with a Hash
Join operator. For a fair comparison, the Hash Join was extended to support sim-
ilarity functions at the Probing stage. That is, blocking SJoin is compared against
blocking similarity Hash Join and non-blocking SJoin is evaluated against non-
blocking Symmetric Hash Join. The Gold standard for evaluating non-blocking
operators is comprised of the precomputed amounts of pairs which similarity
score exceeds a predefined threshold; gold standards are computed off line.
Metrics: We report on execution time (ET in secs) as the elapsed time
required by the SJoin operator to produce all the answers. Furthermore, we
measure Precision,Recall and report F1-measure during the experiments with
blocking operators. Precision is the fraction of RDF molecules that has been
identified and integrated (M) that intersects with the Gold Standard (GS ), i.e.,
Precision =|M∩GS |
|M|. Recall corresponds to the fraction of the identified similar
molecules in the Gold Standard, i.e., Recall =|M∩GS |
|GS|. Comparing non-blocking
operators, we measure Completeness over time, i.e., a fraction of results produced
at a certain time stamp. The timeout is set to one hour (3,600 seconds), the
operators results are checked every second. Ten thresholds in the range [0.1:1.0]
and step 0.1 were applied in Experiment 1. In Experiment 2, five thresholds in
9https://github.com/RDF-Molecules/Test- DataSets/tree/master/DBpedia-People/20160819
10 https://github.com/RDF-Molecules/Test- DataSets/tree/master/DBpedia-WikiData/operators_evaluation
10 Mikhail Galkin et al.
0
200
400
600
800
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Threshold
ET, sec
sjoin_partitioning sjoin_probing
F1 score
0.00
0.25
0.50
0.75
1.00
F1 score
(a) SJoin performance
0
200
400
600
800
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Threshold
ET, sec
hash_partitioning hash_probing
F1 score
0.00
0.25
0.50
0.75
1.00
F1 score
(b) Hash Join performance
Fig. 5: Experiment 1 (GADES) with blocking operators. The partitioning
bar shows the time taken to partition the molecules in RSMs, probing indicates
the time required for 1-1 weighted perfect matching. Black line chart on the right
axis denotes F1 score. (a) SJoin demonstrates higher F1 score while consuming
more time for perfect matching. (b) Baseline Hash Join demonstrates less than
0.25 F1 score even on lower thresholds spending less time on probing.
the range [0.1 : 0.5] were evaluated because no pair of entities in the sampled
RDF datasets has a GADES similarity score higher than 0.5.
Implementation: Both blocking and non-blocking SJoin operators are im-
plemented in Python 2.7.1011. Baseline improved Hash Joins are implemented
in Python as well12. The experiments were executed on a Ubuntu 16.04 (64
bits) Dell PowerEdge R805 server, AMD Opteron 2.4GHz CPU, 64 cores, 256GB
RAM. We evaluated two similarity functions: GADES [10] and Semantic Jaccard
(SemJaccard) [1]. GADES relies on semantic descriptions encoded in ontologies
to determine relatedness, while SemJaccard requires the materialization of im-
plicit knowledge and mappings. Evaluating schema heterogeneity of DBpedia
and Wikidata in Experiment 2 the similarity function is fixed to GADES.
4.1 DBpedia – DBpedia People
Experiment 1 evaluates the performance and effectiveness of blocking and non-
blocking SJoin compared to respective Hash Join implementations. The testbed
includes two split DBpedia dumps with semantically equivalent entities but non-
matching resource URIs and randomly distributed properties; GADES and Sem-
Jaccard similarity functions. That is, both graphs are described in terms of one
DBpedia ontology. Fig. 5 visualizes the results obtained when applying GADES
semantic similarity function in order to identify a perfect matching of graphs
resources, i.e., in blocking conditions. SJoin exhibits better F1 score up to very
11 https://github.com/RDF-Molecules/operators/tree/master/mFuhsion
12 https://github.com/RDF-Molecules/operators/tree/master/baseline_ops
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs 11
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.1 # triples: 166573
(a) T = 0.1
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.3 # triples: 108922
(b) T = 0.3
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.5 # triples: 15148
(c) T=0.5
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.8 # triples: 406
(d) T=0.8
Fig. 6: Experiment 1 (GADES) with non-blocking operators. SJoin pro-
duces complete results at all threholds in contrast to Hash Join.
high 0.9 threshold value. Moreover, the effectiveness of more than 80% is ensured
up to 0.6 threshold value whereas Hash Join barely reaches 25% even on lower
thresholds. The partitioning time is constant for both operators but Hash Join
performs the partitioning slower due to the application of a hash function to
all incoming molecules. However, high effectiveness of SJoin is achieved at the
expense of time efficiency. SJoin has to complete a 1-1 perfect matching algo-
rithm against a large 500x500 matrix whereas Hash Join performs the perfect
matching three times but for smaller matrices equal to the size of its buckets,
e.g., about 166x166 for three buckets which is faster due to the cubic complexity
of the weighted perfect matching algorithm.
Fig. 6 shows the results of the evaluation of non-blocking operators with
GADES. SJoin outperforms the baseline Hash Join in terms of completeness
over time in all four cases with the threshold in the range 0.1-0.8. Fig. 6a demon-
strates that the SJoin operator is capable of producing 100% of results within
the timeframe whereas the Hash Join operator outputs only about 10% of the
expected tuples. In Fig. 6b, SJoin achieves the full completeness even faster. In
Fig. 6c both operators finish after 18 minutes, but SJoin retains full complete-
ness while Hash Join reaches only 35%. Finally, with the 0.8 threshold in Fig. 6d,
Hash Join performs very fast but still struggles to attain the full completeness;
SJoin takes more time but sustainably achieves answer completeness. One of
the reasons why Hash Join performs worse is its hash function which does not
consider semantics encoded in the molecules descriptions. Therefore, the hash
function partitions RDF molecules into buckets almost randomly, while it was
originally envisioned to place similar entities in the same buckets.
Fig. 7 presents the efficiency and effectiveness of blocking SJoin and Hash Join
when applying SemJaccard similarity function. As an unsophisticated measure,
12 Mikhail Galkin et al.
0
100
200
300
400
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Threshold
ET, sec
sjoin_partitioning sjoin_probing
F1 score
0.00
0.25
0.50
0.75
1.00
F1 score
(a) SJoin performance
0
100
200
300
400
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Threshold
ET, sec
hash_partitioning hash_probing
F1 score
0.00
0.25
0.50
0.75
1.00
F1 score
(b) Hash Join performance
Fig. 7: Experiment 1 (SemJaccard) with blocking operators. (a) SJoin
takes less time to compute similarity scores while F1 score quickly deteriorates
after threshold 0.5. (b) Baseline Hash Join in most cases consumes more time
and produces less reliable matchings.
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.4 # triples: 50857
(a) T = 0.4, GADES
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.4 # triples: 486
(b) T = 0.4, Jaccard
Fig. 8: Experiment 1 with fixed threshold. GADES identifies two orders of
magnitude more results than Jaccard while SJoin still achieves full completeness.
operators require less time for partitioning and take less time for probing stages.
That is, due to the heterogeneous nature of the compared datasets, SemJaccard
is not able to produce similarity scores higher than 0.4. On the other hand,
SemJaccard simplicity leads to significant deterioration of the F1 score already
at low thresholds, i.e., 0.3-0.4.
Fig. 8 illustrates the difference in elapsed time and achieved completeness of
SJoin and Hash Join applying GADES or SemJaccard similarity functions. Evi-
dently, SemJaccard outputs fewer tuples even on lower thresholds, e.g., 486 pairs
at 0.4 threshold against 50,857 pairs by GADES. We therefore demonstrate that
plain set similarity measures as SemJaccard that consider only an intersection
of exactly same triples are ineffective in integrating heterogeneous RDF graphs.
4.2 DBpedia - Wikidata People
The distinctive feature of the experiment consists in completely different vo-
cabularies used to semantically describe the same people. Therefore, traditional
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs 13
(a) GADES distribution
0
1000
2000
3000
4000
0.1 0.2 0.3 0.4 0.5
Threshold
ET, sec
sjoin_partitioning sjoin_probing
F1 score
0.00
0.25
0.50
0.75
1.00
F1 score
(b) SJoin
0
200
400
600
800
0.1 0.2 0.3 0.4 0.5
Threshold
ET, sec
hash_partitioning hash_probing
F1 score
0.00
0.25
0.50
0.75
1.00
F1 score
(c) Hash Join
Fig. 9: Experiment 2 (GADES) with blocking operators, 500 molecules.
(a) The distribution of GADES similarity scores shows that there are few pairs
which score exceeds 0.4 threshold. (b) SJoin requires more time but achieves
more than 0.9 F1 score until T0.3. (c) Baseline Hash Join operates faster but
achieves less than 0.25 F1 accuracy.
joins and set similarity joins, e.g., Jaccard, are not applicable. We evaluate the
performance of SJoin employing GADES semantic similarity measure.
Fig. 9 reports the efficiency and effectiveness of SJoin compared to Hash Join
in the 500 molecules setup. Fig. 9a justifies the range of selected thresholds as
only a few number of pairs have a similarity score higher than 0.5. Blocking
SJoin manages to achieve higher F1 score (max 95%) up to 0.3 threshold value,
but requires significantly more time to accomplish the perfect matching.
Results of non-blocking SJoin and Hash Join executed against 500 and 1000
molecules configurations are reported on Fig. 10. The observed behavior of these
operators resembles the one in Experiment 1, i.e., SJoin outputs complete results
within a predefined time frame, while Hash Join barely achieves 40% complete-
ness in the case with a relatively high threshold 0.4 and small number of outputs.
Analyzing the observed empirical results, we are able to answer our research
questions: RQ1) Blocking SJoin consistently exhibits higher F1 scores, and the
results are more reliable. However, time efficiency depends on the input graphs
and applied similarity functions. RQ2) A threshold value prunes the amount of
expected results and does not affect the completeness of SJoin. RQ3) Clearly, a
semantic similarity function allows for matching RDF graphs more accurately.
5 Related Work
Traditional binary join operators require join variables instantiations to be ex-
actly the same. For example, XJoin [11] and Hash Join [2] (chosen as a baseline
in this paper) operators abide this condition. At the Insert step, both blocking
and non-blocking Hash Join algorithms partition incoming tuples into a number
of buckets based on the assumption that after applying a hash function similar
tuples will reside in the same bucket. The assumption holds true in cases of sim-
ple data structures, e.g., numbers or strings. However, applying hash functions
to string representations of complex data structures such as RDF molecules or
RSMs tend to produce more collisions rather then efficient partitions. At the
14 Mikhail Galkin et al.
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.2 # triples: 153904
(a) T = 0.2, 500 molecules
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.4 # triples: 639
(b) T = 0.4, 500 molecules
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.2 # triples: 160062
(c) T=0.2, 1000 molecules
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.4 # triples: 3466
(d) T=0.4, 1000 molecules
Fig. 10: Experiment 2. Non-blocking operators in different dataset
sizes. In larger setups, SJoin still reaches full completeness.
Probe stage, Hash Join performs matching as to a specified join variable. Thus,
having URI as a join variable, semantically equivalent RSMs with different URIs
can not be joined by Hash Join.
Similarity join algorithms are able to match syntactically different entities
and address the heterogeneity issue. String similarity join techniques reported
in [3, 5, 12] rely on various metrics to compute a distance between two strings.
Set similarity joins [6,8] identify matches between sets. String and set similarity
techniques are, however, inefficient being applied to RDF data as they do not
consider the graph nature of semantic data. There exist graph similarity joins [9,
13] which traverse graph data in order to identify similar nodes. On the other
hand, those operators do not tackle semantics encoded in the knowledge graphs
and are tailored for specific similarity functions.
In contrast, SJoin, presented in this paper, is a semantic similarity operator
that fully leverages RDF and OWL semantics encoded in the RDF graphs. More-
over, SJoin is able to perform in blocking, i.e., 1-1 perfect matching, conditions
or non-blocking, i.e., incremental N−M, manner allowing for on-demand and
ad-hoc semantic data integration pipelines. Additionally, SJoin is flexible and is
able to employ various similarity functions and metrics, e.g., from simple Jac-
card similarity to complex NED [14] or GADES [10] measures, achieving best
performance with semantic similarity functions.
6 Conclusions and Future Work
We presented SJoin, an operator for detecting semantically equivalent RDF
molecules from RDF graphs. SJoin implements two operators: Blocking and
Non-Blocking, which rely on similarity measures and ontologies to effectively
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs 15
detect equivalent entities from heterogeneous RDF graphs. Moreover, the time
complexity of SJoin operators depends on the time complexity of the similar-
ity measure, i.e., SJoin does not introduce additional overhead. The behavior
of SJoin was empirically studied on DBpedia and Wikidata real-world RDF
graphs, and on Jaccard and GADES similarity measures. Observed results sug-
gest that SJoin is able to identify and merge semantically equivalent entities, and
is empowered by the semantics encoded in ontologies and exploited by similarity
measures. As future work, we plan to define new SJoin operators to compute
on-demand integration of RDF graphs and address streams of RDF data.
Acknowledgments
Mikhail Galkin is supported by the project Open Budgets (GA 645833). This
work is also funded in part by the European Union under the Horizon 2020
Framework Program for the project BigDataEurope (GA 644564), and the Ger-
man Ministry of Education and Research with grant no. 13N13627 (LiDaKra).
References
1. D. Collarana, M. Galkin, C. Lange, I. Grangel-González, M. Vidal, and S. Auer.
Fuhsen: A federated hybrid search engine for building a knowledge graph on-
demand (short paper). In ODBASE, pages 752–761, 2016.
2. A. Deshpande, Z. G. Ives, and V. Raman. Adaptive query processing. Foundations
and Trends in Databases, 1(1):1–140, 2007.
3. J. Feng, J. Wang, and G. Li. Trie-join: a trie-based method for efficient string
similarity joins. VLDB J., 21(4):437–461, 2012.
4. J. D. Fernández, A. Llaves, and Ó. Corcho. Efficient RDF interchange (ERI) format
for RDF data streams. In ISWC, pages 244–259, 2014.
5. G. Li, D. Deng, J. Wang, and J. Feng. PASS-JOIN: A partition-based method for
similarity joins. PVLDB, 5(3):253–264, 2011.
6. W. Mann, N. Augsten, and P. Bouros. An empirical evaluation of set similarity
join techniques. PVLDB, 9(9):636–647, 2016.
7. J. Munkres. Algorithms for the assignment and transportation problems. Journal
of the society for industrial and applied mathematics, 5(1):32–38, 1957.
8. L. A. Ribeiro, A. Cuzzocrea, K. A. A. Bezerra, and B. H. B. do Nascimento.
Incorporating clustering into set similarity join algorithms: The sjclust framework.
In DEXA 2016, Porto, Portugal, pages 185–204, 2016.
9. Z. Shang, Y. Liu, G. Li, and J. Feng. K-join: Knowledge-aware similarity join.
IEEE Trans. Knowl. Data Eng., 28(12):3293–3308, 2016.
10. I. Traverso, M.-E. Vidal, B. Kämpgen, and Y. Sure-Vetter. Gades: A graph-based
semantic similarity measure. In SEMANTiCS, pages 101–104. ACM, 2016.
11. T. Urhan and M. J. Franklin. Xjoin: A reactively-scheduled pipelined join operator.
IEEE Data Eng. Bull., 23(2):27–33, 2000.
12. S. Wandelt, D. Deng, S. Gerdjikov, S. Mishra, P. Mitankin, M. Patil, E. Siragusa,
A. Tiskin, W. Wang, J. Wang, and U. Leser. State-of-the-art in string similarity
search and join. SIGMOD Record, 43(1):64–76, 2014.
13. Y. Wang, H. Wang, J. Li, and H. Gao. Efficient graph similarity join for information
integration on graphs. Frontiers of Computer Science, 10(2):317–329, 2016.
14. H. Zhu, X. Meng, and G. Kollios. NED: an inter-graph node metric based on edit
distance. PVLDB, 10(6):697–708, 2017.