Conference PaperPDF Available

SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs

Authors:

Abstract and Figures

Semi-structured data models like the Resource Description Framework (RDF), naturally allow for modeling the same real-world entity in various ways. For example, different RDF vocabularies enable the definition of various RDF graphs representing the same drug in Bio2RDF or Drugbank. Albeit semantically equivalent, these RDF graphs may be syntactically different, i.e., they have distinctive graph structure or entity identifiers and properties. Existing data-driven integration approaches only consider syntactic matching criteria or similarity measures to solve the problem of integrating RDF graphs. However, syntactic-based approaches are unable to semantically integrate heterogeneous RDF graphs. We devise SJoin, a semantic similarity join operator to solve the problem of matching semantically equivalent RDF graphs, i.e., syntactically different graphs corresponding to the same real-world entity. Two physical implementations are proposed for SJoin which follow blocking or non-blocking data processing strategies, i.e., RDF graphs can be merged in a batch or incrementally. We empirically evaluate the effectiveness and efficiency of the SJoin physical operators with respect to baseline similarity join algorithms. Experimental results suggest that SJoin outperforms baseline approaches, i.e., non-blocking SJoin incrementally produces results faster, while the blocking SJoin accurately matches all semantically equivalent RDF graphs.
Content may be subject to copyright.
SJoin: A Semantic Join Operator to Integrate
Heterogeneous RDF Graphs
Mikhail Galkin1,2,5, Diego Collarana1,2, Ignacio Traverso-Ribón3,
Maria-Esther Vidal2,4, Sören Auer1,2
1Enterprise Information Systems (EIS), University of Bonn
{galkin|collaran|vidal|auer}@cs.uni-bonn.de
2Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS)
3FZI Research Center for Information Technology, Germany
traverso@fzi.de
4Universidad Simón Bolívar, Venezuela
5ITMO University, Saint Petersburg, Russia
Abstract. Semi-structured data models like the Resource Description
Framework (RDF), naturally allow for modeling the same real-world
entity in various ways. For example, different RDF vocabularies en-
able the definition of various RDF graphs representing the same drug
in Bio2RDF or Drugbank. Albeit semantically equivalent, these RDF
graphs may be syntactically different, i.e., they have distinctive graph
structure or entity identifiers and properties. Existing data-driven inte-
gration approaches only consider syntactic matching criteria or similar-
ity measures to solve the problem of integrating RDF graphs. However,
syntactic-based approaches are unable to semantically integrate hetero-
geneous RDF graphs. We devise SJoin, a semantic similarity join op-
erator to solve the problem of matching semantically equivalent RDF
graphs, i.e., syntactically different graphs corresponding to the same
real-world entity. Two physical implementations are proposed for SJoin
which follow blocking or non-blocking data processing strategies, i.e.,
RDF graphs can be merged in a batch or incrementally. We empirically
evaluate the effectiveness and efficiency of the SJoin physical operators
with respect to baseline similarity join algorithms. Experimental results
suggest that SJoin outperforms baseline approaches, i.e., non-blocking
SJoin incrementally produces results faster, while the blocking SJoin ac-
curately matches all semantically equivalent RDF graphs.
1 Introduction
The support that Open Data and Semantic Web initiatives have received from
the society has resulted in the publication of a large number of publicly available
datasets, e.g., United Nations Data6or Linked Open Data cloud7allows for
accessing billion of records. In the context of the Semantic Web, the Resource
6http://data.un.org/
7http://stats.lod2.eu/
2 Mikhail Galkin et al.
DBpedia
drugbank:DB00316 drugbank:DB 01050
dbr:Paracetamol dbr:Acetaminophen dbr:Ibuprofen
label
Ibuprofen
15687-27-1
CAS
15687-27-1
rdfs:label
dbo:casNumber
103-90-2
dbr:Paracetamol
N02BE01
103-90-2
atcCode
rdfs:label
dbo:casNumber
rdfs:label dbo:pageRedirect
N-(4-hydroxy
phenyl)ethanamide
2-[4-(2-methylpropyl)
phenyl]propanoic acid
chemicalIupacName
dbo:iupacName
(RS)-2-(4-(2-Methyl
propyl)phenyl)pro
panoic acid
Ibuprofen@en
Acetaminophen
@en
Paracetamol@en
dbo:iupacName
dbo:casNumber
N-(4-hydroxy
phenyl) acetamide
chemicalIupacName
Acetaminophen
label
103-90-2
CAS
Drugbank
Fig. 1: Motivating Example. The Ibuprofen and Paracetamol real-world en-
tities are modeled in different ways by Drugbank and DBpedia. Syntactically
the properties and objects are different, but semantically the represent the
same drugs. Drug drugbank:DB01050 matches 1-1 with dbr:Ibuprofen, while
drugbank:DB00316 matches 1-2 with dbr:Paracetamol and dbr:Acetaminophen.
Description Framework (RDF) is utilized for semantically enriching data with
vocabularies or ontologies. Albeit expressive, the RDF data model allows (e.g.,
due to the non-unique names assumption) multiple representations of a real-
world entity using different vocabularies.
To illustrate this, consider chemicals and drugs represented in the Drug-
bank and DBpedia knowledge graphs. Using different vocabularies, drugs are
represented from different perspectives. DBpedia contains more general informa-
tion, whereas Drugbank provides more domain-specific facts, e.g., the chemical
composition and properties, pharmacology, and interactions with other drugs.
Fig. 1 illustrates representations of two drugs in Drugbank and DBpedia. Ibupro-
fen, a drug for treating pain, inflammation and fever, and Paracetamol, a drug
with analgesic, and antipyretic effects. Firstly, Drugbank Uniform Resource
Identifiers (URIs) are textual IDs (e.g., drugbank:DB003168corresponds to Ac-
etaminophen and drugbank:DB01050 to Ibuprofen. In contrast, DBpedia utilizes
human-readable URIs (e.g., dbr:Acetaminophen and dbr:Ibuprofen) to identify
drugs. Secondly, the same attributes are encoded differently with various prop-
erty URIs, e.g., chemicalIupacName,casRegistryNumber in Drugbank, and iupacName,
casNumber in DBpedia, respectively. Thirdly, some drugs might be linked to more
than one analogue, e.g., Acetaminophen in Drugbank (drugbank:DB00316) corre-
sponds to two DBpedia resources: dbr:Paracetamol, and dbr:Acetaminophen.
Traditional join operators, e.g., Hash Join [2] or XJoin [11], are not capable
of joining those resources as neither URIs nor properties match syntactically.
Similarity join operators [3, 5, 6, 8, 12] tackle this heterogeneity issue, but due
to the same extent of inequality string and set similarity techniques are limited
in deciding whether two RDF resources should be joined or not. Therefore, we
identify the need of a semantic similarity join operator able to satisfy the fol-
lowing requirements: R1) Applicable to heterogeneous RDF knowledge graphs.
8Prefixes are as specified on http://prefix.cc/
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs 3
R2) Able to identify joinable tuples leveraging semantic relatedness between
RDF graphs. R3) Capable of performing perfect matching for one-to-one in-
tegration, and fuzzy conditional matching for integrating groups of Nentities
from one graph with Mentities from another knowledge graph. R4) Support of
a blocking operation mode for batch processing, and a non-blocking mode for
on-demand real time cases whenever results are expected incrementally.
We present SJoin – a semantic join operator which meets these requirements.
The contributions of this article include: 1) Definition and description of SJoin, a
semantic join operator for integrating heterogeneous RDF graphs. 2) Algorithms
and complexity study of a blocking SJoin for 11integration and non-blocking
SJoin for the NMsimilarity case. 3) An extensive evaluation that demon-
strates benefits of SJoin in terms of efficiency, effectiveness and completeness
over time in various heterogeneity conditions and confidence levels.
The article is organized as follows: The problem addressed in this work is
clearly defined in Section 2. Section 3 presents the SJoin operator, as well as the
blocking and non-blocking physical implementations, as solutions for detecting
semantically equivalent entities in RDF knowledge graphs. Results from our
experimental study are reported on Section 4. An overview of traditional binary
joins and similarity joins as a related work is analyzed in Section 5. Finally, we
sum up the lessons learned and outline future research directions in Section 6.
2 Problem Statement
In this work, we tackle the problem of identifying semantically equivalent RDF
molecules from RDF graphs. Given an RDF graph G, we call a subgraph M
of Gan RDF molecule [4] iff the RDF triples of M={t1, . . . , tn}share the
same subject, i.e., i, j ∈ {1, .., n}(subject(ti) = subject(tj)). An RDF molecule
can be represented as a pair M= (R, T ), where Rcorresponds to the URI (or
blank node) of the molecule subject, and Tis a set of pairs p=(prop,val) such
that the triple (R,prop,val) belongs to M. We name Rand Tthe head and
the tail of the RDF molecule M, respectively. For example, an RDF molecule
of a drug Paracetamol is (dbr:Paracetamol, {(rdfs:label,"Paracetamol@en"),
(dbo:casNumber,"103-90-2"), (dbo:iupacName,"N-(4-hydroxyphenyl)ethanamide")}).
An RDF graph Gcan be described in terms of its RDF molecules as follows:
φ(G) = {M = (R, T )|t= (R, prop, v al)Gand (prop, val)T}(1)
Definition 1 (Problem of Semantically Equivalent RDF Graphs). Given
sets of RDF molecules φ(G),φ(D), and φ(F), and an RDF molecule Mein
φ(F)which corresponds to an entity erepresented by different RDF molecules
MGand MDin φ(G)and φ(D), respectively. The problem of identifying seman-
tically equivalent entities between sets of RDF molecules φ(G)and φ(D)consists
of providing an homomorphism θ:φ(G)φ(D)2φ(F), such that if two RDF
molecules MGand MDrepresent the RDF molecule Me, then Meθ(MG)
and Meθ(MD); otherwise, θ(MG)6=θ(MD).
4 Mikhail Galkin et al.
Definition 1 considers perfect 1-1 matching, e.g., determining 1-1 seman-
tic equivalences between drugbank:01050 and dbr:Ibuprofen, as well as NM
matching, e.g., drugbank:DB00316 with both dbr:Paracetamol and dbr:Acetaminophen.
3 Proposed Solution: The SJoin Operator
We propose a similarity join operator named SJoin, able to identify joinable
entities between RDF graphs, i.e., SJoin implements the homomorphism θ(.).
SJoin is based on the Resource Similarity Molecule (RSM) structure, that in
combination with a similarity function Simf, and a threshold γ, produce a list
of matching entity pairs. RSM is defined as follows:
Definition 2 (Resource Similarity Molecule (RSM)). Given a set Mof
RDF molecules, a similarity function Simf, and a threshold γ. A Resource Sim-
ilarity Molecule is a pair RSM=(M,T), where:
M = (R, T )is the head of RSM and the RDF molecule described in RSM.
T is the tail of RSM and represents an ordered list of RDF molecules Mi=
(Ri, Ti). T meets the following conditions:
M is highly similar to Mi, i.e., Simf(R, Ri)γ.
For all Mi= (Ri, Ti)T, Simf(R, Ri)Simf(R, Ri+1).
An RSM is composed of a head and tail that correspond to an RDF molecule
and a list of molecules which similarity score is higher than a specified threshold
γ, respectively. For example, an RSM of Ibuprofen (with omitted tails of prop-
erty:value pairs) is ((dbr:Ibuprofen, T)[(drugbank:DB01050, T1), (chebi:5855,
T2), (wikidata:Q186969, T3)]) given a similarity function Simf, a threshold γ,
and Simf(dbr:Ibuprofen,drugbank:DB01050)S imf(dbr:Ibuprofen,chebi:5855),
and Simf(dbr:Ibuprofen,chebi:5855)S imf(dbr:Ibuprofen,wikidata:Q186969).
The SJoin operator is a two-fold algorithm that performs: first, Similarity
Partitioning, and second, Similarity Probing to identify semantically equiva-
lent RDF molecules. To address batch and real-time processing scenarios, we
present two implementations of SJoin. Blocking SJoin Operator solves the
1-1 weighted perfect matching problem allowing for a batch processing of the
graphs. Non-Blocking SJoin Operator employs fuzzy conditional matching
for identifying communities of N-Mentities in graphs covering the on-demand
case whenever results are expected to be produced incrementally.
3.1 Blocking SJoin Operator
Fig. 2 illustrates the intuition behind the blocking SJoin operator. Similarity
Partitioning and Probing steps are executed sequentially. Thus, blocking SJoin
operator completely evaluates both datasets of RDF molecules in the Partition-
ing step, and then fires the Probing step to produce the whole output.
The Similarity Partitioning step is described in Algorithm 1. The operator
initializes two lists of RSMs for two RDF graphs and incoming RDF molecules
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs 5
(R1A,T1A)[ ]
(R2A,T2A)[ ]
(R3A,T3A)[ ]
(R1B,T1B)[ ]
(R2B,T2B)[ ]
(R3B,T3B)[ ]
insert (R2A,T2A)
(R3A,T3A)
(R1B,T1B)
(R2B,T2B)
(R3B,T3B)
(R1A,T1A)(R2B,T2B)
(R2A,T2A)(R3B,T3B)
(R3A,T3A)(R1B,T1B)
Dataset A
Dataset B
simf γ
(R1A,T1A )
insert
insert 1-1 Perfect Matching
Similarity Partitioning Similarity Probing
Fig. 2: SJoin Blocking Operator. Similarity Partitioning step initializes lists
of RSMs and populates their tails through a similarity function Simfand a
threshold γ. Similarity Probing step performs 1-1 weighted perfect matching
and outputs the perfect pairs of semantically equivalent molecules (MiA,Mj B ).
Algorithm 1: Similarity Partitioning step for Blocking SJoin operator
according to similarity function Simfand threshold γ
Data: Dataset φ(DA),Simf, γ
Result: List of RSMA, List of RS MB
1while getMolecule(φ(DA))do
2MiA getMolecule(φ(DA)) ;
3RiA head(MiA);// Get URI
4for RSMj B List of RSMBdo
5RSMj B = ((RjB , Tj B )[(RlA, TlA )),...,(RkA , TkA)] ;
6RjB head(head(RSMjB )) ;// Get URI
7if Simf(Rj B , RiA)γthen // Probe
8tail(RSMj B )tail(RSMjB ) + (MiA );
9return sort(List of RSMA),sort(List of RSMB)
are inserted into a respective list with a filled head Mand empty tail T. To
populate the tail of a RSM in the list A, SJoin resorts to a semantic similarity
function for computing a similarity score between the RSM and all RSMs in
the opposite list B. If the similarity score exceeds a certain threshold γthen
the molecule from the list B is appended to the tail of the RSM. Finally, the
tail is sorted in the descending similarity score order such that the most similar
RDF molecule obtains the top position in the tail. For instance, the semantic
similarity function GADES [10] is able to decide relatedness between the RDF
molecules of dbr:Ibuprofen and drugbank:DB01050 in Fig. 1, and assigns a sim-
ilarity score of 0.8. The algorithm supports datasets with arbitrary amounts of
molecules. However, in order to guarantee 1-1 perfect matching, we place a re-
striction card(φ(DA)) = card(φ(DB)), i.e., the number of molecules in φ(DA)
and φ(DB)must be the same. Thus, card(List of RSMA) = card(List of RSMB).
A 1-1 weighted perfect matching is applied at the Similarity Probing stage in
the Blocking SJoin operator. It accepts the lists of RSMA, RSMBcreated and
populated during the previous Similarity Partitioning step. This step aims at pro-
ducing perfect pairs of semantically equivalent RDF molecules (MiA,Mj B ), i.e.,
6 Mikhail Galkin et al.
(RiA,TiA )[(RjB,TjB),… ,(RkB,TkB)] (RjB,TjB )[(RiA,TiA),… ,(RmA,TmA)]
List of RSMAList of RSMB
(RiA,TiA ) (RjB,TjB )
(a) 1-1 matching from the bipartite graph of RMS
(RaA,TaA ) (RbB,TbB )
(RmA,TmA ) (RnB,TnB )
n pairs
(RiA,TiA ) (RjB,TjB )
(b) Matched pairs
Fig. 3: 1-1 Weighted Perfect Matching. (a) The matching is identified from
the lists of RSMAand RS MB; RDF molecules MiA =(RiA,TiA ) and Mj B =
(RjB ,Tj B ) are semantically equivalent whenever RiA and RjB are reciprocally
the most similar RDF molecules according to Simf.
Algorithm 2: 1-1 Weighted Perfect Matching of RSMs bipartite graph
Data: List of RSMA, List of RS MB
Result: List of pairs LP = ((RiA, TiA ),(RjB , Tj B ))
1for RSMiA List of RS MAdo
2RSMiA = ((RiA , TiA )[(RjB , TjB ),...,(Rk B , TkB )]) ;// Ordered Set
3for (RjB , TjB )tail(RSMiA )do
4RSMj B Find in the List of RSMB;
5RSMj B = ((RjB , Tj B )[(RlA, TlA ),...,(RzA , TzA )]) ;// Ordered Set
6if (RlA, TlA ) = (RiA, TiA )and (RiA , TiA)6∈ LP then
7LP LP + ((RiA, TiA ),(RjB , Tj B )) ;// Add to result
8else
9for (RlA, TlA )tail(RSMjB )do
10 find the position of (RiA , TiA);
11 return LP
max(Simf(MiA , RS MB)) = max(Simf(MjB , RSMA)) = Simf(MiA ,Mj B ).
That is, for a given molecule MiA, there is no molecule in the list of RSMAwhich
has a similarity score higher than Simf(MiA ,MjB )and vice versa. Algorithm 2
describes how perfect pairs are created; Fig. 3 illustrates the algorithm.
Traversing the List of RSMA, the algorithm iterates over each RSMiA . Then,
the tail of RSMiA , i.e., an ordered list of highly similar molecules, is extracted.
The first molecule of the tail RSMj B corresponds to the most similar molecule
from the List of RSMB. The algorithm searches for RSMjB in the List of RSMB
and examines whether the molecule (RiA, TiA )is the first one in the tail of
RSMj B . If this condition holds and (RiA, TiA )is not already matched with
another RSM, then the pair ((RiA, TiA ),(RjB , Tj B )) is identified as a perfect
pair and is appended to the result list of pairs LP (cf. Fig. 3a). If false, then
the algorithm finds the first occurrence of (RiA, TiA)in the tail of RS MjB and
appends the result pair to LP . When all RSM s are matched, the algorithm
yields the list of perfectly matched pairs (cf. Fig. 3b).
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs 7
(R1A,T1A)[ ] (R1B,T1B)[ ]
(R2B,T2B)[ ]
insert
(R1A,T1A)(R 2B,T2B)
probe
simf γ
Dataset A Dataset B
Similarity Partitioning / Similarity Probing
(R1A,T1A)(R1A,T1A)
(a) Molecule (RiA, TiA )yields a pair
((R1A, T1A),(R2B, T2B))
(R1A,T1A)[ ]
(R2A,T2A)[ ]
(R1B,T1B)[ ]
(R2B,T2B)[ ]
(R3B,T3B)[ ]
insert
(R1A,T1A)(R 2B,T2B)
(R3B,T3B)(R 2A,T2A)
probe
simf γ
Dataset A Dataset B
Similarity Partitioning / Similarity Probing
(R3B,T3B)
(b) Molecule (R3B, T3B)yields a pair
((R3B, T3B),(R2A, T2A))
Fig. 4: SJoin Non-Blocking Operator. Identifies N-M matchings and pro-
duces results as soon as new molecule arrives. When a molecule (RiA, TiA )ar-
rives, it is inserted into a relevant list and probed against another list. If the
similarity score exceeds the threshold γ, a new matching is produced.
3.2 Non-Blocking SJoin Operator
The Non-Blocking SJoin operator aims at identifying NMmatchings, i.e.,
an RSMiA might be associated with multiple RSMs, e.g., RSMj B or RSMkB .
Therefore, 1-1 weighted perfect matching is not executed which enables the op-
erator to produce results as soon as new molecules arrive, i.e., in a non-blocking,
on-demand manner. The operator receives two sets of RDF molecules φ(DA)
and φ(DB). Lists of RSMA, RS MBare initialized as empty lists. Algorithm 3
describes the join procedure and Fig. 4 illustrates the algorithm.
For every incoming molecule MiA from φ(DA), Algorithm 3 performs the
same two steps: Similarity Partitioning and Similarity Probing. The URI RiA
of an RDF molecule extracted from the tuple (RiA, TiA )is probed against URIs
of all existing RS M s in the List of RS MB(cf. Fig. 4). If the similarity score of
Simf(RiA , RjB )exceeds the threshold γ, then the pair ((RiA , TiA ),(RjB , Tj B ))
is considered as a matching and appended to the results list LP . During the
Similarity Insert step, an RSMiA is initialized, the molecule (RiA , TiA)becomes
its head, and eventually added to the respective List of RSMA. Algorithm 3 is
applied to both φ(DA)and φ(DB)and able to produce results with constantly
updating Lists of RSMs supporting the non-blocking operation workflow.
3.3 Time Complexity Analysis
The SJoin binary operator receives two RDF graphs of nRDF molecules each.
To estimate the complexity of the blocking SJoin operator, three most expen-
sive operations have to be analyzed. Table 1 gives an overview of the analysis.
The complexity of the Data Partitioner module depends on the Algorithm 1,
i.e., construction of Lists of RSMA, RS MBand a similarity function Simf. The
asymptotic approximation equals to O(n2·O(Simf)). To produce ordered tails
of RSM s the similar molecules in the tail have to be sorted in the descending
8 Mikhail Galkin et al.
Algorithm 3: The Non-Blocking SJoin operator executes both Similarity
Partitioning and Probing steps as soon as an RDF molecule arrives from
an RDF graph.
Data: Dataset φ(DA),Simf, γ
Result: List of pairs LP = ((RiA, TiA ),(RjB , Tj B ))
1while getMolecule(φ(DA))do
2MiA getMolecule(φ(DA)) ;
3RiA head(MiA), TiA tail(MiA );// Get URI, tail
4for RSMj B List of RSMBdo
5RSMj B = ((RjB , Tj B )[]) ;
6RjB head(head(RSMjB )) ;// Get URI
7TjB tail(head(RS MjB );// Get tail
8if Simf(RiA , RjB )γthen // Probe
9LP LP + ((RiA, TiA ),(RjB , Tj B )) ;
10 head(RSMiA )← MiA ,tail(RSMiA )[] ;
11 List of RSMAList of RS MA+RSMiA ;// Insert
12 return LP
Table 1: The SJoin Time Complexity. Results for the steps of Partitioning,
Sorting, and Matching, where nis the number of RDF molecules.
Stage Blocking SJoin Complexity Non-Blocking SJoin Complexity
Partitioning O(n2·O(Simf)) O(n2·O(Simf))
Sorting O(nlog n)
Matching O(n3)
Overall O(n2·O(Simf)) + O(n3)O(n2·O(Simf))
similarity score order. The applicable merge sort and heapsort algorithms have
O(nlog n)asymptotic complexity. The 1-1 Weighted Perfect Matching compo-
nent has O(n3)complexity in the worst case according to the Algorithm 2. How-
ever, the Hungarian algorithm [7], a standard approach for 1-1 weighted perfect
matching, converges to the same O(n3)complexity. Partitioning, sorting, and
perfect matching are executed sequentially. Therefore, the overall complexity
conforms to the sum of complexities, i.e., O(n2·O(Simf)) + O(nlog n) + O(n3)
which equals to O(n2·O(Simf)) + O(n3). We thus deduce that the SJoin com-
plexity depends on the complexity of a chosen similarity measure whereas the
lowest achievable order of complexity is limited to O(n3).
The complexity of the non-blocking SJoin operator stems from the analysis
of the Algorithm 3. The most expensive step of the algorithm is to compute a
similarity score between an RSMiA and RSMs in the List of RSMB. Applied to
both φ(DA)and φ(DB)the complexity converges to O(n2·O(Simf)).
4 Empirical Study
An empirical evaluation is conducted to study the efficiency and effectiveness of
SJoin in blocking and non-blocking conditions on RDF graphs from DBpedia and
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs 9
Table 2: Benchmark Description. RDF datasets used in the evaluation.
Experiment 1: People Experiment 2: People
DBpedia D1 DBpedia D2 DBpedia Wikidata DBpedia Wikidata
Molecules 500 500 500 500 1000 1000
Triples 17,951 17,894 29,263 16,307 54,590 29,138
Wikidata. We assess the following research questions: RQ1) Does blocking SJoin
integrate RDF graphs more efficiently and effectively compared to the state of
the art? RQ2) What is the impact of threshold values on the completeness of
a non-blocking SJoin? RQ3) What is the effect of a similarity function in the
SJoin results? The experimental configuration is as follows:
Benchmark: Experiment 1 is executed against a dataset of 500 molecules9
of type Person extracted from the live version of DBpedia (February 2017).
Based on the original molecules, we created two sets of molecules by randomly
deleting or editing triples in the two sets. Sharing the same DBpedia vocabulary,
Experiment 1 datasets have a higher resemblance degree compared to Experi-
ment 2. Experiment 2 employs subsets of DBpedia and Wikidata of the Person
class. Assessing SJoin in the higher heterogeneity settings, we sampled datasets
of 500 and 1000 molecules varying triples count from 16K up to 55K10 . Table 2
provides basic statistics on the experimental datasets. DBpedia D1 and D2 refer
to the dumps of 500 molecules. Further, the dumps of 500 and 1000 molecules
for Experiment 2 are extracted from DBpedia and Wikidata.
Baseline: Gold standards for blocking operators comparison include the
original DBpedia Person descriptions (Experiment 1) and owl:sameAs links be-
tween DBpedia and Wikidata (Experiment 2). We compare SJoin with a Hash
Join operator. For a fair comparison, the Hash Join was extended to support sim-
ilarity functions at the Probing stage. That is, blocking SJoin is compared against
blocking similarity Hash Join and non-blocking SJoin is evaluated against non-
blocking Symmetric Hash Join. The Gold standard for evaluating non-blocking
operators is comprised of the precomputed amounts of pairs which similarity
score exceeds a predefined threshold; gold standards are computed off line.
Metrics: We report on execution time (ET in secs) as the elapsed time
required by the SJoin operator to produce all the answers. Furthermore, we
measure Precision,Recall and report F1-measure during the experiments with
blocking operators. Precision is the fraction of RDF molecules that has been
identified and integrated (M) that intersects with the Gold Standard (GS ), i.e.,
Precision =|MGS |
|M|. Recall corresponds to the fraction of the identified similar
molecules in the Gold Standard, i.e., Recall =|MGS |
|GS|. Comparing non-blocking
operators, we measure Completeness over time, i.e., a fraction of results produced
at a certain time stamp. The timeout is set to one hour (3,600 seconds), the
operators results are checked every second. Ten thresholds in the range [0.1:1.0]
and step 0.1 were applied in Experiment 1. In Experiment 2, five thresholds in
9https://github.com/RDF-Molecules/Test- DataSets/tree/master/DBpedia-People/20160819
10 https://github.com/RDF-Molecules/Test- DataSets/tree/master/DBpedia-WikiData/operators_evaluation
10 Mikhail Galkin et al.
0
200
400
600
800
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Threshold
ET, sec
sjoin_partitioning sjoin_probing
F1 score
0.00
0.25
0.50
0.75
1.00
F1 score
(a) SJoin performance
0
200
400
600
800
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Threshold
ET, sec
hash_partitioning hash_probing
F1 score
0.00
0.25
0.50
0.75
1.00
F1 score
(b) Hash Join performance
Fig. 5: Experiment 1 (GADES) with blocking operators. The partitioning
bar shows the time taken to partition the molecules in RSMs, probing indicates
the time required for 1-1 weighted perfect matching. Black line chart on the right
axis denotes F1 score. (a) SJoin demonstrates higher F1 score while consuming
more time for perfect matching. (b) Baseline Hash Join demonstrates less than
0.25 F1 score even on lower thresholds spending less time on probing.
the range [0.1 : 0.5] were evaluated because no pair of entities in the sampled
RDF datasets has a GADES similarity score higher than 0.5.
Implementation: Both blocking and non-blocking SJoin operators are im-
plemented in Python 2.7.1011. Baseline improved Hash Joins are implemented
in Python as well12. The experiments were executed on a Ubuntu 16.04 (64
bits) Dell PowerEdge R805 server, AMD Opteron 2.4GHz CPU, 64 cores, 256GB
RAM. We evaluated two similarity functions: GADES [10] and Semantic Jaccard
(SemJaccard) [1]. GADES relies on semantic descriptions encoded in ontologies
to determine relatedness, while SemJaccard requires the materialization of im-
plicit knowledge and mappings. Evaluating schema heterogeneity of DBpedia
and Wikidata in Experiment 2 the similarity function is fixed to GADES.
4.1 DBpedia – DBpedia People
Experiment 1 evaluates the performance and effectiveness of blocking and non-
blocking SJoin compared to respective Hash Join implementations. The testbed
includes two split DBpedia dumps with semantically equivalent entities but non-
matching resource URIs and randomly distributed properties; GADES and Sem-
Jaccard similarity functions. That is, both graphs are described in terms of one
DBpedia ontology. Fig. 5 visualizes the results obtained when applying GADES
semantic similarity function in order to identify a perfect matching of graphs
resources, i.e., in blocking conditions. SJoin exhibits better F1 score up to very
11 https://github.com/RDF-Molecules/operators/tree/master/mFuhsion
12 https://github.com/RDF-Molecules/operators/tree/master/baseline_ops
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs 11
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.1 # triples: 166573
(a) T = 0.1
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.3 # triples: 108922
(b) T = 0.3
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.5 # triples: 15148
(c) T=0.5
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.8 # triples: 406
(d) T=0.8
Fig. 6: Experiment 1 (GADES) with non-blocking operators. SJoin pro-
duces complete results at all threholds in contrast to Hash Join.
high 0.9 threshold value. Moreover, the effectiveness of more than 80% is ensured
up to 0.6 threshold value whereas Hash Join barely reaches 25% even on lower
thresholds. The partitioning time is constant for both operators but Hash Join
performs the partitioning slower due to the application of a hash function to
all incoming molecules. However, high effectiveness of SJoin is achieved at the
expense of time efficiency. SJoin has to complete a 1-1 perfect matching algo-
rithm against a large 500x500 matrix whereas Hash Join performs the perfect
matching three times but for smaller matrices equal to the size of its buckets,
e.g., about 166x166 for three buckets which is faster due to the cubic complexity
of the weighted perfect matching algorithm.
Fig. 6 shows the results of the evaluation of non-blocking operators with
GADES. SJoin outperforms the baseline Hash Join in terms of completeness
over time in all four cases with the threshold in the range 0.1-0.8. Fig. 6a demon-
strates that the SJoin operator is capable of producing 100% of results within
the timeframe whereas the Hash Join operator outputs only about 10% of the
expected tuples. In Fig. 6b, SJoin achieves the full completeness even faster. In
Fig. 6c both operators finish after 18 minutes, but SJoin retains full complete-
ness while Hash Join reaches only 35%. Finally, with the 0.8 threshold in Fig. 6d,
Hash Join performs very fast but still struggles to attain the full completeness;
SJoin takes more time but sustainably achieves answer completeness. One of
the reasons why Hash Join performs worse is its hash function which does not
consider semantics encoded in the molecules descriptions. Therefore, the hash
function partitions RDF molecules into buckets almost randomly, while it was
originally envisioned to place similar entities in the same buckets.
Fig. 7 presents the efficiency and effectiveness of blocking SJoin and Hash Join
when applying SemJaccard similarity function. As an unsophisticated measure,
12 Mikhail Galkin et al.
0
100
200
300
400
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Threshold
ET, sec
sjoin_partitioning sjoin_probing
F1 score
0.00
0.25
0.50
0.75
1.00
F1 score
(a) SJoin performance
0
100
200
300
400
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Threshold
ET, sec
hash_partitioning hash_probing
F1 score
0.00
0.25
0.50
0.75
1.00
F1 score
(b) Hash Join performance
Fig. 7: Experiment 1 (SemJaccard) with blocking operators. (a) SJoin
takes less time to compute similarity scores while F1 score quickly deteriorates
after threshold 0.5. (b) Baseline Hash Join in most cases consumes more time
and produces less reliable matchings.
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.4 # triples: 50857
(a) T = 0.4, GADES
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.4 # triples: 486
(b) T = 0.4, Jaccard
Fig. 8: Experiment 1 with fixed threshold. GADES identifies two orders of
magnitude more results than Jaccard while SJoin still achieves full completeness.
operators require less time for partitioning and take less time for probing stages.
That is, due to the heterogeneous nature of the compared datasets, SemJaccard
is not able to produce similarity scores higher than 0.4. On the other hand,
SemJaccard simplicity leads to significant deterioration of the F1 score already
at low thresholds, i.e., 0.3-0.4.
Fig. 8 illustrates the difference in elapsed time and achieved completeness of
SJoin and Hash Join applying GADES or SemJaccard similarity functions. Evi-
dently, SemJaccard outputs fewer tuples even on lower thresholds, e.g., 486 pairs
at 0.4 threshold against 50,857 pairs by GADES. We therefore demonstrate that
plain set similarity measures as SemJaccard that consider only an intersection
of exactly same triples are ineffective in integrating heterogeneous RDF graphs.
4.2 DBpedia - Wikidata People
The distinctive feature of the experiment consists in completely different vo-
cabularies used to semantically describe the same people. Therefore, traditional
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs 13
(a) GADES distribution
0
1000
2000
3000
4000
0.1 0.2 0.3 0.4 0.5
Threshold
ET, sec
sjoin_partitioning sjoin_probing
F1 score
0.00
0.25
0.50
0.75
1.00
F1 score
(b) SJoin
0
200
400
600
800
0.1 0.2 0.3 0.4 0.5
Threshold
ET, sec
hash_partitioning hash_probing
F1 score
0.00
0.25
0.50
0.75
1.00
F1 score
(c) Hash Join
Fig. 9: Experiment 2 (GADES) with blocking operators, 500 molecules.
(a) The distribution of GADES similarity scores shows that there are few pairs
which score exceeds 0.4 threshold. (b) SJoin requires more time but achieves
more than 0.9 F1 score until T0.3. (c) Baseline Hash Join operates faster but
achieves less than 0.25 F1 accuracy.
joins and set similarity joins, e.g., Jaccard, are not applicable. We evaluate the
performance of SJoin employing GADES semantic similarity measure.
Fig. 9 reports the efficiency and effectiveness of SJoin compared to Hash Join
in the 500 molecules setup. Fig. 9a justifies the range of selected thresholds as
only a few number of pairs have a similarity score higher than 0.5. Blocking
SJoin manages to achieve higher F1 score (max 95%) up to 0.3 threshold value,
but requires significantly more time to accomplish the perfect matching.
Results of non-blocking SJoin and Hash Join executed against 500 and 1000
molecules configurations are reported on Fig. 10. The observed behavior of these
operators resembles the one in Experiment 1, i.e., SJoin outputs complete results
within a predefined time frame, while Hash Join barely achieves 40% complete-
ness in the case with a relatively high threshold 0.4 and small number of outputs.
Analyzing the observed empirical results, we are able to answer our research
questions: RQ1) Blocking SJoin consistently exhibits higher F1 scores, and the
results are more reliable. However, time efficiency depends on the input graphs
and applied similarity functions. RQ2) A threshold value prunes the amount of
expected results and does not affect the completeness of SJoin. RQ3) Clearly, a
semantic similarity function allows for matching RDF graphs more accurately.
5 Related Work
Traditional binary join operators require join variables instantiations to be ex-
actly the same. For example, XJoin [11] and Hash Join [2] (chosen as a baseline
in this paper) operators abide this condition. At the Insert step, both blocking
and non-blocking Hash Join algorithms partition incoming tuples into a number
of buckets based on the assumption that after applying a hash function similar
tuples will reside in the same bucket. The assumption holds true in cases of sim-
ple data structures, e.g., numbers or strings. However, applying hash functions
to string representations of complex data structures such as RDF molecules or
RSMs tend to produce more collisions rather then efficient partitions. At the
14 Mikhail Galkin et al.
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.2 # triples: 153904
(a) T = 0.2, 500 molecules
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.4 # triples: 639
(b) T = 0.4, 500 molecules
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.2 # triples: 160062
(c) T=0.2, 1000 molecules
0
20
40
60
80
100
0 600 1200 1800 2400 3000 3600
Time, sec
Completeness, %
hash join sjoin
T0.4 # triples: 3466
(d) T=0.4, 1000 molecules
Fig. 10: Experiment 2. Non-blocking operators in different dataset
sizes. In larger setups, SJoin still reaches full completeness.
Probe stage, Hash Join performs matching as to a specified join variable. Thus,
having URI as a join variable, semantically equivalent RSMs with different URIs
can not be joined by Hash Join.
Similarity join algorithms are able to match syntactically different entities
and address the heterogeneity issue. String similarity join techniques reported
in [3, 5, 12] rely on various metrics to compute a distance between two strings.
Set similarity joins [6,8] identify matches between sets. String and set similarity
techniques are, however, inefficient being applied to RDF data as they do not
consider the graph nature of semantic data. There exist graph similarity joins [9,
13] which traverse graph data in order to identify similar nodes. On the other
hand, those operators do not tackle semantics encoded in the knowledge graphs
and are tailored for specific similarity functions.
In contrast, SJoin, presented in this paper, is a semantic similarity operator
that fully leverages RDF and OWL semantics encoded in the RDF graphs. More-
over, SJoin is able to perform in blocking, i.e., 1-1 perfect matching, conditions
or non-blocking, i.e., incremental NM, manner allowing for on-demand and
ad-hoc semantic data integration pipelines. Additionally, SJoin is flexible and is
able to employ various similarity functions and metrics, e.g., from simple Jac-
card similarity to complex NED [14] or GADES [10] measures, achieving best
performance with semantic similarity functions.
6 Conclusions and Future Work
We presented SJoin, an operator for detecting semantically equivalent RDF
molecules from RDF graphs. SJoin implements two operators: Blocking and
Non-Blocking, which rely on similarity measures and ontologies to effectively
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs 15
detect equivalent entities from heterogeneous RDF graphs. Moreover, the time
complexity of SJoin operators depends on the time complexity of the similar-
ity measure, i.e., SJoin does not introduce additional overhead. The behavior
of SJoin was empirically studied on DBpedia and Wikidata real-world RDF
graphs, and on Jaccard and GADES similarity measures. Observed results sug-
gest that SJoin is able to identify and merge semantically equivalent entities, and
is empowered by the semantics encoded in ontologies and exploited by similarity
measures. As future work, we plan to define new SJoin operators to compute
on-demand integration of RDF graphs and address streams of RDF data.
Acknowledgments
Mikhail Galkin is supported by the project Open Budgets (GA 645833). This
work is also funded in part by the European Union under the Horizon 2020
Framework Program for the project BigDataEurope (GA 644564), and the Ger-
man Ministry of Education and Research with grant no. 13N13627 (LiDaKra).
References
1. D. Collarana, M. Galkin, C. Lange, I. Grangel-González, M. Vidal, and S. Auer.
Fuhsen: A federated hybrid search engine for building a knowledge graph on-
demand (short paper). In ODBASE, pages 752–761, 2016.
2. A. Deshpande, Z. G. Ives, and V. Raman. Adaptive query processing. Foundations
and Trends in Databases, 1(1):1–140, 2007.
3. J. Feng, J. Wang, and G. Li. Trie-join: a trie-based method for efficient string
similarity joins. VLDB J., 21(4):437–461, 2012.
4. J. D. Fernández, A. Llaves, and Ó. Corcho. Efficient RDF interchange (ERI) format
for RDF data streams. In ISWC, pages 244–259, 2014.
5. G. Li, D. Deng, J. Wang, and J. Feng. PASS-JOIN: A partition-based method for
similarity joins. PVLDB, 5(3):253–264, 2011.
6. W. Mann, N. Augsten, and P. Bouros. An empirical evaluation of set similarity
join techniques. PVLDB, 9(9):636–647, 2016.
7. J. Munkres. Algorithms for the assignment and transportation problems. Journal
of the society for industrial and applied mathematics, 5(1):32–38, 1957.
8. L. A. Ribeiro, A. Cuzzocrea, K. A. A. Bezerra, and B. H. B. do Nascimento.
Incorporating clustering into set similarity join algorithms: The sjclust framework.
In DEXA 2016, Porto, Portugal, pages 185–204, 2016.
9. Z. Shang, Y. Liu, G. Li, and J. Feng. K-join: Knowledge-aware similarity join.
IEEE Trans. Knowl. Data Eng., 28(12):3293–3308, 2016.
10. I. Traverso, M.-E. Vidal, B. Kämpgen, and Y. Sure-Vetter. Gades: A graph-based
semantic similarity measure. In SEMANTiCS, pages 101–104. ACM, 2016.
11. T. Urhan and M. J. Franklin. Xjoin: A reactively-scheduled pipelined join operator.
IEEE Data Eng. Bull., 23(2):27–33, 2000.
12. S. Wandelt, D. Deng, S. Gerdjikov, S. Mishra, P. Mitankin, M. Patil, E. Siragusa,
A. Tiskin, W. Wang, J. Wang, and U. Leser. State-of-the-art in string similarity
search and join. SIGMOD Record, 43(1):64–76, 2014.
13. Y. Wang, H. Wang, J. Li, and H. Gao. Efficient graph similarity join for information
integration on graphs. Frontiers of Computer Science, 10(2):317–329, 2016.
14. H. Zhu, X. Meng, and G. Kollios. NED: an inter-graph node metric based on edit
distance. PVLDB, 10(6):697–708, 2017.
... The definition of a novel structure extracted from a tightly coupled semantic graph, which we call the hybrid Molecules, is one of the major contributions of this study as it is a means to provide innovative search results covering both the domain-specific and the structural-based dimensions of the documents. The Hybrid Molecules consist of well-defined sub-graphs that we formally define in view of the characteristics of a tightly coupled semantic graph and the definition of a molecule concept in the literature [27,28,30,36,69]. They are hybrid as they encapsulate domain-specific information coupled with related structural-based information of the documents. ...
... In this section, we introduce hybrid molecules which we build upon the definitions of molecules in the literature [27,28,30,36,69], yet regardless of the serialization technology. Molecules are sub-graphs of connected nodes. ...
... The hybrid moleculebased query answers bring in helpful contextual information of the documents improving the search results and reducing users' efforts in tracking and interpreting them. We formally define the hybrid molecule's structure in view of the characteristics of a tightly coupled semantic graph and the definition of a molecule concept in the literature[27,28,30,36,69]. We then integrate the notion of hybrid molecules in a query processing pipeline, where users submit their natural language (e.g., plain English text) queries over a heterogeneous document corpus and obtain relevant answers in the form of hybrid molecules. ...
Thesis
Full-text available
The recent advances of Information and Communication Technology (ICT) have resulted in the development of several industries. Adopting semantic technologies has proven several benefits for enabling a better representation of the data and empowering reasoning capabilities over it, especially within an Information Retrieval (IR) application. This has, however, few applications in the industries as there are still unresolved issues, such as the shift from heterogeneous interdependent documents to semantic data models and the representation of the search results while considering relevant contextual information. In this thesis, we address two main challenges. The first one focuses on the representation of the collective knowledge embedded in a heterogeneous document corpus covering both the domain-specific content of the documents, and other structural aspects such as their metadata, their dependencies (e.g., references), etc. The second one focuses on providing users with innovative search results, from the heterogeneous document corpus, helping the users in interpreting the information that is relevant to their inquiries and tracking cross document dependencies.To cope with these challenges, we first propose a semantic representation of a heterogeneous document corpus that generates a semantic graph covering both the structural and the domain-specific dimensions of the corpus. Then, we introduce a novel data structure for query answers, extracted from this graph, which embeds core information together with structural-based and domain-specific context. In order to provide such query answers, we propose an innovative query processing pipeline, which involves query interpretation, search, ranking, and presentation modules, with a focus on the search and ranking modules.Our proposal is generic as it can be applicable in different domains. However, in this thesis, it has been experimented in the Architecture, Engineering and Construction (AEC) industry using real-world construction projects.
... In order to efficiently integrate big data sources and to address interoperability conflicts, several integration approaches have been devised to collect domain independent data, whereas others integrate data particularly from biomedical domain. KARMA [31], MINTE [8], SILK [25], SJoin [18], LDIF [2], Sieve [34], LIMES [36], and RapidMiner LOD Extension [44] are generic approaches for semantic data integration. KARMA is a semi-automatic approach capable to resolve interoperability conflicts among structured sources. ...
... Further, Hu et al. [23] perform various link analysis methods against, e.g., data link analysis, entity link analysis, and term link analysis; the results of link analysis are exploited for solving interoperability conflicts and for facilitating data integration. X X X MINTE [8] X X X SILK [25] X X SJoin [18] X X LDIF [2] X X Sieve [34] X X X LIMES [36] X X RapidMiner [44] X Knowledge-driven framework ...
... Queries can be written in SPARQL, and Ontario decides the subqueries that need to be executed over each knowledge graph to collect the data required for query answer. Additionally, Ontario executes physical operators, e.g., symmetric join [18] and gjoin [1], and is able to relate during query execution, RDF triples stored in different knowledge graphs. To illustrate this feature, consider the following query: "Mutations of the type confirmed somatic variant located in transcripts which are translated as proteins that are transporters of the drug docetaxel" which is represented by SPARQL query in Listing 1. Ontario collects the results and merges them in order to project out the names of the mutations. ...
Chapter
Full-text available
Big biomedical data has grown exponentially during the last decades and a similar growth rate is expected in the next years. Likewise, semantic web technologies have also advanced during the last years, and a great variety of tools, e.g., ontologies and query languages, have been developed by different scientific communities and practitioners. Although a rich variety of tools and big data collections are available, many challenges need to be addressed in order to discover insights from which decisions can be taken. For instance, different interoperabil-ity conflicts can exist among data collections, data may be incomplete, and entities may be dispersed across different datasets. These issues hinder knowledge exploration and discovery, being thus required data integration in order to unveil meaningful outcomes. In this chapter, we address these challenges and devise a knowledge-driven framework that relies on semantic web technologies to enable knowledge exploration and discovery. The framework receives big data sources and integrates them into a knowledge graph. Semantic data integration methods are utilized for identifying equivalent entities, i.e., entities that correspond to the same real-world elements. Fusion policies enable the merging of equivalent entities inside the knowledge graph, as well as with entities in other knowledge graphs, e.g., DBpedia and Bio2RFD. Knowledge discovery allows for the exploration of knowledge graphs in order to uncover novel patterns and relations. As proof of concept, we report on the results of applying the knowledge-driven framework in the EU funded project iASiS 3 in order to transform big data into actionable knowledge, paving thus the way for personalised medicine.
... PSJ has received much attention in many domains. Galkin et al. [15] applied PSJ to integrate heterogeneous RDF graphs by introducing an equivalent semantics for RDF graphs. Ma et al. [23] proposed an effective filter-based method for high-dimensional vector similarity join. ...
... the queues cl .q y of cell cl (lines[9][10][11][12][13][14][15]. Note that, each object o p y ∈ cl .q ...
Conference Paper
For decades, the join operator over fast data streams has always drawn much attention from the database community, due to its wide spectrum of real-world applications, such as online clustering, intrusion detection, sensor data monitoring, and so on. Existing works usually assume that the underlying streams to be joined are complete (without any missing values). However, this assumption may not always hold, since objects from streams may contain some missing attributes, due to various reasons such as packet losses, network congestion/failure, and so on. In this paper, we formalize an important problem, namely join over incomplete data streams (Join-iDS), which retrieves joining object pairs from incomplete data streams with high confidences. We tackle the Join-iDS problem in the style of "data imputation and query processing at the same time". To enable this style, we design an effective and efficient cost-model-based imputation method via deferential dependency (DD), devise effective pruning strategies to reduce the Join-iDS search space, and propose efficient algorithms via our proposed cost-model-based data synopsis/indexes. Extensive experiments have been conducted to verify the efficiency and effectiveness of our proposed Join-iDS approach on both real and synthetic data sets.
... Authors in [9] propose a vertex join with link creation (Figure 1c): each social network user is matched to the paper where they appear as a first author. The black dashed edges in the figure provide the output of the link creation. ...
Conference Paper
Despite the growing popularity of techniques related to graph summarization, a general operator for joining graphs on both the vertices and the edges is still missing. Current languages such as Cypher and SPARQL express binary joins through the non-scalable and inefficient composition of multiple traversal and graph creation operations. In this paper, we propose an efficient equi-join algorithm that is able to perform vertex and path joins over a secondary memory indexed graph, also the resulting graph is serialised in secondary memory. The results show that the implementation of the proposed model outperforms solutions based on graphs, such as Neo4J and Virtuoso, and the relational model, such as PostgreSQL. Moreover, we propose two ways how edges can be combined, namely the conjunctive and disjunctive semantics, Preliminary experiments on the graph conjunctive join are also carried out with incremental updates, thus suggesting that our solution outperforms materialized views over PostgreSQL.
... PSJ has received much attention in many domains. Galkin et al. [15] applied PSJ to integrate heterogeneous RDF graphs by introducing an equivalent semantics for RDF graphs. Ma et al. [23] proposed an effective filter-based method for high-dimensional vector similarity join. ...
Preprint
Full-text available
For decades, the join operator over fast data streams has always drawn much attention from the database community, due to its wide spectrum of real-world applications, such as online clustering, intrusion detection, sensor data monitoring, and so on. Existing works usually assume that the underlying streams to be joined are complete (without any missing values). However, this assumption may not always hold, since objects from streams may contain some missing attributes, due to various reasons such as packet losses, network congestion/failure, and so on. In this paper, we formalize an important problem, namely join over incomplete data streams (Join-iDS), which retrieves joining object pairs from incomplete data streams with high confidences. We tackle the Join-iDS problem in the style of "data imputation and query processing at the same time". To enable this style, we design an effective and efficient cost-model-based imputation method via deferential dependency (DD), devise effective pruning strategies to reduce the Join-iDS search space, and propose efficient algorithms via our proposed cost-model-based data synopsis/indexes. Extensive experiments have been conducted to verify the efficiency and effectiveness of our proposed Join-iDS approach on both real and synthetic data sets.
Article
Big data has exponentially grown in the last decade; it is expected to grow at a faster rate in the next years as a result of the advances in the technologies for data generation and ingestion. For instance, in the biomedical domain, a wide variety of methods are available for data ingestion, e.g., liquid biopsies and medical imaging, and the collected data can be represented using myriad formats, e.g., FASTQ and Nifti. In order to extract and manage valuable knowledge and insights from big data, the problem of data integration from structured and unstructured data needs to be effectively solved. In this paper, we devise a knowledge-driven approach able to transform disparate data into knowledge from which actions can be taken. The proposed framework resorts to computational extraction methods for mining knowledge from data sources, e.g., clinical notes, images, or scientific publications. Moreover, controlled vocabularies are utilized to annotate entities and a unified schema describes the meaning of these entities in a knowledge graph; entity linking methods discover links to existing knowledge graphs, e.g., DBpedia and Bio2RDF. A federated query engine enables the exploration of the linked knowledge graphs while knowledge discovery methods allow for uncovering patterns in the knowledge graphs. The proposed framework is used in the context of the EU H2020 funded project iASiS with the aim of paving the way for accurate diagnostics and personalized treatments.
Article
Full-text available
Set similarity joins compute all pairs of similar sets from two collections of sets. We conduct extensive experiments on seven state-of-the-art algorithms for set similarity joins. These algorithms adopt a filter-verification approach. Our analysis shows that verification has not received enough attention in previous works. In practice, efficient verification inspects only a small, constant number of set elements and is faster than some of the more sophisticated filter techniques. Although we can identify three winners, we find that most algorithms show very similar performance. The key technique is the prefix filter, and AllPairs, the first algorithm adopting this techniques is still a relevant competitor. We repeat experiments from previous work and discuss diverging results. All our claims are supported by a detailed analysis of the factors that determine the overall runtime.
Conference Paper
Full-text available
RDF streams are sequences of timestamped RDF statements or graphs, which can be generated by several types of data sources (sensors, social networks, etc.). They may provide data at high volumes and rates, and be consumed by applications that require real-time responses. Hence it is important to publish and interchange them efficiently. In this paper, we exploit a key feature of RDF data streams, which is the regularity of their structure and data values, proposing a compressed, efficient RDF interchange (ERI) format, which can reduce the amount of data transmitted when processing RDF streams. Our experimental evaluation shows that our format produces state-of-the-art streaming compression, remaining efficient in performance.
Article
Full-text available
String similarity search and its variants are fundamental problems with many applications in areas such as data integration, data quality, computational linguistics, or bioinformatics. A plethora of methods have been developed over the last decades. Obtaining an overview of the state-of-the-art in this field is difficult, as results are published in various domains without much cross-talk, papers use different data sets and often study subtle variations of the core problems, and the sheer number of proposed methods exceeds the capacity of a single research group. In this paper, we report on the results of the probably largest benchmark ever performed in this field. To overcome the resource bottleneck, we organized the benchmark as an international competition, a workshop at EDBT/ICDT 2013. Various teams from different fields and from all over the world developed or tuned programs for two crisply defined problems. All algorithms were evaluated by an external group on two machines. Altogether, we compared 14 different programs on two string matching problems (k-approximate search and k-approximate join) using data sets of increasing sizes and with different characteristics from two different domains. We compare programs primarily by wall clock time, but also provide results on memory usage, indexing time, batch query effects and scalability in terms of CPU cores. Results were averaged over several runs and confirmed on a second, different hardware platform. A particularly interesting observation is that disciplines can and should learn more from each other, with the three best teams rooting in computational linguistics, databases, and bioinformatics, respectively.
Conference Paper
A vast amount of information about various types of entities is spread across the Web, e.g., people or organizations on the Social Web, product offers on the Deep Web or on the Dark Web. These data sources can comprise heterogeneous data and are equipped with different search capabilities e.g., Search API. End users such as investigators from law enforcement institutions searching for traces and connections of organized crime have to deal with these interoperability problems not only during search time but also while merging data collected from different sources. We devise FuhSen, a keyword-based federated engine that exploits the search capabilities of heterogeneous sources during query processing and generates knowledge graphs on-demand applying an RDF-Molecule integration approach in response to keyword-based queries. The resulting knowledge graph describes the semantics of entities collected from the integrated sources, as well as relationships among these entities. Furthermore, FuhSen utilizes ontologies to describe the available sources in terms of content and search capabilities and exploits this knowledge to select the sources relevant for answering a keyword-based query. We conducted a user evaluation where FuhSen is compared to traditional search engines. FuhSen semantic search capabilities allow users to complete search tasks that could not be accomplished with traditional Web search engines during the evaluation study.
Article
Similarity join is a fundamental operation in data cleaning and integration. Existing similarity-join methods utilize the string similarity to quantify the relevance but neglect the knowledge behind the data, which plays an important role in understanding the data. Thanks to public knowledge bases, e.g., Freebase and Yago, we have an opportunity to use the knowledge to improve similarity join. To address this problem, we study knowledge-aware similarity join, which, given a knowledge hierarchy and two collections of objects (e.g., documents), finds all knowledge-aware similar object pairs. To the best of our knowledge, this is the first study on knowledge-aware similarity join. There are two main challenges. The first is how to quantify the knowledge-aware similarity. The second is how to efficiently identify the similar pairs. To address these challenges, we first propose a new similarity metric to quantify the knowledge-aware similarity using the knowledge hierarchy. We then devise a filter-and-verification framework to efficiently identify the similar pairs. We propose effective signature-based filtering techniques to prune large numbers of dissimilar pairs and develop efficient verification algorithms to verify the candidates that are not pruned in the filter step. Experimental results on real-world datasets show that our method significantly outperforms baseline algorithms in terms of both efficiency and effectiveness.
Conference Paper
Knowledge graphs encode semantics that describes resources in terms of several aspects, e.g., neighbors, class hierarchies, or node degrees. Assessing relatedness of knowledge graph entities is crucial for several data-driven tasks, e.g., ranking, clustering, or link discovery. However, existing similarity measures consider aspects in isolation when determining entity relatedness. We address the problem of similarity assessment between knowledge graph entities, and devise GADES. GADES relies on aspect similarities and computes a similarity measure as the combination of these similarity values. We empirically evaluate the accuracy of GADES on knowledge graphs from different domains, e.g., proteins, and news. Experiment results indicate that GADES exhibits higher correlation with gold standards than studied existing approaches. Thus, these results suggest that similarity measures should not consider aspects in isolation, but combinations of them to precisely determine relatedness.
Conference Paper
Data cleaning and integration found on duplicate record identification, which aims at detecting duplicate records that represent the same real-world entity. Similarity join is largely used in order to detect pairs of similar records in combination with a subsequent clustering algorithm meant for grouping together records that refer to the same entity. Unfortunately, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance, and final results are produced at the end of the whole process only. Inspired by this critical evidence, in this paper we propose and experimentally assess SjClust, a framework to integrate similarity join and clustering into a single operation. The basic idea of our proposal consists in introducing a variety of cluster representations that are smoothly merged during the set similarity task, carried out by the join algorithm. An optimization task is further applied on top of such framework. Experimental results, which are derived from an extensive experimental campaign, we retrieve are really surprising, as we are able to outperform the original set similarity join algorithm by an order of magnitude in most settings.
Article
Node similarity is a fundamental problem in graph analytics. However, node similarity between nodes in different graphs (inter-graph nodes) has not received a lot of attention yet. The inter-graph node similarity is important in learning a new graph based on the knowledge of an existing graph (transfer learning on graphs) and has applications in biological, communication, and social networks. In this paper, we propose a novel distance function for measuring inter-graph node similarity with edit distance, called NED. In NED, two nodes are compared according to their local neighborhood structures which are represented as unordered k-adjacent trees, without relying on labels or other assumptions. Since the computation problem of tree edit distance on unordered trees is NP-Complete, we propose a modified tree edit distance, called TED*, for comparing neighborhood trees. TED* is a metric distance, as the original tree edit distance, but more importantly, TED* is polynomially computable. As a metric distance, NED admits efficient indexing, provides interpretable results, and shows to perform better than existing approaches on a number of data analysis tasks, including graph de-anonymization. Finally, the efficiency and effectiveness of NED are empirically demonstrated using real-world graphs.
Article
Graphs have been widely used for complex data representation in many real applications, such as social network, bioinformatics, and computer vision. Therefore, graph similarity join has become imperative for integrating noisy and inconsistent data from multiple data sources. The edit distance is commonly used to measure the similarity between graphs. The graph similarity join problem studied in this paper is based on graph edit distance constraints. To accelerate the similarity join based on graph edit distance, in the paper, we make use of a preprocessing strategy to remove the mismatching graph pairs with significant differences. Then a novel method of building indexes for each graph is proposed by grouping the nodes which can be reached in k hops for each key node with structure conservation, which is the k-hop tree based indexing method. As for each candidate pair, we propose a similarity computation algorithm with boundary filtering, which can be applied with good efficiency and effectiveness. Experiments on real and synthetic graph databases also confirm that our method can achieve good join quality in graph similarity join. Besides, the join process can be finished in polynomial time. © 2015 Higher Education Press and Springer-Verlag Berlin Heidelberg
Article
In this paper we presen algorithms for the solution of the general assignment and transportation problems. In Section 1, a statement of the algorithm for the assignment problem appears, along with a proof for the correctness of the algorithm. The remarks which constitute the proof are incorporated parenthetically into the statement of the algorithm. Following this appears a discussion of certain theoretical aspects of the problem. In Section 2, the algorithm is generalized to one for the transportation problem. The algorithm of that section is stated as concisely as possible, with theoretical remarks omitted. 1. THE ASSIGNMENT PROBLEM. The personnel-assignment problem is the problem of choosing an optimal assignment of n men to n jobs, assuming that numerical ratings are given for each man’s performance on each job. An optimal assignment is one which makes the sum of the men’s ratings for their assigned jobs a maximum. There are n! possible assignments (of which several may be optimal), so that it is physically impossible, except