Conference PaperPDF Available

Efficient Multidimensional Blocking for Link Discovery without losing Recall

Authors:

Abstract and Figures

Over the last three years, an increasing number of data providers have started to publish structured data according to the Linked Data principles on the Web. The resulting Web of Data currently consists of over 28 billion RDF triples. As the Web of Data grows, there is an increasing need for link discovery tools which scale to very large datasets. In record linkage, many partitioning methods have been proposed which substantially reduce the number of required entity comparisons. Unfortunately, most of these methods either lead to a decrease in recall or only work on metric spaces. We propose a novel blocking method called Multi-Block which uses a multidimensional index in which similar objects are located near each other. In each dimension the entities are indexed by a different property increasing the efficiency of the index significantly. In addition, it guarantees that no false dismissals can occur. Our approach works on complex link specifications which aggregate several different similarity measures. MultiBlock has been implemented as part of the Silk Link Discovery Framework. The evaluation shows a speedup factor of several 100 for large datasets compared to the full evaluation without losing recall.
Content may be subject to copyright.
Efficient Multidimensional Blocking for Link Discovery
without losing Recall
Robert Isele, Anja Jentzsch, and Christian Bizer
Freie Universität Berlin, Web-based Systems Group
Garystr. 21, 14195 Berlin, Germany
mail@robertisele.com, mail@anjajentzsch.de, chris@bizer.de
ABSTRACT
Over the last three years, an increasing number of data
providers have started to publish structured data accord-
ing to the Linked Data principles on the Web. The resulting
Web of Data currently consists of over 28 billion RDF triples.
As the Web of Data grows, there is an increasing need for
link discovery tools which scale to very large datasets. In
record linkage, many partitioning methods have been pro-
posed which substantially reduce the number of required
entity comparisons. Unfortunately, most of these methods
either lead to a decrease in recall or only work on metric
spaces. We propose a novel blocking method called Multi-
Block which uses a multidimensional index in which similar
objects are located near each other. In each dimension the
entities are indexed by a different property increasing the ef-
ficiency of the index significantly. In addition, it guarantees
that no false dismissals can occur. Our approach works on
complex link specifications which aggregate several different
similarity measures. MultiBlock has been implemented as
part of the Silk Link Discovery Framework. The evalua-
tion shows a speedup factor of several 100 for large datasets
compared to the full evaluation without losing recall.
Keywords
Blocking, Link Discovery, Identity Resolution, Duplicate De-
tection, Record Linkage, Linked Data
1. INTRODUCTION
The Web of Data forms a single global data space for
the very reason that its data sources are connected by RDF
links [2]. While there are some fully automatic tools for link
discovery [6], most tools generate links semi-automatically
based on link specifications [20, 16, 11]. Link specifications
specify the conditions which must hold true for a pair of
entities for the link discovery tool to generate a RDF link
between them. Based on a link specification, the link dis-
covery tool compares entities and concludes to set links if
the aggregated similarity is above a given threshold. The
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright is held by the author/owner. Fourteenth International Workshop
on the Web and Databases (WebDB 2011), June 12, 2011 - Athens, Greece.
naive approach to compare all entities with each other does
not scale due to the computation of the Cartesian product
of all the entities.
As the Web of Data is growing fast there is an increas-
ing need for link discovery tools which scale to very large
datasets. A number of methods have been proposed to im-
prove the efficiency of link discovery by dismissing defini-
tive non-matches prior to comparison. The most well-known
method to achieve this is known as blocking [5]. Unfortu-
nately, standard blocking methods in general lead to a de-
crease of recall due to false dismissals [4].
We propose a novel blocking approach which maps enti-
ties to a multidimensional index, called MultiBlock. The
basic idea of the mapping function is that it preserves the
distances of the entities i.e. similar entities will be located
near to each other in the index space. While standard block-
ing techniques block in one dimension, MultiBlock concur-
rently blocks by multiple properties using multiple dimen-
sions increasing its efficiency significantly. MultiBlock has
been implemented and evaluated within the Silk Link Dis-
covery Framework. It works on complex link specifications
which aggregate multiple different similarity measures such
as string, geographic or date similarity and does not need
any additional configuration. MultiBlock is organized in
three phases:
1. In the index generation phase, an index is built for
each similarity measure. The basic idea of the indexing
method is that it preserves the distances of the entities
i.e. similar entities will be located near each other in
the index. The specific indexing method depends on the
employed similarity measure.
2. In the index aggregation phase, all indexes are aggre-
gated into one multidimensional index, preserving the
property of the indexing that the indexes of two entities
within a given distance share the same index.
3. Finally, the comparison pair generation employs the
index to generate the set of entity pairs which are po-
tential links. These pairs are then evaluated using the
link specification to compute the exact similarity and
determine the actual links.
We illustrate the indexing by looking at the simple exam-
ple of interlinking geographical entities based on their label
and geographic coordinates: In that case the index genera-
tion phase would generate 2 indices: A 1-dimensional index
of the labels and a 2-dimensional index of the coordinates.
The index aggregation phase would than aggregate both in-
dexes into a single 3-dimensional index. Figure 1 visualises
the index generation and aggregation in this example. Note
that each similarity measure may create multiple indexes for
a single entity, which for simplicity is not considered in the
Figure.
Figure 1: Aggregating a geographic and a string sim-
ilarity
Figure 1 shows the aggregated index for 1,000 cities in
DBpedia.
Figure 2: Index of 1,000 cities in DBpedia
This paper is structured as follows: In the next Section we
discuss related work. In Section 3, we outline how our work
contributes to the current state of art. Section 4, introduces
some preliminaries on the process of link discovery. Based on
these foundations, Section 5 explains the general framework
of our approach which is independent of a specific similar-
ity measure or similarity aggregation. At the same time,
it specifies which properties a similarity measure or aggre-
gation must adhere to in order to be used in our approach.
Subsequently, Section 6 specifies various similarity measures
and aggregations which can be plugged into the framework
in order to provide a complete blocking method. Our imple-
mentation as part of the Silk Link Discovery Framework [16]
is discussed in Section 7. Finally, Section 8 reports on the
results of the evaluation.
2. RELATED WORK
The problem of link discovery is very similar to record
linkage. In record linkage [5] a number of methods to im-
prove the efficiency by reducing the number of required com-
parisons are often applied:
Traditional blocking methods work by partitioning the
entities into blocks based on the value of a specific prop-
erty [1]. Then only entities from the same block are com-
pared reducing the number of comparisons significantly at
the cost of a loss in accuracy. Especially in cases where the
data is noisy as it is often the case in Linked Data, similar
entities might be assigned to different blocks and thus not
compared in the subsequent comparison phase. In order to
reduce the number of false dismissals, a multi-pass approach
has been proposed [13]. In a multi-pass approach the block-
ing is run several times, each time with a different blocking
key.
The Sorted-Neighborhood method has been proposed
in order to improve the handling of fuzzy data [12]. The
Sorted-Neighborhood method works on the list of entities
which has been sorted according to a user-defined key. The
entities which are selected for comparison are determined
by sliding a fixed-size window along the sorted list. Only
entities inside the window are selected for comparison. The
biggest problem of the Sorted-Neighborhood Method lies in
the choice of the window size. A small size may miss entities
if many similar entities share the same key. A big size will
lead to a decrease in efficiency. A solution for this is to adapt
the windows size while sliding through the list [24].
The Sorted Blocks [4] method generalizes Blocking and
Sorted-Neighborhood in order to overcome some of their in-
dividual disadvantages. It uses overlapping blocks and is
both easy to parallize and more stable in the presence of
noise.
While Blocking and Sorted-Neighborhood methods usu-
ally map the property key to a single dimensional index,
some methods have been developed which map the similar-
ity space to a multidimensional Euclidean space. The fun-
damental idea of these methods is to preserve the distance
of the entities i.e. after mapping, similar entities are located
close to each other in the Euclidean space. Techniques which
use this approach include FastMap [8], MetricMap [21],
SparseMap [15] and StringMap [17]. Unfortunately, in
general, these methods do not guarantee that no false dis-
missals will occur [14]. The only exception is SparseMap for
which variants have been proposed which guarantee no false
dismissals [14]. All of these approaches require the similar-
ity space to form a metric space i.e. the similarity measure
must respect the triangle inequality. This implies that they
can not be used with non-metric similarity measures such as
Jaro-Winkler [22].
Another approach which uses the characteristics of met-
ric spaces, in particular the triangle inequality, to reduce the
number of similarity computations, has been implemented
in LIMES[20]. To the best of our knowledge no other tool in
link discovery, besides Silk and LIMES, makes use of block-
ing techniques.
3. CONTRIBUTIONS
Our main contribution to the current state of art is that,
while significantly reducing the number of comparisons, Multi-
Block guarantees that no false dismissals and thus no loss
of recall can occur and does not require the similarity space
to form a metric space. In addition, it uses a multidimen-
sional index increasing its efficiency significantly. Another
advantage of MultiBlock is that it can work on a stream
of entities as it does not require to preprocess the whole
dataset. Table 1 summarizes how MultiBlock compares to
existing methods.
Method Lossless Non-Metrics Stream.
Traditional Blocking no yes yes
Sorted-Neighborhood no yes no
Sorted Blocks no yes no
FastMap no no no
MetricMap no no no
SparseMap no no no
StringMap no no no
Modified SparseMap yes no no
MultiBlock yes yes yes
Table 1: Comparison of different blocking methods
4. PRELIMINARIES
The general problem of link discovery can be formalized
as follows [9]:
Given two datasets Aand B, find the subset of all matching
pairs:
M={(a, b); sims(a, b)θ, a A, b B}(1)
Where simsassigns a similarity value to each pair of entities:
sims:A×B[0,1] (2)
If the similarity between two entities exceeds a threshold θ,
a link between these two entities is generated. sim is com-
puted by evaluating a link specification s(in record linkage
typically called linkage decision rule [23]) which specifies the
conditions two entities must fulfill in order to be interlinked.
A link specification typically applies one or more similar-
ity measures to the values of a set of property paths. While
in the case of database records this is typically achieved by
selecting a subset of the record fields, in Linked Data the
features are selected by specifying a set of property paths
through the graph. If the data sources use different data
types, the property values may be normalized by applying a
transformation prior to the comparison. Due to the rea-
son that in most cases the similarity of two entities can
not be determined by evaluating a single property, a typi-
cal link specification aggregates multiple similarity measures
into one compound similarity value. Finally, a threshold is
applied to determine if the entities should be interlinked.
In order to compute the set of all matching pairs M, the
naive approach is to evaluate the link specification for each
entity of the Cartesian product A×Bwhich results in |A|×|B|
comparisons. Because this is infeasible for all but very small
datasets, we define a blocking function which assigns, based
on the link specification, a block to each entity:
blocks:ABN(3)
The fundamental idea behind blocking is to assign the same
block to similar entities so that comparisons only have to be
made between the entities in each block. This decreases the
number of required comparisons considerably by reducing
the set of pairs which have to be evaluated to:
R={(a, b); blocks(a) = blocks(b), a A, b B}(4)
5. APPROACH
In this section, we lay down the general framework of our
approach which is independent of a specific similarity mea-
sure or aggregation. At the same time, we define which
properties a similarity measure or aggregation must adhere
to in order to be used in our approach. The subsequent
Section will specify various similarity measures and aggre-
gations which can be plugged into the framework in order
to provide a complete blocking method.
Our approach is organized in three phases: At first, for
each similarity measure, an index is generated. Afterwards,
all similarity values are aggregated into a single multidimen-
sional index. Finally, the comparison pairs are generated
from the index.
5.1 Index Generation
For each similarity measure in the link specification, an
index is built which consists of a set of vectors which de-
fine locations in the Euclidean space. The basic idea of the
indexing method is that it preserves the distances of the en-
tities i.e. similar entities will be located near each other in
the index. The index generation is not restricted to a spe-
cific similarity measure. In order to be used for MultiBlock,
each similarity measure must define the following functions:
1. A function which computes a similarity value for a pair
of entities:
sims:A×B[0,1] (5)
2. A blocking function which generates the index for a sin-
gle entity:
indexs: (AB)×[0,1] P(Nn)(6)
where Pdenotes the power set (i.e. it might generate
multiple blocks for a single entity) and nis the dimension
of the index. The first argument denotes the entity to
be blocked, which may be either in the source or target
set. The second argument denotes the similarity thresh-
old. indexsincludes two modifications of the standard
block function presented in the preliminaries. Firstly, it
does not map each entity to a one-dimensional block,
but to a multi-dimensional block. This way increases
the efficiency as the entities are distributed in multiple
dimensions. Secondly, it does not map each entity to a
single block, but to multiple blocks at once, similarily
to multi-pass blocking. This avoids losing recall if an
entity cannot be mapped to a definite block such as in
string similarity measures (see Section 6.1).
The indexsfunction must adhere to the property that
two entities whose similarity according to simsis below the
threshold must share a block. More formally, given two enti-
ties e1, e2and a threshold θ,simsand indexsmust be related
by the following equivalence:
sims(e1, e2)θ indexs(e1)indexs(e2)6=(7)
Section 6.1 gives an overview over the most common sim-
ilarity measures which can be used in MultiBlock.
5.2 Index Aggregation
In the index aggregation phase, all indexes which have
been built in the index generation phase are aggregated into
one compound index. The aggregation function preserves
the property of the index that two entities within a given
distance share the same index vector. Generally, aggregat-
ing the indexes of multiple similarity measures will lead to
an increase in dimensionality, but the concrete aggregation
function depends on the specific aggregation type. Section
6.2 outlines the concrete aggregation functions for the most
common aggregation types.
In order to be used for MultiBlock, each aggregation must
define the following functions:
1. A function which aggregates multiple similarity values:
aggS ima: [0,1]n×[0,1]n[0,1] (8)
where nis the number of operators to be aggregated.
The first argument denotes the similarity values to be
aggregated. As an aggregation may weight the results
of the underlying operators e.g. a weighted average, the
second argument denotes the weight of the specific op-
erator. Each weight is a number between 0 and 1, while
all weights total to 1.
2. A function which aggregates multiple blocks:
aggI ndexa:P(Nn)× P(Nn) P(Nn)(9)
where Pdenotes the power set and nis the dimension
of the index. Note that while aggI ndexaonly aggregates
two sets of blocks at once, it can also be used to aggre-
gate multiple sets by calling it repeatedly.
3. A function which updates the threshold of the underly-
ing operators in order to retain the condition that two
entities within the threshold share a block.
thresholda: [0,1] ×[0,1] [0,1] (10)
The first argument denotes the threshold on the aggre-
gation. As in the similarity aggregation function, the
second argument denotes the weight of the specific op-
erator.
5.3 Comparison Pair Generation
Finally, the comparison pair generation employs the index
to generate the set of entity pairs which are potential links.
For each two entities which share a block, a comparison pair
is generated. These pairs are then evaluated using the link
specification to compute the exact similarity and determine
the actual links.
6. EMPLOYED SIMILARITY MEASURES
AND AGGREGATIONS
This section provides an overview of various similarity
measures and aggregations which can be used in conjunc-
tion with our approach. For each similarity measure, we
define the required similarity function and the correspond-
ing blocking function. Likewise, for each aggregation, we
define the required similarity aggregation function as well
as the blocking aggregation function.
6.1 Similarity Measures
As RDF datasets typically make use of a variety of dif-
ferent data types, many similarity measures have been pro-
posed to match their values [7]. As the most common data
types are plain string literals, most of these techniques han-
dle approximate string matching. Apart from that, similar
techniques have been proposed for numeric data or special
purpose data types such as dates or geographic coordinates.
In this section, we show how similarity measures for various
data types can be integrated into our proposed approach.
String Similarity
A number of string similarity measures have been developed
in literature [19, 3]. We use the Levenshtein distance [18]
for approximate string comparisons in this paper.
Given a finite alphabet Σand two strings σ1Σand
σ2Σ, we define the similarity function to compute the
normalized Levenshtein distance as:
sims(σ1, σ2) := levenshtein(σ1, σ2)
max(|σ1|,|σ2|)(11)
The basic problem of blocking string values under the
presence of typographical errors is the potential loss of recall.
We define a blocking function which avoids false dismissals
by indexing multiple q-Grams of the given input string. For
this purpose, we first define a function which assigns a single
block to a given q-Gram σqΣq:
blockq(σq) :=
q
X
i=0
|Σ|i·σq(i)(12)
blockqassigns each possible letter combination of the q-Gram
to a different block.
In order to increase the efficiency, we do not want to
block all q-Grams of a given string, but just as many needed
to avoid any false dismissals. We can make the following
observation [10]: Given a maximum Levenshtein distance
kbetween two strings, they differ by at most k·q+ 1 q-
Grams, wherein the maximum Levenshtein distance is given
as k:= max(|str1|,|str2|)·(1.0θ). Consequently, the min-
imal number of q-grams which must be blocked in order to
avoid false dismissals is:
c(θ) := max(|str1|,|str2|)·(1.0θ)·q+ 1 (13)
By combining both functions, we can define the blocking
function as:
indexs(σ, θ) := {blockq(σq); σqqgrams(σ)[0...c(θ)]}(14)
The function starts with decomposing the given string into
its q-Grams. From this set, it takes as many q-Grams as
needed to avoid false dismissals and assigns a block to each.
Finally, it returns the set of all blocks.
Numeric Similarity
The similarity of two numbers is computed with:
sims(d1, d2) := |d1d2|
dmax dmin
where d1, d2[dmin, dmax ]
(15)
Using standard blocking without overlapping blocks may
lead to false dismissals. For that reason, we define an over-
lapping factor overlap, which we set to 0.5 by default. The
overlapping factor specificies to what extend the blocks over-
lap. Using the overlapping factor, the maximum number of
blocks which does not lead to false dismissals can be com-
puted with:
sizes(θ) := 1
θ·overlap (16)
Based on the block count we can define the blocking func-
tion as follows:
indexs(d, θ) :=
{0}if scaled(d)<= 0.5
{sizes(θ)1}if scaled(d)>=sizes(θ)0.5
{i(d), i(d)1}if scaled(d)i(d)< overlap
{i(d), i(d)+1}if scaled(d)i(d)+1< overlap
{i(d)}
with scaled(d) := sizes(θ)·ddmin
dmax
i(d) := floor(ddmin
dmax
)
(17)
Geographic Similarity
Blocking geographic coordinates can be reduced to index-
ing numbers, by using the numeric similarity functions on
both the latitude and the longitude of the coordinate which
results in 2-dimensional blocks.
6.2 Aggregations
In this section we focus on the most common aggrega-
tions: Computing the average similarity and selecting the
minimum or maximum similarity value.
Average Aggregation
The average aggregation computes the weighted arithmetic
mean of all provided similarity values.
aggS ima(v, w) := v0w0+v1w0+... +vnwn
n(18)
Two indexes are combined by concatenating their index
vectors:
aggI ndexa(A, B) := {(a1, ..., an, b1, ...bm), a A, b B}(19)
In order to preserve the condition that two entities within
the given threshold share a block, the local threshold of the
underlying similarity measures is modified according to:
thresholda(θ, w ) := 1 (1 θ)1
w(20)
Minimum/Maximum Aggregation
The minimum and maximum aggregations simply select the
minimum/maximum value:
aggS ima(v, w) := min/max(v0+v1+... +vn)(21)
In this case, we can not just aggregate the blocks to sepa-
rate dimensions in the same way as in the average aggrega-
tor. The reason for this is that if one similarity value exceeds
the threshold, the remaining similarity values may be arbi-
trary low while the entities are still considered as matches.
For this reason, the index vectors of all indexes are mapped
into the same index space:
aggI ndexa(A, B) := {(a1, ..., an),(b1, ..., bn); aA, b B}
(22)
In case the dimensionality of the two indexes does not match,
the vectors of the lower dimensional index are expanded by
setting the values in the additional dimensions to zero.
For minimum and maximum aggregations we can leave
the threshold unchanged:
thresholda(θ, w ) := θ(23)
7. IMPLEMENTATION
The Silk Link Discovery Framework generates RDF links
between data items based on user-provided link specifica-
tions which are expressed using the Silk Link Specification
Language (Silk-LSL). The Silk Link Discovery Framework
is implemented in Scala1and can be downloaded from the
project homepage2under the terms of the Apache Software
License.
Until version 2.2, Silk supports basic blocking with over-
lapping blocks. It provides a separate configuration directive
to configure the property paths which are used as blocking
keys as well as the number of block and the overlapping
factor.
The current version of Silk includes the MultiBlock method
as specified in this paper. The blocking is configured by the
1http://scala-lang.org
2http://www4.wiwiss.fu-berlin.de/bizer/silk/
Method Comparisons Runtime Links
Full evaluation 108,301,460,054 305,188s 70,037
Blocking, 100 blocks 3,349,755,846 22,453s 69,403
Blocking, 1,000 blocks 1,049,015,356 7,909s 60,025
MultiBlock 37,667,462 420s 70,037
Table 2: Results of experiment 1
link specification and does not need any further configura-
tion. The blocking function has been implemented for all
common similarity measures. For similarity measures which
do not define a blocking function yet, the fallback is to use a
single block for this specific similarity measure. In this case,
Silk still blocks by the remaining similarity measures.
8. EVALUATION
MultiBlock has been evaluated regarding scalability and
effectiveness.
This section reports on the results of two experiments in
which we used Silk with MultiBlock to generate RDF links
between different Linked Data sets. The link specifications
used for the experiments are included in the current release
of Silk.
All experiments have been run on a 3GHz Intel(R) Core
i7 CPU with 4 cores and 8GB of RAM.
8.1 Scalability
In order to be used for link discovery in Linked Data,
MultiBlock must be able to scale to very large datasets.
Thus, it is essential that MultiBlock reduces the number
of comparisons drastically without dismissing correct pairs.
We evaluated the scalability of MultiBlock by applying it to
interlink two large geographic datasets. For this purpose,
we used a dataset consisting of 204,109 settlements from
DBpedia3and 530,606 settlements from LinkedGeoData4.
First, we interlinked both datasets by evaluating the com-
plete Cartesian product without the use of any blocking
method. As this results in over 100 billion comparisons it is
a clear case when matching the complete Cartesian product
is not reasonable anymore. The link generation took about
85 hours and generated 70,037 links. The generated links
have been spot-checked for correctness.
After that, we evaluated how standard blocking reduces
the number of comparisons. We used the labels of the en-
tities as blocking keys. We ran the blocking multiple times
with different parameters in each run. In order to reduce the
loss of recall we used overlapping blocks. An overlapping fac-
tor of 0.2 was chosen for that purpose as it provides a good
trade-off between efficiency and minimizing the number of
false dismissals.
Finally, we evaluated how MultiBlock compares to stan-
dard blocking. By blocking in 3 dimensions instead of a
single dimension, MultiBlock was able to further reduce the
number of comparisons to 37,667,462. Furthermore, it gen-
erated the identical 70,037 links as in the full evaluation, but
ran in only 420 seconds.
Table 2 summarizes the results. The evaluation shows
that MultiBlock reduces the number of comparisons by a
factor of 2,875 and is over 700 times faster than evaluat-
ing the complete Cartesian product. It is also almost 20
3http://dbpedia.org
4http://linkedgeodata.org
Phase Time
Build index 14 %
Generate comparison pairs 41 %
Similarity comparison 45 %
Table 3: The runtimes of the different phases
Setting Comparisons Runtime Links
Full comparison 22,242,292 430s 1,403
Blocking, 100 blocks 906,314 44s 1,349
Blocking, 1,000 blocks 322,573 14s 1,287
MultiBlock 122,630 6s 1,403
Table 4: Results of experiment 2
times faster than standard blocking with 1,000 blocks, but
contrary to it does yield all correct links. We can also ob-
serve that due to the usage of a multidimensional index,
the overhead of MultiBlock is higher than that of standard
blocking. This can be concluded from the fact that Multi-
Blocks reduces the number of comparisons by 28 compared
to standard blocking with 1,000 blocks, but its runtime only
increases by a factor of 19.
In order to determine which phase of MultiBlock is re-
sponsible for the overhead, we evaluated the runtimes of the
different phases of the matching process. As we can see in
Table 3, a big part is spent for creating the comparison pairs.
For this reason future improvements will focus on using a
more efficient algorithm for the comparison pair generation.
8.2 Effectiveness
In order to be applicable to discover links between arbri-
tary data sources, MultiBlock must be flexible enough to be
applied even to complex data sets without losing recall. For
this purpose we employed Silk with MultiBlock to interlink
drugs in DBpedia and DrugBank5. Here it is not sufficient
to compare the drug names alone, but also necessary to take
the various unique bio-chemical identifiers, e.g. CAS num-
ber, into consideration. Therefore the corresponding Silk
link specification compares the drug names and their syn-
onyms as well as a list of well-known and used identifiers of
which not all have to be set on the entity.
The employed Silk link specification results in 1,403 links.
Table 4 shows the different runtimes using Silk and Silk
with MultiBlock. Using Silk with MultiBlock, we achieved a
speedup factor of 71 with full recall. The gain in this exam-
ple is smaller than in the previous one because the dataset
here is much smaller and the link specification is more com-
plicated.
9. REFERENCES
[1] R. Baxter, P. Christen, and C. F. Epidemiology. A
comparison of fast blocking methods for record linkage.
2003.
[2] C. Bizer, T. Heath, and T. Berners-Lee. Linked data - the
story so far. Int. J. Semantic Web Inf. Syst, 5(3):1–22,
2009.
[3] W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A
comparison of string distance metrics for name-matching
tasks. 2003.
5DrugBank is a large repository of almost 5,000 FDA-
approved drugs and has been published as Linked Data on
http://www4.wiwiss.fu-berlin.de/drugbank/
[4] U. Draisbach and F. Naumann. A comparison and
generalization of blocking and windowing algorithms for
duplicate detection. 2009.
[5] A. K. Elmagarmid, P. G. Ipeirotis, et al. Duplicate record
detection: A survey. IEEE Trans. on Knowl. and Data
Eng., 19(1), 2007.
[6] J. Euzenat, A. Ferrara, C. Meilicke, et al. First Results of
the Ontology Alignment Evaluation Initiative 2010.
Ontology Matching, page 85, 2010.
[7] J. Euzenat and P. Shvaiko. Ontology matching.
Springer-Verlag, Heidelberg (DE), 2007.
[8] C. Faloutsos and K. Lin. FastMap: A fast algorithm for
indexing, data-mining and visualization of traditional and
multimedia datasets. In Proceedings of the 1995 ACM
SIGMOD international conference on Management of
data. ACM, 1995.
[9] I. P. Fellegi and A. B. Sunter. A Theory for Record
Linkage. Journal of the American Statistical Association,
64(328), 1969.
[10] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, et al.
Approximate string joins in a database (almost) for free. In
Proceedings of the 27th International Conference on Very
Large Data Bases, VLDB ’01, San Francisco, USA, 2001.
[11] O. Hassanzadeh, R. Xin, R. J. Miller, et al. Linkage query
writer, 2009.
[12] M. A. Hern´andez and S. J. Stolfo. The merge/purge
problem for large databases. In Proceedings of the 1995
ACM SIGMOD international conference on Management
of data, SIGMOD ’95, New York, NY, USA, 1995. ACM.
[13] M. A. Hern´andez and S. J. Stolfo. Real-world data is dirty:
Data cleansing and the merge/purge problem. Data Min.
Knowl. Discov., 2, 1998. Introduced the multi-pass
approach for blocking.
[14] G. R. Hjaltason and H. Samet. Properties of embedding
methods for similarity searching in metric spaces. PAMI,
25, 2003.
[15] G. Hristescu and M. Farach-Colton. Cluster-preserving
embedding of proteins, 1999.
[16] A. Jentzsch, R. Isele, and C. Bizer. Silk - Generating RDF
Links while publishing or consuming Linked Data. In
Poster at the International Semantic Web Conference
(ISWC2010), Shanghai, 2010.
[17] L. Jin, C. Li, and S. Mehrotra. Efficient Record Linkage in
Large Data Sets. In Database Systems for Advanced
Applications (DASFAA 2003). IEEE, 2003.
[18] V. Levenshtein. Binary codes capable of correcting
deletions, insertions, and reversals. In Soviet Physics
Doklady, volume 10, 1966.
[19] G. Navarro. A guided tour to approximate string matching.
ACM Comput. Surv., 33, 2001.
[20] A.-C. N. Ngomo and S. Auer. Limes - a time-efficient
approach for large-scale link discovery on the web of data.
[21] J. Wang, X. Wang, K. Lin, et al. Evaluating a class of
distance-mapping algorithms for data mining and
clustering. In Proceedings of the 5th ACM SIGKDD
international conference on Knowledge discovery and data
mining. ACM, 1999.
[22] W. Winkler. String Comparator Metrics and Enhanced
Decision Rules in the Fellegi-Sunter Model of Record
Linkage. In Proceedings of the Section on Survey Research
Methods, American Statistical Association., 1990.
[23] W. E. Winkler. Matching and Record Linkage. In Business
Survey Methods, pages 355–384, 1995.
[24] S. Yan, D. Lee, M.-Y. Kan, et al. Adaptive sorted
neighborhood methods for efficient record linkage. In
Proceedings of the 7th ACM/IEEE-CS joint conference on
Digital libraries, JCDL ’07, USA, 2007.
... Establishing links is a tedious process when performed manually, especially in giant KGs such as DBpedia 4 , Linked Geo Data (LGD) 5 , Bio2RDF 6 , KEGG [1] and Wikidata 7 . In addition to the ever-increasing number of published KGs, the size of individual KGs increases with each new edi-tion. ...
... Nentwig et al. [4] list many data augmentation systems that have been developed in the last two decades. For example, LIMES [5], [6] and SILK [7] apply matching strategies on instance level for computing the property values. Nentwig et al [4] address many challenges and aspects of the current link discovery frameworks. ...
... Here, we use the operators , and \ as they are complete and frequently used to define LS [12]. A LS is also called linkage rule in the literature [7]. Note that a LS can be generated manually or automatically. ...
Article
Full-text available
Knowledge graphs (KGs) that follow the Linked Data principles are created daily. However, there are no holistic models for the Linked Open Data (LOD). Building these models( i.e., engineering a pipeline system) is still a big challenge in order to make the LOD vision comes true. In this paper, we address this challenge by presenting NELLIE, a pipeline architecture to build a chain of modules, in which each of our modules addresses one data augmentation challenge. The ultimate goal of the proposed architecture is to build a single fused knowledge graph out of the LOD. NELLIE starts by crawling the available knowledge graphs in the LOD cloud. It then finds a set of matching KG pairs. NELLIE uses a two-phase linking approach for each pair (first an ontology matching phase, then an instance matching phase). Based on the ontology and instance matching, NELLIE fuses each pair of knowledge graphs into a single knowledge graph. The resulting fused KG is then an ideal data source for knowledge-driven applications such as search engines, question answering, digital assistants and drug discovery. Our evaluation shows an improved Hit @1 score of the link prediction task on the resulting fused knowledge graph by NELLIE in up to 94.44% of the cases. Our evaluation also shows a runtime improvement by several orders of magnitude when comparing our two-phases linking approach with the estimated runtime of linking using a naïve approach.
... To cope with the volume of big data, considerable research has been conducted on blocking techniques. As surveyed in [51,66], we can divide blocking methods into rule-based [21,36,40,45,65] or DL-based [25,41,78,80], both have their strengths and limitations. ...
... Blocking algorithms. There has been a host of work on the blocking algorithms, classified as follows: (1) Rule-based [21,36,40,45,65], e.g., [36] creates data partitions and then refines candidate pairs in every partition, by removing mismatches with similarity measures or length/count filtering [56]. ...
Preprint
Full-text available
This paper studies rule-based blocking in Entity Resolution (ER). We propose HyperBlocker, a GPU-accelerated system for blocking in ER. As opposed to previous blocking algorithms and parallel blocking solvers, HyperBlocker employs a pipelined architecture to overlap data transfer and GPU operations. It generates a dataaware and rule-aware execution plan on CPUs, for specifying how rules are evaluated, and develops a number of hardware-aware optimizations to achieve massive parallelism on GPUs. Using reallife datasets, we show that HyperBlocker is at least 6.8x and 9.1x faster than prior CPU-powered distributed systems and GPU-based ER solvers, respectively. Better still, by combining HyperBlocker with the state-of-the-art ER matcher, we can speed up the overall ER process by at least 30% with comparable accuracy.
... These frameworks use complex link specifications (LSs) to express the conditions required to declare a link between two resources. For instance, state-of-the-art LD frameworks such as Limes [15] and Silk [10] adopt a property-based computation of links between entities. For configuring link discovery frameworks, the user can either (1) manually enter a LS or (2) use machine learning for automatic generation of LSs. ...
... The underlying idea of using a language model is to verbalize LS from different types of systems only by using few examples. For example, LSs from Limes [15] differ from the ones used in Silk [10]. In addition, the second stage contains a standard seq2seq 2 layers encoderdecoder architecture using different RNN cells such as GRU, LSTM, BiLSTM, and transformer trained with more diverse data. ...
Chapter
Full-text available
Linked knowledge graphs build the backbone of many data-driven applications such as search engines, conversational agents and e-commerce solutions. Declarative link discovery frameworks use complex link specifications to express the conditions under which a link between two resources can be deemed to exist. However, understanding such complex link specifications is a challenging task for non-expert users of link discovery frameworks. In this paper, we address this drawback by devising NMV-LS, a language model-based verbalization approach for translating complex link specifications into natural language. NMV-LS relies on the results of rule-based link specification verbalization to apply continuous training on T5, a large language model based on the Transformer architecture. We evaluated NMV-LS on English and German datasets using well-known machine translation metrics such as BLUE, METEOR, ChrF++ and TER. Our results suggest that our approach achieves a verbalization performance close to that of humans and outperforms state of the art approaches. Our source code and datasets are publicly available at https://github.com/dice-group/NMV-LS.KeywordsKG IntegrationNeural Machine VerbalizationExplainable AISemantic WebMachine Learning ApplicationsLarge Language Models
... Blocking has recently become an active area of research in the Semantic Web community (Isele, Jentzsch, and Bizer 2011), but SN has been challenging to apply to linked data. One reason is that RDF data can be schema-free and may not include type information (especially on Linked Open Data), while SN explicitly assumes structured tuples. ...
Article
Entity Resolution (ER) concerns identifying logically equivalent entity pairs across databases. To avoid quadratic pairwise comparisons of entities, blocking methods are used. Sorted Neighborhood is an established blocking method for relational databases. It has not been applied on graph-based data models such as the Resource Description Framework (RDF). This poster presents a modular workflow for applying Sorted Neighborhood to RDF. Real-world evaluations demonstrate the workflow's utility against a popular baseline.
... A major challenge in this setting is efficiency, where a pairwise matching would require ( 2 ) comparisons for the number of nodes. To address this issue, blocking can be used to group similar entities into (possibly overlapping, possibly disjoint) "blocks" based on similarity-preserving keys, with matching performed within each block [133,265,297]; for example, if matching places based on latitude/longitude, blocks may represent geographic regions. An alternative to discrete blocking is to use windowing over entities in a similarity-preserving ordering [133], or to consider searching for similar entities within multi-dimensional spaces (e.g., spacetime [460], spaces with Minkowski distances [379], orthodromic spaces [380], etc. [477]). ...
... A minimum bounding box (MBB) is the rectangle of the minimum area that encloses all coordinates of geometry and is commonly used as an approximation to the geometry to reduce computational costs that involve this geometry [22]. Smeros and Koubarakis [23] use the MultiBlocking technique [24] to discover topological relations. This technique divides the Earth's surface into curved rectangles and assigns each geometry to all blocks in which it intersects, based on the geometry's MBB. ...
Article
Full-text available
Geospatial linked data are an emerging domain, with growing interest in research and the industry. There is an increasing number of publicly available geospatial linked data resources, which can also be interlinked and easily integrated with private and industrial linked data on the web. The present paper introduces Geo-L, a system for the discovery of RDF spatial links based on topological relations. Experiments show that the proposed system improves state-of-the-art spatial linking processes in terms of mapping time and accuracy, as well as concerning resources retrieval efficiency and robustness.
... Silk was the first link discovery tool for finding links between entities and it provides a language to specify the link types which should be discovered between datasets [13]. Silk and LIMES support more link types than other tools that just determine owl:sameAs and they provide a GUI for an interactive use [14,15]. KNOFUSS just supports the owl:sameAs link type and string similarity approach [16]. ...
Article
It is an important task to connect encyclopaedic knowledge graphs by finding and linking the same entity nodes. Various available automated linking solutions cannot be applied in situations where data is sparse, private or a high degree of correctness is expected. Wikidata has grown into a leading linking hub collecting entity identifiers from various registries and repositories. To get a picture of connectability, we analysed the linking methods and results between the COURAGE registry and Wikidata, VIAF, ISNI and ULAN. This paper describes our investigations and solutions while mapping and enriching entities in Wikidata. Each possible mapped pair of entities received a numeric score of reliability. Using this score-based matching method, we tried to minimise the need for human decisions, hence we introduced the term human decision window for the mappings where neither acceptance nor refusal can be made automatically and safely. Furthermore, Wikidata has been enriched with related COURAGE entities and bi-directional links between mapped persons, organisations, collections, and collection items. We also describe the findings on coverage and quality of mapping among the above mentioned authority databases.
Article
The join query, which finds matching pairs from two object sets, is a fundamental operation in computer systems and helps to solve many real problems, e.g., entity resolution. In this paper, we address the problem of join queries by leveraging crowdsourcing to obtain matching relationships. The goal is to minimize the monetary cost while maintaining high quality of query results. However, existing approaches focused on finding matching pairs from a single object set and assumed the existence of prior knowledge, which is not applicable in real applications. We propose a cost-effective crowdsourced join query framework that minimizes the overall monetary cost by reducing the monetary cost of labeling single pairs and the amount of comparison pairs. Specifically, we first propose a novel two-level confidence-based labeling model that minimizes the cost for labeling a single pair with confidence guarantee. This model crowdsources easy-judging pairs to ordinary workers, and asks for skilled workers who may charge more than ordinary workers to compare only hard-judging pairs. Statistical estimations are used to aggregate crowdsourcing results with 1−α confidence. Then, we propose a transitivity-based query scheme that minimizes the number of comparison pairs on the basis of transitive relations. Guided by the principle of eagerly identifying matching pairs, especially matching pairs from a single set, our scheme carefully designs the processing order of pairs in order to make full use of transitivities to infer new labels. The results of our extensive experiments demonstrate that the proposed framework can save much more monetary cost while assuring the accuracy of results.
Article
Full-text available
This paper describes a string comparator metric that partially accounts for typographical variation in strings such as first name or surname, decision rules that utilize the string comparator, and improvements in empirical matching results. The string comparators are used in production computer matching software during the Post Enumeration Survey for the 1990 Census. The Post Enumeration Survey will use capture/recapture and other statistical techniques to produce a set of adjusted Census cotmts.
Article
Full-text available
Duplicate detection is the problem of identifying pairs of records that represent the same real world object, and could thus be merged into a single record. To avoid a prohibitively expensive comparison of all pairs of records, a common tech- nique is to carefully partition the records into smaller sub- sets. If duplicate records appear in the same partition, only all pairs within each partition must be compared. Two competing approaches are often cited: Blocking methods strictly partition records into disjoint subsets, for instance using zip-codes as partitioning key. Windowing methods, in particular the Sorted-Neighborhood method, sort the data according to some key, such as zip-code, and then slide a window of xed size across the sorted data and compare pairs only within the window. Herein we compare both approaches qualitatively and ex- perimentally. Further, we present a new generalized algo- rithm, the Sorted Blocks method, with the competing meth- ods as extreme cases. Experiments show that the windowing algorithm is better than blocking and that the generalized algorithm slightly improves upon it in terms of eciency (detected duplicates vs. overall number of comparisons).
Article
Full-text available
The term "Linked Data" refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions-the Web of Data. In this article, the authors present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. They describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.
Conference Paper
Full-text available
Ontology matching consists of finding correspondences between entities of two ontologies. OAEI campaigns aim at comparing ontology matching systems on precisely defined test cases. Test cases can use ontologies of different nature (from simple directories to expressive OWL ontologies) and use different modalities, e.g., blind evaluation, open evaluation, consensus. OAEI-2010 builds over previous campaigns by having 4 tracks with 6 test cases followed by 15 participants. This year, the OAEI campaign introduces a new evaluation modality in association with the SEALS project. A subset of OAEI test cases is included in this new modality which provides more automation to the evaluation and more direct feedback to the participants. This paper is an overall presentation of the OAEI 2010 campaign.
Conference Paper
Full-text available
Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass.
Conference Paper
Full-text available
The Linked Data paradigm has evolved into a powerful enabler for the transition from the document-oriented Web into the Semantic Web. While the amount of data published as Linked Data grows steadily and has surpassed 25 billion triples, less than 5% of these triples are links between knowledge bases. Link discovery frameworks provide the functionality necessary to discover missing links between knowledge bases. Yet, this task requires a significant amount of time, especially when it is carried out on large data sets. This paper presents and evaluates LIMES, a novel time-efficient approach for link discovery in metric spaces. Our approach utilizes the mathematical characteristics of metric spaces during the mapping process to filter out a large number of those instance pairs that do not suffice the mapping conditions. We present the mathematical foundation and the core algorithms employed in LIMES. We evaluate our algorithms with synthetic data to elucidate their behavior on small and large data sets with different configurations and compare the runtime of LIMES with another state-of-the-art link discovery tool.
Article
Full-text available
We present Linkage Query Writer (LinQuer), a system for generating SQL queries for semantic link discovery over re- lational data. The LinQuer framework consists of (a) LinQL, a language for specification of linkage requirements; (b) a web interface and an API for translating LinQL queries to standard SQL queries; (c) an interface that assists users in writing LinQL queries. We discuss the challenges involved in the design and implementation of a declarative and easy to use framework for discovering links between different data items in a single data source or across different data sources. We demonstrate different steps of the linkage requirements specification and discovery process in several real world sce- narios and show how the LinQuer system can be used to create high-quality linked data sources.
Article
A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical persons, objects or events (said to be matched). A comparison is to be made between the recorded characteristics and values in two records (one from each file) and a decision made as to whether or not the members of the comparison-pair represent the same person or event, or whether there is insufficient evidence to justify either of these decisions at stipulated levels of error. These three decisions are referred to as link (A1), a non-link (A3), and a possible link (A2). The first two decisions are called positive dispositions. The two types of error are defined as the error of the decision A1 when the members of the comparison pair are in fact unmatched, and the error of the decision A3 when the members of the comparison pair are, in fact matched. The probabilities of these errors are defined as and respectively where u(γ), m(γ) are the probabilities of realizing γ (a comparison vector whose components are the coded agreements and disagreements on each characteristic) for unmatched and matched record pairs respectively. The summation is over the whole comparison space r of possible realizations. A linkage rule assigns probabilities P(A1|γ), and P(A2|γ), and P(A3|γ) to each possible realization of γ ε Γ. An optimal linkage rule L (μ, λ, Γ) is defined for each value of (μ, λ) as the rule that minimizes P(A2) at those error levels. In other words, for fixed levels of error, the rule minimizes the probability of failing to make positive dispositions. A theorem describing the construction and properties of the optimal linkage rule and two corollaries to the theorem which make it a practical working tool are given.
Conference Paper
Traditionally, record linkage algorithms have played an im- portant role in maintaining digital libraries { i.e., identifying matching citations or authors for consolidation in updating or integrating digital libraries. As such, a variety of record linkage algorithms have been developed and deployed suc- cessfully. Often, however, existing solutions have a set of parameters whose values are set by human experts o-line and are xed during the execution. Since nding the ideal values of such parameters is not straightforward, or no such single ideal value even exists, the applicability of existing so- lutions to new scenarios or domains is greatly hampered. To remedy this problem, we argue that one can achieve signi- cant improvement by adaptively and dynamically changing such parameters of record linkage algorithms. To validate our hypothesis, we take a classical record linkage algorithm, the sorted neighborhood method (SNM), and demonstrate how we can achieve improved accuracy and performance by adaptively changing its xed sliding window size. Our claim is analytically and empirically validated using both real and synthetic data sets of digital libraries and other domains.