Conference PaperPDF Available

Scaling Entity Resolution to Large, Heterogeneous Data with Enhanced Meta-blocking

Conference Paper

Scaling Entity Resolution to Large, Heterogeneous Data with Enhanced Meta-blocking

Abstract and Figures

Entity Resolution constitutes a quadratic task that typically scales to large entity collections through blocking. The resulting blocks can be restructured by Meta-blocking in order to significantly increase precision at a limited cost in recall. Yet, its processing can be time-consuming, while its precision remains poor for configurations with high recall. In this work, we propose new meta-blocking methods that improve precision by up to an order of magnitude at a negligible cost to recall. We also introduce two efficiency techniques that, when combined, reduce the overhead time of Meta-blocking by more than an order of magnitude. We evaluate our approaches through an extensive experimental study over 6 real-world, heterogeneous datasets. The outcomes indicate that our new algorithms outperform all meta-blocking techniques as well as the state-of-the-art methods for block processing in all respects.
The effect of Block Filtering's ratio r on the blocks of D 2C and D 2D with respect to RR and PC. contained more than half of the input entity profiles. The technical characteristics of the resulting block collections are presented in Table 1(a). Remember that |V B | and |E B | stand for the order and the size of the corresponding blocking graph, respectively. RR has been computed with respect to ||E|| in Table 2, i.e., RR=1-||B||/||E||. We observe that all block collections exhibit nearly perfect recall, as their PC consistently exceeds 0.98. They also convey significant gains in efficiency, executing an order of magnitude less comparisons than the brute-force approach (RR>0.9) in most cases. They reduce the resolution time to a similar extent, due to their very low OT ime. Still, their precision is significantly lower than 0.01 in all cases. This means that on average, more than 100 comparisons have to be executed in order to identify a new pair of duplicates. The corresponding blocking graphs vary significantly in size, ranging from tens of thousands edges to tens of billions, whereas their order ranges from few thousands nodes to few millions. Note that we experimented with additional redundancy-positive blocking methods, such as Q-grams Blocking. All of them involved a schema-agnostic functionality that tackles effectively the schema heterogeneity. They all produced blocks with similar characteristics as Token Blocking and are omitted for brevity. In general, the outcomes of our experiments are independent of the schemaagnostic, redundancy-positive methods that yield the input blocks. Block Filtering. Before using Block Filtering, we have to finetune its filtering ratio r, which determines the portion of the most important blocks that are retained for each profile. To examine its effect, we measured the performance of the restructured blocks using all values of r in [0.05, 1] with a step of 0.05. We consider two evaluation measures: recall (PC) and reduction ratio (RR). Figure 10 presents the evolution of both measures over the original blocks of D 2C and D 2D-the other datasets exhibit similar patterns and are omitted for brevity. We observe that there is a clear trade-off between RR and PC: the smaller the value of r, the less blocks are retained for each profile and the lower is the total cardinality of the restructured blocks ||B ||, thus increasing RR; this reduces the number of detected duplicates, thus decreasing PC. The opposite is true for large values of r. Most importantly, Block Filtering exhibits a robust performance with respect to r, with small
… 
Content may be subject to copyright.
Scaling Entity Resolution to Large,
Heterogeneous Data with Enhanced Meta-blocking
George Papadakis$, George Papastefanatos#, Themis Palpanas^, Manolis Koubarakis$
^Paris Descartes University, France themis@mi.parisdescartes.fr
#IMIS, Research Center “Athena”, Greece gpapas@imis.athena-innovation.gr
$Dep. of Informatics & Telecommunications, Uni. Athens, Greece {gpapadis, koubarak}@di.uoa.gr
ABSTRACT
Entity Resolution constitutes a quadratic task that typically scales
to large entity collections through blocking. The resulting blocks
can be restructured by Meta-blocking in order to significantly in-
crease precision at a limited cost in recall. Yet, its processing can
be time-consuming, while its precision remains poor for configura-
tions with high recall. In this work, we propose new meta-blocking
methods that improve precision by up to an order of magnitude at
a negligible cost to recall. We also introduce two eciency tech-
niques that, when combined, reduce the overhead time of Meta-
blocking by more than an order of magnitude. We evaluate our
approaches through an extensive experimental study over 6 real-
world, heterogeneous datasets. The outcomes indicate that our new
algorithms outperform all meta-blocking techniques as well as the
state-of-the-art methods for block processing in all respects.
1. INTRODUCTION
A common task in the context of Web Data is Entity Resolution
(ER), i.e., the identification of dierent entity profiles that pertain
to the same real-world object. ER suers from low eciency, due
to its inherently quadratic complexity: every entity profile has to
be compared with all others. This problem is accentuated by the
continuously increasing volume of heterogeneous Web Data; LOD-
Stats1recorded around 1 billion triples for Linked Open Data in De-
cember, 2011, which had grown to 85 billion by September, 2015.
Typically, ER scales to these volumes of data through blocking [4].
The goal of blocking is to boost precision and time eciency
at a controllable cost in recall [4, 5, 21]. To this end, it groups
similar profiles into clusters (called blocks) so that it suces to
compare the profiles within each block [7, 8]. Blocking methods
for Web Data are confronted with high levels of noise, not only in
attribute values, but also in attribute names. In fact, they involve an
unprecedented schema heterogeneity: Google Base2alone encom-
passes 100,000 distinct schemata that correspond to 10,000 entity
types [17]. Most blocking methods deal with these high levels of
1http://stats.lod2.eu
2http://www.google.com/base
c
2016, Copyright is with the authors. Published in Proc. 19th Inter-
national Conference on Extending Database Technology (EDBT), March
15-18, 2016 - Bordeaux, France: ISBN 978-3-89318-070-7, on OpenPro-
ceedings.org. Distribution of this paper is permitted under the terms of the
Creative Commons license CC-by-nc-nd 4.0
p1
p3
FullName:JackLloydMiller
job:autoseller
fullname:JackMiller
Work :carvendor‐seller
p2
p4
name:ErickGreen
profession:vehiclevendor
ErickLloydGreen
cartrader
p5Fullname:JamesJordan
job:carseller
p6name:NickPapas
profession:cardealer
(a)
b1(Jack)
p1
b2(Miller)
p1
b3(Erick)
p2
b4(Green)
p2
p3p3p4p4
b5(vendor)
p2
b6(seller)
p3
b7(Lloyd)
p1
b8(car)
p3
p3p5p4p5
(b)
p4p6
Figure 1: (a) A set of entity profiles, and (b) the corresponding
blocks produced by Token Blocking.
schema heterogeneity through a schema-agnostic functionality that
completely disregards schema information and semantics [5]. They
also rely on redundancy, placing every entity profile into multiple
blocks so as to reduce the likelihood of missed matches [4, 22].
The simplest method of this type is Token Blocking [21]. In
essence, it splits the attribute values of every entity profile into to-
kens based on whitespace; then, it creates a separate block for every
token that appears in at least two profiles. To illustrate its function-
ality, consider the entity profiles in Figure 1(a), where p1and p2
match with p3and p4, respectively; Token Blocking clusters them
in the blocks of Figure 1(b). Despite the schema heterogeneity and
the noisy values, both pairs of duplicates co-occur in at least one
block. Yet, the total cost is 13 comparisons, which is rather high,
given that the brute-force approach executes 15 comparisons.
This is a general trait of redundancy-based blocking methods:
in their eort to achieve high recall in the context of noisy and
heterogeneous data, they produce a large number of unnecessary
comparisons. These come in two forms [22, 23]: the redundant
ones repeatedly compare the same entity profiles across dierent
blocks, while the superfluous ones compare non-matching profiles.
In our example, b2and b4contain one redundant comparison each,
which are repeated in b1and b3, respectively; all other blocks en-
tail superfluous comparisons between non-matching entity profiles,
except for the redundant comparison p3-p5in b8(it is repeated in
b6). In total, the blocks of Figure 1(b) involve 3 redundant and 8
superfluous out of the 13 comparisons.
Block Processing. To improve the quality of redundancy-based
blocks, methods such as Meta-blocking [5, 7, 22], Comparison
Propagation [21] and Iterative Blocking [27] aim to process them
in the optimal way (see Section 2 for more details). Among these
methods, Meta-blocking achieves the best balance between preci-
v3
v5
2/5 1/5 1/5
1/2
(a)
v4
v6
b'1
p1 p3
(c)
1/4
v3
v5
(b)
v6
v1
b'2
p2 p4
b'3
p3 p5
b'4
p4 p6
v1
2/6
1/6 1/7 v2
2/5
1/8 v4
v2
b'5
p5 p6
Figure 2: (a) A blocking graph extracted from the blocks in
Figure 1(b), (b) one of the possible edge-centric pruned block-
ing graphs, and (c) the new blocks derived from it.
sion and recall [22, 23], and is the focus of this work.
Meta-blocking restructures a block collection Binto a new one
B0that contains a significantly lower number of unnecessary com-
parisons, while detecting almost the same number of duplicates.
It operates in two steps [7, 22, 23]: first, it transforms Binto the
blocking graph GB, which contains a vertex vifor every entity pro-
file piin B, and an edge ei,jfor every pair of co-occurring profiles pi
and pj(i.e., entity profiles sharing at least one block). Figure 2(a)
depicts the graph for the blocks in Figure 1(b). As no parallel edges
are constructed, every pair of entities is compared at most once,
thus eliminating all redundant comparisons.
Second, it annotates every edge with a weight analogous to the
likelihood that the incident entities are matching. For instance,
the edges in Figure 2(a) are weighted with the Jaccard similarity
of the lists of blocks associated with their incident entity profiles.
The lower the weight of an edge, the more likely it is to connect
non-matching entities. Therefore, Meta-blocking discards most su-
perfluous comparisons by pruning the edges with low weights. A
possible approach is to discard all edges with a weight lower than
the overall mean weight (1/4). This yields the pruned graph in Fig-
ure 2(b). The restructured block collection B0is formed by creating
a new block for every retained edge – as depicted in Figure 2(c).
Note that B0maintains the original recall, while reducing the com-
parisons from 13 to just 5.
Open issues. Despite the significant enhancements in eciency,
Meta-blocking suers from two drawbacks:
(i) There is plenty of room for raising its precision, especially
for the configurations that are more robust to recall. The reason is
that they retain a considerable portion of redundant and superfluous
comparisons. This is illustrated in our example, where the restruc-
tured blocks of Figure 2(c) contain 3 superfluous comparisons in
b0
3,b0
4and b0
5.
(ii) The processing of voluminous datasets involves a significant
overhead. The corresponding blocking graphs comprise millions of
nodes that are strongly connected with billions of edges. Inevitably,
the pruning of such graphs is very time-consuming; for example, a
graph with 3.3 million nodes and 35.8 billion edges requires 16
hours, on average, on commodity hardware (see Section 6.3).
Proposed Solution. In this paper, we describe novel techniques
for overcoming both weaknesses identified above.
First, we speed up Meta-blocking in two ways:
(i) We introduce Block Filtering, which intelligently removes
profiles from blocks, in which their presence is unnecessary. This
acts as a pre-processing technique that shrinks the blocking graph,
discarding more than half of its unnecessary edges, on average. As
a result, the running time is also reduced to half, on average.
(ii) We accelerate the creation and the pruning of the blocking
graph by minimizing the computational cost for edge weighting,
which is the bottleneck of Meta-blocking. Our approach reduces
its running time by 30% to 70%.
In combination, these two techniques restrict drastically the over-
head of Meta-blocking even on commodity hardware. For example,
the blocking graph mentioned earlier is now processed within just
3 hours, instead of 16.
Second, we enhance the precision of Meta-blocking in two ways:
(i) We redefine two pruning algorithms so that they produce re-
structured blocks with no redundant comparisons. On average, they
save 30% more comparisons for the same recall.
(ii) We introduce two new pruning algorithms that rely on a
generic property of the blocking graph: the reciprocal links. That
is, our algorithms retain only the edges that are important for both
incident profiles. Their recall is slightly lower than the existing
techniques, but precision raises by up to an order of magnitude.
We analytically examine the performance of our methods using
6 real-world established benchmarks, which range from few thou-
sands to several million entities. Our experimental results designate
that our algorithms consistently exhibit the best balance between
recall, precision and run-time for the main types of ER applica-
tions among all meta-blocking techniques. They also outperform
the best relevant methods in the literature to a significant extent.
Contributions &Paper Organization. In summary, we make
the following contributions:
We improve the running time of Meta-blocking by an order of
magnitude in two complementary ways: by cleaning the blocking
graph from most of its noisy edges, and by accelerating the estima-
tion of edge weights.
We present four new pruning algorithms that raise precision by
30% to 100% at a small (if any) cost in recall.
We experimentally verify the superior performance of our new
methods through an extensive study over 6 datasets with dierent
characteristics. In this way, our experimental results provide in-
sights into the best configuration for Meta-blocking, depending on
the data and the application at hand. The code and the data of our
experiments are publicly available for any interested researcher.3
The rest of the paper is structured as follows: Section 2 delves
into the most relevant works in the literature, while Section 3 elab-
orates on the main notions of Meta-blocking. In Section 4, we in-
troduce two methods for minimizing its running time, and in Sec-
tion 5, we present new pruning algorithms that boost the precision
of Meta-blocking at no or limited cost in recall. Section 6 presents
our thorough experimental evaluation, while Section 7 concludes
the paper along with directions for future work.
2. RELATED WORK
Entity Resolution has been the focus of numerous works that
aim to tame its quadratic complexity and scale it to large volumes
of data [4, 8]. Blocking is the most popular among the proposed ap-
proximate techniques [5, 7]. Some blocking methods produce dis-
joint blocks, such as Standard Blocking [9]. Their majority, though,
yields overlapping blocks with redundant comparisons in an eort
to achieve high recall in the context of noisy and heterogeneous
data [4]. Depending on the interpretation of redundancy, blocking
methods are distinguished into three categories [22]:
(i) The redundancy-positive methods ensure that the more blocks
two entity profiles share, the more likely they are to be matching.
In this category fall the Sux Arrays [1], Q-grams Blocking [12],
Attribute Clustering [21] and Token Blocking [21].
(ii) The redundancy-negative methods ensure that the most sim-
ilar entity profiles share just one block. In Canopy Clustering [19],
for instance, the entity profiles that are highly similar to the cur-
3See http://sourceforge.net/projects/erframework for
both the code and the datasets.
rent seed are removed from the pool of candidate matches and are
exclusively placed in its block.
(iii) The redundancy-neutral methods yield overlapping blocks,
but the number of common blocks between two profiles is irrelevant
to their likelihood of matching. As such, consider the single-pass
Sorted Neighborhood [13]: all pairs of profiles co-occur in the same
number of blocks, which is equal to the size of the sliding window.
Another line of research focuses on developing techniques that
optimize the processing of an existing block collection, called block
processing methods. In this category falls Meta-blocking [7], which
operates exclusively on top of redundancy-positive blocking meth-
ods [20, 21]. Its pruning can be either unsupervised [22] or su-
pervised [23]. The latter achieves higher accuracy than the former,
due to the composite pruning rules that are learned by a classifier
trained over a set of labelled edges. In practice, though, its utility
is limited, as there is no eective and ecient way for extracting
the required training set from the input blocks. For this reason, we
exclusively consider unsupervised Meta-blocking in this work.
Other prominent block processing methods are the following:
(i) Block Purging [21] aims for discarding oversized blocks that
are dominated by redundant and superfluous comparisons. It auto-
matically sets an upper limit on the comparisons that can be con-
tained in a valid block and purges those blocks that exceed it. Its
functionality is coarser and, thus, less accurate than Meta-blocking,
because it targets entire blocks instead of individual comparisons.
However, it is complementary to Meta-blocking and is frequently
used as a pre-processing step [22, 23].
(ii) Comparison Propagation [21] discards all redundant com-
parisons from a block collection without any impact on recall. In a
small scale, this can be accomplished directly, using a central data
structure Hthat hashes all executed comparisons; then, a compar-
ison is executed only if it is not contained in H. Yet, in the scale
of billions of comparisons, Comparison Propagation can only be
accomplished indirectly: the input blocks are enumerated accord-
ing to their processing order and the Entity Index is built. This is
an inverted index that points from entity ids to block ids. Then, a
comparison pi-pjin block bkis executed (i.e., non-redundant) only
if it satisfies the Least Common Block Index condition (LeCoBI
for short). That is, if the id kof the current block bkequals the
least common block id of the profiles piand pj. Comparison Prop-
agation is competitive to Meta-blocking, but targets only redundant
comparisons. We compare their performance in Section 6.4.
(ii) Iterative Blocking [27] propagates all identified duplicates
to the subsequently processed blocks so as to save repeated com-
parisons and to detect more duplicates. Hence, it improves both
precision and recall. It is competitive to Meta-blocking, too, but it
targets exclusively redundant comparisons between matching pro-
files. We employ it as our second baseline method in Section 6.4.
3. PRELIMINARIES
Entity Resolution. An entity profile,p, is defined as a uniquely
identified collection of name-value pairs that describe a real-world
object. A set of profiles is called entity collection,E. Given E,
the goal of ER is to identify all profiles that describe the same real-
world object; two such profiles, piand pj, are called duplicates
(pipj) and their comparison is called matching. The set of all
duplicates in the input entity collection Eis denoted by D(E), with
|D(E)|symbolizing its size (i.e., the number of existing duplicates).
Depending on the input entity collection(s), we identify two ER
tasks [4, 21, 22]: (i) Dirty ER takes as input a single entity col-
lection with duplicates and produces as output a set of equivalence
clusters. (ii) Clean-Clean ER receives two duplicate-free, but over-
lapping entity collections, E1and E2, and identifies the matching
entity profiles between them. In the context of Databases, the for-
mer task is called Deduplication and the latter Record Linkage [4].
Blocking improves the run-time of both ER tasks by grouping
similar entity profiles into blocks so that comparisons are limited
between co-occurring profiles. Placing an entity profile into a block
is called block assignment. Two profiles, piand pj, assigned to the
same block are called co-occurring and their comparison is denoted
by ci,j. An individual block is symbolized by b, with |b|denoting
its size (i.e., number of profiles) and ||b|| denoting its cardinality
(i.e., number of comparisons). A set of blocks Bis called block
collection, with |B|denoting its size (i.e., number of blocks) and ||B||
its cardinality (i.e., total number of comparisons): ||B|| =PbB||b||.
Performance Measures. To assess the eectiveness of a block-
ing method, we follow the best practice in the literature, which
treats entity matching as an orthogonal task [4, 5, 7]. We assume
that two duplicate profiles can be detected using any of the avail-
able matching methods as long as they co-occur in at least one
block. D(B) stands for the set of co-occurring duplicate profiles
and |D(B)|for its size (i.e., the number of detected duplicates).
In this context, the following measures are typically used for es-
timating the eectiveness of a block collection Bthat is extracted
from the input entity collection E[4, 5, 22]:
(i) Pairs Quality (PQ) corresponds to precision, assessing the
portion of comparisons that involve a non-redundant pair of dupli-
cates. In other words, it considers as true positives the matching
comparisons and as false positives the superfluous and the redun-
dant ones (given that some of the redundant comparisons involve
duplicate profiles, PQ oers a pessimistic estimation of precision).
More formally, PQ =|D(B)|/||B||.PQ takes values in the interval
[0,1], with higher values indicating higher precision for B.
(ii) Pairs Completeness (PC) corresponds to recall, assessing the
portion of existing duplicates that can be detected in B. More for-
mally, PC =|D(B)|/|D(E)|.PC is defined in the interval [0,1], with
higher values indicating higher recall.
The goal of blocking is to maximize both PC and PQ so that the
overall eectiveness of ER exclusively depends on the accuracy of
the entity matching method. This requires that |D(B)|is maximized,
while ||B|| is minimized. However, there is a clear trade-obetween
PC and PQ: the more comparisons are executed (higher ||B||), the
more duplicates are detected (higher |D(B)|), thus increasing PC;
given, though, that ||B|| increases quadratically for a linear increase
in |D(B)|[10, 11], PQ is reduced. Hence, a blocking method is
eective if it achieves a good balance between PC and PQ.
To assess the time eciency of a block collection B, we use two
measures [22, 23]:
(i) Overhead Time (OTime) measures the time required for ex-
tracting Beither from the input entity collection Eor from another
block collection B0.
(ii) Resolution Time (RTime) is equal to OT ime plus the time
required to apply an entity matching method to all comparisons in
the restructured blocks. As such, we use the Jaccard similarity of
all tokens in the values of two entity profiles for entity matching –
this approach does not aect the relative eciency of the examined
methods and is merely used for demonstration purposes.
For both measures, the lower their value, the more ecient is the
corresponding block collection.
Meta-blocking. The redundancy-positive block collections place
every entity profile into multiple blocks, emphasizing recall at the
cost of very low precision. Meta-blocking aims for improving this
balance by restructuring a redundancy-positive block collection B
into a new one B0that contains a small part of the original unneces-
sary comparisons, while retaining practically the same recall [22].
More formally, PC(B0)PC(B) and PQ(B0)PQ(B).
WeightingSchemesPruningAlgorithms
1)AggregateReciprocalComparisons(ARCS)1)CardinalityEdgePruning(CEP)
2)CommonBlocks(CBS)2)CardinalityNodePruning(CNP)
3)EnhancedCommonBlocks(ECBS)3)Weigh tedEdgePruning(WEP)
4)JaccardSimilarity(JS)4)Weigh tedNodePruning(WNP)
5)EnhancedJaccardSimilarity(EJS)
Figure 3: All configurations for the two parameters of Meta-
blocking: the weighting scheme and the pruning algorithm.
Aggregate Reciprocal Comparisons
Scheme (,,) =
1
||||
∈
Common Blocks Scheme (,,) = ||
Enhanced Common Blocks Scheme (,,) = (,,)log ||
||log ||
||
Jaccard Scheme (,,) = ||
+||
Enhanced Jaccard Scheme (,,) = (,,)log||
|| log||
||
Figure 4: The formal definition of the five weighting schemes.
BiBdenotes the set of blocks containing pi,Bi,jBthe set of
blocks shared by piand pj, and |vi|the degree of node vi.
Central to this procedure is the blocking graph GB, which cap-
tures the co-occurrences of profiles within the blocks of B. Its
nodes correspond to the profiles in B, while its undirected edges
connect the co-occurring profiles. The number of edges in GBis
called graph size (|EB|) and the number of nodes graph order (|VB|).
Meta-blocking prunes the edges of the blocking graph in a way
that leaves the matching profiles connected. Its functionality is con-
figured by two parameters: (i) the scheme that assigns weights to
the edges, and (ii) the pruning algorithm that discards the edges
that are unlikely to connect duplicate profiles. The two parameters
are independent in the sense that every configuration of the one is
compatible with any configuration of the other (see Figure 3).
In more detail, five schemes have been proposed for weighting
the edges of the blocking graph [22]. Their formal definitions are
presented in Figure 4. They all normalize their weights to [0,1] so
that the higher values correspond to edges that are more likely to
connect matching profiles. The rationale behind each scheme is the
following: ARCS captures the intuition that the smaller the blocks
two profiles share, the more likely they are to be matching; CBS
expresses the fundamental property of redundancy-positive block
collections that two profiles are more likely to match, when they
share many blocks; ECBS improves CBS by discounting the eect
of the profiles that are placed in a large number of blocks; JS esti-
mates the portion of blocks shared by two profiles; EJS improves
JS by discounting the eect of profiles involved in too many non-
redundant comparisons (i.e., they have a high node degree).
Based on these weighting schemes, Meta-blocking discards part
of the edges of the blocking graph using an edge- or a node-centric
pruning algorithm. The former iterates over the edges of the block-
ing graph and retains the globally best ones, as in Figure 2(b); the
latter iterates over the nodes of the blocking graph and retains the
locally best edges. An example of node-centric pruning appears
in Figure 5(a); for each node in Figure 2(a), it has retained the in-
cident edges that exceed the average weight of the neighborhood.
For clarity, the retained edges are directed and outgoing, since they
might be preserved in the neighborhoods of both incident profiles.
Again, every retained edge forms a new block, yielding the restruc-
tured block collection in Figure 5(b).
Every pruning algorithm relies on a pruning criterion. Depend-
ing on its scope, this can be either a global criterion, which applies
to the entire blocking graph, or a local one, which applies to an
(a)
v2
v4
v6
v3
v5
v1 b'1
p1 p3
b'2
p1 p3
b'3
p2 p4
b'4
p2 p4
b'5
p3 p5
b'6
p3 p5
b'7
p5 p6
b'8
p5 p6
b'9
p4 p6
(b)
Figure 5: (a) One of the possible node-centric pruned blocking
graphs for the graph in Figure 2(a). (b) The new blocks derived
from the pruned graph.
individual node neighborhood. With respect to its functionality,
the pruning criterion can be a weight threshold, which specifies the
minimum weight of the retained edges, or a cardinality threshold,
which determines the maximum number of retained edges.
Every combination of a pruning algorithm with a pruning crite-
rion is called pruning scheme. The following four pruning schemes
were proposed in [22] and were experimentally verified to achieve
a good balance between PC and PQ:
(i) Cardinality Edge Pruning (CEP) couples the edge-centric
pruning with a global cardinality threshold, retaining the top-K
edges of the entire blocking graph, where K=bPbB|b|/2c.
(ii) Cardinality Node Pruning (CNP) combines the node-centric
pruning with a global cardinality threshold. For each node, it re-
tains the top-kedges of its neighborhood, with k=bPbB|b|/|E|-1c.
(iii) Weighted Edge Pruning (WEP) couples edge-centric prun-
ing with a global weight threshold equal to the average edge weight
of the entire blocking graph.
(iv) Weighted Node Pruning (WNP) combines the node-centric
pruning with a local weight threshold equal to the average edge
weight of every node neighborhood.
The weight-based schemes, WEP and WNP, discard the edges
that do not exceed their weight threshold and typically perform a
shallow pruning that retains high recall [22]. The cardinality-based
schemes, CEP and CNP, rank the edges of the blocking graph in
descending order of weight and retain a specific number of the top
ones. For example, if CEP retained the 4 top-weighted edges of
the graph in Figure 2(a), it would produce the pruned graph of Fig-
ure 2(b), too. Usually, CEP and CNP perform deeper pruning than
WEP and WNP, trading higher precision for lower recall [22].
Applications of Entity Resolution. Based on their performance
requirements, we distinguish ER applications into two categories:
(i) The eciency-intensive applications aim to minimize the re-
sponse time of ER, while detecting the vast majority of the dupli-
cates. More formally, their goal is to maximize precision (PQ) for
a recall (PC) that exceeds 0.80. To this category belong real-time
applications or applications with limited temporal resources, such
as Pay-as-you-go ER [26], entity-centric search [25] and crowd-
sourcing ER [6]. Ideally, their goal is to identify a new pair of
duplicate entities with every executed comparison.
(ii) The eectiveness-intensive applications can aord a higher
response time in order to maximize recall. At a minimum, recall
(PC) should not fall below 0.95. Most of these applications corre-
spond to o-line batch processes like data cleaning in data ware-
houses, which practically call for almost perfect recall [2]. Yet,
higher precision (PQ) is pursued even in o-line applications so as
to ensure that they scale to voluminous datasets.
Meta-blocking accommodates the eectiveness-intensive appli-
cations through the weight-based pruning schemes (WEP,WNP)
and the eciency-intensive applications through the cardinality-
based schemes (CEP,CNP).
v1
v3
v5
2/3
1/3
(b)
b'1 (Jack)
p1
b'2 (Miller)
p1
b'3 (Erick)
p2
b'4 (Green)
p2
p3 p3
p4 p4
b'5 (seller)
p3 p5
v2
v4
(a)
1
v1
v3
v2
v4
v5
(c)
b''1
p1 p3
b''2
p2 p4
(d)
Figure 6: (a) The block collection produced by applying Block
Filtering to the blocks in Figure 1(b), (b) the corresponding
blocking graph, (c) the pruned blocking graph produced by
WEP, and (d) the corresponding restructured block collection.
4. TIME EFFICIENCY IMPROVEMENTS
We now propose two methods for accelerating the processing of
Meta-blocking, minimizing its OT ime:(i) Block Filtering, which
operates as a pre-processing step that reduces the size of the block-
ing graph, and (ii) Optimized Edge Weighting, which minimizes
the computational cost for the weighting of individual edges.
4.1 Block Filtering
This approach is based on the idea that each block has a dier-
ent importance for every entity profile it contains. For example, a
block with thousands of profiles is usually superfluous for most of
them, but it may contain a couple of matching entity profiles that
do not co-occur in another block; for them, this particular block
is indispensable. Based on this principle, Block Filtering restruc-
tures a block collection by removing profiles from blocks, in which
their presence is not necessary. The importance of a block bkfor
an individual entity profile pibkis implicitly determined by the
maximum number of blocks piparticipates in.
Continuing our example with the blocks in Figure 1(b), assume
that their importance is inversely proportional to their id; that is,
b1and b8are the most and the least important blocks, respectively.
A possible approach to Block Filtering would be to remove every
entity profile from the least important of its blocks, i.e., the one
with the largest block id. The resulting block collection appears in
Figure 6(a). We can see that Block Filtering reduces the 15 origi-
nal comparisons to just 5. Yet, there is room for further improve-
ments, due to the presence of 2 redundant comparisons, one in b0
2
and another one in b0
4, and 1 superfluous in block b0
5. Using the
JS weighting scheme, the graph corresponding to these blocks is
presented in Figure 6(b) and the pruned graph produced by WEP
appears in Figure 6(c). In the end, we get the 2 matching compar-
isons in Figure 6(d). This is a significant improvement over the 5
comparisons in Figure 2(c), which were produced by applying the
same pruning scheme directly to the blocks of Figure 1(b).
In more detail, the functionality of Block Filtering is outlined in
Algorithm 1. First, it orders the blocks of the input collection B
in descending order of importance (Line 3). Then, it determines
the maximum number of blocks per entity profile (Line 4). This
requires an iteration over all blocks in order to count the block
assignments per entity profile. Subsequently, it iterates over all
blocks in the specified order (Line 5) and over all profiles in each
block (Line 6). The profiles that have more block assignments than
their threshold are discarded, while the rest are retained in the cur-
rent block (Lines 7-10). In the end, the current block is retained
only if it still contains at least two entity profiles (Lines 11-12).
The time complexity of this procedure is dominated by the sort-
ing of blocks, i.e., O(|Blog |B|). Its space complexity is linear with
respect to the size of the input, O(|E|), because it maintains a thresh-
old and a counter for every entity profile.
Algorithm 1: Block Filtering.
Input:Bthe input block collection
Output:B0the restructured block collection
1B0{};
2counter[] {}; // count blocks per profile
3orderBlocks(B); // sort in descending importance
4maxBlocks[] getThresholds(B); // limit per profile
5foreach bkBdo // check all blocks
6foreach pibkdo // check all profiles
7if counter[i] >maxBlocks[i] then
8bkbk\pi;// remove profile
9else
10 counter[i]++;// increment counter
11 if |bk|>1then // retain blocks with
12 B0B0bk;// at least 2 profiles
13 return B0;
Block
Filtering Meta-blocking
Redundancy-positive
block collection B
Restructured
block collection B’
(a)
Block
Filtering
Comparison
Propagation
Redundancy-positive
block collection B
Restructured
block collection B’
(b)
Figure 7: (a) Using Block Filtering for pre-processing the
blocking graph of Meta-blocking, and (b) using Block Filter-
ing as a graph-free Meta-blocking method.
The performance of Block Filtering is determined by two factors:
(i) The criterion that specifies the importance of a block bi. This
can be defined in various ways and ideally should be dierent for
every profile in bi. For higher eciency, though, we use a criterion
that is common for all profiles in bi. It is also generic, applying
to any block collection, independently of the underlying ER task
or the schema heterogeneity. This criterion is the cardinality of
bi,||bi||, presuming that the less comparisons a block contains, the
more important it is for its entities. Thus, Block Filtering sorts a
block collection from the smallest block to the largest one.
(ii) The filtering ratio (r) that determines the maximum number
of block assignments per profile. It is defined in the interval [0,1]
and expresses the portion of blocks that are retained for each pro-
file. For example, r=0.5 means that each profile remains in the first
half of its associated blocks, after sorting them in ascending cardi-
nality. We experimentally fine-tune this parameter in Section 6.2.
Instead of a local threshold per entity profile, we could apply
the same global threshold to all profiles. Preliminary experiments,
though, demonstrated that this approach exhibits low performance,
as the number of blocks associated with every profile varies largely,
depending on the quantity and the quality of information it con-
tains. This is particularly true for Clean-Clean ER, where E1and
E2usually dier largely in their characteristics. Hence, it is dicult
to identify the break-even point for a global threshold that achieves
a good balance between recall and precision for all profiles.
Finally, it is worth noting that Block Filtering can be used in two
fundamentally dierent ways, which are compared in Section 6.4.
(i) As a pre-processing method that prunes the blocking graph
before applying Meta-blocking – see Figure 7(a).
(ii) As a graph-free Meta-blocking method that is combined only
with Comparison Propagation – see Figure 7(b).
The latter workflow skips the blocking graph, operating on the
level of individual profiles instead of profile pairs. Thus, it is ex-
pected to be significantly faster than all graph-based algorithms. If
it achieves higher precision, as well, it outperforms the graph-based
workflow in all respects, rendering the blocking graph unnecessary.
Algorithm 2: Original Edge Weighting.
Input:Bthe input block collection
Output:Wthe set of edge weights
1W{};
2EI buildEntityIndex( B);
3foreach bkBdo // check all blocks
4foreach ci,jbkdo // check all comparisons
5BiEI .getBlockList ( pi);
6BjEI .getBlockList ( pj);
7commonBlocks 0;
8foreach mBido
9foreach nBjdo
10 if m<nthen break; // repeat until
11 if n<mthen continue; // finding common id
12 if commonBlocks =0then // 1 st common id
13 if m,kthen // it violates LeCoBI
14 break to next comparison;
15 commonBlocks++;
16 wi,jcalculateWeight ( commonBlocks,Bi,Bj);
17 WW{wi,j};
18 return W;
4.2 Optimized Edge Weighting
A complementary way of speeding up Meta-blocking is to accel-
erate its bottleneck, i.e., the estimation of edge weights. Intuitively,
we want to minimize the computational cost of the procedure that
derives the weight from every individual edge. Before we explain
in detail our solution, we give some background, by describing how
the existing Edge Weighting algorithm operates.
The blocking graph cannot be materialized in memory in the
scale of million nodes and billion edges. Instead, it is implemented
implicitly. The key idea is that every edge ei,jin the blocking graph
GBcorresponds to a non-redundant comparison ci,jin the block
collection B. In other words, a comparison ci,jin bkBdefines an
edge ei,jin GBas long as it satisfies the LeCoBI condition (see Sec-
tion 2). The condition is checked with the help of the Entity Index
during the core process that derives the blocks shared by piand pj.
In more detail, the original implementation of Edge Weighting
is outlined in Algorithm 2. Note that Bistands for the block list
of pi, i.e., the set of block ids associated with pi, sorted from the
smallest to the largest one. The core process appears in Lines 7-15
and relies on the established process of Information Retrieval for
intersecting the posting lists of two terms while answering a key-
word query [18]: it iterates over the blocks lists of two co-occurring
profiles in parallel, incrementing the counter of common blocks for
every id they share (Line 15). This process is terminated in case the
first common block id does not coincide with the id of the current
block bk, thus indicating a redundant comparison (Lines 12-14).
Our observation is that since this procedure is repeated for ev-
ery comparison in B, a more ecient implementation would sig-
nificantly reduce the run time of Meta-blocking. To this end, we
develop a filtering technique inspired by similarity joins [14].
Prefix Filtering [3, 14] is a prominent method, which prunes dis-
similar pairs of strings with the help of the minimum similarity
threshold tthat is determined a-priori;tcan be defined with re-
spect to various similarity metrics that are essentially equivalent,
due to a set of transformations [14]. Without loss of generality, we
assume in the following that tis normalized in [0,1], just like the
edge weights, and that it expresses a Jaccard similarity threshold.
Adapted to edge weighting, Prefix Filtering represents every pro-
file piby the b(1 t)·|Bi|c+1 smallest blocks of Bi. The idea is that
pairs having disjoint representations cannot exceed the similarity
Algorithm 3: Optimized Edge Weighting.
Input:Bthe input block collection, Ethe input entity collection
Output:Wthe set of edge weights
1W{}; commonBlocks[] {}; f lag s[] {};
2EI buildEntityIndex( B);
3foreach piEdo // check all profiles
4BiEI .getBlockList ( pi);
5neighbors {}; // set of co-occurring profiles
6foreach bkBido // check all associated blocks
7foreach pj(,pi)bkdo // co-occurring profile
8if f lags[j] ,ithen
9f lags[j] =i;
10 commonBlocks[j] =0;
11 neighbors neighbor s ∪ {pj};
12 commonBlocks[j]++;
13 foreach pjneighbors do
14 BjEI .getBlockList ( pj);
15 wi,jcalculateWeight ( commonBlocks[j], Bi,Bj);
16 WW{wi,j};
17 return W;
threshold t. For example, for t=0.8 an edge ei,jcould be pruned
using 1/5 of Biand Bj, thus speeding up the nested loops in Lines
8-9 of Algorithm 2. Yet, there are 3 problems with this approach:
(i) For the weight-based algorithms, the pruning criterion tcan
only be determined a-posteriori – after averaging all edge weights
in the entire graph (WEP), or in a node neighborhood (WNP). As a
result, the optimizations of Prefix Filtering apply only to the prun-
ing phase of WEP and WNP and not to the initial construction of
the blocking graph.
(ii) For the cardinality-based algorithms CEP and CNP,tequals
the minimum edge weight in the sorted stack with the top-weighted
edges. Thus, its value is continuously modified and cannot be used
for a-priori building optimized entity representations.
(iii) Preliminary experiments demonstrated that tinvariably takes
very low values, below 0.1, for all combinations of pruning algo-
rithms and weighting schemes. These low thresholds force all ver-
sions of Prefix Filtering to consider the entire block lists Biand Bj
as entity representations, thus ruining their optimizations.
For these reasons, we propose a novel implementation that is in-
dependent of the similarity threshold t. Our approach is outlined
in Algorithm 3. Instead of iterating over all comparisons in B, it
iterates over all input profiles in E(Line 3). The core procedure in
Lines 6-12 works as follows: for every profile pi, it iterates over
all co-occurring profiles in the associated blocks and records their
frequency in an array. At the end of the process, commonBlocks[j]
indicates the number of blocks shared by piand pj. This informa-
tion is then used in Lines 13-16 for estimating the weight wi,j. This
method is reminiscent of ScanCount [16]. Note that the array f lag s
helps us to avoid reallocating memory for commonBlocks in every
iteration, a procedure that would be costly, due to its size, |E|[16];
neighbors is a hash set that stores the unique profiles that co-occur
with pi, gathering the distinct neighbors of node niin the blocking
graph without evaluating the LeCoBI condition.
4.3 Discussion
The average time complexity of Algorithm 2 is O(2·BPE·||B||),
where BPE(B)=PbB|b|/|E|is the average number of blocks as-
sociated with every profile in B; 2·BPE corresponds to the aver-
age computational cost of the nested loops in Lines 8-9, while ||B||
stems from the nested loops in Lines 3-4, which iterate over all
comparisons in B.
Block Filtering improves this time complexity in two ways:
(i) It reduces ||B|| by discarding a large part of the redundant and
superfluous comparisons in B.
(ii) It reduces 2·BPE to 2·r·BPE by removing every profile from
(1r)·100% of its associated blocks, where ris the filtering ratio.4
The computational cost of Algorithm 3 is determined by two pro-
cedures that yield an average time complexity of O(||B|| +¯
|v|·|E|):
(i) The three nested loops in Lines 3-7. For every block b, these
loops iterate over |b|-1 of its entity profiles (i.e., over all profiles
except pi) for |b|times – once for each entity profile. Therefore,
the process in Lines 8-12 is repeated |b(|b|-1)=||b|| times and the
overall complexity of the three nested loops is O(||B||).
(ii) The loop in Lines 13-16. Its cost is analogous to the average
node degree ¯
|v|, i.e., the average number of neighbors per profile. It
is repeated for every profile and, thus, its overall cost is O(¯
|v|·|E|).
Comparing the two algorithms, we observe that the optimized
implementation minimizes the computational cost of the process
that is applied to each comparison: instead of intersecting the asso-
ciated block lists, it merely updates 2-3 cells in 2 arrays and adds
an entity id in the set of neighboring profiles. The former process
involves two nested loops with an average cost of O(2·BC), while
the latter processes have a constant complexity, O(1). Note that
Algorithm 3 incorporates the loop in Lines 13-16, which has com-
plexity of O(¯
|v|·|E|). In practice, though, this is considerably lower
than both O(2·BPE·||B||) and O(||B||), as we show in Section 6.3.
5. PRECISION IMPROVEMENTS
We now introduce two new pruning algorithms that significantly
enhance the eectiveness of Meta-blocking, increasing precision
for similar levels of recall: (i) Redefined Node-centric Pruning,
which removes all redundant comparisons from the restructured
blocks of CNP and WNP, and (ii) Reciprocal Pruning, which in-
fers the most promising matches from the reciprocal links of the
directed pruned blocking graphs of CNP and WNP.
There are several reasons why we exclusively focus on improv-
ing the precision of CNP and WNP:
(i) Node-centric algorithms are quite robust to recall, since they
retain the most likely matches for each node and, thus, guarantee
to include every profile in the restructured blocks. Edge-centric
algorithms do not provide such guarantees, because they retain the
overall best edges from the entire graph.
(ii) Node-centric algorithms are more flexible, in the sense that
their performance can be improved in generic, algorithmic ways.
Instead, edge-centric algorithms improve their performance only
with threshold fine-tuning. This approach, though, is application-
specific, as it is biased by the characteristics of the block collection
at hand. Another serious limitation is that some superfluous com-
parisons cannot be pruned without discarding part of the matching
ones. For instance, the superfluous edge e5,6in Figure 2(a) has a
higher weight than both matching edges e1,3and e2,4.
(iii) There is more room for improvement in node-centric algo-
rithms, because they exhibit lower precision than the edge-centric
ones. They process every profile independently of the others and
they often retain the same edge for both incident profiles, thus
yielding redundant comparisons. They also retain more superfluous
comparisons than the edge-centric algorithms in most cases [22].
As an example, consider Clean-Clean ER: every profile from both
4This seems similar to the eect of Prefix Filtering, but there are
fundamental dierences: (i) Prefix Filtering does not reduce the
number of executed comparisons; it just accelerates their pruning.
(ii) Prefix Filtering relies on a similarity threshold for pairs of pro-
files, while the filtering ratio rpertains to individual profiles.
(a)
v2
v4
v6
v3
v5
v1
(b)
b'1
p1 p3
b'2
p3
b'3
p5 p6
b'4
p5
p2 p4
b'5
p4 p6
Figure 8: (a) The undirected pruned blocking graph corre-
sponding to the directed one Figure 5(a), and (b) the corre-
sponding block collection.
entity collections retains its connections with several incident nodes,
even though only one of them is matching with it.
Nevertheless, our experiments in Section 6.4 demonstrate that
our new node-centric algorithms outperform the edge-centric ones,
as well. They also cover both eciency- and eectiveness-intensive
ER applications, enhancing both CNP and WNP.
5.1 Redefined Node-centric Pruning
We can enhance the precision of both CNP and WNP without
any impact on recall by discarding all the redundant comparisons
they retain. Assuming that the blocking graph is materialized, a
straightforward approach is to convert the directed pruned graph
into an undirected one by connecting every pair of neighboring
nodes with a single undirected edge – even if they are reciprocally
linked. In the extreme case where every retained edge has a recip-
rocal link, this saves 50% more comparisons and doubles precision.
As an example, the directed pruned graph in Figure 5(a) can be
transformed into the undirected pruned graph in Figure 8(a); the
resulting blocks, which are depicted in Figure 8(b), reduce the re-
tained comparisons from 9 to 5, while maintaining the same recall
as the blocks in Figure 5(b): p1-p3co-occur in b0
1and p2-p4in b0
4.
Yet, it is impossible to materialize the blocking graph in mem-
ory in the scale of billions of edges. Instead, the graph is implicitly
implemented as explained in Section 4.2. In this context, a straight-
forward solution for improving CNP and WNP is to apply Compar-
ison Propagation to their output. This approach, though, entails a
significant overhead, as it evaluates the LeCoBI condition for every
retained comparison; on average, its total cost is O(2·BPE·||B0||).
The best solution is to redefine CNP and WNP so that Com-
parison Propagation is integrated into their functionality. The new
implementations are outlined in Algorithms 4 and 5, respectively.
In both cases, the processing consists of two phases:
(i) The first phase involves a node-centric functionality that goes
through the nodes of the blocking graph and derives the pruning
criterion from their neighborhood. A central data structure stores
the top-knearest neighbors (CNP) or the weight threshold (WNP)
per node neighborhood. In total, this phase iterates twice over every
edge of the blocking graph – once for each incident node.
(ii) The second phase operates in an edge-centric fashion that
goes through all edges, retaining those that satisfy the pruning cri-
terion for at least one of the incident nodes. Thus, every edge is re-
tained at most once, even if it is important for both incident nodes.
In more detail, Redefined CNP iterates over all nodes of the
blocking graph to extract their neighborhood and calculate the cor-
responding cardinality threshold k(Lines 2-4 in Algorithm 4). Then
it iterates over the edges of the current neighborhood and places the
top-kweighted ones in a sorted stack (Lines 5-8). In the second
phase, it iterates over all edges and retains those contained in the
sorted stack of either of the incident profiles (Lines 10-13).
Similarly, Redefined WNP first iterates over all nodes of the
blocking graph to extract their neighborhood and to estimate the
Algorithm 4: Redefined Cardinality Node Pruning.
Input: (i) Gin
Bthe blocking graph, and
(ii) ct the function defining the local cardinality thresholds.
Output:Gout
Bthe pruned blocking graph
1S ortedS tack s[] {}; // sorted stack per node
2foreach viVBdo // for every node
3GvigetNeighborhood( vi,GB);
4kct(Gvi); // get local cardinality threshold
5foreach ei,jEvido // add every adjacent edge
6S ortedS tack s[i].push( ei,j); // in sorted stack
7if k<S ortedS tack s[i].size() then
8S ortedS tack s[i].pop(); // remove last edge
9Eout
B{}; // the set of retained edges
10 foreach ei,jEBdo // for every edge
11 if ei,jS ortedS tack s[i]
12 OR ei,jS ortedS tack s[j]then // retain if in
13 Eout
BEout
B{ei,j}; // top-k for either node
14 return Gout
B={VB,Eout
B,WS };
corresponding weight threshold (Lines 2-4 in Algorithm 5). Then,
it iterates once over all edges and retains those exceeding the weight
thresholds of either of the incident nodes (Lines 6-9).
For both algorithms, the function getN eighborhood in Line 3
implements the Lines 4-16 of Algorithm 3. Note also that both
algorithms use the same configuration as their original implemen-
tations: k=bPbB|b|/|E| − 1cfor Redefined CNP and the average
weight of each node neighborhood for Redefined WNP. Their time
complexity is O(|VB|·|EB|) in the worst-case of a complete blocking
graph, and O(|EB|) in the case of a sparse graph, which typically
appears in practice. Their space complexity is dominated by the
requirements of Entity Index and the number of retained compar-
isons, i.e., O(BPE·|VB|+||B0||), on average.
5.2 Reciprocal Node-centric Pruning
This approach treats the redundant comparisons retained by CNP
and WNP as strong indications for profile pairs with high chances
of matching. As explained above, these comparisons correspond
to reciprocal links in the blocking graph. For example, the edges
~
e1,3and ~
e3,1in Figure 5(a) indicate that p1is highly likely to match
with p3and vice versa, thus reinforcing the likelihood that the two
profiles are duplicates. Based on this rationale, Reciprocal Prun-
ing retains one comparison for every pair of profiles that are re-
ciprocally connected in the directed pruned blocking graph of the
original node-centric algorithms; profiles that are connected with a
single edge, are not compared in the restructured block collection.
In our example, Reciprocal Pruning converts the directed pruned
blocking graph in Figure 5(a) into the undirected pruned block-
ing graph in Figure 9(a). The corresponding restructured blocks
in Figure 9(b) contain just 4 comparisons, one less than the blocks
in Figure 8(b). Compared to the blocks in Figure 5(b), the overall
eciency is significantly enhanced at no cost in recall.
In general, Reciprocal Pruning yields restructured blocks with
no redundant comparisons and less superfluous ones than both the
original and the redefined node-centric pruning. In the worst case,
all pairs of nodes are reciprocally linked and Reciprocal Pruning
coincides with Redefined Node-centric Pruning. In all other cases,
its precision is much higher, while its impact on recall depends on
the strength of co-occurrence patterns in the blocking graph.
Reciprocal Pruning yields two new node-centric algorithms: Re-
ciprocal CNP and Reciprocal WNP. Their functionality is almost
Algorithm 5: Redefined Weighted Node Pruning.
Input: (i) Gin
Bthe blocking graph, and
(ii) wt the function defining the local weight thresholds.
Output:Gout
Bthe pruned blocking graph
1weights[] {}; // thresholds per node
2foreach viVBdo // for every node
3GvigetNeighborhood( vi,GB);
4weights[i]wt(Gvi); // get local threshold
5Eout
B{}; // the set of retained edges
6foreach ei,jEBdo // for every edge
7if weights[i]ei,j.weight
8OR weights[j]ei,j.weight then // retain if it
9Eout
BEout
B{ei,j}; // exceeds either threshold
10 return Gout
B={VB,Eout
B,WS };
(a)
v2
v4
v6
v3
v5
v1
(b)
b'1
p1 p3
b'2
p3
b'3
p5 p6
b'4
p5
p2 p4
Figure 9: (a) The pruned blocking graph produced by apply-
ing Reciprocal Pruning to the graph in Figure 5(a), and (b) the
restructured blocks.
identical to Redefined CNP and Redefined WNP, respectively.
The only dierence is that they use conjunctive conditions instead
of disjunctive ones: the operator OR in Lines 11-12 and 7-8 of Al-
gorithms 4 and 5, respectively, is replaced by the operator AND.
Both algorithms use the same pruning criteria as redefined meth-
ods, while sharing the same average time and space complexities:
O(|EB|) and O(BPE·|VB|+||B0||), respectively.
6. EXPERIMENTAL EVALUATION
We now examine the performance of our techniques through a
series of experiments. We present their setup in Section 6.1 and in
Section 6.2, we fine-tune Block Filtering, assessing its impact on
blocking eectiveness. Its impact on time eciency is measured
in Section 6.3 together with that of Optimized Edge Weighting.
Section 6.4 assess the eectiveness of our new pruning algorithms
and compares them to state-of-the-art block processing methods.
6.1 Setup
We implemented our approaches in Java 8 and tested them on a
desktop machine with Intel i7-3770 (3.40GHz) and 16GB RAM,
running Lubuntu 15.04 (kernel version 3.19.0). We repeated all
time measurements 10 times and report the average value so as to
minimize the eect of external parameters.
Datasets. In our experiments, we use 3 real-world entity col-
lections. They are established benchmarks [15, 21, 24] with sig-
nificant variations in their size and characteristics. They pertain
to Clean-Clean ER, but are used for Dirty ER, as well; we simply
merge their clean entity collections into a single one that contains
duplicates in itself. In total, we have 6 block collections that lay the
ground for a thorough experimental evaluation of our techniques.
To present the technical characteristics of the entity collections,
we use the following notation: |E|stands for the number of pro-
files they contain, |D(E)|for the number of existing duplicates, |N|
for the number of distinct attribute names, |P|for the total number
of name-value pairs, |¯p|for the mean number of name-value pairs
Original Block Collections After Block Filtering
D1C D2C D3C D1D D2D D3D D1C D2C D3C D1D D2D D3D
|B|6,877 40,732 1,239,424 44,097 76,327 1,499,534 6,838 40,708 1,239,066 44,096 76,317 1,499,267
||B|| 1.92·1068.11·1074.23·1010 9.49·1075.03·1088.00·1010 6.98·1052.77·1071.30·1010 2.38·1071.37·1082.31·1010
BPE 4.65 28.17 17.56 10.67 32.86 14.79 3.63 22.54 14.05 8.54 26.29 11.83
PC(B) 0.994 0.980 0.999 0.997 0.981 0.999 0.990 0.976 0.998 0.994 0.976 0.997
PQ(B) 1.19·1032.76·1042.11·1052.43·1054.46·1051.12·1053.28·1038.06·1046.86·1059.62·1051.62·1043.86·105
RR 0.988 0.873 0.984 0.953 0.610 0.986 0.637 0.659 0.693 0.749 0.727 0.711
|VB|61,399 50,720 3,331,647 63,855 50,765 3,333,356 60,464 50,720 3,331,641 63,855 50,765 3,333,355
|EB|1.83·1066.75·1073.58·1010 7.98·1072.70·1086.65·1010 6.69·1052.52·1071.14·1010 2.11·1079.76·1072.00·1010
OT ime(B) 2.1 sec 5.6 sec 4 min 2.2 sec 5.7 sec 5 min 2.3 sec 6.4 sec 5 min 2.5 sec 6.5 sec 6 min
RT ime(B) 19 sec 65 min 350 hrs 13 min 574 min 660 hrs 9 sec 24 min 110 hrs 3 min 174 min 190 hrs
(a) (b)
Table 1: Technical characteristics of (a) the original block collections, and (b) the ones restructured by Block Filtering with r=0.80.
|E| |D(E)| |N| |P| |¯
p| ||E|| RT(E)
D1C 2,516 2,308 4 1.01·1044.0 1.54·10826 min
61,353 4 1.98·1053.2
D2C 27,615 22,863 4 1.55·1055.6 6.40·108533 min
23,182 7 8.16·10535.2
D3C 1,190,733 892,579 30,688 1.69·10714.2 2.58·1012 21,000 hrs
2,164,040 52,489 3.50·10716.2
(a) Entity Collections for Clean-Clean ER
D1D 63,869 2,308 4 2.08·1053.3 2.04·109272 min
D2D 50,797 22,863 10 9.71·10519.1 1.29·1091,505 min
D3D 3,354,773 892,579 58,590 5.19·10715.5 5.63·1012 47,000 hrs
(b) Entity Collections for Dirty ER
Table 2: Technical characteristics of the entity collections. For
D3Cand D3D,RT (E)was estimated from the average time re-
quired for comparing two of its entity profiles: 0.03 msec.
per profile, ||E|| for the number of comparisons executed by the
brute-force approach and RT (E) for its resolution time; in line with
RT ime(B), RT (E) is computed using the Jaccard similarity of all
tokens in the values of two profiles as the entity matching method.
Tables 2(a) and (b) present the technical characteristics of the
real entity collections for Clean-Clean and Dirty ER, respectively.
D1Ccontains bibliographic data from DBLP (www.dblp.org) and
Google Scholar (http://scholar.google.gr) that were matched
manually [15, 24]. D2Cmatches movies from IMDB (imdb.com)
and DBPedia (http://dbpedia.org) based on their ImdbId [22,
23]. D3Cinvolves profiles from two snapshots of English Wikipedia
(http://en.wikipedia.org) Infoboxes, which were automatically
matched based on their URL [22, 23]. For Dirty ER, the datasets
DxD with x[1,6] were derived by merging the profiles of the indi-
vidually clean collections that make up DxC , as explained above.
Measures. To assess the eectiveness of a restructured block
collection B0, we use four established measures [4, 5, 22]: (i) its
cardinality ||B0||, i.e., total number of comparisons, (ii) its recall
PC(B0), (iii) its precision PQ(B0), and (iv) its Reduction Ratio (RR),
which expresses the relative decrease in its cardinality in compari-
son with the original block collection, B:RR(B,B0)=1 ||B0||/||B||.
The last three measures take values in the interval [0,1], with higher
values, indicating better eectiveness; the opposite is true for ||B0||,
as eectiveness is inversely proportional to its value (cf. Section 3).
To assess the time eciency of a restructured block collection
B0, we use the two measures that were defined in Section 3: (i) its
Overhead Time OTime(B0), which measures the time required by
Meta-blocking to derive it from the input block collection, and (ii)
its Resolution Time RT ime(B0), which adds to OT ime(B0) the time
taken to apply an entity matching method to all comparisons in B0.
6.2 Block Collections
Original Blocks. From all datasets, we extracted a redundancy-
positive block collection by applying Token Blocking [21]. We also
applied Block Purging [21] in order to discard those blocks that
0.0
0.2
0.4
0.6
0.8
1.0
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
Δ|Ε|
ratio
PC Dirty ER RR Dirty ER PC Clean-Clean ER RR Clean-Clean ER
0.0
0.2
0.4
0.6
0.8
1.0
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
ratio
Figure 10: The eect of Block Filtering’s ratio ron the blocks
of D2Cand D2Dwith respect to RR and PC.
contained more than half of the input entity profiles. The technical
characteristics of the resulting block collections are presented in
Table 1(a). Remember that |VB|and |EB|stand for the order and
the size of the corresponding blocking graph, respectively. RR has
been computed with respect to ||E|| in Table 2, i.e., RR=1-||B||/||E||.
We observe that all block collections exhibit nearly perfect re-
call, as their PC consistently exceeds 0.98. They also convey sig-
nificant gains in eciency, executing an order of magnitude less
comparisons than the brute-force approach (RR>0.9) in most cases.
They reduce the resolution time to a similar extent, due to their very
low OT ime. Still, their precision is significantly lower than 0.01 in
all cases. This means that on average, more than 100 comparisons
have to be executed in order to identify a new pair of duplicates.
The corresponding blocking graphs vary significantly in size, rang-
ing from tens of thousands edges to tens of billions, whereas their
order ranges from few thousands nodes to few millions.
Note that we experimented with additional redundancy-positive
blocking methods, such as Q-grams Blocking. All of them involved
a schema-agnostic functionality that tackles eectively the schema
heterogeneity. They all produced blocks with similar character-
istics as Token Blocking and are omitted for brevity. In general,
the outcomes of our experiments are independent of the schema-
agnostic, redundancy-positive methods that yield the input blocks.
Block Filtering. Before using Block Filtering, we have to fine-
tune its filtering ratio r, which determines the portion of the most
important blocks that are retained for each profile. To examine
its eect, we measured the performance of the restructured blocks
using all values of rin [0.05,1] with a step of 0.05. We consider
two evaluation measures: recall (PC) and reduction ratio (RR).
Figure 10 presents the evolution of both measures over the orig-
inal blocks of D2Cand D2D– the other datasets exhibit similar pat-
terns and are omitted for brevity. We observe that there is a clear
trade-obetween RR and PC: the smaller the value of r, the less
blocks are retained for each profile and the lower is the total cardi-
nality of the restructured blocks ||B0||, thus increasing RR; this re-
duces the number of detected duplicates, thus decreasing PC. The
opposite is true for large values of r. Most importantly, Block Fil-
tering exhibits a robust performance with respect to r, with small
Original Block Collections After Block Filtering
D1C D2C D3C D1D D2D D3D D1C D2C D3C D1D D2D D3D
||B0|| 1.43·1057.14·1052.63·1073.41·1058.34·1052.47·1071.10·1055.72·1052.11·1072.73·1056.67·1051.97·107
PC(B0) 0.966 0.860 0.724 0.765 0.522 0.535 0.948 0.871 0.748 0.823 0.641 0.566
PQ(B0) 0.016 0.028 0.025 0.005 0.014 0.019 0.020 0.035 0.032 0.007 0.022 0.026
OT (B0) 395 ms 49 sec 9.4 hrs 22 sec 16 min 17.2 hrs 142 ms 12 sec 2.0 hrs 5 sec 4 min 3.7 hrs
(a) Cardinality Edge Pruning (CEP)
||B0|| 2.34·1051.41·1064.95·1076.38·1051.62·1064.63·1071.69·1051.10·1063.95·1075.10·1051.31·1063.63·107
PC(B0) 0.975 0.946 0.973 0.949 0.880 0.951 0.955 0.942 0.965 0.936 0.888 0.942
PQ(B0) 0.010 0.015 0.018 0.003 0.012 0.018 0.013 0.020 0.022 0.004 0.016 0.023
OT (B0) 899 ms 83 sec 18.6 hrs 58 sec 17 min 35.2 hrs 310 ms 25 sec 4.7 hrs 13 sec 6 min 8.6 hrs
(b) Cardinality Node Pruning (CNP)
||B0|| 4.32·1051.48·1076.64·1091.38·1077.81·1071.19·1010 1.63·1055.50·1062.11·1093.99·1062.26·1074.06·109
PC(B0) 0.977 0.963 0.977 0.987 0.970 0.973 0.953 0.947 0.967 0.964 0.944 0.965
PQ(B0) 1.08·1022.62·1036.66·1042.99·1047.30·1043.54·1042.92·1026.54·1031.49·1039.51·1041.62·1037.27·104
OT (B0) 588 ms 92 sec 17.1 hrs 40 sec 26 min 31.7 hrs 193 ms 23 sec 3.7 hrs 8 sec 7 min 6.9 hrs
(c) Weighted Edge Pruning (WEP)
||B0|| 1.11·1062.81·1071.60·1010 3.41·1071.54·1083.00·1010 4.64·1051.05·1075.38·1099.84·1065.29·1079.49·109
PC(B0) 0.988 0.972 0.997 0.993 0.971 0.995 0.979 0.964 0.997 0.979 0.959 0.992
PQ(B0) 2.32·1031.14·1031.44·1041.13·1043.11·1047.63·1055.13·1033.01·1033.56·1043.42·1046.79·1041.94·104
OT (B0) 862 ms 85 sec 18.6 hrs 55 sec 16 min 35.0 hrs 303 ms 24 sec 4.7 hrs 13 sec 5 min 9.0 hrs
(d) Weighted Node Pruning (WNP)
Table 3: Performance of the existing pruning schemes, averaged across all weighting schemes, before and after Block Filtering.
variations in its value leading to small dierences in RR and PC.
To use Block Filtering as a pre-processing method, we should
set its ratio to a value that increases precision at a low cost in re-
call. We quantify this constraint by requiring that rdecreases PC
by less than 0.5%, while maximizing RR and, thus, PQ. The ra-
tio that satisfies this constraint across all datasets is r=0.80. Ta-
ble 1(b) presents the characteristics of the restructured block col-
lections corresponding to this configuration. Note that RR has been
computed with respect to the cardinality of the original blocks.
We observe that the number of blocks is almost the same as in
Table 1(a). Yet, their total cardinality is reduced by 64% to 75%,
while recall is reduced by less than 0.5% in most cases. As a result,
PQ rises from 2.7 to 4.0 times, but still remains far below 0.01.
There is an insignificant increase in OT ime and, thus, RT ime de-
creases to the same extent as RR. The same applies to the order of
the blocking graph |EB|, while its size |VB|remains almost the same.
Finally, BPE is reduced by (1-r)·100%=20% across all datasets.
6.3 Time Efficiency Improvements
Table 3 presents the performance of the four existing pruning
schemes, averaged across all weighting schemes. Their original
performance appears in the left part, while the right part presents
their performance after Block Filtering.
Original Performance. We observe that CEP reduces the exe-
cuted comparisons by 1 to 3 orders of magnitude for the smallest
and the largest datasets, respectively. It increases precision (PQ) to
a similar extent at the cost of much lower recall in most of the cases.
This applies particularly to Dirty ER, where PC drops consistently
below 0.80, the minimum acceptable recall of eciency-intensive
ER applications. The reason is that Dirty ER is more dicult than
Clean-Clean ER, involving much larger blocking graphs with many
more noisy edges between non-matching entity profiles.
CNP is more robust to recall than CEP, as its PC lies well over
0.80 across all datasets. Its robustness stems from its node-centric
functionality, which retains the best edges per node, instead of the
globally best ones. This comes, though, at the cost of a much higher
computational cost: its overhead time is larger than that of CEP
by 44%, on average. Further, CNP retains almost twice as many
comparisons and yields a slightly lower precision than CEP.
For WEP, we observe that its recall consistently exceeds 0.95,
the minimum acceptable PC of eectiveness-intensive ER appli-
cations. At the same time, it executes almost an order of magni-
tude less comparisons than the original blocks in Table 1(a) and en-
hances PQ to a similar extent. These patterns apply to all datasets.
Finally, WNP saves 60% of the brute-force comparisons, on av-
erage, retaining twice as many comparisons as WEP. Its recall re-
mains well over 0.95 across all datasets, exceeding that of WEP to
a minor extent. As a result, its precision is half that of WEP, while
its overhead time is slightly higher.
In summary, these experiments verify previous findings about
the relative performance of pruning schemes [22]: the cardinality-
based ones excel in precision, being more suitable for eciency-
intensive ER applications, while the weight-based schemes excel in
recall, being more suitable for eectiveness-intensive applications.
In both cases, the node-centric algorithms trade higher recall for
lower precision and higher overhead.
Block Filtering. Examining the eect of Block Filtering in the
right part of Table 3, we observe two patterns:
(i) Its impact on overhead time depends on the dataset at hand,
rather than the pruning scheme applied on top of it. In fact, OT ime
is reduced by 65% (D1C) to 78% (D1D), on average, across all prun-
ing schemes. This is close to RR and the reduction in the order of
the blocking graph |EB|, but higher than them, because Block Fil-
tering additionally reduces BPE by 20%.
(ii) Its impact on blocking eectiveness depends on the type
of the pruning criterion used by Meta-blocking. For cardinality
thresholds, Block Filtering conveys a moderate decrease in the re-
tained comparisons, with ||B0|| dropping by 20%, on average. The
reason is that both CEP and CNP use thresholds that are propor-
tional to BPE , which is reduced by (1-r)·100%. At the same time,
their recall is either reduced to a minor extent (<2%), or increases
by up to 10%. The latter case appears in half the datasets and in-
dicates that Block Filtering cleans the blocking graph from noisy
edges, enabling CEP and CNP to identify more duplicates with
fewer retained comparisons.
For weight thresholds, Block Filtering reduces the number of
retained comparisons to a large extent: ||B0|| drops by 62% to 71%
for both WEP and WNP. The reason is that their pruning criteria
depends directly on the size and the structure of the blocking graph.
At the same time, their recall gets lower by less than 3% in all
cases, an aordable reduction that is caused by two factors: (i)
Block Filtering discards some matching edges itself, and (ii) Block
Filtering reduces the extent of co-occurrence for some matching
entity profiles, thus lowering the weight of their edges.
Redefined Node-centric Pruning Reciprocal Node-centric Pruning
D1C D2C D3C D1D D2D D3D D1C D2C D3C D1D D2D D3D
||B0|| 1.63·1058.52·1053.36·1073.91·1051.02·1062.92·1076.54·1032.50·1055.88·1061.19·1052.86·1057.12·106
PC(B0) 0.955 0.942 0.965 0.936 0.888 0.942 0.880 0.886 0.912 0.847 0.650 0.868
PQ(B0) 0.014 0.025 0.026 0.006 0.020 0.029 0.312 0.084 0.142 0.017 0.057 0.111
OT (B0) 339 ms 5 sec 2.1 hrs 5 sec 23 sec 4.9 hrs 329 ms 5 sec 2.1 hrs 5 sec 23 sec 4.9 hrs
(a) Redefined Cardinality Node Pruning (b) Reciprocal Cardinality Node Pruning
||B0|| 3.72·1057.52·1063.96·1097.23·1063.26·1076.96·1099.30·1042.96·1061.41·1092.61·1062.02·1072.54·109
PC(B0) 0.979 0.964 0.994 0.979 0.959 0.992 0.953 0.949 0.977 0.964 0.924 0.971
PQ(B0) 6.53·1034.79·1035.01·1044.91·1041.09·1032.74·1043.16·1028.78·1031.47·1031.14·1031.78·1037.88·104
OT (B0) 582 ms 10 sec 5.6 hrs 9 sec 45 sec 10.7 hrs 576 ms 10 sec 5.4 hrs 9 sec 45 sec 10.5 hrs
(c) Redefined Weighted Node Pruning (d) Reciprocal Weighted Node Pruning
Table 4: Performance of the new pruning schemes on top of Block Filtering over all datasets, averaged across all weighting schemes.
D1C D2C D3C D1D D2D D3D
CEP 117 ms 4 sec 1.5 hrs 3 sec 14 sec 1.8 hrs
CNP 246 ms 6 sec 2.2 hrs 6 sec 25 sec 4.3 hrs
WEP 150 ms 6 sec 2.7 hrs 5 sec 25 sec 3.8 hrs
WNP 257 ms 8 sec 4.4 hrs 7 sec 33 sec 7.5 hrs
Table 5: OT ime of Optimized Edge Weighting for each prun-
ing scheme, averaged across all weighting schemes, over the
datasets in Table 1(b), i.e., after Block Filtering.
We can conclude that Block Filtering enhances the scalability of
Meta-blocking to a significant extent, accelerating the processing of
all pruning schemes almost by 4 times, on average. It also achieves
much higher precision, while its impact on recall is either negligible
or beneficial. Thus, it constitutes an indispensable pre-processing
step for Meta-blocking. For this reason, the following experiments
are carried out on top of Block Filtering.
Optimized Edge Weighting. Table 5 presents the overhead time
of the four pruning schemes when combined with Block Filtering
and Optimized Edge Weighting (cf. Algorithm 3). Comparing it
with OT ime in the right part of Table 3, we observe significant
enhancements in eciency. Again, they depend on the dataset at
hand, rather than the pruning scheme. In fact, the higher the BPE
of a dataset after Block Filtering, the larger is the reduction in over-
head time: OT ime is reduced by 19% for D1C, where BPE takes the
lowest value across all datasets (3.63), and by 92% for D2D, where
BPE takes the highest one (26.29). The reason is that Optimized
Edge Weighting minimizes the computational cost of the process
that is applied to every comparison by Original Edge Weighting
(cf. Algorithm 2) from O(2·BPE) to O(1).
Also interesting is the comparison between OT ime in Table 5 and
OT ime in the left part of Table 3, i.e., before applying Block Filter-
ing to the input blocks. On average, across all datasets and pruning
schemes, OT ime is reduced by 87%, which is almost an order of
magnitude. Again, the lowest (72%) and the highest (98%) average
reductions correspond to D1Cand D2D, respectively, which exhibit
the minimum and the maximum BPE before Block Filtering. We
can conclude, therefore, that the two eciency optimizations are
complementary and indispensable for scalable Meta-blocking.
6.4 Precision Improvements
To estimate the performance of Redefined and Reciprocal Node-
centric Pruning, we applied the four pruning schemes they yield to
the datasets in Table 1(b). Their performance appears in Table 4.
Cardinality-based Pruning. Starting with Redfined CNP, we
observe that it maintains the recall of the original CNP, while con-
veying a moderate increase in eciency. On average, across all
weighting schemes and datasets, it retains 18% less comparisons,
increasing PQ by 1.2 times. With respect to OT ime, there is no
clear winner, as the implementation of both algorithms is highly
similar, relying on Optimized Edge Weighting. Hence, the origi-
nal CNP is just 2% faster, on average, because it does not store all
retained edges per node in memory.
For Reciprocal CNP,OT ime is practically identical with that of
Redefined CNP, as they only dier in a single operator. Yet, Re-
ciprocal CNP consistently achieves the highest precision among
all versions of CNP at the cost of the lowest recall. On average, it
retains 82% and 78% less comparisons than CNP and Redefined
CNP, respectively, while increasing precision by 7.9 and 6.9 times,
respectively; it also decreases recall by 11%, but exceeds the mini-
mum acceptable PC for eciency-intensive applications (0.80) to a
significant extent in most cases. The only exception is D2D, where
PC drops below 0.80 for all weighting schemes. Given that the
corresponding Clean-Clean ER dataset (D2C) exhibits much higher
PC, the poor recall for D2Dis attributed to highly similar (i.e.,
noisy) entity profiles in one of the duplicate-free entity collections.
It is worth comparing at this point the new pruning schemes with
their edge-centric counterpart: CEP coupled with Block Filtering
(see right part of Table 3). We observe three patterns: (i) On aver-
age, CEP keeps 33% less comparisons than Redefined CNP, but
lowers recall by 18%. Its recall is actually so low in most datasets
that Redefined CNP achieves higher precision in all cases except
D1Cand D2C.(ii) Reciprocal CNP typically outperforms CEP in
all respects of eectiveness. On average, it executes 67% less com-
parisons, while increasing recall by 8%. D1Cis the exception that
proves this rule: Reciprocal CNP saves more than an order of mag-
nitude more comparisons, but CEP achieves slightly higher recall.
(ii) The overhead time of CEP is lower by 44%, on average, than
both algorithms, because it iterates once instead of twice over all
edges in the blocking graph.
On the whole, we can conclude that Reciprocal CNP oers the
best choice for eciency-intensive applications, as it consistently
achieves the highest precision among all cardinality-based pruning
schemes with PC0.8. However, in datasets with high levels of
noise, where many entity profiles share the same information, Re-
defined CNP should be preferred; it is more robust to recall than
CEP and Reciprocal CNP, while saving 30% more comparisons
than CNP. In any case, Block Filtering is indispensable.
Weight-based Pruning. As expected, Redefined WNP main-
tains the same recall as the original implementation of WNP, while
conveying major enhancements in eciency. On average, across
all weighting schemes and datasets, it reduces the retained com-
parisons by 28% and increases precision by 1.5 times. It is also
faster than WNP by 7%, but in practice, the method with the lowest
OT ime varies across the datasets.
Reciprocal WNP performs a deeper pruning that consistently
trades a lower number of executed comparisons for a lower recall.
On average, it reduces the total cardinality of WNP by 72% and the
recall by 2%. Its mean PC drops slightly below 0.95 in D2C, but this
does not apply to all weighting schemes; two of them exceed the
minimum acceptable recall of eectiveness-intensive applications
even for this dataset. As a result, Reciprocal WNP enhances the
precision of WNP by 3.9 times. Its overhead is slightly lower than
D1C D2C D3C D1D D2D D3D
||B0|| 4.22·1047.47·1051.60·1084.53·1052.10·1063.06·108
PC(B0) 0.870 0.862 0.940 0.862 0.804 0.928
PQ(B0) 0.048 0.026 0.005 0.004 0.009 0.003
OT ime(B0) 24 msec 86 msec 1 min 62 msec 150 msec 2 min
(a) Eciency-intensive Graph-free Meta-blocking (r=0.25)
||B0|| 2.21·1056.16·1062.46·1094.73·1062.36·1074.27·109
PC(B0) 0.973 0.959 0.993 0.980 0.954 0.991
PQ(B0) 1.02·1023.56·1033.60·1044.78·1049.24·1042.07·104
OT ime(B0) 30 msec 221 msec 6 min 146 msec 650 msec 25 min
(b) Eectiveness-intensive Graph-free Meta-blocking (r=0.55)
||B0|| 1.76·1061.32·1072.34·1010 9.07·1074.08·1084.81·1010
PC(B0) 0.994 0.980 0.999 0.997 0.981 0.999
PQ(B0) 1.31·1031.70·1033.81·1052.54·1055.49·1051.85·105
OT ime(B0) 76 msec 2 sec 1.9 hrs 5 sec 1 min 11.1 hrs
(c) Iterative Blocking
Table 6: Performance of the baseline methods over all datasets.
Redefined WNP, because it retains less comparisons in memory.
Thus, it is faster than WNP by 9%, on average.
Compared to WEP,Redefined WNP exhibits lower precision,
but is more robust to recall, maintaining PC well above 0.95 under
all circumstances. In contrast, WEP violates this constraint in all
datasets for at least one weighting scheme. Reciprocal WNP scores
higher precision than WEP, while being more robust to recall, as
well. For this reason, it is the optimal choice for eectiveness-
intensive applications. Again, for datasets with high levels of noise,
Redefined WNP should be preferred.
Baseline Methods. We now compare our techniques with the
state-of-the-art block processing method Iterative Blocking [27]
and with Graph-free Meta-blocking (see end of Section 4.1). The
functionality of the former was optimized by ordering the blocks
from the smallest to the largest cardinality; its functionality was
further optimized for Clean-Clean ER by assuming the ideal case
where two matching entities are not compared to other co-occurring
entities after their detection. The configuration of Graph-free Meta-
blocking was also optimized by setting rto the smallest values
in [0.05,1.0] with a step of 0.05 that ensures a recall higher than
0.80 and 0.95 across all real datasets; this resulted in r=0.25 and
r=0.55 for eciency- and eectiveness-intensive applications, re-
spectively. Table 6 presents the performance of the two methods.
Juxtaposing Eciency-intensive Graph-free Meta-blocking and
Reciprocal CNP, we notice a clear trade-obetween precision
and recall. The latter approach emphasizes PQ, retaining 85%
less comparisons at the cost of 5% lower PC, on average. The
only advantage of Graph-free Meta-blocking is its minimal over-
head: its lightweight functionality is able to process datasets with
millions of entities within few minutes even on commodity hard-
ware. Similar patterns arise when comparing Reciprocal WNP
with Eectiveness-intensive Graph-free Meta-blocking: the former
executes 42% less comparisons at the cost of a 4.4% decrease in
recall, thus achieving higher precision. This behaviour can be ex-
plained by the fine-grained functionality of Reciprocal Pruning: un-
like Graph-free Meta-blocking, which operates on the level of indi-
vidual entities, it considers pairwise comparisons, thus being more
accurate in their pruning.
Compared to Iterative Blocking, Reciprocal WNP retains less
comparisons by a whole order of magnitude at the cost of slightly
lower recall. Thus, it achieves significantly higher precision. The
overhead of Iterative Blocking is significantly lower in most cases,
but it does not scale well to large datasets, even though Iterative
Blocking does not involve a blocking graph. The reason is that
it goes through the input block collection several times, updating
the representation of the duplicate entities in all blocks that contain
them, whenever a new match is detected.
7. CONCLUSIONS
In this paper, we introduced two techniques for boosting the e-
ciency of Meta-blocking along with two techniques for enhancing
its eectiveness. Our thorough experimental analysis verified that
in combination, our methods go well beyond the existing Meta-
blocking techniques in all respects and simplify its configuration,
depending on the data and the application at hand. For eciency-
intensive ER applications, Reciprocal CNP processes a large het-
erogeneous dataset with 3 millions entities and 80 billion compar-
isons within 2 hours even on commodity hardware; it also man-
ages to retain recall and precision above 0.8 and 0.1, respectively.
For eectiveness-intensive ER applications, Reciprocal WNP pro-
cesses the same voluminous dataset within 5 hours on commodity
hardware, while retaining recall above 0.95 and precision close to
0.01. In both cases, Block Filtering and Optimized Edge Weighting
are indispensable. In the future, we plan to adapt our techniques for
Enhanced Meta-blocking to Incremental Entity Resolution.
References
[1] A. N. Aizawa and K. Oyama. A fast linkage detection scheme for multi-source
information integration. In WIRI, pages 30–39, 2005.
[2] Y. Altowim, D. V. Kalashnikov, and S. Mehrotra. Progressive approach to rela-
tional entity resolution. PVLDB, 7(11):999–1010, 2014.
[3] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins
in data cleaning. In ICDE, page 5, 2006.
[4] P. Christen. A survey of indexing techniques for scalable record linkage and
deduplication. IEEE Trans. Knowl. Data Eng., 24(9):1537–1555, 2012.
[5] V. Christophides, V. Efthymiou, and K. Stefanidis. Entity Resolution in the Web
of Data. Morgan & Claypool Publishers, 2015.
[6] G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Large-scale linked data in-
tegration using probabilistic reasoning and crowdsourcing. VLDB J., 22(5):665–
687, 2013.
[7] X. L. Dong and D. Srivastava. Big Data Integration. Synthesis Lectures on Data
Management. Morgan & Claypool Publishers, 2015.
[8] A. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicate record detection: A
survey. IEEE Trans. Knowl. Data Eng., 19(1):1–16, 2007.
[9] I. Fellegi and A. Sunter. A theory for record linkage. Journal of American
Statistical Association, pages 1183–1210, 1969.
[10] L. Getoor and A. Machanavajjhala. Entity resolution: Theory, practice &amp;
open challenges. PVLDB, 5(12):2018–2019, 2012.
[11] L. Getoor and A. Machanavajjhala. Entity resolution for big data. In KDD, 2013.
[12] L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, and D. Sri-
vastava. Approximate string joins in a database (almost) for free. In VLDB,
pages 491–500, 2001.
[13] M. Hernández and S. Stolfo. The merge/purge problem for large databases. In
SIGMOD, pages 127–138, 1995.
[14] Y. Jiang, G. Li, J. Feng, and W. Li. String similarity joins: An experimental
evaluation. PVLDB, 7(8):625–636, 2014.
[15] H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches
on real-world match problems. PVLDB, 3(1):484–493, 2010.
[16] C. Li, J. Lu, and Y. Lu. Ecient merging and filtering algorithms for approxi-
mate string searches. In ICDE, pages 257–266, 2008.
[17] J. Madhavan, S. Cohen, X. L. Dong, A. Y. Halevy, S. R. Jeery, D. Ko, and
C. Yu. Web-scale data integration: You can aord to pay as you go. In CIDR,
pages 342–350, 2007.
[18] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information re-
trieval. Cambridge University Press, 2008.
[19] A. McCallum, K. Nigam, and L. Ungar. Ecient clustering of high-dimensional
data sets with application to reference matching. In KDD, pages 169–178, 2000.
[20] G. Papadakis, E. Ioannou, C. Niederée, T. Palpanas, and W. Nejdl. Beyond 100
million entities: large-scale blocking-based resolution for heterogeneous data.
In WSDM, pages 53–62, 2012.
[21] G. Papadakis, E. Ioannou, T. Palpanas, C. Niederée, and W. Nejdl. A block-
ing framework for entity resolution in highly heterogeneous information spaces.
IEEE Trans. Knowl. Data Eng., 25(12):2665–2682, 2013.
[22] G. Papadakis, G. Koutrika, T. Palpanas, and W. Nejdl. Meta-blocking: Taking
entity resolution to the next level. IEEE Trans. Knowl. Data Eng., 2014.
[23] G. Papadakis, G. Papastefanatos, and G. Koutrika. Supervised meta-blocking.
PVLDB, 7(14):1929–1940, 2014.
[24] A. Thor and E. Rahm. Moma - a mapping-based object matching system. In
CIDR, pages 247–258, 2007.
[25] M. J. Welch, A. Sane, and C. Drome. Fast and accurate incremental entity reso-
lution relative to an entity knowledge base. In CIKM, pages 2667–2670, 2012.
[26] S. E. Whang, D. Marmaros, and H. Garcia-Molina. Pay-as-you-go entity resolu-
tion. IEEE Trans. Knowl. Data Eng., 25(5):1111–1124, 2013.
[27] S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina.
Entity resolution with iterative blocking. In SIGMOD, pages 219–232, 2009.
... The redundancy-positive block collections involve a large portion of redundant comparisons, as the same pairs of entities are repeated across different blocks. These can be easily removed by aggregating for every entity ∈ 1 the set of all entities from 2 that share at least one block with it [22]. The union of these individual sets yields the distinct set of comparisons, which is called candidate pairs and is denoted by . ...
... stop-words) that provide no distinguishing information. Subsequently, we apply Block Filtering [22], removing each entity from the largest 20% blocks in which it appears. Finally, the features described in Section 4 are generated for each pair of candidates. ...
... This means that BLAST is able to discard much more non-matching pairs, while retaining a few more matching ones, too. Given that the weight-based pruning algorithms are crafted for applications that promote recall at the cost of slightly lower precision [20,22], we select BLAST as the best one in this category, even though it does not achieve the highest F1, on average. ...
Preprint
Full-text available
Entity Resolution constitutes a core data integration task that relies on Blocking in order to tame its quadratic time complexity. Schema-agnostic blocking achieves very high recall, requires no domain knowledge and applies to data of any structuredness and schema heterogeneity. This comes at the cost of many irrelevant candidate pairs (i.e., comparisons), which can be significantly reduced through Meta-blocking techniques, i.e., techniques that leverage the co-occurrence patterns of entities inside the blocks: first, a weighting scheme assigns a score to every pair of candidate entities in proportion to the likelihood that they are matching and then, a pruning algorithm discards the pairs with the lowest scores. Supervised Meta-blocking goes beyond this approach by combining multiple scores per comparison into a feature vector that is fed to a binary classifier. By using probabilistic classifiers, Generalized Supervised Meta-blocking associates every pair of candidates with a score that can be used by any pruning algorithm. For higher effectiveness, new weighting schemes are examined as features. Through an extensive experimental analysis, we identify the best pruning algorithms, their optimal sets of features as well as the minimum possible size of the training set. The resulting approaches achieve excellent performance across several established benchmark datasets.
... To this end, Block Purging sets an upper limit on the number of comparisons [136] or the block size [135]. Block Filtering applies a limit to the blocks of every description, retaining it in r % of its smallest blocks [139,141]. ...
... For edge pruning, the following algorithms are available: Weighted Edge Pruning [137] removes all edges that do not exceed the average edge weight; Cardinality Edge Pruning retains the globally K top weighted edges [137,200]; Weighted Node Pruning (WNP) [137] and BLAST [161] retain in each node neighborhood the descriptions that exceed a local threshold; Cardinality Node Pruning (CNP) retains the top-k weighted edges in each node neighborhood [137]; Reciprocal WNP and CNP [139] retain edges satisfying the pruning criteria in both adjacent node neighborhoods. Other methods perform edge pruning inside individual blocks [47], while Disjunctive Blocking Graph [56] associates every edge with multiple weights to express composite co-occurrence conditions. ...
... Numerous studies have demonstrated that Block and Comparison Cleaning are indispensable for schemaagnostic Blocking, raising precision by orders of magnitude, without hurting recall [136,141,161]. Multiple Block Cleaning methods can be part of the same end-to-end ER workflow, as they are typically complementary; e.g., Block Purging is usually followed by Block Filtering [139]. Yet, at most one Comparison Cleaning method can be part of an ER workflow: applying it to a redundancypositive block collection removes its co-occurrence patterns and renders all other techniques inapplicable. ...
Article
One of the most critical tasks for improving data quality and increasing the reliability of data analytics is Entity Resolution (ER), which aims to identify different descriptions that refer to the same real-world entity. Despite several decades of research, ER remains a challenging problem. In this survey, we highlight the novel aspects of resolving Big Data entities when we should satisfy more than one of the Big Data characteristics simultaneously (i.e., Volume and Velocity with Variety). We present the basic concepts, processing steps, and execution strategies that have been proposed by database, semantic Web, and machine learning communities in order to cope with the loose structuredness , extreme diversity , high speed, and large scale of entity descriptions used by real-world applications. We provide an end-to-end view of ER workflows for Big Data, critically review the pros and cons of existing methods, and conclude with the main open research directions.
... From the former category, we employ Block Purging (BP) and Block Filtering (BF) [27]. Both methods rely on the idea that the larger a block is, the less likely it is to contain unique duplicates. ...
... From the later category, we employ the Edge Pruning (EP) method [27]. This method restructures a block collection into a new one that contains significantly fewer unnecessary comparisons, while maintaining almost the same effectiveness. ...
Preprint
Full-text available
In this work, we explore the problem of correctly and efficiently answering complex SPJ queries issued directly on top of dirty data. We introduce QueryER, a framework that seamlessly integrates Entity Resolution into Query Processing. QueryER executes analysis-aware deduplication by weaving ER operators into the query plan. The experimental evaluation of our approach exhibits that it adapts to the workload and scales on both real and synthetic datasets.
... It may be used to solve ER issues in a single set or to link records from different sets. Furthermore, blocking is a strategy used in cutting-edge frameworks with a proven performance boost, as evidenced by the papers [5,9,10,11,12]. Jay Yagnik's Locality Sensitive Hashing (LSH) Approach, called Winner-Take-All (WTA) Hashing, is used in this study in order to embed a blocking schema. Firstly mentioned in the work The Power of Comparative Reasoning [17], Yagnik presented a very efficient Hashing algorithm, which can create groups of objects from the initial data set that share common properties with a very low computational cost. ...
Article
Full-text available
In this study, we propose an end-to-end unsupervised learning model that can be used for Entity Resolution problems on string data sets. An innovative prototype selection algorithm is utilized in order to create a rich euclidean, and at the same time, dissimilarity space. Part of this work, is a fine presentation of the theoretical benefits of a euclidean and dissimilarity space. Following we present an embedding scheme based on rank-ordered vectors, that circumvents the Curse of Dimensionality problem. The core of our framework is a locality hashing algorithm named Winner-Take-All, which accelerates our models run time while also maintaining great scores in the similarity checking phase. For the similarity checking phase, we adopt Kendall Tau rank correlation coefficient, a metric for comparing rankings. Finally, we use two state-of-the-art frameworks in order to make a consistent evaluation of our methodology among a famous Entity Resolution data set.
... To achieve this there are four main processes: metadata harvesting (MH) (4)(5)(6)(7)(8)(9)(10)(11)(12)(13), data integration (DI) ( DI and CL are two axes of the Big Data research field. Big Data may be defined as "datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze" or "data too big to be handled and analyzed by traditional database protocols such as SQL" (40 (23). However, before ERM process, DI and CL are the first challenges of data management in the context of big data and cloud computing. ...
Article
Full-text available
The wide proliferation of various wireless communication systems and devices has led to the arrival of a massive amount of Digital Resources (DR) from multi-sources, various metadata and media. However, data integration has allowed the ability to provide to users a uniform interface for multiple heterogonous data sources, metadata and users. Hence, the problem of matching which contents or DR belong to a specific user interest that demands more attention. In this article, we proposed a different model named: Learning & Boosting Architecture Model (LBAM). LBAM has goals to identify evolving interests of a person and to potentially propose a personal agenda, channels and activities. The first process is based on the creation of a hub of multiple sources of Micro Metadata (MM) using a Semantic Enriched MM Harvestor, a Watch & Notify Engine and a Semantic Shared Knowledge Notice (SSKN). They are harvested through a process able to catalogue the rights, interests and novelties in a scorm notice. It uses Machine Learning Models to improve the auto cataloguing of the DRs. It includes a Semantic Learning Watch and Notify engine using SSKN that allows ways to find DR or Event novelties of DR according to the evolving user interests. Using simulation studies and prototypes, we demonstrate that LBAM slightly improves accuracy in harvesting treatment from Entity Resolution and Linked Data compared to existing models using SSKN. We also demonstrate the integration of MM rights in a notice compared to other existing architectures. This article is the first paper of multiple for the LBAM project.
Article
Entity matching (EM) aims to identify whether two records refer to the same underlying real-world entity. Traditional entity matching methods mainly focus on structured data, where the attribute values are short and atomic. Recently, there has been an increasing demand for matching textual records, such as matching descriptions of products that correspond to long spans of text, which challenges the applications of these methods. Although a few deep learning (DL) solutions have been proposed, these solutions tend to “directly” use the DL techniques and treat the EM as NLP tasks without determining the unique demand for the EM task. Thus, the performance of these DL-based solutions is still far from satisfactory. In this paper, we present JointMatcher, a novel EM method based on the pre-trained Transformer-based language models so that the generated features of the textual records contain the context information. We realize that more attention paid to the similar segments and number-contained segments of the record pair is crucial for accurate matching. To integrate the high-contextualized features with the consideration of paying more attention to the similar segments and the number-contained segments, JointMatcher is equipped with the relevance-aware encoder and the numerically-aware encoder. Extensive experiments using structured and real-world textual datasets demonstrated that JointMatcher outperforms the previous state-of-the-art (SOTA) results without injecting any domain knowledge when small or medium size training sets are used.
Article
Different use cases have acknowledged the importance of author identities and the non-triviality of determining them. Author disambiguation (AD) is a special case of entity resolution resolving author mentions to actual real-world authors. Like in other entity resolution tasks, AD methods are strongly restricted by scale and person name conventions. So far, this has been addressed by static blocking methods which cannot adapt to such collection-dependent properties. We address this gap by presenting the first progressive method of author disambiguation. Progressive entity resolution tackles large-scale conflation problems by repeatedly increasing the number of pairs compared for potential equivalence. Our method uses lattice structures to model name inclusion in an adaptive and more efficient way than traditional blocking techniques based on alphabetical order or fixed-level generalization. Our work offers additional insights into the relationship between name-matching, different blocking schemes, blocking and clustering as well as cost and benefit. Using the Web of Science as large-scale annotated test data, we observe and compare our model’s performance over time and compare it with various configurations and baselines. Our approach consistently outperforms state-of-the-art blocking methods, underlining its contribution to the field of author disambiguation. Our approach offers a novel alternative for tackling ambiguity in entity resolution, which is a major challenge for many information systems.
Article
Entity resolution refers to the process of identifying, matching, and integrating records belonging to unique entities in a data set. However, a comprehensive comparison across all pairs of records leads to quadratic matching complexity. Therefore, blocking methods are used to group similar entities into small blocks before the matching. Available blocking methods typically do not consider semantic relationships among records. In this paper, we propose a Semantic-aware Meta-Blocking approach called SeMBlock. SeMBlock considers the semantic similarity of records by applying locality-sensitive hashing (LSH) based on word embedding to achieve fast and reliable blocking in a large-scale data environment. To improve the quality of the blocks created, SeMBlock builds a weighted graph of semantically similar records and prunes the graph edges. We extensively compare SeMBlock with 16 existing blocking methods, using three real-world data sets. The experimental results show that SeMBlock significantly outperforms all 16 methods with respect to two relevant measures, F-measure and pair-quality measure. F-measure and pair-quality measure of SeMBlock are approximately 7% and 27%, respectively, higher than recently released blocking methods.
Chapter
Entity Resolution (ER) is an essential task in the data integration process, by identifying records that refer to the same object in the real world. In a naive approach, ER needs to compare all pairs of records in a dataset. This process has a high cost, especially for large-scale datasets. Several techniques have been proposed in the literature to restrict the comparison among records grouped in the same blocks to mitigate such a cost. In order to further reduce the number of comparisons, some approaches, named reblocking, focus on blocking reprocessing. The reblocking techniques include two major groups: meta-blocking and filtering. Meta-blocking reduces the number of comparisons based on blocks shared by the records. On the other hand, filtering focuses on providing pairs of records for comparison based on the degree of similarity between them. Although both approaches have the same goal, as far as we know, no work in the literature experimentally compares the reblocking techniques. Filling this gap, in this research, we present a qualitative and comparative analysis of techniques in the state-of-the-art of reblocking approaches. With this analysis, we provide different characteristics to assess issues of effectiveness and efficiency of the techniques. Finally, we specify appropriate scenarios for each evaluated technique.
Chapter
In the field of database deduplication, the goal is to find approximately matching records within a database. Blocking is a typical stage in this process that involves cheaply finding candidate pairs of records that are potential matches for further processing. We present here Hashed Dynamic Blocking, a new approach to blocking designed to address datasets larger than those studied in most prior work. Hashed Dynamic Blocking (HDB) extends Dynamic Blocking, which leverages the insight that rare matching values and rare intersections of values are predictive of a matching relationship. We also present a novel use of Locality Sensitive Hashing (LSH) to build blocking key values for huge databases with a convenient configuration to control the trade-off between precision and recall. HDB achieves massive scale by minimizing data movement, using compact block representation, and greedily pruning ineffective candidate blocks using a Count-min Sketch approximate counting data structure. We benchmark the algorithm by focusing on real-world datasets in excess of one million rows, demonstrating that the algorithm displays linear time complexity scaling in this range. Furthermore, we execute HDB on a 530 million row industrial dataset, detecting 68 billion candidate pairs in less than three hours at a cost of $307 on a major cloud service.
Article
Full-text available
Entity Resolution matches mentions of the same entity. Being an expensive task for large data, its performance can be improved by blocking, i.e., grouping similar entities and comparing only entities in the same group. Blocking improves the run-time of Entity Res-olution, but it still involves unnecessary comparisons that limit its performance. Meta-blocking is the process of restructuring a block collection in order to prune such comparisons. Existing unsuper-vised meta-blocking methods use simple pruning rules, which of-fer a rather coarse-grained filtering technique that can be conserva-tive (i.e., keeping too many unnecessary comparisons) or aggres-sive (i.e., pruning good comparisons). In this work, we introduce supervised meta-blocking techniques that learn classification mod-els for distinguishing promising comparisons. For this task, we propose a small set of generic features that combine a low extrac-tion cost with high discriminatory power. We show that supervised meta-blocking can achieve high performance with small training sets that can be manually created. We analytically compare our su-pervised approaches with baseline and competitor methods over 10 large-scale datasets, both real and synthetic.
Conference Paper
Full-text available
User facing topical web applications such as events or shopping sites rely on large collections of data records about real world entities that are updated at varying latencies ranging from days to seconds. For example, event venue details are changed relatively infrequently whereas ticket pricing and availability for an event is often updated in near-realtime. Users regard these sites as high quality if they seldom show duplicates, the URLs are stable, and their content is fresh, so it is important to resolve duplicate entity records with high quality and low latencies. High quality entity resolution typically evaluates the entire record corpus for similar record clusters at the cost of latency, while low latency resolution examines the least possible entities to keep time to a minimum, even at the cost of quality. In this paper we show how to keep low latency while achieving high quality, combining the best of both approaches: given an entity to be resolved, our incremental Fastpath system, in a matter of milliseconds, makes approximately the same decisions that the underlying batch system would have made. Our experiments show that the Fastpath system makes matching decisions for previously unseen entities with 90% precision and 98% recall relative to batch decisions, with latencies under 20ms on commodity hardware.
Conference Paper
Full-text available
This tutorial provides an overview of the key research results in the area of entity resolution that are relevant to addressing the new challenges in entity resolution posed by the Web of data, in which real world entities are described by interlinked data rather than documents. Since such descriptions are usually partial, overlapping and sometimes evolving, entity resolution emerges as a central problem both to increase dataset linking but also to search the Web of data for entities and their relations.
Article
Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass.
Book
In recent years, several knowledge bases have been built to enable large-scale knowledge sharing, but also an entity-centric Web search, mixing both structured data and text querying. These knowledge bases offer machine-readable descriptions of real-world entities, e.g., persons, places, published on the Web as Linked Data. However, due to the different information extraction tools and curation policies employed by knowledge bases, multiple, complementary and sometimes conflicting descriptions of the same real-world entities may be provided. Entity resolution aims to identify different descriptions that refer to the same entity appearing either within or across knowledge bases. The objective of this book is to present the new entity resolution challenges stemming from the openness of the Web of data in describing entities by an unbounded number of knowledge bases, the semantic and structural diversity of the descriptions provided across domains even for the same real-world entities, as well as the autonomy of knowledge bases in terms of adopted processes for creating and curating entity descriptions. The scale, diversity, and graph structuring of entity descriptions in the Web of data essentially challenge how two descriptions can be effectively compared for similarity, but also how resolution algorithms can efficiently avoid examining pairwise all descriptions. The book covers a wide spectrum of entity resolution issues at the Web scale, including basic concepts and data structures, main resolution tasks and workflows, as well as state-of-the-art algorithmic techniques and experimental trade-offs.
Article
String similarity join is an important operation in data integration and cleansing that finds similar string pairs from two collections of strings. More than ten algorithms have been proposed to address this problem in the recent two decades. However, existing algorithms have not been thoroughly compared under the same experimental framework. For example, some algorithms are tested only on specific datasets. This makes it rather difficult for practitioners to decide which algorithms should be used for various scenarios. To address this problem, in this paper we provide a comprehensive survey on a wide spectrum of existing string similarity join algorithms, classify them into different categories based on their main techniques, and compare them through extensive experiments on a variety of real-world datasets with different characteristics. We also report comprehensive findings obtained from the experiments and provide new insights about the strengths and weaknesses of existing similarity join algorithms which can guide practitioners to select appropriate algorithms for various scenarios.
Conference Paper
This paper proposes a progressive approach to entity resolution (ER) that allows users to explore a trade-off between the resolution cost and the achieved quality of the resolved data. In particular, our approach aims to produce the highest quality result given a constraint on the resolution budget, specified by the user. Our proposed method monitors and dynamically reassesses the resolution progress to determine which parts of the data should be resolved next and how they should be resolved. The comprehensive empirical evaluation of the proposed approach demonstrates its significant advantage in terms of efficiency over the traditional ER techniques for the given problem settings.
Article
This tutorial brings together perspectives on ER from a variety of fields, including databases, machine learning, natural language processing and information retrieval, to provide, in one setting, a survey of a large body of work. We discuss both the practical aspects and theoretical underpinnings of ER. We describe existing solutions, current challenges, and open research problems.
Conference Paper
Entity resolution (ER), the problem of extracting, matching and resolving entity mentions in structured and unstructured data, is a long-standing challenge in database management, information retrieval, machine learning, natural language processing and statistics. Accurate and fast entity resolution has huge practical implications in a wide variety of commercial, scientific and security domains. Despite the long history of work on entity resolution, there is still a surprising diversity of approaches, and lack of guiding theory. Meanwhile, in the age of big data, the need for high quality entity resolution is growing, as we are inundated with more and more data, all of which needs to be integrated, aligned and matched, before further utility can be extracted. In this tutorial, we bring together perspectives on entity resolution from a variety of fields, including databases, information retrieval, natural language processing and machine learning, to provide, in one setting, a survey of a large body of work. We discuss both the practical aspects and theoretical underpinnings of ER. We describe existing solutions, current challenges and open research problems. In addition to giving attendees a thorough understanding of existing ER models, algorithms and evaluation methods, the tutorial will cover important research topics such as scalable ER, active and lightly supervised ER, and query-driven ER.