Conference PaperPDF Available

Record linkage approaches in big data: A state of art study

Authors:

Figures

Content may be subject to copyright.
Record Linkage Approaches in Big Data: A State Of
Art Study
Randa M. Abd El-Ghafar, Mervat H.Gheith
Computer Science Department
Institute of Statistical Studies and Research, Cairo
University, Cairo, Egypt
randa_mohamed_cs@yahoo.com,
mervat_gheith@yahoo.com
Ali H. El-Bastawissy
Faculty of Computer Science
Modern Sciences and Arts University
Cairo, Egypt
aelbastawissy@msa.eun.eg
Eman S. Nasr
Independent Researcher
Cairo, Egypt
nasr.eman.s@gmail.com
Abstract—Record Linkage aims to find records in a dataset
that represent the same real-world entity across many different
data sources. It is a crucial task for data quality. With the
evolution of Big Data, new difficulties appeared to deal mainly
with the 5Vs of Big Data properties; i.e. Volume, Variety,
Velocity, Value, and Veracity. Therefore Record Linkage in Big
Data is more challenging. This paper investigates ways to apply
Record Linkage algorithms that handle the Volume property of
Big Data. Our investigation revealed four major issues. First, the
techniques used to resolve the Volume property of Big Data
mainly depend on partitioning the data into a number of blocks.
The processing of those blocks is parallelly distributed among
many executers. Second, MapReduce is the most famous
programming model that is designed for parallel processing of
Big Data. Third, a blocking key is usually used for partitioning
the big dataset into smaller blocks; it is often created by the
concatenation of the prefixes of chosen attributes. Partitioning
using a blocking key may lead to unbalancing blocks, which is
known as data skew, where data is not evenly distributed among
blocks. An uneven distribution of data degrades the performance
of the overall execution of the MapReduce model. Fourth, to the
best of our knowledge, a small number of studies has been done
so far to balance the load between data blocks in a MapReduce
framework. Hence more work should be dedicated to balancing
the load between the distributed blocks.
Keywords—Big Data; Big Data Integration; blocking; entity
matching; entity resolution; Hadoop; machine learning;
MapReduce; Record Linkage.
I.
I
NTRODUCTION
The term Big in Big Data does not only refer to the size of
the data, but it also indicates many characteristics known as the
five dimensions of Big Data, or 5Vs of Big Data, as shown in
Fig. 1. Those 5Vs are; Volume which refers to "Big" in the
term Big Data; Variety which refers to Data that come from a
variety of sources and has many formats (structured as
traditional databases, semi-structured as XML files or non-
structured as Images); Velocity which refers to how quickly the
data could arrive, be stored, and retrieved; Veracity which
represents the reliability of data; and Value which is the high
value of data that could be obtained by analyzing a huge
amount of data. Actually, the majority of the data (about 80%)
is unstructured data and this explains why a business is mainly
concerned with managing unstructured data [1].
Data Integration is the process of combining data from
many different sources and providing a user with one unified
view of these data. Big Data Integration differs from traditional
Data Integration in the three basic dimensions of Big Data
known as 3Vs (Volume, Variety, Velocity), as they make the
process of Big Data Integration more challenging and
complicated. For the Volume, the number of data sources has
grown to be millions for a single domain [2]. For the Velocity,
many of the data sources are very dynamic and rapidly
exploding as a direct result of the rate at which data is moving
and being collected from heterogeneous data sources [2]. For
Variety, Big Data projects include data sources from different
domains, which are naturally diverse as they refer to different
types of entities and relationships; these different types of
sources are in need to be integrated into a unified view of data
[2].
Fig. 1. Five dimensions of Big Data.
The architecture of Data Integration is composed of three
steps, Schema alignment, Record Linkage and data fusion. The
three steps aim to address the challenges of integrating data
from multiple sources as semantic ambiguity, instance
representation ambiguity, and data inconsistency [2] as shown
in Fig. 2.
Schema alignment addresses the semantic ambiguity
challenges and aims to recognize which attributes have the
same meaning and which does not [2]. Record Linkage
addresses the challenges of the ambiguity of instance
representation and aims to recognize which elements represent
the same real-world entity and which are not [2]. Data fusion
addresses the challenges of data quality and aims to know
which values to use in case of conflict attributes value arises
from many sources [2].
This paper presents a survey of Record Linkage techniques
found in the literature to handle the increasing volume of data.
Adding the volume dimension to the traditional Record
Linkage makes the Record Linkage more complicating and
challenging as it compares a pair of records using one or more
similarity techniques which make Record Linkage more
expensive task that could take days to be accomplished
The rest of this paper is organized as follows. Section II
presents the state of art of Big Data Integration techniques.
Section III explains blocking techniques and MapReduce
programming model. Finally, the conclusion and future work
are given in section IV.
II.
THE STATE OF ART OF BIG DATA INTEGRATION TECHNIQUES
The importance of Big Data Integration resulted in
increasing amounts of researches in the fields of schema
mapping, Record Linkage and data fusion of Big Data
Integration over the past few years to deal with the challenges
associated with them. Table I shows a summary of these
techniques [3]. This section addresses Record Linkage
techniques that handle the Volume property of Big Data.
A. Record Linkage Definition and Importance
Preparing and cleaning any dataset for analysis is very
important step because of the concept of garbage in garbage
out. It is a costly process, error pruning and time-consuming.
Actually, it takes about 80% of the whole time spent on
Preparing and cleaning any dataset for analysis is very
important step because of the concept of garbage in garbage
out. It is a costly process, error pruning and time-consuming.
Actually, it takes about 80% of the whole time spent on
analytics. According to Gartner, $44 billion total in 2014 alone
was invested for successfully preparing data obtained from
many sources for use in data analysis [4].
Record Linkage, Entity Resolution or Entity Matching is a
crucial step in data cleaning which aims to find records (i.e.,
database tuples) that refer to the same real-world entity [5], [6].
Record Linkage is an expensive process that could take many
Fig. 2. Traditional Data Integration architecture [2].
hours or even days. The situation become more complicated
especially for large datasets as it compares a pair of records
using one or more similarity measures [7].
B. Challenges Involved in Record Linkage of Big Data
Traditional Record Linkage depends on measuring the
similarity between a static set of structured tuples that have the
same schema. Record Linkage is not a simple and trivial task
due to many reasons. First, typographical errors which prevents
records from being associated with the same individual.
Second, choosing the fields that will be used in the detecting
similarity process between records. Thirds, determining the
threshold that will decide which records are considered to be
duplicate and which are not (i.e. If the similarity between
records is 80% or above they will consider duplicate). In Big
Data Integration the situation is more complicated and
challenging because data has very heterogeneous structure and
coming from many sources (structured, unstructured, and semi-
structured), and data sources are continuously evolving and
dynamic. This makes the Record Linkage more challenging
and non-trivial in Big Data Integration [3].
C. Record Linkage Techniques in Big Data
Traditional Record Linkage approaches become inefficient
and ineffective when we are examining large datasets. New
techniques have been proposed to address the challenges of
volume dimension by using MapReduce to efficiently and
effectively parallelize the process of Record Linkage.
Parallelize Record Linkage depends on effectively distribute
the workload between many nodes exploiting the techniques of
adaptive blocking [2].
Applying Record Linkage with each update of large
datasets from scratch is unaffordable especially when the
datasets are dynamic and continuously evolving. Incremental
Record Linkage has been proposed to address the challenges of
velocity aspects. It allows efficient incremental Record
Linkage in cases of any updates, inserts or deletes of any
record in the dataset [2]. To address the variety challenge of
Big Data Integration, new techniques have been proposed to
link or match a structured text with unstructured or free text.
Matching structured data with unstructured text could happen
in many cases as trying to match unstructured offers to
structured products description of people with tweets or blog
posts about their shopping experience [8]. Finally, to address
the veracity aspects, a variety of clustering and linking
techniques have been proposed.
S
UMMARY
OF
S
TATE
O
F
B
IG
D
ATA
I
NTEGRATION
T
ECHNIQUES
[3]
Big Data
Property
Schema
mapping Record linkage Data fusion
Volume Integrating
Deep Web, Web
tables/lists
Adaptive blocking,
MapReduce-based
linkage
Online fusion
Velocity Incremental linkage Fusion in a
dynamic world
Variety Dataspace
systems
Linking text to
structured data
Combining fusion
with linkage
Veracity Value-variety tolerant
linkage
Truth discovery
Those techniques focus on out of date values and how to deal
with erroneous values effectively. Dealing with erroneous and
out of date values is very important because it may prevent
correct Record Linkage [2].
The techniques that handle the increasing volume of data
depend on parallel processing where the input data is
partitioned into a number of blocks to distribute the workload
between them. Then, these blocks will be processed by many
numbers of reduce tasks in the MapReduce programming
model described below.
III. B
LOCKING
T
ECHNIQUES AND
M
AP
R
EDUCE
P
ROGRAMM ING
M
ODEL
.
This paragraph discusses the blocking techniques used to
partition the Big Dataset into a number of chunks where the
expected similar records are more likely to be placed in the
same chunk. After distributing the dataset into chunks, a
similarity function is executing in parallel between those
chunks utilizing the MapReduce programming model. The
main idea of parallelism is to speed up the Record Linkage
runtime and the exploit the ability to scale up.
A. Blocking Techniques
The full detection of Record Linkage process requires the
complete scan of all records to compute the similarity between
them. Performing a complete scan of all records require
executing a Cartesian product of similarity checks between n
input dataset which will require a complexity of O(n2). This
makes the process of Record Linkage very exhaustive,
expensive and tidy process and prone to errors, especially with
large datasets. Partitioning the data into blocks where each
block contains only records that are more likely to be similar
solves this problem by only comparing the records within the
same block [9].
Blocking Key is used to partition the Big Datasets into
smaller blocks depending on a chosen entity attributes’ values.
The blocking key is often created by the concatenation of the
prefixes of the chosen attributes [10]. Standard blocking,
Sorted neighborhood, Q-gram Indexing, and Canopy
Clustering with TF-IDF are the different Blocking methods for
Record Linkage listed by [11], [12] as illustrated in Table II.
The sorted neighborhood method is one of the best-used
blocking methods implied by many authors in [13], [14], [10],
.[15], [16], [17].
B. MapReduce Programming Model
MapReduce is a shared nothing programming model. It is
specially designed to handle the exhaustive processing by
paralleling distribution of the workload between multiple
clusters or multiple nodes. The design of nodes or clusters
could be easily scaled out if needed [18]. In MapReduce
paradigm, data is partitioned as described before and placed in
blocks then, data in each block is represented by key-value
pairs. The processing is done to where the data is located using
two user-defined functions, Map and Reduce. The map
function is used to partition the input dataset into chunks to be
processed separately according to the required function. The
reduce function is used to sort and collect the final output from
each map function and generate one final output.
S
UMMARY
OF
T
HE
B
LOCKING
T
ECHNIQUES
Standard
Blocking
Sorted
Neighborhoo
d
Q-Gram Indexing
Canopy
Clustering
with TF-
IDF
- All records
with the
same
blocking
key value
(BKV) will
be on the
same block.
- Only
records
within the
same block
will be
compared to
check the
duplication
between
them.
- The
database is
sorted
according to
(BKV).
- A window
of a fixed
sized is
moved on the
sorted
records.
-Candidate
records are
generated
only from
records of the
current
window.
- All database records that
have not only the same
BKV but also a similar
BKV are inserted into the
same block.
- Each (Blocking Key
Value) is transformed to a
list of Q-Grams.
- Combinations of these q-
gram lists are then
generated down to a
minimum length,
determined by a user
threshold.
- Then a record identifier
is inserted into more than
one block.
The
clustering is
done by
calculating
Similarity
between the
Blocking
Key Values
using
measures
such as TF-
IDF or
Jaccard.
Fig. 3 illustrates a simple example of MapReduce program.
Suppose we want to calculate the frequency of each char from
the input string (3 lines in this case for simplicity) by utilizing
MapReduce Programming model. First, the input data is split
or partitioned into 3 chunks. Second, each chunk goes to a
mapper which is responsible to generate a key-value pair
(<char>, <char frequency>). Third, the reducer task starts with
shuffle and sort step. It sorts the keys generated from the map
phase into a larger data list to be easily grouped and integrated
into the reduce phase.
By applying the MapReduce model in the Record Linkage
problem for Big Data, The Map function will be used to read
the input datasets in parallel from the connected nodes and
partition them into small chunks and redistribute them based on
their blocking key to the reduce functions/tasks that are
responsible for Record Linkage process.
C. Blocking based Record Linkage using MapReduce
programming model
By utilizing MapReduce model to solve the Record
Linkage problem with Big Data, The Map function or mapper
will read the input datasets in parallel and partition them into
small chunks in the form of a key-value pair (blocking key,
entity). The default hash partition function will use this key-
value pairs to redistribute the blocks based on their blocking
key to the reduce functions/tasks. The reducer function is
Fig. 3. Calculating the frequency of characters using MapReduce paradigm.
responsible for Record Linkage process using one or
combination of the similarity techniques. Entities will whereas
the entity has the same blocking key will be processed in the
same reducer. This guarantees that all input entities having the
same blocking key and reside in the same block will be
processed using the same reducer as illustrated in Pseudocode
depicted in Fig. 4. So, MapReduce offers a convenient
programming model for scaling out record Linkage process by
running Record Linkage techniques in parallelized blocking
using MapReduce model [19]. The use of basic MapReduce
programming model has the following Limitations:
It is susceptible to data Skew because of unequal
distributions of blocks.
Single or few number of reducers tasks may be the
dominant factor of the execution time of the whole
Record Linkage process.
Some nodes of MapReduce model may be ideal waiting
for others to complete their tasks.
Therefore, the basic MapReduce model with Big Data is
not scalable if we don’t solve the bottleneck that is happening
during the execution of large blocks, not effective because of
the time-consuming cost of running large blocks, and finally
has difficulty of handling fault tolerance.
Partitioning Big Datasets into smaller blocks using
blocking key techniques may lead to unequal distributed blocks
which cause a problem known as data skew. Data skew is
happening when the data is not evenly distributed across data
partition. An uneven distribution of data degrades the
performance of the overall execution as some nodes will sitting
idle waiting for some other to finish their job with larger sizes.
Data skew can occur on both map and reduce phases. Solving
data skew that happens in map phase is easy whereas it is very
complex and demanding issue in the reduce phase because the
majority of the execution is happening the reduce phase.
A small number of studies consider the use of MapReduce
paradigm to solve the load-balancing problem, especially in the
Record Linkage. Hadoop is one of the most famous
frameworks that implement MapReduce programming model.
An extremely improvement in the efficiency of the whole
process of the Record Linkage is noticed by balancing the
workload of the reducers in the matching step [19].
Unequal distributed of the blocks will cause many problems
as Load balancing and fault tolerance between the Reducers in
MapReduce programming Model. Many authors tried to solve
the load balancing problem by utilizing the idea of blocking
key distribution as Yipeng [20] introduced a partitioning
solution that dynamically balances the workload of records'
comparison to solve the problem of uneven workload that
happens due to unequal partitions blocks. Their evaluation
achieves a significant improvement that outperforms the
default partition of Hadoop for Record Linkage problems
involving data skew.
Kolb, et al. [21] addressed the skew problems by proposing
a general balancing approach named BlockSplit. The aim of
BlockSplit is to reduce the search space of Record Linkage and
evenly distribute the workload between the reducers in
Fig. 4. Pseudocode of Record Linkage of Big Data using basic MapReduce
model.
MapReduce programming model. BlockSplit consists of 2
main jobs. Each of them is done by a number of maps and
reduces tasks. First one is analysis job. This job receives the
input partitions and performs blocking using blocking key. The
output of this job is the number of entities per each block,
known as Block Distribution Matrix (BDM). BDM is the input
of the second job known as matching phase. Record Linkage is
done between elements of the same block in this phase. The
approach of BlockSplit took the size of blocks into account and
assigned each block to a number of reduce tasks after passing
the checking of load balancing constraints. Small blocks will
be processed in one reducer. Large Blocks are divided into
smaller numbers of sub-blocks based on the input partitions for
better performance of parallel matching occurred within
multiple reduce tasks. But this approach as mentioned by Kolb,
et al. is [22] may still lead to unbalanced reduce load balance
because of the different size of sub-blocks. Also, they used
round robin as a balancing technique in the first job (analysis)
in order to evenly distribute the blocks between the available
reducers. Using round-robin technique did not lead to a
balanced number of elements between reducers, although it
could be ignored because time in this job is not the dominant
factor.
Kolb, et al. [22] proposed and evaluated PairRange, a load
balancing for MapReduce based Entity resolution approach.
PairRange redistributes the entities between reducers such that
each reducer computes nearly the same numbers of entities
comparisons. PairRange consists of two main jobs, first one is
calculating the BDM as described in [21]. The second job
calculates the total number of comparisons from all blocks and
redistributes them to all available reducers such that each
reducer takes almost an equal range of entities comparisons.
The approaches proposed by Kolb, et al. in [21] and [22] are
scalable as they scale for increasing the number of working
nodes comparing to the basic MapReduce model, Robustness
against data skew and finally achieve a decrease in execution
time comparing to Basic MapReduce model as many nodes are
added. BlockSplit and PairRange approaches are only applied
to a large scale of datasets and not yet explore in Big Datasets.
No filtering or pruning strategy where implemented in both
approaches to limit the number of comparison between
elements of the same block. In pairRange approach, Elements
are not equally distributed in the reducers although the
resulting comparisons are balanced between reducers. It is not
clear in PairRange approach how it ignores many elements
comparisons as it occurs in other reducers.
Chen, et al. [23] introduced a methodology to carry out
Record Linkage without having to perform pair-wise matching.
They used matching keys based on MapReduce framework.
This methodology consisted of three main parts,
Standardization, match key generation, and transitive closure.
They overcome the limitation of transitive closure by designing
a new iterative transitive closure that applies the method on
multiple match keys in MapReduce scenario.
Papadakis, et al. [24] proposed a method to remove
redundant comparisons. They also introduced a measurement
for quantifying the redundancy entailed by blocking method
and explain how to use it to tune the process of comparison
pruning. They applied the proposed blocking techniques on
two large datasets. The results showed a remarkable increase in
the efficiency of the comparison and blocking process.
While Jin, et al. [25] proposed DISC, a distributed
algorithm used for single Record Linkage hierarchical
clustering based on MapReduce programming model. They
introduced the analysis of the algorithm, including an upper
bound on the computation cost. The Algorithm was flexible to
run in a big dataset by configuring some parameters and
showed a scalable speedup of up to 160 on 190 computer cores.
Hsueh, et al. [19] applied multiple key distributions. They
tried to solve the Record Linkage problem for a large dataset
by proposing a MapReduce algorithm that has multiple keys.
The proposed algorithm consists of two phases, first is the
combination-based blocking and second is load-balanced
matching. They used the two keys in the combination-based
blocking to filter out unnecessary entity pairs. In the matching
step, the load balancing is done by obtaining statistics of the
total number of computation required for each block than
evenly distribute the number of comparisons between all the
reducers. They only used one input partition. Solving the load
balancing problem in this case is much easier than having more
than one input partition as in [21] and [22].
Current state of art employs the supervised or unsupervised
learning based approaches in the Record Linkage problem
where supervised learning aims to classify a pair of records as
matched or non-matched. The classifier is learned or trained
depending on sets of records labeled as matched or non-
matched then it is tested until having an acceptable accuracy.
The goal of learning based Record Linkage is to classify new
pair records as matched or non-matched depending on the
trained data.
Many authors considered the use of machine learning in the
Record Linkage problem as Kolb, et al. [17] discussed the use
of MapReduce programming model to parallel the workload
distribution between many reducers using variations of the
Sorted Neighborhood Method (SNM) with a varying size
window. They proposed two different strategies for calculating
similarity in the case of the Cartesian product of two input
sources. The first one named MapSide depended only on
computing similarity in the Map phase. While the second one
name ReduceSplit evenly distributes entity pairs of the two
input sources across all available reduce tasks that finally apply
similarity computation for each reducer. Each pair-wise
similarity between the common properties in the two data
sources serves as a feature of the classification model.
Computing similarities are the dominant factor in the
classification model because it consumes about 95% of the
time. No blocking techniques or load balancing strategies were
employed to handle the computational skew results because of
unbalanced blocks. No pruning similarity computation was
developed to reduce the overall number of comparisons of
entities that are likely to be non-matched.
Kolb, et al. [26] devolved a tool called Dedoop
(Deduplicate with Hadoop) which supports MapReduce-based
Entity Resolution for large datasets. Dedoop had a GUI for its
workflow as shown in Fig. 5 that consisted of three steps;
blocking, similarity computation and matching decision of the
input blocks depending on their similarity value. The final step
(match classification) could depend on a trained classifier using
a Machine Learning algorithm on the training dataset. Several
blocking techniques could be used in the blocking, similarity
computation, and match classification. Dedoop did not offer
any pruning steps to reduce the number of unnecessary
comparisons.
Cao, et al. [9] proposed a new algorithm for the blocking of
the Record Linkage problem. The proposed algorithm learned
the blocking schema for both labeled data and unlabeled data.
Experiments showed that using unlabeled data in the learning
process could decrease the number of candidate matches while
at the same time maintaining the same level of true matches
[9].
Moir, et al [27] introduced the Generic Record Linkage
(GRL) framework to classify pairs of entities as matching pairs
or non-matching pairs based on their features and the semantic
relationship between them. The proposed approach applied
supervised Machine Learning to determine the features of the
attributes and their semantic relationships used in the
classification process. The applying of the proposed approach
results in increasing the accuracy of the classification.
Fig. 5. Overview of Dedoop [26].
IV. C
ONCLUSION AND FUTURE WORK
Record Linkage techniques have many challenges involved
in Big Data due to the high resolution of it. This paper surveys
the Record Linkage techniques used to solve the volume
property of Big Data. Most techniques used MapReduce model
for distributed computing and parallel processing. All solutions
depend on partitioning the data into blocks using a chosen key
by map function, then resolve the duplicates using one of the
techniques of duplicate resolution in reduce function.
Partitioning input data into blocks may lead to data skew,
which results in unbalanced workload that can be solved by
additional MapReduce Job to determine the blocking key
distribution.
Yet there is not much effort done in load balancing of
MapReduce techniques so more work should be dedicated to it
because it has not been actively explored. More attention
should be pointed to pruning an unnecessary number of
comparisons of entities that are likely to be non-matched
during the similarity computations. In addition, using ML in
the field of Record Linkage is still in its infant stage and needs
more efforts to be exerted on it. In future work, we plan to find
an adaptive method to balance the load between the data blocks
in MapReduce programming model. We will try to execute the
mentioned load balancing approaches (BlockSplit and
PairRange) on Big Data as they only explored on large data
scale. We will exploit ML to formalize the best attributes that
could be used for blocking. In addition, trying to apply some
pruning methods to avoid the unnecessary comparisons
between the records located in the same blocks.
R
EFERENCES
[1] M. Dhavapriya and N. Yasodha, "Big Data Analytics Challenges and
Solutions Using Hadoop, Map Reduce and Big Table," International
Journal of Computer Science Trends and Technology (IJCST), vol. 4,
no. 1, 2016.
[2] X. L. Dong and D. Srivastava, Big Data Integration, Morgan & Claypool
Publishers, 2015.
[3] X. L. Dong and D. Srivastava, "Big data integration," in ICDE, 2013.
[4] F. Castanedo, Data Preparation in the Big Data Era, USA: O’Reilly
Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA, 2015.
[5] L. Getoor and A. Machanavajjhala, "Entity Resolution: Theory, Practice
& Open Challenges," in VLDB Endowment 5(12), 2012.
[6] C. Kong, M. Gao, C. Xu, W. Qian and A. Zhou, "Entity Matching
Across Multiple Heterogeneous Data Sources," in International
Conference on Database Systems for Advanced Applications, Cham,
2016.
[7] H. Köpcke, A. Thor and E. Rahm, "Evaluation of entity resolution
approaches on real-world match
p
roblems," in Proceedings of the VLDB
Endowment, 2010.
[8] A. Kannan, A. Kannan, R. Agrawal and A. Fuxman, "Matchin g
unstructured product offers to structured product specifications," in 17th
ACM SIGKDD international conference on Knowledge discovery and
data mining., 2011.
[9] Y. Cao, Z. Chen, J. Zhu, P. Yue, C.-Y. Lin and Y. Yu, "Leveraging
Unlabeled Data to Scale Blocking for Record Linkage," in Proceedings
of the 22nd International JointConference on Artificial Intelligence
(IJCAI), 2011.
[10] L. Kolb, A. Thor and E. Rahm, "Parallel Sorted Neighborhood Blocking
with MapReduce," in Proc. Conf. Datenbanksysteme in Buro, Technik
und Wissenschaft, 2011.
[11] R. Baxter, P. Christen and T. Churches, "comparison of fast blocking
methods for record linkage," in ACM SIGKDD '03 Workshop on Data
Cleaning, Record Linkage, and Object Consolidation, 2003.
[12] P. Christen, "A Survey of Indexing Techniques for Scalable Record
Linkage and Deduplication," in IEEE Transactions on Knowledge and
Data Engineering X(Y), 2011.
[13] G. Papadakis, E. Ioannou, T. Palpanas, C. Niederee and N. Wolfgang,
"A Blocking Framework for Entity Resolution in Highly Heterogeneous
Information Spaces," in IEEE Transactions on Knowledge and Data
Engineering, 2013.
[14] G. Papadakis, E. Ioannou, C. Niedere, T. Palpanas and N. , "Eliminating
the Redundancy in Blocking-
b
ased Entity Resolution Methods," in
Proceedings of the 11th annual international ACM/IEEE joint
conference on Digital libraries, 2011.
[15] L. Kolb, T. and E. Rahm, "Multi-
p
ass Sorted Neighborhood Blocking
with MapReduce," in Computer Science-Research and Development,
2012.
[16] D. G. Mestre and C. E. Pires, "An Adaptive Blocking Approach for
Entity Matching with MapReduce," in SBBD, 2013.
[17] L. Kolb, H. Köpcke, A. Thor and E. Rahm, "Learning-
b
ased Entity
Resolution with MapReduce," in CloudDB, 2011.
[18] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on
Large Clusters," in the 6th conference on Symposium on Operarting
Systems Design & Implementation, Berkeley, CA, USA, 2004.
[19] S.-C. Hsueh, M.-Y. Lin and Y.-C. Chiu, "A Load-Balanced MapReduce
Algorithm for Blocking-based Entity-resolution with Multiple Keys," in
Parallel and Distributed Computing, 2014.
[20] Y. Huang, "Record linkage in an Hadoop environment," School o
Computing, National University of Singapore., 2011.
[21] L. Kolb, A. Thor and E. Rahm, "Block-based Load Balancing for Entity
Resolution with MapReduce," in Proceedings of the 20th ACM
international conference on Information and knowledge management,
2011.
[22] L. Kolb, A. Thor and E. Rahm, "Load Balancing for MapReduce-
b
ased
Entity Resolution," in International Conference on Data Engineering
(ICDE), IEEE, Leipzing,German, 2012.
[23] C. Chen, D. Pullen, R. H. Petty and J. R. Talburt, "Methodology for
Large-Scale Entity Resolution Without Pairwise Matching," in IEEE
International Conference on Data Mining Workshop (ICDMW), 2015.
[24] G. Papadakis, E. Ioannou, C. Niederée, T. Palpanas and W. Nejdl, "To
Compare or Not to Compare:Making Entity Resolution more Efficient,"
in Proceedings of the international workshop on semantic web
information management, 2011.
[25] C. Jin, M. M. A. Patwary, A. Agrawal, W. Hendrix, W.-k. Liao and A.
Choudhary, "DiSC: A Distributed Single-Linkage Hierarchical
Clustering Algorithm using MapReduce," in 4th International SC
Workshop on Data Intensive Computing in the Clouds (DataCloud),
2013.
[26] L. Kolb, A. Thor and E. Rahm, "Dedoop: Efficient Deduplication with
Hadoop," VLDB Endow, vol. 12, no. 5, p. 1878–188, 2012.
[27] C. Moir and J. Dean, "A Machine Learning approach to Generic Entity
Resolution in support of Cyber Situation Awareness," in Proceedings o
f
the 38th Australasian Computer Science Conference (ACSC 2015),
2015.
... Blocking is done using blocking techniques such as choosing the initial 3 characters for the first name of the employee. There are other different blocking methods for record linkage such as Standard Blocking, Sorted Neighborhood, Q-gram Indexing, and Canopy Clustering with TF-IDF [9]. ...
... The default hash function is responsible for redistributing the blocks using their blocking key to the reduce functions. The reduce function is responsible for running the record linkage techniques in parallel using one or combinations of similarity techniques where blocks have the same blocking keys will be processed by the same reducer as illustrated in the pseudocode depicted in Figure 2 [9]. ...
... Pseudocode of record linkage for Big Data using basic MapReduce model [9]. ...
Article
Full-text available
Record linkage is a challenging task for Big Data. This paper, hence, attempts to shed light on record linkage approaches for Big Data by comparing three dimensions involving record linkage phases, dataset properties, and parallel processing approach for Big Data. The current state of art have only conducted comparative studies between record linkage approaches. There has been only one comparative study exploring the whole record linkage framework of the relational database. It is believed that the focus of the present study on the dimensions of the parallel processing approaches for Big Data and dataset properties was worth exploring. It was found that first, data exploration was almost a non-existing phase despite its importance of exploring the dataset being examined; second, techniques that handle data standardization and preparation phase of the first dimension were not extensively covered in the literature which can directly affect the results’ quality; third, the record linkage in unstructured data was not yet explored in literature; fourth, the MapReduce was used in about 50 % of the selected studies to handle the parallel processing of Big Data, but due to its limitations, more recent and efficient approaches had been used, such as Apache Spark and Apache Flink. Apache Spark is just recently adapted to resolve duplicates due to its supporting of in-memory computation, which makes the whole linkage process more efficient. Although the comparative study includes many recent studies supporting Apache Spark, adopting Apache Spark to solve the problem of record linkage is not yet well explored in literature, as more researches need to be conducted. In addition, Apache Flink is still rarely used to solve the record linkage problem of Big Data. Fifth, pruning techniques, used to eliminate unnecessary comparisons, are not adequately applied in the covered studies despite their effect on reducing the search space resulting in a more effective Record Linkage process.
... The purpose of this article is to provide a comprehensive review of record linkage methods for big data. The challenges of record linkage are related to the volume, velocity and variety of data, which makes the process very complex and difficult, and takes several days [2]. The record linkage process involves some important steps, the first step is data pre-processing that aims to apply data transformation and standardization to prepare the dataset for record linkage process, the second step is indexing which divides the data into blocks, comparison, which works on calculating similarities between records, and finally the classification of records into matched and non-matched. ...
... These processes take up 80% of the time in data analysis. Record linkage is a very important step in data cleaning, it is the role of detecting duplicates in the data, record linkage is a costly process that can take many hours or even days, the situation becomes more complicated, especially for large datasets, because the process compares a pair of records using one or more similarity measures [2]. ...
Conference Paper
Analyzing data and making the right decisions have become crucial objectives in various domains. Record linkage is one of the most important processes for guaranteeing good data quality for analysis. The aim of record linkage is to find records in a dataset that represent the same real-world entity across many different data sources. This process becomes complex in the context of Big Data due to the high volume, variety of sources, and rapid velocity of data. This paper provides a comprehensive review of record linkage processes that can be adaptable to big data, the challenges and issues involved, and the limitations that remain. We give a comparative study of parallel processing approaches for big data that can reduce the execution time of record linkage processes such as Hadoop MapReduce, Apache Spark, and Apache Flink. We find that these last two technologies are more efficient on several characteristics such as data processing, performance, optimization, processing speed, and real-time analysis.
... Blocking techniques depend on distributing the big dataset into several small blocks such that elements that reside in the same block are more likely to be matched. Many blocking techniques have been used in literature as clarified in [4]. ...
... They can utilize commodity hardware and provide fault tolerance, high scalability, and flexibility to distributed processing. MR is an early popular framework because of its simplicity and easiest of programming however, it has several limitations [4]. In addition, every single job has to read the input data from Hadoop Distributed File System (HDFS), process it, and write it back to HDFS. ...
Article
Full-text available
Entity Resolution (ER) is defined as the process 0f identifying records/ objects that correspond to real-world objects/ entities. To define a good ER approach, the schema of the data should be well-known. In addition, schema alignment of multiple datasets is not an easy task and may require either domain expert or ML algorithm to select which attributes to match. Schema agnostic meta-blocking tries to solve such a problem by considering each token as a blocking key regardless of the attributes it appears in. It may also be coupled with meta-blocking to reduce the number of false negatives. However, it requires the exact match of tokens which is very hard to occur in the actual datasets and it results in very low precision. To overcome such issues, we propose a novel and efficient ER approach for big data implemented in Apache Spark. The proposed approach is employed to avoid schema alignment as it treats the attributes as a bag of words and generates a set of n-grams which is transformed to vectors. The generated vectors are compared using a chosen similarity measure. The proposed approach is a generic one as it can accept all types of datasets. It consists of five consecutive sub-modules: 1) Dataset acquisition, 2) Dataset pre-processing, 3) Setting selection criteria, where all settings of the proposed approach are selected such as the used blocking key, the significant attributes, NLP techniques, ER threshold, and the used scenario of ER, 4) ER pipeline construction, and 5) Clustering where the similar records are grouped into the similar cluster. The ER pipeline could accept two types of attributes; the Weighted Attributes (WA) or the Compound Attributes (CA). In addition, it accepts all the settings selected in the fourth module. The pipeline consists of five phases. Phase 1) Generating the tokens composing the attributes. Phase 2) Generating n-grams of length n. Phase 3) Applying the hashing Text Frequency (TF) to convert each n-grams to a fixed-length feature vector. Phase 4) Applying Locality Sensitive Hashing (LSH), which maps similar input items to the same buckets with a higher probability than dissimilar input items. Phase 5) Classification of the objects to duplicates or not according to the calculated similarity between them. We introduced seven different scenarios as an input to the ER pipeline. To minimize the number of comparisons, we proposed the length filter which greatly contributes to improving the effectiveness of the proposed approach as it achieves the highest F-measure between the existing computational resources and scales well with the available working nodes. Three results have been revealed: 1) Using the CA in the different scenarios achieves better results than the single WA in terms of efficiency and effectiveness. 2) Scenario 3 and 4 Achieve the best performance time because using Soundex and Stemming contribute to reducing the performance time of the proposed approach. 3) Scenario 7 achieves the highest F-measure because by utilizing the length filter, we only compare records that are nearly within a pre-determined percentage of increase or decrease of string length. LSH is used to map the same inputs items to the buckets with a higher probability than dis-similar ones. It takes numHashTables as a parameter. Increasing the number of candidate pairs with the same numHashTables will reduce the accuracy of the model. Utilizing the length filter helps to minimize the number of candidates which in turn increases the accuracy of the approach.
... Parallelization approaches: Most of the parallel ER adopts MapReduce frameworks [14]. MapReduce framework consists of two functions, map and reduce. ...
... Blocks could be constructed using many techniques such as using the initial three characters of a certain field. In addition, many blocking methods could be used such as Sorted Neighborhood, Standard Blocking, Q-gram Indexing, and Canopy Clustering with TF-IDF [14]. In order to reduce the time for similarity computation, we propose an Efficient Multi-Blocking Phase Strategy (EMPBS). ...
Article
Full-text available
: Entity Resolution (ER) is the process of identifying records that refer to the same real-world entity. It plays a key role in many applications as data warehouse, data integration, and business intelligence. Comparing every record with all corresponding records is infeasible especially for a big dataset. To overcome such a problem, blocking techniques have been implemented. In this paper, we propose a novel Efficient Multi-Phase Blocking Strategy (EMPBS) for resolving duplicates in big data. As per our knowledge, some state of art blocking techniques may result in overlapping blocks (i.e. Q-grams) which cause redundant comparisons and hence increase the time complexity. Our proposed blocking strategy has disjoint blocks and less time complexity compared to Q-grams and slandered blocking techniques. In addition, EMPBS is general and requires no restrictions on the type of blocking keys. EMPBS consists of three phases. The first one generates three single efficient blocking keys. The second phase takes the output of the first phase as an input to construct a compound key. The compound key is composed of concatenation of two single blocking keys. Three compound blocking keys are the output of this phase that will be used as an input for the last phase, which is generating the Efficient Multi-Phase Blocking Key (EMPBK). EMPBK is constructed using the union of two compound blocking keys. The implementation of EMPBS presents promising results in terms of Reduction Ratio (RR) as it achieves a higher value of RR than adopting only a single blocking key, while at the same time maintains nearly the same precision and recall. EMPBS reduced about 84% of the average number of comparisons accomplished in a single blocking key. To evaluate EMPBS, we developed a Duplicate Generation tool (DupGen) that accepts a clean semi-structured file as an input and generates labeled duplicate records according to certain criteria.
... Some research on algorithms that address the computational burden of the comparison and classification tasks in record linkage has been undertaken. Most work on distributed and parallel algorithms for record linkage is specific to the MapReduce paradigm [15], a programming model for processing large data sets in parallel on a cluster. Few sources detail the comparison and classification tasks themselves, with the focus on load balancing algorithms to address issues associated with data skew. ...
... The blocking techniques used in these studies are based on the same techniques used for traditional probabilistic and deterministic linkages [15]. There are many blocking techniques used in these conventional approaches to record linkages that reduce the comparison space significantly, even when running a linkage on a single machine [26]. ...
Article
Full-text available
Background: The linking of administrative data across agencies provides the capability to investigate many health and social issues with the potential to deliver significant public benefit. Despite its advantages, the use of cloud computing resources for linkage purposes is scarce, with the storage of identifiable information on cloud infrastructure assessed as high risk by data custodians. Objective: This study aims to present a model for record linkage that utilizes cloud computing capabilities while assuring custodians that identifiable data sets remain secure and local. Methods: A new hybrid cloud model was developed, including privacy-preserving record linkage techniques and container-based batch processing. An evaluation of this model was conducted with a prototype implementation using large synthetic data sets representative of administrative health data. Results: The cloud model kept identifiers on premises and uses privacy-preserved identifiers to run all linkage computations on cloud infrastructure. Our prototype used a managed container cluster in Amazon Web Services to distribute the computation using existing linkage software. Although the cost of computation was relatively low, the use of existing software resulted in an overhead of processing of 35.7% (149/417 min execution time). Conclusions: The result of our experimental evaluation shows the operational feasibility of such a model and the exciting opportunities for advancing the analysis of linkage outputs.
... Blocking is done using blocking techniques such as choosing the initial three characters for the first name of the employee. There are other different blocking methods for record linkage such as Standard Blocking, Sorted Neighborhood, Q-gram Indexing, and Canopy Clustering with TF-IDF [9]. ...
... Pseudocode of record linkage for Big Data using basic MapReduce model[9]. ...
Research
Full-text available
Record linkage is a challenging task for Big Data. This paper presents a comparative study of record linkage approaches for Big Data. We compare based on three dimensions; record linkage phases, dataset properties, and parallel processing approach for Big Data. As far as we know, current state of art only conducts comparative studies between record linkage approaches. We only found one comparative study covers the whole record linkage framework of the relational database. Our focus on the dimensions of the parallel processing approaches for Big Data and dataset properties are novel. Our research revealed five findings. First, data exploration is almost a non-existing phase despite its importance of exploring the dataset being examined. Second, techniques that handle data standardization and preparation phase of the first dimension are not extensively covered in the literature which can directly affect the results' quality. Third, record linkage in unstructured data is not yet explored in literature. Fourth, MapReduce has been used in about 50% of the selected studies to handle the parallel processing of Big Data, but due to its limitations, more recent and efficient approaches have been used. These approaches include Apache Spark and Apache Flink. Apache Spark is just recently adapted to resolve duplicates due to its supporting of in-memory computation, which makes the whole linkage process more efficient. Although the comparative study, includes many recent studies supporting Apache Spark, it is not yet well explored in literature, as more researches need to be conducted. In addition, Apache Flink is still rarely used to solve the record linkage problem of Big Data. Fifth, pruning techniques which used to eliminate unnecessary comparisons, are not adequately applied in the covered studies despite their effect on reducing the search space which results in a more effective Record Linkage process.
... Record linkage [11], [12] is used for partitioning of a set of records in order to achieve compliance between a partition and its identified records which refer to a distinct entity. It performs in three sequent phases: Blocking, Pairwise matching, Clustering. ...
Article
Full-text available
The presented paper deals with data integration and sorting of Covid-19 data. The data file contains fifteen data fiels and for the design of integration and sorting model each of them is configured in data type, format and field length. For the data integration and sorting model design Talend Open Studio is used. The model concerns the performance of four main tasks: data integration, data sorting, result display, and output in .xls file format. For the sorting process two rules are assigned in accordance with the medical and biomedical requirements, namely to sort report date descending order and the Country Name field in alphabetical one
... Record linkage [12], [13] is used for partitioning of a set of records in order to achieve compliance between a partition and its identified records which refer to a distinct entity. It performs in three sequent phases: Blocking, Pairwise matching, Clustering. ...
Conference Paper
The paper below presents an overview of big data integration techniques, methods and approaches. The report describes the data integration architecture and its components. Also, an overview of five big data integrators is included. Integrator.io, Oracle Data Integrator, Apache Spark, SQL Server Integration Service, Zapier are discussed in details. In addition, recommendations for their application are presented.
Chapter
Record linkage is a crucial step in big data integration (BDI). It is also one of its major challenges with the increasing number of structured data sources that need to be linked and do not share common attributes. Our research-in-progress aims to develop a record linkage layer that assists data scientist in integrating a variety of data sources. A structured literature review of 68 papers reveals (1) key data sets, (2) available classification algorithms (match or no match), and (3) similarity measures to consider in BDI projects. The results highlight the foundational requirements for the development of the record linkage layer such as processing unstructured attributes. As BDI emerges as a priority for industry, our work proposes a record linkage layer that provide similarity measures and integration algorithms while assisting its selection. A record linkage layer can contribute to big data adoption in industry settings and improve quality of big data integration processes to effectively support business decision-making.
Conference Paper
Full-text available
Entity matching is the problem of identifying which entities in a data source refer to the same real-world entity in the others. Identifying entities across heterogeneous data sources is paramount to entity profiling, product recommendation, etc. The matching process is not only overwhelmingly expensive for large data sources since it involves all tuples from two or more data sources, but also need to handle heterogeneous entity attributes. In this paper, we design an unsupervised approach, called EMAN, to match entities across two or more heterogeneous data sources. The algorithm utilizes the locality sensitive hashing schema to reduce the candidate tuples and speed up the matching process. To handle the heterogeneous entity attributes, we employ the exponential family to model the similarities between the different attributes. EMAN is highly accurate and efficient even without any ground-truth tuples. We illustrate the performance of EMAN on re-identifying entities from the same data source, as well as matching entities across three real data sources. Our experimental results manifest that our proposed approach outperforms the comparable baseline.
Article
Full-text available
We demonstrate a powerful and easy-to-use tool called Dedoop ( De duplication with Ha doop ) for MapReduce-based entity resolution (ER) of large datasets. Dedoop supports a browser-based specification of complex ER workflows including blocking and matching steps as well as the optional use of machine learning for the automatic generation of match classifiers. Specified workflows are automatically translated into MapReduce jobs for parallel execution on different Hadoop clusters. To achieve high performance Dedoop supports several advanced load balancing strategies.
Conference Paper
Full-text available
The Big Data era is upon us: data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through society. Since the value of data explodes when it can be linked and fused with other data, addressing the big data integration (BDI) challenge is critical to realizing the promise of Big Data. BDI differs from traditional data integration in many dimensions: (i) the number of data sources, even for a single domain, has grown to be in the tens of thousands, (ii) many of the data sources are very dynamic, as a huge amount of newly collected data are continuously made available, (iii) the data sources are extremely heterogeneous in their structure, with considerable variety even for substantially similar entities, and (iv) the data sources are of widely differing qualities, with significant differences in the coverage, accuracy and timeliness of data provided. This tutorial explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community.
Article
Full-text available
Hierarchical clustering has been widely used in numerous ap- plications due to its informative representation of clustering results. But its higher computation cost and inherent data dependency prohibits it from performing on large datasets efficiently. In this paper, we present a distributed single- linkage hierarchical clustering algorithm (DiSC) based on MapReduce, one of the most popular programming models used for scalable data analysis. The main idea is to divide the original problem into a set of overlapped subproblems, solve each subproblem and then merge the sub-solutions into an overall solution. Further, our algorithm has sufficient flexibility to be used in practice since it runs in a fairly small number of MapReduce rounds through configurable parameters for data merge phase. In our experiments, we evaluate the DiSC algorithm using synthetic datasets with varied size and dimensionality, and find that DiSC provides a scalable speedup of up to 160 on 190 computer cores.
Article
Entity resolution (ER), which detects records referring to the same entity across data sources, is a long-lasting challenge in database management research. The sheer volume of data collections today calls for the need of a blocking-based ER algorithm using the MapReduce framework for cloud computing. Most studies on blocking-based ER assume that only one blocking key is associated with an entity. An entity in reality may have multiple blocking keys in some applications. When the entities have a number of blocking keys, ER can be more efficient since two entities can form a similar pair only if they share several common keys. Therefore, we propose a MapReduce algorithm to solve the ER problem for a huge collection of entities with multiple keys. The algorithm is characterized in the combination-based blocking and the load-balanced matching. The combination-based blocking utilizes the multiple keys to sort out necessary entity pairs for future matching. The load-balanced matching evenly distributes the required similarity computations to all the reducers in the matching step so as to remove the bottleneck of skewed matching computations for a single node in a MapReduce framework. Our experiments using the well-known CiteSeerX digital library show that the proposed algorithm is both efficient and scalable.
Article
This paper introduces the Generic Entity Resolution (GER) framework; a framework that classifies pairs of entities as matching or non-matching based on the entities' features and their semantic relationships with other entities. The GER framework has been developed as part of an AI-based system for the development of Cyber situational awareness and provides a data fusion role by resolving entities discovered across multiple disparate data sources. The approach utilizes supervised machine learning to identify the set of features and semantic relationships that result in the optimum classification accuracy. We evaluated the GER framework using several well-known data sets and compare the framework's accuracy to existing state-ofthe- art resolution algorithms. We found that the GER framework's accuracy compares favourably to existing state-of-the-art resolution algorithms for the data sets used in this evaluation.
Article
This tutorial brings together perspectives on ER from a variety of fields, including databases, machine learning, natural language processing and information retrieval, to provide, in one setting, a survey of a large body of work. We discuss both the practical aspects and theoretical underpinnings of ER. We describe existing solutions, current challenges, and open research problems.