ArticlePDF Available

Creating probabilistic databases from duplicated data

Authors:

Abstract and Figures

A major source of uncertainty in databases is the presence of duplicate items, i.e., records that refer to the same real-world entity. However, accurate deduplication is a difficult task and imperfect data cleaning may result in loss of valuable information. A reasonable alternative approach is to keep duplicates when the correct cleaning strategy is not certain, and utilize an efficient probabilistic query-answering technique to return query results along with probabilities of each answer being correct. In this paper, we present a flexible modular framework for scalably creating a probabilistic database out of a dirty relation of duplicated data and overview the challenges raised in utilizing this framework for large relations of string data. We study the problem of associating probabilities with duplicates that are detected using state-of-the-art scalable approximate join methods. We argue that standard thresholding techniques are not sufficiently robust for this task, and propose new clustering algorithms suitable for inferring duplicates and their associated probabilities. We show that the inferred probabilities accurately reflect the error in duplicate records.
Content may be subject to copyright.
VLDB Journal manuscript No.
(will be inserted by the editor)
Creating Probabilistic Databases from Duplicated Data
Oktie Hassanzadeh ·Ren´ee J. Miller
Received: 14 September 2008 / Revised: 1 April 2009 / Accepted: 26 June 2009
Abstract A major source of uncertainty in databases
is the presence of duplicate items, i.e., records that refer
to the same real world entity. However, accurate dedu-
plication is a difficult task and imperfect data cleaning
may result in loss of valuable information. A reason-
able alternative approach is to keep duplicates when
the correct cleaning strategy is not certain, and uti-
lize an efficient probabilistic query answering technique
to return query results along with probabilities of each
answer being correct. In this paper, we present a flex-
ible modular framework for scalably creating a prob-
abilistic database out of a dirty relation of duplicated
data and overview the challenges raised in utilizing this
framework for large relations of string data. We study
the problem of associating probabilities with duplicates
that are detected using state-of-the-art scalable approx-
imate join methods. We argue that standard threshold-
ing techniques are not sufficiently robust for this task,
and propose new clustering algorithms suitable for in-
ferring duplicates and their associated probabilities. We
show that the inferred probabilities accurately reflect
the error in duplicate records.
1 Introduction
The presence of duplicates is a major concern for the
quality of data in large databases. To detect duplicates,
entity resolution also known as duplication detection or
record linkage is used as a part of the data cleaning
process to identify records that potentially refer to the
Work supported in part by NSERC.
Department of Computer Science
University of Toronto
E-mail: {oktie,miller}@cs.toronto.edu
same entity. Numerous deduplication techniques exist
to normalize data and remove erroneous records [42].
However, in many real world applications accurately
merging duplicate records and fully eliminating erro-
neous duplicates is still a very human-labor intensive
process. Furthermore, full deduplication may result in
the loss of valuable information.
An alternative approach is to keep all the data and
introduce a notion of uncertainty for records that have
been determined to potentially refer to the same en-
tity. Such data would naturally be inconsistent, con-
taining sets of duplicate records. Various methodologies
exist with different characteristics for managing uncer-
tainty and inconsistency in data [2,3, 15,22,51]. A large
amount of previous work addresses the problem of ef-
ficient query evaluation on probabilistic databases in
which it is assumed that meaningful probability val-
ues are assigned to the data in advance. Given these
probabilities, a query can return answers together with
a probability of the answer being correct, or alterna-
tively return the top-k most likely answers. For such ap-
proaches to work over duplicate data, the record prob-
abilities must accurately reflect the error in the data.
To illustrate this problem, consider the dirty rela-
tions of Figure 1. To assign probabilities, we must first
understand which records are potential duplicates. For
large data sets, a number of scalable approximate join
algorithm exist which return pairs of similar records
and their similarity scores (e.g., [4,8, 38]). Given the
result of an approximate join, we can group records
into sets of potential duplicates using a number of tech-
niques. The most simple technique is to group all records
whose similarity is within some threshold value. We
show that using simple thresholding with such tech-
niques to determine groups (clusters) of duplicates of-
ten results in poor accuracy. This is to be expected as
2
Company
tid name emp# hq cid prob
t1Altera Corporation 6K NY c1 0.267
t2Altersa Corporation 5K New York c1 0.247
t3lAtera Croporation 5K New York c1 0.224
t4Altera Corporation 6K NY, NY c1 0.262
t5ALTEL Corporatio 2K Albany, NY c2 0.214
t6ALLTEL Corporation 3K Albany c2 0.208
t7ALLTLE Corporation 3K Albany c2 0.192
t8Alterel Coporation 5K NY c2 0.184
t9ALTEL Corporation 2K Albany, NY c2 0.202
Product
pid product tidFk cidFk cid prob
p1MaxLink 300 t1c1c3 0.350
p2MaxLink 300 t8c2c3 0.350
p3MaxLnk 300 t4c1c3 0.300
p4SmartConnect t6c2c4 1.0
Price
rid product price cid prob
r1MaxLink 300 $285 c5 0.8
r2MaxLink 300 $100 c5 0.2
Fig. 1 A sample dirty database with Company,Product and Price relations.
thresholding does not take into account the character-
istics of the data or the duplicate detection task. To
overcome this, we consider existing and new scalable
clustering algorithms that are designed to produce high
quality clusterings even when the number of clusters is
unknown.
In Figure 1, the clustering is indicated by the clus-
ter identifier in the cid attribute. Records that share a
cluster identifier are potential duplicates. Once a clus-
tering is determined, we consider how to generate prob-
abilities. For our uncertainty model, we adopt the model
of Andritsos et al. [2] and Dalvi and Suciu [22] called
disjoint-independent databases. In this model, tuples
within a cluster (potential duplicates) are mutually dis-
joint. Tuples in different clusters are independent. This
reflects the intuition that errors are introduced for dif-
ferent (real-world) entities independently. So, the prob-
ability that t1(from Cluster c1) is in the clean (dedu-
plicated) database is independent of the probability
of t8(from Cluster c2) being in the clean database.
An important motivation for our choice of uncertainty
model is that efficient query answering techniques are
known for large classes of queries over such databases,
which is important since keeping duplicate information
is only worthwhile if it can be queried and used effec-
tively in decision making. As further motivation, the
probabilistic databases we create, can be used as input
to query evaluation techniques which model clustering
uncertainty (that is the uncertainty introduced by the
clustering process itself) [10]. We elaborate on this in
Section 2.4.
We also consider clustering techniques that produce
overlapping clusters. In this approach, records that have
been assigned to multiple clusters are no longer in-
dependent. Such probabilistic databases require more
complex query processing techniques which might be
supported by the lineage mechanisms of systems like
Trio [51] or world-set semantics of MayBMS [3].
To assign probabilities within a cluster, we follow
the common wisdom in uncertain data management
which has noted that record probabilities are mostly
internal to the system and useful primarily for ranking
answers [24,43]. Hence, in this work, we do not consider
different probability distributions within clusters, but
focus instead on assigning confidence scores that accu-
rately reflect the error in the records. That is, among a
set of duplicate records, a record with less error should
have a lower probability than a record containing more
error.
1.1 Outline and Contributions
In this paper, we propose a flexible modular frame-
work for scalably creating a probabilistic database out
of a dirty relation of duplicated data (Section 2). This
framework consists of three separate components. The
input to the first component is a base relation Rand
the output of the third component is a probabilistic re-
lation. Our framework complements and extends some
existing entity resolution and approximate join algo-
rithms, permitting their results to be used in a princi-
pled way within a probabilistic database management
system. We study this framework for the case of string
data, where the input relation consists of duplicated
string records and no additional information exists or
is usable to enhance the deduplication process. This in
fact is the case in many real world problems.
For each component of our framework, we briefly
overview the state-of-the art (Sections 2.1-2.3) to fur-
ther describe the characteristics of our framework in
comparison with other deduplication techniques. We
also present a detailed discussion of query evaluation
over probabilistic databases focusing on how the prob-
abilistic databases we create can be used. We justify
the scalability and adaptability of our framework and
the need for thorough evaluation of the performance
of each component. We perform this evaluation using a
methodology (summarized in Section 2.5) heavily based
on existing evaluation methods.
We present an overview of several string similarity
measures used in state-of-the-art similarity join tech-
niques and benchmark the accuracy of these measures
3
in our framework (Section 3). Unlike previous compar-
isons, we focus on measures useful for duplicate detec-
tion [33]. Given pairs of similar records, we present sev-
eral clustering algorithms for string data suitable for
our framework (Section 4).
We address the problem of assigning probabilities
(confidence scores) to records within each cluster that
naturally reflect the relative error in the record (Section
5). We present several algorithms based on a variety
of well-performing similarity measures for strings, and
an algorithm using the information bottleneck method
[2,47] which assigns probabilities based on the relative
information content of records within a cluster.
An important characteristic of our framework is its
modularity with components that are reusable for other
cleaning tasks. Hence, we thoroughly benchmark each
component individually to evaluate the effectiveness of
different techniques in terms of both accuracy and run-
ning time. For each component, we present a summary
of the results of extensive experiments which used many
datasets with different characteristics to ensure that
our framework is robust. We used several existing and
some novel measures of accuracy in our evaluations.
We also present an end-to-end evaluation of the com-
ponents when used together for creating a probabilistic
database.
2 Framework
Figure 2 shows the components of our framework. The
input to this framework is a base relation Rand the out-
put is a probabilistic relation. In this work, we focus on
creating a framework using scalable algorithms that do
not rely on a specific structure in the input relation R.
There are duplicate detection algorithms that can take
advantage of other types of input such as co-citation or
co-occurrence information [13]. Such information may
be available in bibliographic co-citation data or in social
networks. However, we do not consider these specialized
algorithms as such information is often not present in
the data.
An important characteristic of our framework is its
modularity. This makes our framework adaptable to
other data cleaning tasks. As new deduplication tech-
niques are developed, they may replace one or both of
our first two components. Moreover, if the input rela-
tion contains additional information that can be used
to enhance the accuracy of deduplication, these differ-
ent methods may be used. Furthermore, by dividing
the system into three separate modules, we are able to
evaluate the performance of each module individually.
Fig. 2 Components of the framework
2.1 Similarity Join
The first component of the system is the similarity
join module. The input to this module is a relation
R={ri: 1 iN}, and the output is a set of
pairs (ri, rj)R×Rwhere riand rj(i < j) are similar
and a similarity score for each pair. In existing join ap-
proaches, two records are considered similar when their
similarity score based on a similarity function sim() is
above a threshold θ. Many join methods typically model
records as strings. We denote by rthe set of q-grams
(sequences of qconsecutive characters of a string) in r.
For example, for t=‘db lab’, t={‘ d’,‘db’, ‘b ’ ,‘ l’,‘la’,
‘ab’, ‘b ’}for tokenization using 2-grams1. In certain
cases, a weight may be associated with each token.
Similarity join methods use a variety of different
similarity measures for string data [20,31]. Recently,
there has been an increasing interest in using measures
from the information retrieval field [4,16, 28,31,45]. In
[31], several such similarity measures are introduced
and benchmarked for approximate selection where the
goal is to sort the tuples in a relation based on their
similarity with a query string. The extension of approx-
imate selection to approximate join is not considered.
Furthermore, the effect of threshold values on accuracy
for approximate joins is also not considered. To fill in
this gap, we show that the performance of the similarity
predicates in a similarity join is slightly different (than
in selection) mainly due to the effect of choosing a sin-
gle threshold for matching all the tuples as opposed to
1Strings are first padded with whitespaces at the beginning
and the end, then all whitespaces are replaced with q1 occur-
rences of special unused symbol (e.g., a $).
4
ranking the tuples and choosing a different threshold
for each selection query.
Our work is motivated by the recent advancements
that have made similarity join algorithms highly scal-
able. Signature-based approaches (e.g., [4,16,45]) ad-
dress the efficiency and scalability of similarity joins
over large datasets. Many techniques are proposed for
set-similarity join, which can be used along with qgrams
for the purpose of (string) similarity joins, and are mostly
based on the idea of creating signatures for sets (strings)
to reduce the search space. Some signature generation
schemes are derived from dimensionality reduction. One
efficient approach uses the idea of Locality Sensitive
Hashing [36] in order to hash similar sets into the same
values with high probability and therefore provides an
approximate solution. Arasu et al. [4] proposed algo-
rithms specifically for set-similarity joins that are ex-
act and outperform previous approximation methods
in their framework, although parameters of the algo-
rithms require extensive tuning. More recent work [8]
proposes algorithms based on novel indexing and opti-
mization strategies that do not rely on approximation
or extensive parameter tuning and outperform previ-
ous state-of-the-art approaches. One advantage of our
approach is that all these techniques can be applied to
make this first component of the framework scalable.
2.2 Clustering
The clustering module outputs a set of clusters of records
c1,. . . ,ckwhere records in each cluster are highly sim-
ilar and records in different clusters are more dissimi-
lar. Most of the data clustering algorithms assume that
clusters are disjoint, i.e., cicj=for all i, j 1. . . k.
We will also present algorithms for a model in which
clusters are not disjoint, i.e., records may be present in
two or more clusters. This makes sense for the dupli-
cation detection problem where it may be impossible
to allocate a record with certainty to a single cluster.
Record t8in the database of Figure 1 is an example
of such a record where there may be uncertainty as to
whether t8belongs to cluster c2or c1.
Given our framework, we consider clustering tech-
niques that do not require as input the number of clus-
ters. There is a large body of work on clustering, includ-
ing the use of clustering for information retrieval [6,34]
and record linkage [25,35, 39,41]. We consider existing
and new techniques that do not require input parame-
ters such as the number of clusters. In this sense, our
motivation is similar to the use of generative models
and unsupervised clustering in entity resolution [14].
Notably however, we are dealing with large datasets
and scalability is an important goal. Moreover, as noted
earlier, our evaluation is based on the assumption that
structural or co-occurrence information does not exist
or such information cannot effectively be used to en-
hance deduplication. The only input to our clustering
component is the result of a similarity join, i.e., the sim-
ilar pairs and the similarity scores between them. Our
algorithms will generally be linear in this input, with
the exception that some techniques will require sorting
of this input.
Therefore, we do not consider relational clustering
algorithms or any of the new generative clustering mod-
els. Notably algorithms like Latent Dirichlet Allocation
(LDA) [14] are not scalable at present. For example,
one recent promising application of LDA to entity reso-
lution requires hours of computation on relatively small
data sets of less than 10,000 entities [12].
The majority of existing clustering algorithm that
do not require the number of clusters as input [7,17,
27,50] do not meet the requirements of our framework.
Specifically, they may require another parameter to be
set by the user and/or they may be computationally ex-
pensive and far from practical. There are other cluster-
ing algorithms that produce non-disjoint clusters, like
Fuzzy C-Means [11], but like K-Means they require the
number of clusters. We refer the reader to [25] and ref-
erences therein for details of numerous clustering al-
gorithms used for duplicate detection. A thorough ex-
perimental comparison of diverse clustering algorithms
from the Information Retrieval, Machine Learning, and
Data Management literature can be found elsewhere
[32]. These include the disjoint algorithms presented in
this paper (Section 4), as well as algorithms not consid-
ered here, like correlation clustering and its optimiza-
tions [1,23] that were shown to not perform well (or
not better than those we consider) for the duplicate
detection task.
2.3 Creating a Probabilistic Database
The final component of our framework creates a prob-
abilistic database. Managing uncertainty and inconsis-
tency has been an active research topic for a long time.
Various methodologies exist with different characteris-
tics that handle uncertainty and inconsistency in a va-
riety of applications [2,3, 15, 22,51]. A large amount of
previous work addresses the problem of efficient query
evaluation on databases in which it is assumed that
a probability value is assigned to each record in the
database beforehand. The vast majority of approaches
do not address the problem of creating probabilistic
databases. A common assumption is that the probabil-
ities reflect the reliability of the data source, for exam-
ple, based on the reliability of the device (e.g. RFID sen-
5
sor) that generates the data or statistical information
about the reliability of a web data source. The Price
relation in Figure 1 is an example of such database,
where it is assumed that there is an existing knowledge
about the reliability of the data sources that provide
the prices.
Andritsos et al. [2] propose a method for creating a
probabilistic database for duplicated categorical data.
In categorical data, the similarity between two attribute
values is either 0 (if the values are different) or 1 (if
the values are the same). They first cluster the rela-
tion using a scalable algorithm based on the Agglom-
erative Information Bottleneck [47], and then assign a
probability to each record within a cluster that repre-
sents the probability of that record being in the clean
database. However, they do not evaluate the accuracy of
the probabilities assigned. The Andritsos et al. [2] work
creates a database with row-level uncertainty (probabil-
ities are associated with records). Gupta and Sarawagi
[29] present a method for creating a probabilistic database
with both row- and column-level uncertainty from sta-
tistical models of structure extraction. In structure ex-
traction, unlike duplicate detection, there is uncertainty
about not only the correctness/existence of a record,
but also the correctness of attribute values within each
record.
Dalvi and Suciu [21] propose an online approach for
generating the probabilities in which the SQL queries
are allowed to have approximate equality predicates
that are replaced at execution time by a user defined
MATCH() operator. Accurate and efficient implementa-
tion of a MATCH() operator is not a trivial task as partly
shown in this paper.
2.4 Query Evaluation
An important motivation for our work is the increased
value that can be found from effectively modeling du-
plicates and their uncertainty. To realize this value, we
must be able to query and use the database we cre-
ate. Consider again our example of Figure 1. It may
be possible to normalize (or standardize) the names of
companies and their location by, for example, choosing
one common convention for representing cities. How-
ever, in other attributes there may be true disagreement
on what the real value should be. For the first company
(Altera), we do not know how many employees (emp#)
it has. By keeping all values and using some of the query
answering techniques described in this subsection, we
can still give users meaningful answers to queries. For
example, if we want to find small companies (with less
than 1000 employees), we know not to return Altera. If
we want to know the total number of employees in New
York, we can again use our assigned probabilities to
give probabilities for the possible answers to this query.
In this subsection, we briefly discuss several query
processing techniques suitable for probabilistic databases
generated by our framework. We begin with a recent
proposal for modeling and querying possible repairs in
duplicate detection. We then discuss two other pro-
posals that have considered efficient query evaluation
on the specific probabilistic database we create (that
is disjoint-independent databases). We then consider
techniques for top-k query evaluation on probabilistic
databases along with a new proposal for using proba-
bilistic information (of the type we can create) to help
in data cleaning. Finally, we describe a simple extension
to our framework to create databases with attribute-
level uncertainty.
2.4.1 Querying Repairs of Duplicate Data
Beskales et al. [10] present an uncertainty model for
representing the possible clusterings generated by any
fixed parametrized clustering algorithm, as well as effi-
cient techniques for query evaluation over this model.
Any probabilistic database generated by our framework
can be viewed as a duplication repair in this model.
Their approach provides a way of modeling cluster-
ing uncertainty on top of our probabilistic databases.
Hence, their queries use both the probabilities we as-
sign and in addition account for possible uncertainty
in the clustering itself (e.g., uncertainty in the assign-
ment of tuples to clusters). Their model is based on the
notion of U-Clean Relations. A U-Clean relation Rcof
an unclean relation Ris defined as a set of c-records.
A c-record is a representative record of a cluster along
with two additional attributes Cand P. The attribute
Cof a c-record is the set of record identifiers in Rthat
are clustered together to form the c-record, and the at-
tribute Pis the parameter settings of the clustering
algorithm Athat leads to the generation of the cluster
C. In Beskales et al. [10], possible parameter values are
represented using a continuous random variable τ, and
Pis an interval for τthat results in C. Here, we con-
sider possible parameter values as a discrete random
variable θ, and Pas the set of thresholds θused for the
similarity join component that results in the cluster C.
Let θldenote the lower bound and θudenote the up-
per bound for the threshold. For many string similarity
joins, θl= 0 and θu= 1. Applying Ato the unclean
relation Rwith parameter θ, generates a possible clus-
tering of R, denoted by A(R, θ).
Consider again the Company relation in the dirty
database of Figure 1. Assume that any threshold less
than or equal to 0.3 results in one clusters {t1,t2,t3,t4,
6
Companyc
ID ··· C P
CP1 · · · { t1,t2,t3,t4,t5,t6,t7,t8} { 0.2,0.3 }
CP2 · · · { t1,t2,t3,t4,t8} { 0.4 }
CP3 · · · { t5,t6,t7} { 0.4 }
CP4 · · · { t1,t2,t3,t4} { 0.5,0.6,0.7 }
CP5 · · · { t4,t5,t6,t7} { 0.5,0.6,0.7 }
Fig. 3 U-Clean relation created from the Company relation in
the dirty database of Figure 1
t5,t6,t7,t8}, threshold θ= 0.4 results in two clusters
{t1,t2,t3,t4,t8}and {t5,t6,t7}, and any threshold
above 0.4 and below 0.7 results in two clusters {t1,t2,
t3,t4}and {t4,t5,t6,t7}. Figure 3 shows the Cand
Pattributes of the corresponding U-Clean relation.
The set of all clusterings χis defined as {A(R, θ) :
θ∈ {θl,· · · , θu} }. Let function fθ(t) be the probability
that tis the suitable parameter setting. The probabil-
ity of a specific clustering Xχ, denoted P r(X), is
derived as follows:
P r(X) =
θu
X
t=θl
fθ(t)·h(t, X) (1)
where h(t, X) = 1 if A(R, t) = X, and 0 otherwise.
In our framework, the function fθ(t) can be de-
rived by manual inspection of a possibly small sub-
set of the clustering results, and calculating (and nor-
malizing) the quality measures presented in Section 4.3
over a subset of the data using different thresholds. Ef-
ficient algorithms are proposed in [10] for evaluation
of Selection, Projection, Join (SPJ) and aggregation
queries. Moreover, an extension of this model is pre-
sented in which the uncertainty in merging the clusters
(or choosing the representative record for each cluster)
is also considered. Our probability assignment compo-
nent (Section 5) can be used to generate such U-Clean
relations.
2.4.2 Clean answers over Duplicated Data
This approach, presented by Andritsos et al. [2], re-
quires a probabilistic database with row-level uncer-
tainty, where probabilities are assigned in a way such
that for each i1. . . k,PtCiprob(t) = 1. Such a
database is referred to as a dirty database. The proba-
bility values reflect the probability of the record being
the best representation of the real entity. Even if the
database does not contain a completely clean record,
such a database can be used for accurate query answer-
ing.
Consider the example dirty database in Figure 1.
This database consists of three dirty relations: Company
with original schema Company(tid, name, emp#, hq),
Product with original schema Product(pid, product,
tidFk) and Price with original schema Price(tid,
product, price). Two new attributes are introduced
in all the relations: cid for the identifier of the cluster-
ing produced by the clustering component, and prob
for the tuple probabilities. In relation Product, a new
attribute cidFk is introduced for the identifier of the
company referenced by Product.tidFk. The values of
this attribute are updated using a process called iden-
tifier propagation which runs after the clustering phase
and adds references to the cluster identifiers of the tu-
ples in all the relations that refer to those tuples.
Acandidate database Dcd for the dirty database D
is defined as a subset of Dthat for every cluster ci
of a relation in D, there is exactly one tuple tfrom
cisuch that tis in Dcd. Candidate databases are re-
lated to the notion of possible worlds, which has been
used to give semantics to probabilistic databases. No-
tice, however, that the definition of candidate database
imposes specific conditions on the tuple probabilities:
the tuples within a cluster must be exclusive events, in
the sense that exactly one tuple of each cluster appears
in the clean database, and the probabilities of tuples
from different clusters are independent. For the exam-
ple database in Figure 1 without the relation Price,
the candidate databases are:
Dcd
1={t1, t5, p1, p4}Dcd
2={t2, t5, p1, p4}Dcd
3={t3, t5, p1, p4}
Dcd
4={t4, t5, p1, p4}Dcd
5={t1, t6, p1, p4}Dcd
6={t2, t6, p1, p4}
Dcd
7={t3, t6, p1, p4}Dcd
8={t4, t6, p1, p4}Dcd
9={t1, t7, p1, p4}
Dcd
10 ={t2, t7, p1, p4}Dcd
11 ={t3, t7, p1, p4}Dcd
12 ={t4, t7, p1, p4}
Dcd
13 ={t1, t8, p1, p4}Dcd
14 ={t2, t8, p1, p4}Dcd
15 ={t3, t8, p1, p4}
Dcd
16 ={t4, t8, p1, p4}Dcd
17 ={t1, t9, p1, p4}Dcd
18 ={t2, t9, p1, p4}
Dcd
19 ={t3, t9, p1, p4}Dcd
20 ={t4, t9, p1, p4}Dcd
21 ={t1, t5, p2, p4}
Dcd
22 ={t2, t5, p2, p4}Dcd
23 ={t3, t5, p2, p4}Dcd
24 ={t4, t5, p2, p4}
Dcd
25 ={t1, t6, p2, p4}Dcd
26 ={t2, t6, p2, p4}Dcd
27 ={t3, t6, p2, p4}
Dcd
28 ={t4, t6, p2, p4}Dcd
29 ={t1, t7, p2, p4}Dcd
30 ={t2, t7, p2, p4}
Dcd
31 ={t3, t7, p2, p4}Dcd
32 ={t4, t7, p2, p4}Dcd
33 ={t1, t8, p2, p4}
Dcd
34 ={t2, t8, p2, p4}Dcd
35 ={t3, t8, p2, p4}Dcd
36 ={t4, t8, p2, p4}
Dcd
37 ={t1, t9, p2, p4}Dcd
38 ={t2, t9, p2, p4}Dcd
39 ={t3, t9, p2, p4}
Dcd
40 ={t4, t9, p2, p4}Dcd
41 ={t1, t5, p3, p4}Dcd
42 ={t2, t5, p3, p4}
Dcd
43 ={t3, t5, p3, p4}Dcd
44 ={t4, t5, p3, p4}Dcd
45 ={t1, t6, p3, p4}
Dcd
46 ={t2, t6, p3, p4}Dcd
47 ={t3, t6, p3, p4}Dcd
48 ={t4, t6, p3, p4}
Dcd
49 ={t1, t7, p3, p4}Dcd
50 ={t2, t7, p3, p4}Dcd
51 ={t3, t7, p3, p4}
Dcd
52 ={t4, t7, p3, p4}Dcd
53 ={t1, t8, p3, p4}Dcd
54 ={t2, t8, p3, p4}
Dcd
55 ={t3, t8, p3, p4}Dcd
56 ={t4, t8, p3, p4}Dcd
57 ={t1, t9, p3, p4}
Dcd
58 ={t2, t9, p3, p4}Dcd
59 ={t3, t9, p3, p4}Dcd
60 ={t4, t9, p3, p4}
Clearly, not all the candidate databases are equally
likely to be clean. This is modeled with a probability
distribution, which assigns to each candidate database
a probability of being clean. Since the number of candi-
date databases may be huge (exponential in the worst
case), the distribution is not given by extension. In-
stead, probabilities of each tuple are used to calculate
7
the probability of a candidate database being the clean
one. Since tuples are chosen independently, the proba-
bility of each candidate database can be obtained as
the product of the probability of each of its tuples:
P r(Dcd) = QtDcd prob(t).
Although the clean database is not known, a query
can be evaluated by being applied to the candidate
databases. Intuitively, a result is more likely to be in
the answer if it is obtained from candidates with higher
probability of being clean. A clean answer to a query q
is therefore defined as a tuple tsuch that there exists
a candidate database Dcd such that tq(Dcd). The
probability of tis: p=PDcd:tq(Dcd)P r (Dcd).
The clean answers to a query can be obtained di-
rectly from the definition if we assume that the query
can be evaluated for each candidate database. How-
ever, this is an unrealistic assumption due to the poten-
tially huge number of candidate databases. Andritsos et
al. [2] propose a solution to this problem by rewriting
the SQL queries to queries that can be applied directly
on the dirty database in order to obtain the clean an-
swers along with their probabilities. The following two
examples illustrate this approach.
Example 1 Consider a query q1for the dirty database
in Figure 1 that retrieves all the companies that have at
least 5K employees.
Company cluster c1has more than 5K employees in
all the candidate databases and therefore is a clean an-
swer with probability 1. The cluster c2, however, has
at least 5K employees only in the candidate databases
that include tuple t8. The probability of this candidate
database is 0.184. The following re-written query re-
turns the clean answers along with their probability val-
ues.
select cid, sum(prob)
from company
where emp# >= 5K
group by cid
The previous example focuses on a query with just one
relation. However, as shown in the next example, the
rewriting strategy can be extended to queries involving
foreign key joins.
Example 2 Consider a query q2for the dirty database
in Figure 1 that selects the products and the companies
for those companies that have at most 5Kemployees.
The product cluster c4associated with company clus-
ter c2appears in every candidate database and the em-
ployee count of c2is always at most 5K. Therefore
(c4, c2)has probability 1 of being a clean answer. The
query answer (c3, c2)appears only in the result of apply-
ing the query q2to the candidate databases that include
tuples t8and p2(Dcd
33,Dcd
34,Dcd
35 and Dcd
36), and sum
of their probabilities is 0.064.(c3, c1)does not appear
in any of the candidate databases and therefore is not a
clean answer (i.e., has probability zero). It is easy to see
that the clean answers can be obtained by the following
rewriting of the query.
select p.cid, p.cidFk, sum(p.prob * c.prob)
from company c, product p
where p.cidFk = c.cid
and c.emp# <= 5K
group by p.cid, c.cid
The above rewriting strategy works only for a cer-
tain class of queries. Let qbe a Select-Project-Join
(SPJ) query. The identifier of a relation is defined as
the attribute containing the cluster id (which identifies
the tuples which are duplications). The join graph Gof
qis defined as a directed graph such that the vertices
of Gare the relations used in qand there is an arc from
Rito Rjif a non-identifier attribute of Riis equated
with the identifier attribute of Rj. Andritsos et al. [2]
define an SPJ query qwith join graph Gas a rewritable
query if: 1) all the joins involve the identifier of at least
one relation 2) G is a tree 3) a relation appears in the
from clause at most once, and 4) the identifier of the
relation at the root of Gappears in the select clause.
These conditions rule out, for example, joins that do
not involve an identifier attribute and queries that are
cyclic or contain self joins.
Dalvi and Suciu [22] present a theoretical study of
the problem of query evaluation over dirty databases
(also known as disjoint independent databases). They
present a dichotomy for the complexity of query eval-
uation for queries without self-joins: evaluating every
query is either PTIME or #P-hard. #P-hard queries
are called hard queries and are in one of the follow-
ing forms (the underlined attributes are the keys of the
relations):
h1=R(x), S(x, y ), T (y)
h2=R(x, y),· · · , Rk(x, y ), S(y)
h3=R(x, y),· · · , Rk(x, y ), S1(x, y),· · · , Sm(x, y)
The hardness of any conjunctive query without self-
joins follows from a reduction from one of these three
queries. Any query that is not hard (#P-hard) is re-
ferred to as safe and can be evaluated in PTIME.
2.4.3 Top-kQuery Evaluation
The problem of evaluating top-kquery results on prob-
abilistic databases has been studied in previous work
[43,49,48]. Different types of top-k queries are possible
for uncertain data. Consider the following queries over
the dirty database of Figure 1:
8
Find the location of the headquarters of companies
that have a product selling for more than $300, re-
turn only the top klocations (ranked according to
their probabilities).
Find the top kmost expensive products.
Find the companies that have the kmost expen-
sive products (ranking based on the price in all the
possible worlds).
Here again, these queries can be answered by ma-
terializing all the candidate databases, obtaining an-
swers for each candidate database and aggregating the
probabilities of identical answers, which could be pro-
hibitively expensive because of the huge number of can-
didate databases. For evaluation of the first query, the
fact that the user is interested only in the top 3 most
probable answers can be used to make the query evalu-
ation more efficient. R´e et al. [43] present an approach
for generating the top-kprobable query answers using
Monte-Carlo simulation. In this approach, the top kan-
swers of a SQL query (according to their probabilities)
are returned, and their probabilities are approximated
only to the extent needed to compute their ranking. Al-
though the probabilities are approximate, the answers
are guaranteed to be the correct khighest ranked an-
swers. The queries considered in this work are of the
following form:
TOP k
SELECT ¯
B, agg1(A1), agg2(A2), · · ·
FROM ¯
R
WHERE C
GROUP BY ¯
B
The aggregate operators can be sum,count,min and
max;avg is not supported.
The other type of top-kquery requires finding the
top ktuples according to their price values (or some
other scoring function). The second and third queries
above are examples of such queries. Soliman et al. [48,
49] present a single framework for processing both score
and uncertainty leveraging current DBMS storage and
query processing capabilities. Their work is based on
an uncertainty model that includes generation rules,
which are arbitrary logical formulas that determine the
valid worlds. Tuples that are not correlated using gen-
eration rules are independent. Such a model is partic-
ularly useful for duplicate detection. The disjointness
(mutual exclusion) of tuples within clusters that can
be expressed using generation rules. In addition, two
clusters can share a single tuple with a generation rule
that states that the shared tuple cannot be present in
both clusters. Therefore, for the example database in
Figure 1 where there may be uncertainty as to whether
t8belongs to cluster c2or c1, it is possible to include
a new tuple t0
8in cluster c1which has the same values
as tuple t8, using the generation rule (t8t0
8) which
means that both tuples cannot be present in a single
candidate database. This model makes it possible to
use the non-disjoint clustering algorithms we propose
in this paper.
2.4.4 Cleaning with Quality Guarantees
Another interesting application of uncertain data man-
agement for duplicate detection is cleaning the data in
order to increase the quality of certain query results.
Cheng et al. [19] recently proposed a framework for this
purpose. In their work, they present the PWS-quality
metric, which is a universal measure that quantifies the
level of ambiguity of query answers under the possible
worlds semantics. They provide efficient methods for
evaluating this measure for two classes of queries:
Non-rank-based queries, where a tuple’s qualifica-
tion probability is independent of the existence of
other tuples. For example, range queries, i.e., queries
that return a set of tuples having an attribute value
that is in a certain range.
Rank-based queries, where a tuple’s qualification
probability depends on the existence of other tu-
ples, such as MAX query which is the main focus of
the techniques in this framework.
Using the PWS-quality, a set of uncertain objects in
the database can be chosen to be cleaned by the user,
in order to achieve the best improvement in the quality
of query answers.
2.4.5 Attribute-level Uncertainty
We have limited our discussions so far to databases with
tuple-level (row-level) uncertainty. It is also possible
to use the probability assignment methods we present
in this paper to create databases with attribute-level
(column-level) uncertainty. This can be done easily by
applying our techniques to each attribute individually
(essentially applying our techniques to a column-store
version of the database). Figure 4 shows such a database
for the sample dirty relations in Figure 1. Relations with
attribute-level uncertainty can either be transformed
to several relations with tuple-level uncertainty and be
used along with one of the query evaluation techniques
described in this section, or they can be stored and
queried in more efficient frameworks designed for effi-
cient handling of attribute-level uncertainty [3,46].
2.5 Evaluation Framework
To generate datasets for our experiments, we use an
enhanced version of the UIS database generator which
9
Company
cid name emp# hq
c1 Altera Corporation {0.430}5K {0.5}New York {0.50}
Altersa Corporation {0.297}6K {0.5}NY {0.25}
lAtera Croporation {0.273}NY, NY {0.25}
c2 ALTEL Corporatio {0.310}2K {0.4}Albany, NY {0.4}
ALLTEL Corporation {0.254}3K {0.4}Albany {0.4}
ALLTLE Corporation {0.234}5K {0.2}NY {0.2}
Alterel Coporation {0.202}
Product
cid product cidFk
c3 MaxLink 300 {0.5}c1{0.66}
MaxLnk 300 {0.5}c2{0.33}
c4 SmartConnect {1.0}c2{1.0}
Price
rid product price
c5 MaxLink 300 {1.0}$285 {0.8}
$100 {0.2}
Fig. 4 The sample dirty database of Figure 1 with attribute-level uncertainty.
has been effectively used in the past to evaluate du-
plicate detection algorithms and has been made pub-
licly available [31,35]. We follow a relatively standard
methodology of using the data generator to inject differ-
ent types and percentages of errors to a clean database
of string attributes. The erroneous records made from
each clean record are put in a single cluster (which we
use as ground truth) in order to be able to measure
quality (precision and recall) of the similarity join and
clustering modules. The generator permits the creation
of data sets of varying sizes, error types and distribu-
tions, thus is a very flexible tool for our evaluation. The
kind of typographical errors injected by the data gen-
erator are based on studies on common types of errors
present in string data in real databases [37]. Therefore
the synthetic datasets resemble real dirty databases,
but allow thorough evaluation of the results based on
robust quality measures.
Our data generator provides the following parame-
ters to control the error injected in the data:
the size of the dataset to be generated
the fraction of clean records to be utilized to gener-
ate erroneous duplicates
distribution of duplicates: the number of duplicates
generated for a clean record can follow a uniform,
Zipfian or Poisson distribution.
percentage of erroneous duplicates: the fraction of
duplicate records in which errors are injected by the
data generator.
extent of error in each erroneous record: the per-
centage of characters that will be selected for inject-
ing character edit error (character insertion, dele-
tion, replacement or swap) in each record selected
for error injection.
token swap error: the percentage of word pairs that
will be swapped in each record that is selected for
error injection.
We use two different clean sources of data: a data
set consisting of company names and a data set con-
sisting of titles from DBLP. Statistical details for the
two datasets are shown in Table 1. Note that we can
generate reasonably large datasets out of these clean
Table 1 Statistics of clean datasets
dataset #rec. Avg. rec. length #words/rec.
Company Names 2139 21.03 2.92
DBLP Titles 10425 33.55 4.53
sources. For the company names dataset, we also in-
ject domain specific abbreviation errors, e.g., replacing
Inc. with Incorporated and vice versa. We describe
the characteristics of the specific datasets generated for
evaluating each component (parameters used to create
datasets) in the related sections.
3 Similarity Join Module
There are a large number of similarity functions for
string data. The choice of the similarity function highly
depends on the characteristics of the datasets. In what
follows, we briefly describe the similarity measures that
are suitable for our framework. Since one of our main
goals in this work is scalability, we only consider those
similarity measures that could have efficient implemen-
tation. Our contribution in this section is benchmark-
ing accuracy of these measures in order to choose the
measure with highest performance for this framework.2
3.1 Similarity Measures
The similarity measures that fit in our framework are
those based on q-grams created out of strings along
with a similarity measure that has been shown to be
effective in previous work. The measures discussed here
share one or both of the following properties.
High scalability: There are various techniques pro-
posed in the literature as described in Section 2.1
for enhancing the performance of the similarity join
operation using q-grams along with these measures.
High accuracy: Previous work has shown that these
measures perform better or equally well in terms
2A presentation of the evaluation was given at the Interna-
tional Workshop on Quality in Databases [33].
10
of accuracy when compared with other string sim-
ilarity measures. Specifically, these measures have
shown good accuracy in name-matching tasks [20]
or in approximate selection [31]. We include these
measures to compare their accuracy to the scalable
measures. The results of our experiments show that
some highly scalable measures outperform other highly
accurate but non-scalable measures in terms of ac-
curacy on the approximate join task.
3.1.1 Edit Similarity
Edit-distance is widely used as the measure of choice in
many similarity join techniques. Specifically, previous
work [28] has shown how to use q-grams for an effi-
cient implementation of this measure in a declarative
framework. Recent work on enhancing performance of
similarity join has also proposed techniques for scalable
implementation of this measure [4,38].
Edit distance between two string records r1and r2is
defined as the transformation cost of r1to r2,tc(r1, r2),
which is equal to the minimum cost of edit operations
applied to r1to transform it to r2. Edit operations in-
clude character insert (inserting a new character in r1
to transform it into r2,delete (deleting a character from
r1for the transformation) and substitute (substitute a
character in r1with a new character for the transfor-
mation) [30]. The edit similarity is defined as:
simedit(r1, r2) = 1 tc(r1, r2)
max{|r1|,|r2|} (2)
There is a cost associated with each edit operation.
There are several cost models proposed for edit opera-
tions for this measure. The most commonly used mea-
sure called Levenshtein edit distance, which we will re-
fer to as edit distance in this paper, uses unit cost for
all the operations.
3.1.2 Jaccard and Weighted Jaccard
Jaccard similarity is the fraction of tokens in r1and r2
that are present in both. Weighted Jaccard similarity
is the weighted version of Jaccard similarity, i.e.,
simW Jaccar d(r1, r2) = Ptr1r2wR(t)
Ptr1r2wR(t)(3)
where wR(t) is a weight function that reflects the com-
monality of the token tin the relation R. We choose
a slightly modified form of the Inverse Document Fre-
quency (IDF) weights based on the Robertson/Sparck-
Jones (RSJ) weights for the tokens which was shown
to be effective in our experiments and in previous work
[31]:
wR(t) = log Nnt+ 0.5
nt+ 0.5(4)
where Nis the number of tuples in the base relation
Rand ntis the number of tuples in Rcontaining the
token t.
3.1.3 Measures from IR
A well-studied problem in information retrieval is the
problem of given a query and a collection of documents,
return the most relevant documents to the query. In the
measures for this problem, records are treated as doc-
uments and q-grams are seen as words (tokens) of the
documents. Therefore, the same techniques for finding
relevant documents to a query can be used to return
similar records to a query string. In the rest of this
subsection, we present three measures that have been
shown to have higher performance for the approximate
selection problem [31]. Note that IR models may be
asymmetric, but we are able to still use them since we
are using self-joins for duplicate detection.
Cosine w/tf-idf The tf-idf cosine similarity is a
well established measure in the IR community which
leverages the vector space model. This measure deter-
mines the closeness of the input strings r1and r2by
first transforming the strings into unit vectors and then
measuring the angle between their corresponding vec-
tors. The cosine similarity with tf-idf weights is given
by:
simCosine (r1, r2) = X
tr1r2
wr1(t)·wr2(t) (5)
where wr1(t) and wr2(t) are the normalized tf-idf weights
for each common token in r1and r2respectively. The
normalized tf-idf weight of token tin a given string
record ris defined as follows:
wr(t) = w0
r(t)
pPt0rw0
r(t0)2, w0
r(t) = tfr(t)·idf(t)
where tfr(t) is the term frequency of token twithin
string rand idf(t) is the inverse document frequency
with respect to the entire relation R.
BM25 The BM25 similarity score for a query r1
and a string record r2is defined as follows:
simBM 25(r1, r2) = X
tr1r2
ˆwr1(t)·wr2(t) (6)
11
where:
ˆwr1(t) = (k3+1)·tfr1(t)
k3+tfr1(t)
wr2(t) = w(1)
R(t)(k1+1)·tfr2(t)
K(r2)+tfr2(t)
w(1)
R(t) = log Nnt+0.5
nt+0.5
K(r) = k1(1 b) + b|r|
avgrl
and tfr(t) is the frequency of the token tin string record
r,|r|is the number of tokens in r,avgrl is the average
number of tokens per record, N is the number of records
in the relation R,ntis the number of records containing
the token tand k1,k3and bare set of independent
parameters.
Hidden Markov Model The approximate string
matching could be modeled by a discrete Hidden Markov
process which has been shown to have better perfor-
mance than Cosine w/tf-idf in the IR literature [40] and
high accuracy and low running time for approximate se-
lection [31]. This particular Markov model consists of
only two states where the first state models the tokens
that are specific to one particular “String” and the sec-
ond state models the tokens in “General English”, i.e.,
tokens that are common in many records. A complete
description of the model and possible extensions are
presented elsewhere [31,40].
The HMM similarity function accepts two string
records r1and r2and returns the probability of gen-
erating r1given r2is a similar record:
simHM M (r1, r2) = Y
tr1
(a0P(t|GE) + a1P(t|r2)) (7)
where a0and a1= 1 a0are the transition states
probabilities of the Markov model and P(t|GE) and
P(t|r2) is given by:
P(t|r2) = number of times tappears in r2
|r2|
P(t|GE) = PrRnumber of times tappears in r
PrR|r|
3.1.4 Hybrid Measures
The implementation of these measures involves two sim-
ilarity functions, one that compares the strings by com-
paring their word tokens and another similarity func-
tion which is more suitable for short strings and is used
for comparison of the word tokens.
GES The generalized edit similarity (GES) which is
a modified version of fuzzy match similarity [16], takes
two strings r1and r2, tokenizes the strings into a set
of words and assigns a weight w(t) to each token. GES
defines the similarity between the two given strings as a
minimum transformation cost required to convert string
r1to r2and is given by:
simGES (r1, r2) = 1 min tc(r1, r2)
wt(r1),1.0(8)
where wt(r1) is the sum of weights of all tokens in r1
and tc(r1, r2) is the minimum cost of a sequence of the
following transformation operations:
token insertion: inserting a token tin r1with cost
w(t).cins where cins is the insertion factor constant
and is in the range between 0 and 1. In our experi-
ments, cins = 1.
token deletion: deleting a token tfrom r1with cost
w(t).
token replacement: replacing a token t1by t2in r1
with cost (1 simedit(t1, t2)) ·w(t) where simedit is
the edit-distance between t1and t2.
SoftTFIDF SoftTFIDF is another hybrid measure
proposed by Cohen et al. [20], which relies on the nor-
malized tf-idf weight of word tokens and can work with
any arbitrary similarity function to find the similarity
between word tokens. In this measure, the similarity
score is defined as follows:
simSof tT F ID F (r1, r2) =
X
t1C(θ,r1,r2)
w(t1, r1)·w(arg max
t2r2
(sim(t1, t2)), r2)·max
t2r2
(sim(t1, t2))
(9)
where w(t, r) is the normalized tf-idf weight of word to-
ken tin record rand C(θ, r1, r2) returns a set of tokens
t1r1such that for t2r2we have sim(t1, t2)> θ
for some similarity function sim() suitable for compar-
ing word strings. In our experiments sim(t1, t2) is the
Jaro-Winkler similarity as suggested by Cohen et al.
[20].
3.2 Evaluation
We only evaluate the accuracy of the similarity mea-
sures, since there has been several studies on the scal-
ability of these measures, but little work studying the
accuracy of the join operation. The accuracy is known
to be dataset-dependent and there is no common frame-
work for evaluation and comparison of accuracy of dif-
ferent similarity measures and techniques. This makes
comparing their accuracy a difficult task. Nevertheless,
we argue that it is possible to evaluate relative perfor-
mance of different measures for approximate joins by
using datasets containing different types of well-known
12
Table 2 Datasets used for the results in this paper
Percentage of
Group Name Erroneous Errors in Token Abbr.
Duplicates Duplicates Swap Error
Dirty D1 90 30 20 50
D2 50 30 20 50
Medium M1 30 30 20 50
Error M2 10 30 20 50
M3 90 10 20 50
M4 50 10 20 50
Low L1 30 10 20 50
Error L2 10 10 20 50
AB 50 0 0 50
Single TS 50 0 20 0
Error EDL 50 10 0 0
EDM 50 20 0 0
EDH 50 30 0 0
quality problems such as typing errors and differences
in notations and abbreviations.
Datasets In order to evaluate the effectiveness of dif-
ferent similarity measures described in this section, we
use the same datasets used in an evaluation of approx-
imate selection [31]. As described in Section 2.5, the
errors in these datasets include commonly occurring
typing mistakes (edit errors, character insertion, dele-
tion, replacement and swap), token swap and abbre-
viation errors (e.g., replacing Inc. with Incorporated
and vice versa). For the results presented in this section,
the datasets are generated by the data generator out of
the clean company names dataset described in Table
1. The errors in the datasets have a uniform distribu-
tion. For each dataset, on average 5000 dirty records
are created out of 500 clean records. We have also run
experiments on datasets generated using different pa-
rameters. For example, we generated data using a Zip-
fian distribution, and we also used data from the other
clean source in Table 1 (DBLP titles). We also created
larger datasets. For these other datasets, the accuracy
trends remain the same. Table 2 shows the description
of all the datasets used for the results in this paper.
We used 8 different datasets with mixed types of errors
(edit errors, token swap and abbreviation replacement).
Moreover, we used 5 datasets with only a single type of
error (3 levels of edit errors, token swap or abbreviation
replacement errors) to measure the effect of each type
of error individually.
Measures We use well-known measures from IR, namely
precision, recall, and F1, for different values of the thresh-
old to evaluate the accuracy of the similarity join oper-
ation. We perform a self-join on the input table using
a similarity measure with a fixed threshold θ.Precision
(Pr) is defined as the percentage of duplicate records
among the records that have a similarity score above
the threshold θ. In our datasets, duplicate records are
marked with the same cluster ID as described above.
Recall (Re) is the ratio of the number of duplicate records
that have similarity score above the threshold θto the
total number of duplicate records. Therefore, a join that
returns all the pairs of records in the two input tables
as output has low (near zero) precision and recall of 1.
A join that returns an empty answer has precision 1
and zero recall. The F1measure is the harmonic mean
of precision and recall, i.e., F1=2×P r ×Re
P r +Re We mea-
sure precision, recall, and F1for different values of the
similarity threshold θ. For comparison of different sim-
ilarity measures, we use the maximum F1score across
different thresholds.
Settings For the measures based on q-grams, we set
q= 2 since it yields the best accuracy in our experi-
ments for all these measures. We use the same param-
eters for BM25 and HMM score formula that were sug-
gested elsewhere [40,44] [31].
Results Appendix A contains the full precision-recall
curves for all the measures described above. The results
of our experiments show that the “dirtiness” of the in-
put data greatly affects the value of the threshold that
results in the most accurate join. For all the measures,
a lower value of the threshold is needed as the degree
of error in the data increases. For example, Weighted
Jaccard achieves the best F1score over the dirty group
of datasets with threshold 0.3, while it achieves the best
F1for the low-error datasets at threshold 0.55. BM25
and HMM are less sensitive and the best value of the
threshold varies from 0.25 for dirty datasets to 0.3 for
low-error datasets. We will discuss later how the de-
gree of error in the data affects the choice of the most
accurate measure.
Effect of types of errors: Figure 5 shows the
maximum F1score for different values of the thresh-
old for different measures on datasets containing only
edit-errors (the EDL, EDM and EDH datasets). These fig-
ures show that weighted Jaccard and Cosine have the
highest accuracy followed by Jaccard, and edit simi-
larity on the low-error dataset EDL. By increasing the
amount of edit error in each record, HMM performs as
well as weighted Jaccard, although Jaccard, edit sim-
ilarity, and GES perform much worse on high edit er-
ror datasets. Considering the fact that edit-similarity
is mainly proposed for capturing edit errors, this shows
the effectiveness of weighted Jaccard and its robustness
with varying amount of edit errors. Figure 6 shows the
effect of token swap and abbreviation errors on the ac-
curacy of different measures. This experiment indicates
that edit similarity is not capable of modeling such er-
rors. HMM, BM25 and Jaccard also are less capable
13
Fig. 5 Maximum F1score for different measures on datasets
with only edit errors
Fig. 6 Maximum F1score for different measures on datasets
with only token swap and abbr. errors
Fig. 7 Maximum F1score for different measures on dirty,
medium and low-error group of datasets
of modeling abbreviation errors than cosine with tf-idf,
SoftTFIDF and weighted Jaccard.
Comparison of measures: Figures 7 shows the
maximum F1score for different values of the threshold
for different measures on dirty, medium and low-error
datasets. Here, we have aggregated the results for all
the dirty data sets together (respectively, the moder-
ately dirty or medium data sets and the low-error data
sets). The results show the effectiveness and robust-
ness of weighted Jaccard and cosine in comparison with
other measures. Again, HMM is among the most accu-
rate measures when the data is extremely dirty, and has
relatively low accuracy when the percentage of error in
the data is low.
3.3 Our Choice of Similarity Measure
Unless specifically mentioned, we use weighted Jaccard
similarity as the measure of choice for the rest of the
paper due to its relatively high efficiency and accuracy
compared with other measures. Note that this similarity
predicate can be implemented declaratively and used as
a join predicate in a standard RDBMS engine [31], or
used with some of the specialized, high performance,
approximate join algorithms as described in Section
2. Specifically, the Weighted Enumeration (WtEnum)
signature generation algorithm can be used to signifi-
cantly improve the running time of the join [4]. In ad-
dition, novel indexing and optimization techniques can
be utilized to make the join even faster [8].
4 Clustering Module
Here, we consider algorithms for clustering records based
on the output of the similarity join module. So the input
to this module is a set of similar pairs of records and the
output is a set of clusters of records C={c1, . . . , ck}
where records in each cluster are highly similar. We
present two groups of algorithms, one for creating dis-
joint clusters, i.e., non-overlapping clusters that parti-
tion the base relation, and the other for non-disjoint
clustering, i.e., we allow a few records to be present in
two or more clusters.
The scalable similarity join will eliminate large por-
tions of the data (records without duplicates) from the
clustering. Specifically, the similarity graph used in the
clustering will be much smaller after using a similar-
ity join. Of course, we want to be able to handle large
amounts of error in the data, so we do also focus on
clustering techniques that can still handle large data
sets containing hundreds of thousands of potential du-
plicates. But the combination of a scalable similarity
join, with a clustering technique that can handle large
similarity graphs, greatly enhances the end-to-end scal-
ability of the overall approach and permits the gen-
eration of probability values (Section 5) on very large
databases.
There exists a variety of clustering algorithms in
the literature each with different characteristics. How-
ever, as mentioned earlier, we are dealing with a rather
different clustering problem here. First of all, we use
only the output of the similarity join module for the
clustering. Our goal of clustering is to create a proba-
bilistic database and therefore we need to seek specific
characteristics that fit this goal. For example, a few ex-
tra records in a cluster is preferable to a few missing
records, since the few extra records will get less proba-
bility in the probability assignment component. More-
14
over, since the similarity join module needs a threshold
for the similarity measure which is hard to choose and
dataset-dependent, we seek clustering algorithms that
are less sensitive to the choice of the threshold value. A
comprehensive study of the performance of clustering
algorithms in duplicate detection including the disjoint
algorithms presented here and several more sophisti-
cated clustering algorithms can be found elsewhere [32].
4.1 Disjoint Algorithms
In this group of algorithms, the goal is to create clusters
of similar records C={c1, . . . , ck}where the value of
kis unknown, Sci∈C ci=Rand cicj=for all
ci, cj∈ C, i.e., clusters are disjoint and partition the
base relation.
We can think of the source relation as a graph G(U, V )
in which each node uUpresents a record in the base
relation and each edge (u, v)Vconnects two nodes
uand vhaving corresponding records that are similar,
i.e., their similarity score based on some similarity func-
tion sim() is above a specified threshold θ. Note that
the graph is undirected, i.e., (u, v)=(v, u). The task
of clustering the relation is then clustering the nodes in
the graph. In our implementation, we do not materialize
the graph. In fact, all the algorithms can be efficiently
implemented by a single scan of the list of similar pairs
returned by the similarity join module, although some
require the list to be sorted by similarity score. We only
use the graph Gto illustrate our techniques.
4.1.1 Algorithm1: Partitioning
In this algorithm, Partitioning (or transitive closure),
we cluster the graph of records by finding the connected
components in the graph and putting the records in
each component in a separate cluster. This can be done
by first assigning each node to a different cluster and
then scanning the list of similar pairs and merging clus-
ters of all connected nodes. Figure 8(a) shows the result
of this algorithm on a sample graph. As Figure 8(a)
shows, this algorithm may put many records that are
not similar in the same cluster. Partitioning is a com-
mon algorithm used in early entity resolution work [35,
25], and is included as a baseline.
4.1.2 Algorithm2: CENTER
This algorithm, which we call CENTER as in [34] per-
forms clustering by partitioning the graph of the records
so that each cluster has a center and all records in the
cluster are similar to the center. This can be performed
by a single scan of the sorted list of similar pairs. The
first time a node uappears in the scan, it is assigned
as the center of the cluster. All the subsequent nodes v
that appear in a pair (u, v) are assigned to the cluster
of uand are not considered again. Figure 8(b) shows
how this algorithm clusters a sample graph of records,
where node u1is the first node in the sorted list of
similar records and node u2appears right after all the
nodes similar to u1, and node u3appears after all the
nodes similar to u2. This algorithm may result in more
clusters than Partitioning since it puts into one cluster
only those records that are similar to one record which
is the center of the cluster.
4.1.3 Algorithm3: MERGE-CENTER
MERGE-CENTER, or MC, is similar to CENTER, but
merges two clusters ciand cjwhenever a record sim-
ilar to the center node of cjis already in the cluster
ci, i.e., it is similar to a node that is the center or is
similar to the center (or one of the center nodes) of the
cluster ci(Note that when two clusters are merged, we
do not choose a single center node in this algorithm, so
each cluster can have multiple center nodes). As with
CENTER, this is done using a single scan of the list
of similar records, but keeping track of the records that
are already in a cluster. The first time a node uappears
in the scan, it is assigned as the center of the cluster.
All the subsequent nodes vthat appear in a pair (u, v)
and are not present in any cluster, are assigned to the
cluster of u, and are not selected as the center of any
other cluster. Whenever a pair (u, v0) is encountered
such that v0is already in another cluster, all the nodes
in the cluster of u(records similar to u) are merged with
the cluster of v0. Figure 8(c) shows the clusters created
by this algorithm assuming again that the nodes u1,
u2and u3are the first three nodes in the sorted list
of similar records that are selected as the center of a
cluster. As shown in the Figure, this algorithm creates
fewer clusters for the sample graph than the CENTER
algorithm, but more than the partitioning algorithm.
4.2 Non-Disjoint Algorithms
In this group of algorithms, we do not require cicj=
for all i, j 1. . . k. For this purpose, we use the results
of the similarity join module along with the similarity
scores of the similar records. The idea is to have a core
for each cluster that consists of the records that are
highly similar, and marginal records for each cluster
that are relatively less similar. The core of the clusters
are created based on the results of the similarity join
with similarity score above a high threshold θ1. The
marginal records are added to the clusters based on the
15
(a) Partitioning (b) CENTER (c) MERGE-CENTER (MC)
Fig. 8 Illustration of disjoint clustering algorithms
results of the similarity join with a threshold θ2θ1.
Using the terminology from probabilistic record linkage
[26], we can say that we put the records that match with
the center of the cluster in its core, and records that
probably match with the center in the marginal records
of the cluster. Each record appears in the core of only
one cluster, but may appear in the marginal records of
more than one cluster.
4.2.1 Algorithm4: Non-disjoint Clustering
Our first non-disjoint algorithm, ND, creates a set of
core clusters (in a similar way to MERGE-CENTER),
then a set of records are added to each cluster which
are less similar to the center of the cluster. The algo-
rithm performs as follows. Assume that we have the
list of records with similarity score above a threshold
θ2along with their similarity score from the output
of the similarity join module. The algorithm starts by
scanning the list. The first time a node uappears in
the scan, it is assigned as the center of the core of the
cluster. All the subsequent nodes vthat appear in a
pair (u, v), have sim(u, v)θ1, and are not present
in the core of any other cluster, are assigned to the
core of the cluster of uand are not selected as the cen-
ter of any other cluster. Other pairs (u, v) that have
sim(u, v)θ1(but have sim(u, v)θ2) are added as
the marginal members of the cluster. Whenever a pair
(u, v0) with sim(u, v0)θ1is encountered such that v0
is already in the core of another cluster, all the nodes
in the cluster of uare merged with the cluster of v0.
4.2.2 Algorithm5: Improved Non-disjoint Clustering
with Information Bottleneck Method
The ND algorithm performs well when thresholds θ1
and θ2are chosen accurately. However, the choice of
the thresholds highly depends on the similarity mea-
sure used in the similarity join module and the type of
errors in the datasets. Therefore, it is plausible to be
able to choose a low value for the lower threshold θ2and
then enhance the accuracy of the clustering by pruning
extra records from each cluster in a uniform way regard-
less of the value of the thresholds. Here, we adopt an
approach from the information theory field called the
information bottleneck in order to enhance the results
of non-disjoint clustering. The idea is to prune those
marginal records in clusters that are less similar to the
records in the core of the clusters.
Our ND-IB algorithm is based on the Agglomera-
tive Information Bottleneck (IB) algorithm for cluster-
ing data [47] which we briefly explain here.
Assume Ris the set of records, n=|R|is the num-
ber of records, Tis the set of qgrams of the strings and
d=|T|is the total number of qgrams in all records. In
the information bottleneck method for clustering data,
the goal is to partition the records in Rinto kclus-
ters C={c1, c2, . . . , ck}where each cluster ciCis
a non-empty subset of Rsuch that cicj=for all
i, j. Giving equal weight to each record rR, we de-
fine p(r) = 1
n. We also set the probability of a qgram t
given a record p(t|r) = idf(t)
Pt0ridf (t0)where idf(t) is the
inverse document frequency of qgram tin the relation.
For cC, the elements of R,Tand Care related as
follows:
p(c) = X
rc
p(r) (10)
p(t|c) = 1
p(c)X
rc
p(r)p(t|r) (11)
Merging two clusters ciand cjis performed by setting
the following parameters for the new cluster c:
p(c) = p(ci) + p(cj)
p(t|c) = p(ci)
p(c)p(t|ci) + p(cj)
p(c)p(t|cj) (12)
In the IB algorithm, clustering is performed by first as-
suming that each record is a separate cluster and then
iteratively merging the clusters nktimes to reduce the
number of clusters to k. In each iteration, two clusters
16
are chosen to be merged so that the amount of informa-
tion loss as a result of merging the clusters is minimum.
Information loss is given by the following formula [47]:
δI(ci, cj) = [p(ci) + p(cj)] ·DJS [p(t|ci), p(t|cj)] (13)
where DJS [p(t|ci), p(t|cj)] is equal to:
p(ci)
p(c)DKL [p(t|ci),¯p] + p(cj)
p(c)DKL [p(t|ci),¯p]
where:
¯p=p(ci)
p(c)p(t|ci) + p(cj)
p(c)p(t|cj) (14)
DKL [p, q] = X
rR
p(r) log p(r)
q(r)(15)
(16)
The pruning algorithm for our non-disjoint clustering
performs as follows. For each cluster: 1. The records in
the core of the cluster are merged using the merge op-
eration and put in cluster ccore. 2. For each record ri
in the set of marginal records M={r1, . . . , rk}, the
amount of information loss for merging riwith the core
cluster ccore,ili=δI(ri, ccore ), is calculated. 3. As-
sume avgil is the average value of ilifor i1. . . k
and stddevil is the standard deviation. Those marginal
records that have iliavgil stddevil are pruned from
the cluster.
The intuition behind this algorithm is that by us-
ing the information in all the qgrams of the records
from the core of the cluster that are identified to be
duplicates (and match), we can identify which of the
marginal records (that probably match) are more prob-
ably duplicates that belong to that cluster. For this, the
records in the core of each cluster are merged using the
merge operation (equation 12). If merging a marginal
record with the core of the cluster would result in high
information loss, then the record is removed from the
marginal records of the cluster.
4.3 Evaluation
Datasets: The datasets used for accuracy results in
this section are the same datasets described in Table 1
of Section 3.2. Most of the results presented here are for
the medium error group of these datasets. In our eval-
uation, we note when the trends on the other groups of
datasets are different than those shown in this report.
Note again that we limited the size of the datasets only
for our experiments on accuracy. For running time ex-
periments, we used the data generator with DBLP titles
dataset of Table 1 to generate larger datasets. In order
to show that these results are not limited to the specific
datasets we used here, we have made the results of our
extensive experiments over various datasets (with dif-
ferent sizes, types and distribution of errors) publicly
available at
http://dblab.cs.toronto.edu/project/stringer/evaluation/
Accuracy Measures: We evaluate the quality of
the clustering algorithms based on several measures
from the clustering literature and also measures that
are suitable for evaluation of these clusterings in du-
plicate detection. The latter measures are taken from
Hassanzadeh et al. [32]. Suppose that we have a set of
kground truth clusters G={g1, . . . , gk}of the base
relation Rand let Cdenote a clustering of records into
k0clusters {c1, . . . , ck0}produced by a clustering algo-
rithm. Consider mapping ffrom elements of Gto el-
ements of C, such that each cluster giis mapped to
a cluster cj=f(gi) that has the highest percentage
of common elements with gi. We define precision, P ri,
and recall, Rei, for a cluster gi, 1 ikas follows:
P ri=|f(gi)gi|
|f(gi)|and Rei=|f(gi)gi|
|gi|(17)
Intuitively, P rimeasures the accuracy with which clus-
ter f(gi) reproduces cluster gi, while Reimeasures the
completeness with which f(gi) reproduces class gi. We
define the precision and recall of the clustering as the
weighted averages of the precision and recall over all
ground truth clusters. More precisely:
P r =
k
X
i=1
|gi|
|R|P riand Re =
k
X
i=1
|gi|
|R|Rei(18)
Again, we also use the F1-measure (the harmonic mean
of precision and recall).
We think of precision, recall and F1-measure as in-
dicative values of the ability of the algorithm to recon-
struct the indicated clusters in the dataset. However,
since in our framework the number of clusters created
by the clustering algorithm is not fixed and depends
on the datasets and the thresholds used in the similar-
ity join, we should also take into account this value in
our quality measure. We use two other measures more
suitable for our framework. The first, called clustering
precision, CPri, is the ratio of the pairs of records in
each cluster cithat are in the same ground truth clus-
ter gj:ci=f(gj), i.e.,
CP ri=|(t, s)ci×ci|t6=s∧ ∃j1. . . k , (t, s)gj×gj|
|ci|
2(19)
Clustering precision, CP r , is then the average of
CP rifor all clusters with size 2. C pr measures the
ability of the clustering algorithm to put the records
17
that must be in the same cluster in one cluster regard-
less of the number and the size of the clusters. We also
need to have a measure that penalizes those algorithms
that create more or fewer clusters than the ground truth
number of clusters. P CP r is CP r multiplied by the per-
centage of the extra or missing clusters in the result of
clustering, i.e.,
P CP r =(k
k0CP r k < k0
k0
kCP r k k0(20)
Partitioning and CENTER algorithms: We mea-
sure the quality of clustering algorithms based on dif-
ferent thresholds of the similarity join. The table be-
low shows the values for our medium-error datasets
and thresholds that result in the best F1 measure and
the best PCPr measure values. We have chosen these
thresholds to show how the threshold value could affect
the accuracy of the algorithms, and also justify using
the PCPr measure. Similar trends can be observed for
other thresholds and datasets.
Partitioning CENTER
Best PCPr Best F1Best PCPr Best F1
PCPr 0.554 0.469 0.593 0.298
CPr 0.946 0.805 0.760 0.692
Pr 0.503 0.934 0.586 0.971
Re 0.906 0.891 0.783 0.805
F1 0.622 0.910 0.666 0.877
Cluster# 353 994 472 1305
Note that the number of clusters in the ground truth
datasets is 500. The last row in the table shows the
number of clusters generated by each algorithm. These
results show that, precision, recall and F1measures
cannot alone determine the best algorithm since they
do not take into account the number of clusters gener-
ated. As it can be seen, the best value of F1measure
among different thresholds is 0.910 for partitioning and
0.877 while the corresponding number of clusters are
994 and 1305 respectively. However, the best value of
PCPr among different thresholds is 0.554 for partition-
ing and 0.593 for CENTER, with 353 and 472 clusters
in the results respectively. This justifies using CPr and
PCPr measures. Also note that the accuracy of these
algorithms highly depend on the threshold used for the
similarity join module. The results above show that the
CENTER algorithm is more suitable than the partition-
ing algorithm for identification of the correct number of
clusters.
MERGE-CENTER (MC) algorithm: The ac-
curacy results for the MERGE-CENTER algorithm for
the medium error datasets are shown below. The results
are for the similarity threshold that produced the best
PCPr results although the trend is the same for both
algorithms with any fixed threshold. These results show
that the MC algorithm results in significant improve-
ment in all accuracy measures comparing with CEN-
TER and Partitioning algorithms.
Partitioning MC Diff.
PCPr 0.554 0.696 +25.6%
CPr 0.946 0.940 -0.1%
Pr 0.503 0.658 +30.8%
Re 0.906 0.950 +4.9%
F10.622 0.776 +24.8%
Cluster # 353 459
Non-disjoint algorithms (ND and ND-IB): We
compare the results of MERGE-CENTER (MC) with
our non-disjoint algorithms, ND and ND-IB, below. Adding
marginal records to the clusters increases PCPr, CPr
with a small drop in recall but a significant drop in
the precision. Note that for our goal which is creating
probabilistic databases, recall is more important than
precision, since missing records can result in missing re-
sults for queries over the output probabilistic database,
whereas a few extra records result in extra answers with
lower probability values. For these results, we set the
threshold θ= 0.3 for MC, the lower threshold θ2= 0.2
and the higher threshold θ1= 0.4 for non-disjoint algo-
rithms, and we use our low error datasets. We observed
a similar trend using many different thresholds and
other datasets. In fact non-disjoint algorithms become
more effective when used on highly erroneous datasets
as partly shown in Figure 9.
MC ND ND-IB
θ= 0.3θ1= 0.4, θ2= 0.2θ1= 0.4, θ2= 0.2
Diff.(MC) Diff.(ND)
PCPr 0.696 0.930 +0.234 0.924 -0.006
CPr 0.940 0.999 +0.059 0.993 -0.007
C./Rec. 1.0 3.3 +2.3 2.2 -1.07
A key benefit of using the non-disjoint algorithm
with information bottleneck (IB) is that the cluster-
ing algorithm becomes less sensitive to the value of the
threshold used for the similarity join. In the above re-
sults, changing the threshold for the MC algorithm to
θ= 0.4 results in a much higher PCPr but lower CPr
score and setting θ= 0.2 results in a significant drop in
PCPr but higher CPr. The last row shows the average
number of clusters to which each record belongs, e.g., in
the non-disjoint algorithm with the threshold used for
the results in this table, each record is present in 3.3
clusters on average. As it can be seen, PCPr and CPr
are slightly decreased but in return, the average num-
ber of clusters for each record is significantly decreased.
This results in decreasing the overhead associated with
having non-disjoint clusters as well as increasing the
precision of the clustering.
Effect of amount of error: In order to show the
the effect of the amount of error in the datasets on the
accuracy of the algorithms, we measure the CPr score
of all the clustering algorithms, with threshold θ= 0.5
18
CPr
MERGE-CENTER
CENTER
ND-IB
ND
Partitioning
Fig. 9 CPr score of clustering algorithms for datasets with dif-
ferent amount of error
for disjoint algorithms and lower threshold θ2= 0.3 and
higher threshold θ1= 0.5 for non-disjoint algorithms.
Figure 9 shows the results. For all datasets, the rela-
tive performance of the algorithms remains the same.
All algorithms perform better on lower error datasets.
MERGE-CENTER algorithm becomes more effective
on cleaner datasets comparing with Partitioning and
CENTER algorithms. Non-disjoint algorithms become
less effective on cleaner datasets mainly due to higher
accuracy of the disjoint algorithm with the threshold
used.
Performance Results: We ran our experiments
using a Dell 390 Precision desktop with 2.66 GHz Intel
Core2 Extreme Quad-Core Processor QX6700, 4GB of
RAM running 32-bit Windows Vista. Each experiment
is run multiple times to obtain statistical significance.
Figures 10 and 11 show the running time of the dis-
joint and non-disjoint algorithms. These results are ob-
tained from DBLP datasets of size 10K-100K records.
The average percentage of erroneous duplicates is 50%,
and the average percentage of errors in each duplicate
record, the average amount of token swaps, and the
average amount of abbreviation errors is 30%. For dis-
joint algorithms, a fix threshold of θ= 0.5 is chosen for
the similarity join and for non-disjoint algorithms lower
threshold of θ= 0.4 and higher threshold of θ= 0.6 is
chosen, although we observed a similar trend with many
other threshold values. As expected, the Partitioning
algorithm is the fastest in disjoint algorithms since it
does not need the output of the similarity join to be
sorted. CENTER and MERGE-CENTER both require
the output to be sorted, and MERGE-CENTER has
an extra merge operation which makes it a little slower
than CENTER. The results for non-disjoint algorithms
show that the overhead for the information bottleneck
pruning makes the algorithm 5-10 times slower, but still
reasonable for an offline process.
Fig. 10 Running time: disjoint algorithms
Fig. 11 Running time: non-disjoint algorithms
5 Probability Assignment Module
Assuming that the records in the base relation Rare
clustered using a clustering technique, the output of
the probability assignment module is a probabilistic
database in which each record has a probability value
that reflects the error in the record. We present two
classes of algorithms here. One based on the similarity
score between the records in each cluster, and the other
based on information theory concepts.
5.1 Algorithms
MaxSim Algorithm: In this algorithm, first a record
in each cluster is chosen as the representative of a clus-
ter and then the probability value is assigned to each
record that reflects the similarity between the record
and the cluster representative. This algorithm is based
on the assumption that there exists a record in the
cluster that is clean (has no errors) or has less errors,
and that this record is the most similar record to other
records in the cluster. Therefore, this record is chosen
as the cluster representative and the probability of the
other records being clean is proportional to their simi-
larity score with the cluster’s representative.
Figure 12 shows the generic procedure for finding
probabilities in this approach. For each cluster, the record
that has the maximum sum of similarity score with all
19
Algorithm MaxSim
Input: A set of records R
A clustering C={c1, c2,...,ck}of R
A similarity function sim()
1. Repeat for each cluster ci:
2. let rep = arg maxrci(Pscisim(r, s))
3. For each record tin cluster ci:
4. p(t) = sim(t, r ep)
Prcisim(r, rep)
Fig. 12 MaxSim algorithm
other records in the cluster (based on some similar-
ity function sim()) is chosen as the cluster representa-
tive. The probability assigned to each record is basically
the similarity score between the representative and the
record, normalized for each cluster.
Information Bottleneck Method: Here, we present
a technique for assigning probability values to records
within each cluster based on the Information Bottleneck
(IB) approach. While similar in spirit to the method of
Andritsos et al. [2], our method is designed specifically
for dirty string data. Assume again that Ris the set
of all records, Tis the set of qgrams of the records,
Cis the set of all the clusters and Tciis the set of
all qgrams in the records inside cluster ciC. Giving
equal weight to each record rR, we define p(r) = 1
n.
The probability of a qgram given a record can be set
as p(t|r) = 1
|r|(equal values, as shown in the example
below) or p(t|r) = idf(t)
Pt0ridf (t0)(based on importance
of the tokens, which is our choice for the experiments).
For cC, the elements of R,Tand Care related by
Equations 10 and 11 in Section 4. Merging two clusters
ciand cjis performed by the merge operation using
Equation 12 (Section 4).
Figure 13 shows the steps involved in this algorithm.
To find a cluster representative for cluster ci, we merge
the records in the cluster using the merge operation.
The result is the probability distribution p(t|ci) for all
qgrams tTci. We define the cluster representative to
be (Tci, p(t|ci)), i.e., the set of all the qgrams of the
records in the cluster cialong with their probability
values p(t|ci). Note that a cluster representative does
not necessarily consist of qgrams of a single record in
that cluster. The probability value for each record in
the cluster is basically the sum of the values of the
probabilities p(t|ci) for the qgrams in the record rdi-
vided by the length of the record, normalized so that
the probabilities of the records inside a cluster sum to
1. The intuition behind this algorithm is that by using
the information from all the q-grams in the cluster, a
better cluster representative can be found. This is based
on the assumption that in the cluster ci, the q-grams
that belong to a “clean” record are expected to appear
Algorithm IB
Input: A set of records R
A clustering C={c1, c2,...,ck}of R
1. Repeat for each cluster ci:
2. Merge the records in the cluster (equation 12)
to calculate p(t|ci) for each tTci
3. For each record rin cluster ci:
4. pc(r) = p0
c(r)
Pr0cip0
c(r0)
where p0
c(r) = Ptrp(t|ci)
|r|
Fig. 13 IB algorithm
more in the cluster and therefore have a higher p(t|ci)
value. As a result, the records containing q-grams that
are frequent in the cluster (and are more likely to be
clean) will have higher probability values.
Example 3 Suppose Ris a set of four strings r1 =
“William Turner”, r2 =“Willaim Turner”, r3 = “Will-
liam Turnet” and r4 =“Will Turner” in a cluster. Fig-
ure 14 shows the initial p(t|r)values for each record
rand q-gram t, as well as the final probability distri-
bution values for the cluster representative.3The out-
put of the algorithm is pc(r1) = 0.254,pc(r2) = 0.240,
pc(r3) = 0.233 and pc(r4) = 0.272.
5.2 Evaluation
Measure: We evaluate the effectiveness of the prob-
ability assignment techniques by introducing a mea-
sure that shows how sorting by the assigned probabil-
ity values will preserve the correct order of the error
in the records. We call this measure Order Preserving
Ratio (OPR). OPR is calculated as follows. For each
cluster, we create an ordered list of records Loutput =
(r1, . . . , rk) sorted by the probability values assigned to
the records, i.e., pa(ri)pa(rj) iff ijwhere pa(r) is
the probability value assigned to the record r. Suppose
the correct order of the records is Lcorrect and the true
probability value of the record rbeing the clean one is
pt(r). We can measure the extent to which the sorted
output list preserves the original order by counting the
percentage of pairs (ri, rj) for which riappears before
rjin both Loutput and Lcorrect , i.e.,
OP RC=|(ri, rj)|ri, rjLoutput, i j, pt(ri)pt(rj)|
k
2(21)
Note that k
2is the total number of pairs in Loutput.
OPR is the average of O P Rc0.5
0.5over all clusters. Since
3We omit the initial and ending grams ’ w’, ’t ’, ’r ’ to fit this
on the page.
20
r1 = “William Turner”, r2 = “Willaim Turner”, r3 = “Willliam Turnet”, r4 = “Will Turner”
t’wi’ ’il’ ’ll’ ’li’ ’la’ ’l ’ ’ai’ ’im’ ’ia’ ’am’ ’m ’ ’ T’ ’Tu’ ’ur’ ’rn’ ’ne’ ’er’ ’et’
p(t|r1) 1/13 1/13 1/13 1/13 0 0 0 0 1/13 1/13 1/13 1/13 1/13 1/13 0 1/13 1/13 0
p(t|r2) 1/13 1/13 1/13 0 1/13 0 1/13 1/13 0 0 1/13 1/13 1/13 1/13 1/13 1/13 1/13 0
p(t|r3) 1/14 1/14 1/14 1/14 0 0 0 0 1/14 1/14 1/14 1/14 1/14 1/14 0 1/14 0 1/14
p(t|r4) 1/10 1/10 1/10 0 0 1/10 0 0 0 0 0 1/10 1/10 1/10 1/10 1/10 1/10 0
p(t|rep) .071 .071 .071 .033 .017 .021 .017 .017 .033 .033 .050 .071 .071 .071 .071 .071 .054 .017
Fig. 14 Example IB representative calculation
0.5 is the average value of O P Rcif the records are
sorted randomly, OPR shows the extent to which the
ordering by probabilities is better than a random or-
dering.
Results: We use the same data generator to cre-
ate a dataset of strings with different amounts of er-
ror within the strings, marking each string with the
percentage of error in that string which allows sort-
ing the records based on the relative amount of er-
ror and obtaining the ground truth. We ran experi-
ments on datasets with varying sizes and degree of er-
ror made out of the company names and DBLP titles
datasets (Table 1). The trends observed are similar over
all datasets. We report the results for a dataset contain-
ing 1000 clusters generated out of our clean company
names dataset. Table 3 shows the OPR values for this
dataset for IB and MaxSim algorithm. We have tried
MaxSim with different string similarity functions de-
scribed in Section 3 for similarity join module, namely
Weighted Jaccard (WJaccard), SoftTfIdf, Generalized
Edit Similarity (GES), Hidden Markov Models (HMM),
BM25 and Cosine similarity with tf-idf weights (Cosine
w/tfidf). Interestingly, MaxSim produces the best re-
sults when used with Weighted Jaccard similarity, the
measure of our choice for the similarity join module de-
scribed in Section 3. The IB algorithm performs as well
as MaxSim with the best choice of similarity function in
terms of accuracy. Table 3 also shows the running time
for these algorithms for a DBLP titles dataset of 20K
records. The trend is similar for larger datasets and the
algorithms scale linearly. The IB algorithm is also sig-
nificantly faster than MaxSim with weighted Jaccard.
Another advantage of IB over the MaxSim algorithm is
that the cluster representatives can be stored and up-
dated very efficiently, but for the MaxSim algorithm,
when a record is added to database, the algorithm must
be run again to find the new representative. This makes
the IB algorithm suitable for large dynamic databases,
and also for on-line calculation of the probabilities.
5.3 Putting It All Together
In Section 4, we showed how the quality of the clus-
ters is affected by the similarity measure and threshold
used in the similarity join module. However, the results
Table 3 OPR values/times for IB & MaxSim algs
Algorithm OPR Time(ms)
IB 0.683 749
WJaccard 0.674 4324
SoftTfIdf 0.653 1280
MaxSim GES 0.490 1249
HMM 0.485 3852
BM25 0.480 4009
Cosine w/tfidf 0.470 5397
presented so far in this section are based on a perfect
clustering as input to the probability assignment mod-
ule. In this part, we will show the results of our ex-
perimental evaluation of the effect of the quality of the
clusters on the quality of the probabilities. Our goal is
to ensure that when creating a probabilistic database,
the errors introduced in the first two modules (the clus-
tering errors) do not compound the potential errors in
our probability assignment module in a way that makes
the final probabilities meaningless.
Measure: We need to slightly modify OPR to mea-
sure the quality of the probability values when the clus-
tering is imperfect. We call this measure OPRt. Sup-
pose that we have a set of kground truth clusters
G={g1, . . . , gk}of the base relation Rand let Cde-
note a clustering of records into k0clusters {c1, . . . , ck0}
produced by a clustering algorithm. Consider mapping
ffrom elements of Cto elements of G, such that each
cluster ciis mapped to a cluster f(ci) that has the high-
est percentage of common elements with ci. Here again,
we create an ordered list of records L= (r1, . . . , rk) for
each cluster clC, sorted by the probability values as-
signed to the records, i.e., pa(ri)pa(rj) iff ijwhere
pa(r) is the probability value assigned to the record r.
Let pt(rcl) be the probability value of the record r
being the ground truth cluster f(cl) if tf(cl) and
zero otherwise. We can measure the extent to which
the sorted output list preserves the original order in
the matched ground truth cluster f(cl) by counting the
percentage of pairs (ri, rj) for which at least one of ri
and rjare in f(cl), ptc(ri)ptc(rj) and riappears
before rjin L, i.e.,
|(ri, rj)|ri, rjL, i j, pt(ricl)pt(rjcl)| − e
k
2e
e=|(ri, rj)|ri, rj/f(cl)|
21
This is based on the assumption that we are indiffer-
ent about the order of the records that are not in the
matched ground truth cluster. OPRtis the average of
the value calculated in the formula above over all out-
put clusters in C.
Results: The table below shows OPRtvalues for
the same dataset used for the results in Table 3. The
values are shown for a perfect clustering as well as
clusters created by (disjoint) MERGE-CENTER algo-
rithm performed on the output of similarity join with
Weighted Jaccard similarity measure and different val-
ues of the similarity threshold, and using the IB algo-
rithm for probability assignment. Similar trends were
observed for other clustering algorithms. Moreover, the
relative performance of IB and MaxSim algorithms re-
mained the same as Table 3 and therefore we do not
report OPRtvalues for them.
Similarity Threshold F1 PCPr Cluster# OPRt
Perfect Clustering 1.000 1.000 500 0.832
(No Similarity Join)
θ= 0.1 0.008 0.004 2 0.571
θ= 0.2 0.479 0.389 259 0.655
θ= 0.3 0.726 0.335 934 0.713
θ= 0.4 0.724 0.118 1673 0.700
θ= 0.5 0.614 0.042 2370 0.625
The results above show that the quality of the clus-
tering does affect the effectiveness of the probability
assignment module. This effect is not significant when
the clusters have higher accuracy. However, the qual-
ity of the probability values further decreases as the
accuracy of the clustering decreases.
6 Case Study on Real Data
In this section, we report the results of applying our
framework to a real world dirty data source. In order to
effectively evaluate our framework, we need a dirty data
source that contains several possibly dirty attributes
with duplicate clusters of various sizes and character-
istics. Many real world dirty data sources meet these
requirements. Examples include the bibliographic data
available on DBLP, CiteSeer and DBWorld, the clini-
cal trial data available on ClinicalTrials.gov, shopping
information on Yahoo! Shopping, and hotel informa-
tion from Yahoo! Travel [9]. For the experiments in this
section, we use the Cora dataset [39], which contains
computer science research papers integrated from sev-
eral sources. It has been used in several other duplicate
detection projects [2, 5,13,39] and we take advantage
of previous labelings of the tuples into clusters. To the
best of our knowledge, Cora is the only real world dirty
database freely available for which the ground truth is
Table 4 Statistics of the tables in Cora dataset
dataset #rec. #clusters Avg. len. #words/rec.
pubstr 1,878 185 118.22 17.76
pubtitles 1,878 185 50.84 6.13
pubauthors 714 240 13.76 2.78
pubvenues 615 131 47.07 8.58
known, and that meets the requirements for evaluation
of this framework.
We use a version of Cora that is available in XML
format, and transform the data into four relational ta-
bles: the pubstr table contains a single string attribute
which is obtained by concatenation of the title, venue
and author attributes, pubtitles which contains the ti-
tles of the publications, pubauthors that contains the
author names, and pubvenues that contains the venue
information including name, date and volume number.
The statistics of these tables are shown in Table 4.
6.1 Similarity Join Results
Figure 15 shows the maximum F1score across different
thresholds for all the similarity measures over the four
tables. The relative performance of the similarity mea-
sures differs considerably for each of these tables. This
is expected since 1) the attributes have different charac-
teristics such as length, amount and type of errors, and
2) these tables are relatively small, and failure of an al-
gorithm on a small subset the records can notably affect
the average values of the accuracy measures. However,
it can be seen that those algorithms that performed
better in our experiments in Section 3, are more robust
across the four tables. For example, the weighted Jac-
card similarity measure performs reasonably well for all
the four tables, although it is not the best measure for
any of them. Note that again due to the small size of
these tables, weighted measures do not perform as ex-
pected since the IDF weights over a small collection do
not reasonably reflect the commonality of the tokens.
We would not expect this to be the case for larger real
world dirty data.
6.2 Clustering Algorithms Results
In order to compare the performance of clustering algo-
rithms on the datasets, we again compare the maximum
value of the F1score and PCPr that the algorithms can
achieve using different thresholds. Figure 16 shows the
results. All the clustering algorithms perform better on
the pubstr and pubtitles tables. The reason for this is
that for the pubauthors table, our framework’s dupli-
cate detection phase (i.e., a string similarity join along
22
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Jaccard weightedJaccard Cosine w/tf-idf BM25 HMM SoftTFIDF GES Edit distance
Maximum F1-measure
pubstr pubtitles pubauthors pubvenues
Fig. 15 Maximum F1score for Cora datasets
with a clustering algorithm) results in many false pos-
itives due to the existence of highly similar (or exactly
equal) author names that refer to different real world
entities. The same is true for the pubvenues table. As
stated in Section 2, previous work has addressed this
problem by using more complex, iterative clustering
algorithms that can take advantage of additional co-
occurrence information existing in the data. However,
the relatively high quality of the clusters for pubstr
table shows the effectiveness of our framework in de-
tection of duplicate publications by a simple concate-
nation of all the attributes and without the use of co-
occurrence information (indeed collective resolution has
been developed precisely for highly ambiguous domains
like author name).
If the same threshold is used for all the algorithms,
CENTER produces clusters of much higher quality when
used with a low threshold, while the trend for Partition-
ing is the opposite. MERGE-CENTER is more robust
to the value of the threshold than both Partitioning and
CENTER when the same threshold is used. Figure 17
shows this fact for pubstr table. Similar trends were
observed in all other datasets.
The following table shows the effectiveness of the
non-disjoint algorithms for the pubstr table. Again,
similar trends were observed for the other tables.
MC ND ND-IB
θ= 0.2θ1= 0.1, θ2= 0.3θ1= 0.1, θ2= 0.3
Diff.(MC) Diff.(MC)
F1 0.789 0.756 -0.033 0.833 +0.043
PCPr 0.728 0.965 +0.238 0.952 +0.224
CPr 0.975 0.998 +0.022 0.984 +0.009
C./Rec. 1.0 2.7 +1.7 1.8 +0.8
6.3 Probability Assignment
The evaluation of the probability assignment algorithms
for this dataset is inherently a difficult task since the
ground truth is not known (only the cluster labels are
known). It is hard to determine the correct ordering of
the records within each cluster. However, we have per-
formed a qualitative evaluation of the results over the
output probabilistic tables using simple queries similar
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1 0.2 0.3 0.4 0.5
Threshold
PCPr
Partitioning
CENTER
MC
Fig. 17 PCPr score for different thresholds on pubstr table
to the queries in the examples of Section 2.4.2. Over-
all results are consistent with the results shown in Sec-
tion 5. Moreover, the results clearly show the advantage
of the probabilistic approach for management of dupli-
cated data, as opposed to cleaning the data upfront. As
one example, we used a query retrieving conference ti-
tle, volume and other information for conferences held
in 1995. Over a cleaned database (where we have kept
the most probable tuple in each cluster), the query re-
sults are less informative, sometimes omitting poten-
tially valuable information about a conference that was
contained in attribute values of lower probability tu-
ples. However, using consistent query answering tech-
niques [2], queries over our probabilistic database can
report how much collective evidence there is (among
all the tuples no matter how dirty) for different values.
Our sample SQL queries, along with their rewritings
obtained using the approach discussed in Section 2.4.2
[2] and a subset of their results are available online at
our project’s web page:
http://dblab.cs.toronto.edu/project/stringer/evaluation/
Also, several probabilistic tables created from syn-
thetic and real dirty databases using different thresh-
olds and algorithms are published on the above page.
We hope that these probabilistic databases can serve as
a benchmark for evaluation of probabilistic data man-
23
0.500
0.550
0.600
0.650
0.700
0.750
0.800
0.850
0.900
0.950
Maximum F1
Partitioning 0.897 0.877 0.729 0.664
CENTER 0.828 0.861 0.729 0.642
MC 0.880 0.879 0.717 0.679
pubstr pubtitles pubauthors pubvenues
0.200
0.300
0.400
0.500
0.600
0.700
0.800
0.900
1.000
Maximum PCPr
Partitioning 0.839 0.900 0.551 0.526
CENTER 0.866 0.877 0.342 0.322
MC 0.919 0.889 0.477 0.363
pubstr pubtitles pubauthors pubvenues
(a) Maximum F1score (b) Maximum PCPr score
Fig. 16 Accuracy of clustering algorithms on Cora dataset
agement techniques in the future. Our future plan in-
cludes extending the real datasets by, for example, hand
labeling a subset of the clinical trials data we have
gathered in our LinkedCT4project. This could provide
probabilistic databases for management of duplicated
data in an important real-world domain.
7 Conclusion
We proposed a framework for managing potentially du-
plicated data that leverages existing approximate join
algorithms together with probabilistic data manage-
ment techniques. Our approach consists of three phases:
application of a (scalable) approximate join technique
to identify the similarity between pairs of records; clus-
tering of records to identify sets of records that are
potential duplicates; the assignment of a probability
value to each record in the clusters that reflects the
error in the record. We presented and benchmarked a
set of scalable algorithms for clustering records based
on their similarity scores and on their information con-
tent. We also introduced and evaluated algorithms for
probability assignment.
The modularity of our framework makes it amenable
for a variety of data cleaning tasks. For example, in do-
mains where aggregate constraints for deduplication are
known [18], these constraints can replace our unsuper-
vised clustering techniques, and our probability assign-
ment methods can still be used to create a probabilistic
database for querying and analysis.
Acknowledgments. We thank Periklis Andritsos,
Lise Getoor and Chen Li for their detailed reviews, in-
sights and support of this work. We also thank Mo-
4http://linkedct.org
hammad Sadoghi and George Beskales for their helpful
input.
References
1. N. Ailon, M. Charikar, and A. Newman. Aggregating Incon-
sistent Information: Ranking and Clustering. In ACM Symp.
on Theory of Computing (STOC), pages 684–693, 2005.
2. P. Andritsos, A. Fuxman, and R. J. Miller. Clean Answers
over Dirty Databases: A Probabilistic Approach. In IEEE
Proc. of the Int’l Conf. on Data Eng., page 30, 2006.
3. L. Antova, C. Koch, and D. Olteanu. Fast and Simple Re-
lational Processing of Uncertain Data. In IEEE Proc. of the
Int’l Conf. on Data Eng., pages 983–992, 2008.
4. A. Arasu, V. Ganti, and R. Kaushik. Efficient Exact Set-
Similarity Joins. In Proc. of the Int’l Conf. on Very Large
Data Bases (VLDB), pages 918–929, 2006.
5. A. Arasu, C. R´e, and D. Suciu. Large-Scale Deduplication
with Constraints Using Dedupalog. In IEEE Proc. of the
Int’l Conf. on Data Eng., pages 952–963, 2009.
6. J. A. Aslam, E. Pelekhov, and D. Rus. The Star Cluster-
ing Algorithm For Static And Dynamic Information Orga-
nization. Journal of Graph Algorithms and Applications,
8(1):95–129, 2004.
7. N. Bansal, A. Blum, and S. Chawla. Correlation Clustering.
Machine Learning, 56(1-3):89–113, 2004.
8. R. J. Bayardo, Y. Ma, and R. Srikant. Scaling Up All Pairs
Similarity Search. In Int’l World Wide Web Conference
(WWW), pages 131–140, Banff, Canada, 2007.
9. O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E.
Whang, and J. Widom. Swoosh: A Generic Approach to
Entity Resolution. The Int’l Journal on Very Large Data
Bases, 18(1):255–276, 2009.
10. G. Beskales, M. A. Soliman, I. F. Ilyas, and S. Ben-David.
Modeling and Querying Possible Repairs in Duplicate Detec-
tion. In Proc. of the Int’l Conf. on Very Large Data Bases
(VLDB), 2009 (To Appear). Available as University of Wa-
terloo, Tech. Report CS-2009-15, 2009.
11. J. C. Bezdek. Pattern Recognition with Fuzzy Objective
Function Algorithms. Kluwer Academic Publishers, 1981.
12. I. Bhattacharya and L. Getoor. A Latent Dirichlet Model
for Unsupervised Entity Resolution. In Proc. of the SIAM
International Conference on Data Mining (SDM), pages 47–
58, Bethesda, MD, USA, 2006.
24
13. I. Bhattacharya and L. Getoor. Collective Entity Resolu-
tion in Relational Data. IEEE Data Engineering Bulletin,
29(2):4–12, 2006.
14. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet
Allocation. Journal of Machine Learning Research, 3:993–
1022, 2003.
15. J. Boulos, N. Dalvi, B. Mandhani, S. Mathur, C. Re, and
D. Suciu. MYSTIQ: A System For Finding More Answers
By Using Probabilities. In ACM SIGMOD Int’l Conf. on the
Mgmt. of Data, pages 891–893, 2005.
16. S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust
and Efficient Fuzzy Match for Online Data Cleaning. In ACM
SIGMOD Int’l Conf. on the Mgmt. of Data, pages 313–324,
2003.
17. S. Chaudhuri, V. Ganti, and R. Motwani. Robust Identifica-
tion of Fuzzy Duplicates. In IEEE Proc. of the Int’l Conf.
on Data Eng., pages 865–876, Washington, DC, USA, 2005.
18. S. Chaudhuri, A. Das Sarma, V. Ganti, and R. Kaushik.
Leveraging Aggregate Constraints for Deduplication. In
ACM SIGMOD Int’l Conf. on the Mgmt. of Data, pages
437–448, 2007.
19. R. Cheng, J. Chen, and X. Xie. Cleaning Uncertain Data
with Quality Guarantees. Proceedings of the VLDB Endow-
ment (PVLDB), 1(1):722–735, 2008.
20. W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A Compar-
ison of String Distance Metrics for Name-Matching Tasks. In
Proc. of IJCAI-03 Workshop on Information Integration on
the Web (IIWeb-03), pages 73–78, Acapulco, Mexico, 2003.
21. N. Dalvi and D. Suciu. Efficient Query Evaluation on Prob-
abilistic Databases. The Int’l Journal on Very Large Data
Bases, 16(4):523–544, 2007.
22. N. Dalvi and D. Suciu. Management of Probabilistic Data:
Foundations and Challenges. In ACM SIGMOD Int’l Conf.
on the Mgmt. of Data, pages 1–12, 2007.
23. E. D. Demaine, D. Emanuel, A. Fiat, and N. Immorlica.
Correlation Clustering In General Weighted Graphs. Theor.
Comput. Sci., 361(2):172–187, 2006.
24. X. L. Dong, A. Y. Halevy, and C Yu. Data Integration with
Uncertainty. In Proc. of the Int’l Conf. on Very Large Data
Bases (VLDB), pages 687–698, 2007.
25. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Du-
plicate record detection: A survey. IEEE Transactions on
Knowledge and Data Engineering, 19(1):1–16, 2007.
26. I. P. Fellegi and A. B. Sunter. A Theory for Record
Linkage. Journal of the American Statistical Association,
64(328):1183–1210, 1969.
27. G. W. Flake, R. E. Tarjan, and K. Tsioutsiouliklis. Graph
Clustering and Minimum Cut Trees. Internet Mathematics,
1(4):385–408, 2004.
28. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas,
S. Muthukrishnan, and D. Srivastava. Approximate String
Joins in a Database (Almost) for Free. In Proc. of the Int’l
Conf. on Very Large Data Bases (VLDB), pages 491–500,
2001.
29. R. Gupta and S. Sarawagi. Creating Probabilistic Databases
from Information Extraction Models. In Proc. of the Int’l
Conf. on Very Large Data Bases (VLDB), pages 965–976,
2006.
30. D. Gusfield. Algorithms on Strings, Trees, and Sequences.
Computer Science and Computational Biology. Cambridge
University Press, 1997.
31. O. Hassanzadeh. Benchmarking Declarative Approximate
Selection Predicates. Master’s thesis, University of Toronto,
February 2007.
32. O. Hassanzadeh, F. Chiang, H. C. Lee, and R. J. Miller.
Framework for Evaluating Clustering Algorithms in Dupli-
cate Detection. In Proc. of the Int’l Conf. on Very Large
Data Bases (VLDB), 2009.
33. O. Hassanzadeh, M. Sadoghi, and R. J. Miller. Accuracy
of Approximate String Joins Using Grams. In Proc. of the
International Workshop on Quality in Databases (QDB),
pages 11–18, Vienna, Austria, 2007.
34. T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable Tech-
niques for Clustering the Web. In Proc. of the Int’l Workshop
on the Web and Databases (WebDB), pages 129–134, Dallas,
Texas, USA, 2000.
35. M. A. Hern´andez and S. J. Stolfo. Real-world Data is Dirty:
Data Cleansing and The Merge/Purge Problem. Data Min-
ing and Knowledge Discovery, 2(1):9–37, 1998.
36. P. Indyk, R. Motwani, P. Raghavan, and S. Vempala.
Locality-Preserving Hashing in Multidimensional Spaces. In
ACM Symp. on Theory of Computing (STOC), pages 618–
625, 1997.
37. K. Kukich. Techniques for Automatically Correcting Words
in Text. ACM Computing Surveys, 24(4):377–439, 1992.
38. C. Li, B. Wang, and X. Yang. VGRAM: Improving Perfor-
mance of Approximate Queries on String Collections Using
Variable-Length Grams. In Proc. of the Int’l Conf. on Very
Large Data Bases (VLDB), pages 303–314, Vienna, Austria,
2007.
39. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clus-
tering of high-dimensional data sets with application to ref-
erence matching. In Proc. of the Int’l Conf. on Know ledge
Discovery & Data Mining, pages 169–178, 2000.
40. D. R. H. Miller, T. Leek, and R. M. Schwartz. A Hidden
Markov Model Information Retrieval System. In ACM SI-
GIR Conference on Research and Development in Informa-
tion Retrieval, pages 214–221, 1999.
41. A. E. Monge and C. Elkan. An Efficient Domain-Independent
Algorithm for Detecting Approximately Duplicate Database
Records. In Proc. of SIGMOD Workshop on Data Mining
and Knowledge Discovery (DMKD), 1997.
42. E. Rahm and H. Hai Do. Data Cleaning: Problems and Cur-
rent Approaches. IEEE Data Engineering Bulletin, 23(4):3–
13, 2000.
43. C. Re, N. Dalvi, and D. Suciu. Efficient Top-k Query Evalu-
ation on Probabilistic Data. In IEEE Proc. of the Int’l Conf.
on Data Eng., pages 886–895, 2007.
44. S. Robertson. Understanding Inverse Document Frequency:
On Theoretical Arguments for IDF. Journal of Documenta-
tion, 60(5):503–520, 2004.
45. S. Sarawagi and A. Kirpal. Efficient Set Joins On Similarity
Predicates. In ACM SIGMOD Int’l Conf. on the Mgmt. of
Data, pages 743–754, Paris, France, 2004.
46. P. Sen, A. Deshpande, and L. Getoor. Representing Tuple
and Attribute Uncertainty in Probabilistic Databases. In
ICDM Workshops, pages 507–512, 2007.
47. N. Slonim. The Information Bottleneck: Theory And Appli-
cations. PhD thesis, The Hebrew University, 2003.
48. M. A. Soliman, I. F. Ilyas, and K. C. Chang. Top-k Query
Processing in Uncertain Databases. In IEEE Proc. of the
Int’l Conf. on Data Eng., pages 896–905, 2007.
49. M. A. Soliman, I. F. Ilyas, and K. C. Chang. Probabilis-
tic top-k and ranking-aggregate queries. ACM Trans. on
Database Sys. (TODS), 33(3):1–54, 2008.
50. S. van Dongen. Graph Clustering By Flow Simulation. PhD
thesis, University of Utrecht, 2000.
51. Jennifer Widom. Trio: A System for Integrated Management
of Data, Accuracy, and Lineage. In Proc. of the Conference
on Innovative Data Systems Research (CIDR), pages 262–
276, 2005.
25
Appendices
A Similarity Join Evaluation: Precision/Recall Curves
Figures 18 and 19 show the precision, recall, and F1values for all measures described in Section 2, over the datasets we have defined
with mixed types of errors. For all measures except HMM and BM25, the horizontal axis of the precision/recall graph is the value of
the threshold. For HMM and BM25, the horizontal axis is the percentage of maximum value of the threshold, since these measure do
not return a score between 0 and 1.
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
Edit Similarity
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
Jaccard
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
Weighted Jaccard
Fig. 18 Accuracy of similarity join using Edit-Similarity, Jaccard and Weighted Jaccard measures relative to the value of the threshold
on different datasets
26
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
Cosine w/tf-idf
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
BM25
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
HMM
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
SoftTFIDF
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
GES
Fig. 19 Accuracy of similarity join using measures from IR and hybrid measures relative to the value of the threshold on different
datasets
... The greater part of the current framework faces the issue of discovering duplicates prior in the recognition procedure. Element determination strategy [1] distinguishes the various portrayal of same character. The dynamic Sorted Neighborhood strategy [2] lessens the normal time for which duplicate is found. ...
... The greater part of the current framework faces the issue of discovering duplicates prior in the recognition procedure. Element determination strategy [1] tinguishes the various portrayal of same character. The dynamic Sorted Neighborhood strategy [2] lessens the normal time for which duplicate is found. ...
... For this reason, we have taken a very pragmatical approach by adopting a generate-check-refine loop in which, a merged dataset is generated according to a certain strategy for addressing the various aspects of the semantic heterogeneity problem, the result is statistically analyzed, and if the outcomes of the analysis are not statically significant, the strategy is refined and another iteration of the loop is performed. More sophisticated approaches can be attempted to cope with duplicates, such as those in Hassanzadeh and Miller (2009) whose idea is to keep duplicates in probabilistic databases that can return answers to queries with the associated probabilities of being correct. As explained in Hassanzadeh and Miller (2009), for this to work in practice, sophisticated clustering algorithms needs to be used in place of standard thresholding techniques (such as the one used in this work based on the Jaro-Winkler distance) for eliminating duplicates. ...
... More sophisticated approaches can be attempted to cope with duplicates, such as those in Hassanzadeh and Miller (2009) whose idea is to keep duplicates in probabilistic databases that can return answers to queries with the associated probabilities of being correct. As explained in Hassanzadeh and Miller (2009), for this to work in practice, sophisticated clustering algorithms needs to be used in place of standard thresholding techniques (such as the one used in this work based on the Jaro-Winkler distance) for eliminating duplicates. It would be interesting to integrate this and similar techniques such as Mishra and Mohanty (2020) that use unsupervized learning algorithms to increase the level of automation of the redundancy elimination step, which-as already observed above-is one of the most demanding in terms of time and user intervention of our approach. ...
Article
Full-text available
Providing an adequate assessment of their cyber-security posture requires companies and organisations to collect information about threats from a wide range of sources. One of such sources is history, intended as the knowledge about past cyber-security incidents, their size, type of attacks, industry sector and so on. Ideally, having a large enough dataset of past security incidents, it would be possible to analyze it with automated tools and draw conclusions that may help in preventing future incidents. Unfortunately, it seems that there are only a few publicly available datasets of this kind that are of good quality. The paper reports our initial efforts in collecting all publicly available security incidents datasets, and building a single, large dataset that can be used to draw statistically significant observations. In order to argue about its statistical quality, we analyze the resulting combined dataset against the original ones. Additionally, we perform an analysis of the combined dataset and compare our results with the existing literature. Finally, we present our findings, discuss the limitations of the proposed approach, and point out interesting research directions.
... In order to obtain such a globally consistent result, the tuples are collectively classified in the clustering step based on (independent) pairwise classification decisions. This means that information about duplicate relationships between different tuples is usually taken into account in order to partition the input relation into disjoint duplicate clusters [39]. ...
Conference Paper
Data quality in companies is decisive and critical to the benefits their products and services can provide. However, in heterogeneous IT infrastructures where, e.g., different applications for Enterprise Resource Planning (ERP), Customer Relationship Management (CRM), product management, manufacturing, and marketing are used, duplicates, e.g., multiple entries for the same customer or product in a database or information system, occur. There can be several reasons for this, but the result of non-unique or duplicate records is a degraded data quality. This ultimately leads to poorer, inefficient, and inaccurate data-driven decisions. For this reason, in this paper, we develop a conceptual data governance framework for effective and efficient management of duplicate data, and improvement of data accuracy and consistency in large data ecosystems. We present methods and recommendations for companies to deal with duplicate data in a meaningful way.
... Probabilistic databases have been widely VOLUME 8, 2020 explored for storing uncertain data, instead of just replacing uncertain fields with nulls [2]. Hassanzadeh and Miller [3] have used probabilistic databases in order to remove duplicate records from the base dirty relation. They find out the similarity between the records and use clustering algorithms to identify the set of records that are probable duplicates. ...
Article
Full-text available
Current relational database systems are deterministic in nature and lack the support for approximate matching. The result of approximate matching would be the tuples annotated with the percentage of similarity but the existing relational database system can not process these similarity scores further. In this paper, we propose a system to support approximate matching in the DBMS field. We introduce a '≈' (uncertain predicate operator) for approximate matching and devise a novel formula to calculate the similarity scores. Instead of returning an empty answer set in case of no match, our system gives ranked results thereby providing a glance at existing tuples closely matching with the queried literals. Two variants of the '≈' operator are also introduced for numeric data: '≈+' for higher-the-better and '≈-' for lower-the-better cases. Efficient approximate string matching methods are proposed for matching string-type data whereas numeric closeness is used for other types of data (date, time, and number). We also provide results of our system taken over several sample queries that illustrate the significance of our system. All experiments are performed using the MySQL database, whereas the IMDb movie database and European Football database are used as sample datasets.
Article
Textual bridge inspection reports are important data sources for supporting data-driven bridge deterioration prediction and maintenance decision making. Information extraction methods are available to extract data/information from these reports to support data-driven analytics. However, directly using the extracted data/information in data analytics is still challenging because, even within the same report, there exist multiple data records that describe the same entity, which increases the dimensionality of the data and adversely affects the performance of the analytics. The first step to address this problem is to link the multiple records that describe the same entity and same type of instances (e.g., all cracks on a specific bridge deck), so that they can be subsequently fused into a single unified representation for dimensionality reduction without information loss. To address this need, this paper proposes a spectral clustering-based method for unsupervised data linking. The method includes: (1) a concept similarity assessment method, which allows for assessing concept similarity even when corpus or semantic information is not available for the application at hand; (2) a record similarity assessment method, which captures and uses similarity assessment dependencies to reduce the number of falsely-linked records; and (3) an improved spectral clustering method, which uses iterative bi-partitioning to better link records in an unsupervised way and to address the transitive closure problem. The proposed data linking method was evaluated in linking records extracted from ten bridge inspection reports. It achieved an average precision, recall, and F-1 measure of 96.2%, 88.3%, and 92.1%, respectively.
Research
Nowadays Quora has become a widely used platform where we can share, gain knowledge. The primary product policy of quora is to maintain a single page for each and every distinct question and is to avoid confusion for the reader; this has become a challenging task and was frequently asked in many kaggle competitions for an approach to this problem. In this paper we are going to use Tf-Idf, Random forests and Bayes aNaïve classifier to classify between the duplicates and non-duplicates. We are going to use hadoop multimode cluster for storing large data to hadoop environment ie HDFS and the dataset which consists of over 25000,000+ lines of potential question duplicate pairs from quora and with this we train the classifier to identify the duplicates and non-duplicates. This model can be used like a search engine to process the search queries and return the similar documents based on the similarity between the documents and the query. In this project we are using two classifiers Bayes Naïve classifier and Random forests classifier in order to find the model with best accuracy. In this work, dataset is extracted from the Quora which contains question pairs to find the similarity. Instead of using the traditional techniques like Bag of words or word counter, a new technique which uses Tf-Idf is built to find the similarity. The text is transformed into the vectors using Tf-Idf and this is used to train the model using supervised learning technique along with the labels from the dataset. This model resulted in some results along with improved accuracy.
Research
Duplicate detection is the process of identifying multiple representations of same real world entities. Today, duplicate detection methods need to process ever larger datasets in ever shorter time: maintaining the quality of a dataset becomes increasingly difficult. We present two novel, progressive duplicate detection algorithms that significantly increase the efficiency of finding duplicates if the execution time is limited. They maximize the gain of the overall process within the time available by reporting most results much earlier than traditional approaches. Comprehensive experiments show that our progressive algorithms can double the efficiency over time of traditional duplicate detection and significantly improve upon related work. Two approaches that we follow are progressive duplicate detection algorithms namely progressive sorted neighbourhood method (PSNM), which performs best on small and almost clean datasets, and progressive blocking (PB), which performs best on large and very dirty datasets. Both enhance the efficiency of duplicate detection even on very large datasets. In comparison to traditional duplicate detection, progressive duplicate detection satisfies two conditions viz. Improved early quality and same eventual quality. We introduce a concurrent progressive approach for the multi-pass method and adapt an incremental transitive closure algorithm that together forms the first complete progressive duplicate detection workflow. We make these findings through the paper. A user has only limited, maybe unknown time for data cleansing and wants to make best possible use of it. Algorithm can be starts and terminate when needed. The result size will be maximized. A user has little knowledge about the given data but still needs to configure the cleansing process. Then, the progressive algorithm choose window/block sizes and keys automatically to detect duplicates.
Chapter
This chapter focuses on the basic data operators for entity resolution, which include similarity search, similarity join, and clustering on sets or strings. These three problems are of increasing complexity, and the solution of simpler problems is the building blocks for the harder problem. The authors first introduce the solution of similarity search, covering gram-based algorithms and sketch-based algorithms. Then the chapter turns to the solution of similarity join, covering both exact and approximate algorithms. At last, the authors deal with the problem of clustering similar strings in a set, which can be applied to duplicate detection in databases.
Chapter
Large quantities of records need to be read and analyzed in cloud computing; many records referring to the same entity bring challenges for data processing and analysis. Entity resolution has become one of the hot issues in database research. Clustering based on records similarity is one of most commonly used methods, but the existing methods of computing records similarity often cost much time and are not suitable for cloud computing. This chapter shows that it is necessary to use wave of strings to compute records similarity in cloud computing and provides a method based on wave of strings of entity resolution. Theoretical analysis and experimental results show that the method proposed in this chapter is correct and effective.
Article
Full-text available
The term-weighting function known as IDF was proposed in 1972, and has since been extremely widely used, usually as part of a TF*IDF function. It is often described as a heuristic, and many papers have been written (some based on Shannon's Information Theory) seeking to establish some theoretical basis for it. Some of these attempts are reviewed, and it is shown that the Information Theory approaches are problematic, but that there are good theoretical justifications of both IDF and TF*IDF in the traditional probabilistic model of information retrieval.
Chapter
In this final chapter we consider several fuzzy algorithms that effect partitions of feature space ℝp , enabling classification of unlabeled (future) observations, based on the decision functions which characterize the classifier. S25 describes the general problem in terms of a canonical classifier, and briefly discusses Bayesian statistical decision theory. In S26 estimation of the parameters of a mixed multivariate normal distribution via statistical (maximum likelihood) and fuzzy (c-means) methods is illustrated. Both methods generate very similar estimates of the optimal Bayesian classifier. S27 considers the utilization of the prototypical means generated by (A11.1) for characterization of a (single) nearest prototype classifier, and compares its empirical performance to the well-known k-nearest-neighbor family of deterministic classifiers. In S28, an implicit classifier design based on Ruspini’s algorithm is discussed and exemplified.
Conference Paper
We address optimization problems in which we are given contradictory pieces of input information and the goal is to find a globally consistent solution that minimizes the number of disagreements with the respective inputs. Specifically, the problems we address are rank aggregation, the feedback arc set problem on tournaments, and correlation and consensus clustering. We show that for all these problems (and various weighted versions of them), we can obtain improved approximation factors using essentially the same remarkably simple algorithm. Additionally, we almost settle a long-standing conjecture of Bang-Jensen and Thomassen and show that unless NP⊆BPP, there is no polynomial time algorithm for the problem of minimum feedback arc set in tournaments.
Article
We consider the following clustering problem: we have a complete graph on n vertices (items), where each edge (u, v) is labeled either + or − depending on whether u and v have been deemed to be similar or different. The goal is to produce a partition of the vertices (a clustering) that agrees as much as possible with the edge labels. That is, we want a clustering that maximizes the number of + edges within clusters, plus the number of − edges between clusters (equivalently, minimizes the number of disagreements: the number of − edges inside clusters plus the number of + edges between clusters). This formulation is motivated from a document clustering problem in which one has a pairwise similarity function f learned from past data, and the goal is to partition the current set of documents in a way that correlates with f as much as possible; it can also be viewed as a kind of “agnostic learning” problem.An interesting feature of this clustering formulation is that one does not need to specify the number of clusters k as a separate parameter, as in measures such as k-median or min-sum or min-max clustering. Instead, in our formulation, the optimal number of clusters could be any value between 1 and n, depending on the edge labels. We look at approximation algorithms for both minimizing disagreements and for maximizing agreements. For minimizing disagreements, we give a constant factor approximation. For maximizing agreements we give a PTAS, building on ideas of Goldreich, Goldwasser, and Ron (1998) and de la Veg (1996). We also show how to extend some of these results to graphs with edge labels in [−1, +1], and give some results for the case of random noise.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.