Content uploaded by Mohammad Sadoghi
Author content
All content in this area was uploaded by Mohammad Sadoghi
Content may be subject to copyright.
Accuracy of Approximate String Joins Using Grams
Oktie Hassanzadeh
University of Toronto
10 King’s College Rd.
Toronto, ON M5S3G4, Canada
oktie@cs.toronto.edu
Mohammad Sadoghi
University of Toronto
10 King’s College Rd.
Toronto, ON M5S3G4, Canada
mo@cs.toronto.edu
Ren´ee J. Miller
University of Toronto
10 King’s College Rd.
Toronto, ON M5S3G4, Canada
miller@cs.toronto.edu
ABSTRACT
Approximate join is an important part of many data clean-
ing and integration methodologies. Various similarity mea-
sures have been proposed for accurate and efficient matching
of string attributes. The accuracy of the similarity measures
highly depends on the characteristics of the data such as
amount and type of the errors and length of the strings. Re-
cently, there has been an increasing interest in using meth-
ods based on q-grams (substrings of length q) made out of
the strings, mainly due to their high efficiency. In this work,
we evaluate the accuracy of the similarity measures used
in these methodologies. We present an overview of several
similarity measures based on q-grams. We then thoroughly
compare their accuracy on several datasets with different
characteristics. Since the efficiency of approximate joins de-
pend on the similarity threshold they use, we study how the
value of the threshold (including values used in recent per-
formance studies) effects the accuracy of the join. We also
compare different measures based on the highest accuracy
they can gain on different datasets.
1. INTRODUCTION
Data quality is a major concern in operational databases
and data warehouses. Errors may be present in the data due
to a multitude of reasons including data entry errors, lack of
common standards and missing integrity constraints. String
data is by nature more prone to such errors. Approximate
join is an important part of many data cleaning methodolo-
gies and is well-studied: given two large relations, identify
all pairs of records that approximately match. A variety of
similarity measures have been proposed for string data in
order to match records. Each measure has certain charac-
teristics that makes it suitable for capturing certain types
of errors. By using a string similarity function sim() for the
approximate join algorithm, all pairs of records that have
similarity score above a threshold θare considered to ap-
proximately match and are returned as the output.
Performing approximate join on a large relation is a noto-
Permission to copy without fee all or part of this material is granted provided
that the copies are not made or distributed for direct commercial advantage,
theVLDBcopyrightnoticeandthe titleofthepublicationand its date appear,
and notice is given that copying is by permission of the Very Large Data
Base Endowment. To copy otherwise, or to republish, to post on servers
or to redistribute to lists, requires a fee and/or special permission from the
publisher, ACM.
VLDB ‘07, September 23-28, 2007, Vienna, Austria.
Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09.
riously time-consuming task. Recently, there has been an in-
creasing interest in using approximate join techniques based
on q-grams (substrings of length q) made out of strings.
Most of the efficient approximate join algorithms (which we
describe in Section 2) are based on using a specific similarity
measure, along with a fixed threshold value to return pairs of
records whose similarity is greater than the threshold. The
effectiveness of majority of these algorithms depend on the
value of the threshold used. However, there has been little
work studying the accuracy of the join operation. The accu-
racy is known to be dataset-dependent and there is no com-
mon framework for evaluation and comparison of accuracy
of different similarity measures and techniques. This makes
comparing their accuracy a difficult task. Nevertheless, we
argue that it is possible to evaluate relative performance of
different measures for approximate joins by using datasets
containing different types of known quality problems such as
typing errors and difference in notations and abbreviations.
In this paper, we present an overview of several similarity
measures for approximate string joins using q-grams and
thoroughly evaluate their accuracy for different values of
thresholds and on datasets with different amount and types
of errors. Our results include:
•We show that for all similarity measures, the value of
the threshold that results in the most accurate join
highly depends on the type and amount of errors in
the data.
•We compare different similarity measures by compar-
ing the maximum accuracy they can achieve on dif-
ferent datasets using different thresholds. Although
choosing a proper threshold for the similarity measures
without a prior knowledge of the data characteristics
is known to be a difficult task, our results show which
measures can potentially be more accurate assuming
that there is a way to determine the best threshold.
Therefore, an interesting direction for future work is
to find an algorithm for determining the value of the
threshold for the most accurate measures.
•We show how the amount and type of errors affect the
best value of the threshold. An interesting result of
this is that many previously proposed algorithms for
enhancing the performance of the join operation and
making it scalable for large datasets are not effective
enough in many scenarios, since the performance of
these algorithms highly depends on choosing a high
value of threshold which could result in a very low
accuracy. This shows the effectiveness of those algo-
rithms that are less sensitive to the value of the thresh-
old and opens another interesting direction for future
work which is finding algorithms that are both efficient
and accurate using the same threshold.
The paper is organized as follows. In Section 2, we overview
related work on approximate joins. We present our frame-
work for approximate join and description of the similarity
measures used in Section 3. Section 4 presents thorough
evaluation of these measures and finally, Section 5 concludes
the paper and explains future directions.
2. RELATED WORK
Approximate join also known as similarity join or record
linkage has been extensively studied in the literature. Sev-
eral similarity measures for string data have been proposed
[14, 4, 5]. A recent survey[9], presents an excellent overview
of different types of string similarity measures. Recently,
there has been an increasing interest in using measures from
the Information Retrieval (IR) field along with q-grams made
out strings [10, 6, 2, 18, 5]. In this approach, strings are
treated as documents and q-grams are treated as tokens in
the documents. This makes it possible to take advantage
of several indexing techniques as well as various algorithms
that has been proposed for efficient set-similarity joins. Fur-
thermore, these measures can be implemented declaratively
over a DBMS with vanilla SQL statements [5].
Various recent works address the problem of efficiency and
scalability of the similarity join operations for large datasets
[6, 2, 18]. Many techniques are proposed for set-similarity
join, which can be used along with q-grams for the purpose
of (string) similarity joins. Most of the techniques are based
on the idea of creating signatures for sets (strings) to re-
duce the search space. Some signature generations schemes
are derived from dimensionality reduction for the similar-
ity search problem in high dimensional space. One efficient
approach uses the idea of Locality Sensitive Hashing (LSH)
[13] in order to hash similar sets into the same values with
high probability and therefore is an approximate solution to
the problem. Arasu et al. [2] propose algorithms specifically
for set-similarity joins that are exact and outperform pre-
vious approximation methods in their framework, although
parameters of the algorithms require extensive tuning. An-
other class of work is based on using indexing algorithms,
primarily derived from IR optimization techniques. A recent
proposal in this area [3] presents algorithms based on novel
indexing and optimization strategies that do not rely on ap-
proximation or extensive parameter tuning and outperform
previous state-of-the-art approaches. More recently, Li et
al.[15] propose VGRAM, a technique based on the idea of
using variable-length grams instead of q-grams. At a high
level, it can be viewed as an efficient index structure over the
collection of strings. VGRAM can be used along with pre-
viously proposed signature-based algorithms to significantly
improve their efficiency.
Most of the techniques described above mainly address
the scalability of the join operation and not the accuracy.
The choice of the similarity measure is often limited in these
algorithms. The signature-based algorithm of [6] also con-
siders accuracy by introducing a novel similarity measure
called fuzzy match similarity and creating signatures for this
measure. However, accuracy of this measure is not com-
pared with other measures. In [5] several such similarity
measures are benchmarked for approximate selection, which
is a special case of similarity join. Given a relation R, the
approximate selection operation using similarity predicate
sim(), will report all tuples t∈Rsuch that sim(tq, t)≥θ,
where θis a specified numerical ’similarity threshold’ and tq
is a query string. Approximate selections are special cases
of the similarity join operation. While several predicates
are introduced and benchmarked in [5], the extension of ap-
proximate selection to approximate joins is not considered.
Furthermore, the effect of threshold values on accuracy for
approximate joins is also not considered.
3. FRAMEWORK
In this section, we explain our framework for similarity
join. The similarity join of two relations R={ri: 1 ≤i≤
N1}and S={sj: 1 ≤j≤N2}outputs is a set of pairs
(ri, sj)∈R×Swhere riand sjare similar. Two records
are considered similar when their similarity score based a
similarity function sim() is above a threshold θ. For the
definitions and experiments in this paper, we assume we are
performing a self-join on relation R. Therefore the output
is a set of pairs (ri, rj)∈R×Rwhere sim(ri, rj)≥θfor
some similarity function sim() and a threshold θ. This is a
common operation in many applications such as entity res-
olution and clustering. In keeping with many approximate
join methods, we model records as strings. We denote by r
the set of q-grams (sequences of qconsecutive characters of
a string) in r. For example, for t=‘db lab’, t={‘db ’ ,‘b l’,‘
la’, ‘lab’}for tokenization using 3-grams. In certain cases,
aweight may be associated with each token.
The similarity measures discussed here are those based
on q-grams created out of strings along with a similarity
measure that has shown to be effective in previous work [5].
These measures share one or both of the following proper-
ties:
•High scalability: There are various techniques pro-
posed in the literature as described in Section 2 for
enhancing the performance of the similarity join oper-
ation using q-grams along with these measures.
•High accuracy: Previous work has proved that in most
scenarios these measures perform better or equally well
in terms of accuracy comparing with other string simi-
larity measures. Specifically, these measures have shown
good accuracy in name-matching tasks [8] or in ap-
proximate selection [5].
3.1 Edit Similarity
Edit-distance is widely used as the measure of choice in
many similarity join techniques. Specifically, previous work
[10] has shown how to use q-grams for efficient implemen-
tation of this measure in a declarative framework. Recent
works on enhancing performance of similarity join has also
proposed techniques for scalable implementation of this mea-
sure [2, 15].
Edit distance between two string records r1and r2is de-
fined as the transformation cost of r1to r2,tc(r1, r2), which
is equal to the minimum cost of edit operations applied to
r1to transform it to r2. Edit operations include character
copy,insert,delete and substitute [11]. The edit similarity is
defined as:
simedit(r1, r2) = 1 −tc(r1, r2)
max{|r1|,|r2|} (1)
There is a cost associated with each edit operation. There
are several cost models proposed for edit operations for this
measure. In the most commonly used measure called Leven-
shtein edit distance, which we will refer to as edit distance
in this paper, uses unit cost for all operations except copy
which has cost zero.
3.2 Jaccard and WeightedJaccard
Jaccard similarity is the fraction of tokens in r1and r2
that are present in both. Weighted Jaccard similarity is the
weighted version of Jaccard similarity, i.e.,
simW Jacca rd(r1, r2) = Pt∈r1∩r2wR(t)
Pt∈r1∪r2wR(t)(2)
where w(t, R) is a weight function that reflects the com-
monality of the token tin the relation R. We choose RSJ
(Robertson-Sparck Jones) weight for the tokens which was
shown to be more effective than the commonly-used Inverse
Document Frequency (IDF) weights [5]:
wR(t) = log N−nt+ 0.5
nt+ 0.5(3)
where Nis the number of tuples in the base relation Rand
ntis the number of tuples in Rcontaining the token t.
3.3 Measures from IR
A well-studied problem in information retrieval is that
given a query and a collection of documents, return the
most relevant documents to the query. In the measures in
this part, records are treated as documents and q-grams are
seen as words (tokens) of the documents. Therefore same
techniques for finding relevant documents to a query can
be used to return similar records to a query string. In the
rest of this section, we present three measures that previous
work has shown their higher performance for approximate
selection problem [5].
3.3.1 Cosine w/tf-idf
The tf-idf cosine similarity is a well established measure in
the IR community which leverages the vector space model.
This measure determines the closeness of the input strings
r1and r2by first transforming the strings into unit vectors
and then measuring the angle between their corresponding
vectors. The cosine similarity with tf-idf weights is given
by:
simCosine (r1, r2) = X
t∈r1∩r2
wr1(t)·wr2(t) (4)
where wr1(t) and wr2(t) are the normalized tf-idf weights
for each common token in r1and r2respectively. The nor-
malized tf-idf weight of token tin a given string record ris
defined as follows:
wr(t) = w′
r(t)
qPt′∈rw′
r(t′)2
, w′
r(t) = tfr(t)·idf(t)
where tfr(t) is the term frequency of token twithin string
rand idf(t) is the inverse document frequency with respect
to the entire relation R.
3.3.2 BM25
The BM25 similarity score for a query r1and a string
record r2is defined as follows:
simBM 25(r1, r2) = X
t∈r1∩r2
ˆwr1(t)·wr2(t) (5)
where
ˆwr1(t) = (k3+ 1)·tfr1(t)
k3+tfr1(t)
wr2(t) = w(1)
R(t)(k1+1)·tfr2(t)
K(r2)+tfr2(t)
and w(1)
Ris the RSJ weight:
w(1)
R(t) = log N−nt+0.5
nt+0.5
K(r) = k1(1 −b) + b|r|
avgrl
where tfr(t) is the frequency of the token tin string record
r,|r|is the number of tokens in r,avgrl is the average
number of tokens per record, N is the number of records in
the relation R,ntis the number of record containing the
token tand k1,k3and bare set of independent parameters.
We set these parameters based on TREC-4 experiments [17]
where k∈[1,2], k3= 8 and b∈[0.6,0.75].
3.3.3 Hidden Markov Model
The approximate string matching could be modeled by
a discrete Hidden Markov process which has shown better
performance than Cosine w/tf-idf in IR literature [16] and
high accuracy and running time for approximate selection
[5]. This particular Markov model consists of only two states
where the first state models the tokens that are specific to
one particular “String” and the second state models the to-
kens in the “General English”, i.e., tokens that are common
in many records. Refer to [5] and [16] for complete descrip-
tion of the model and possible extensions.
The HMM similarity function accepts two string records
r1and r2and returns the probability of generating r1given
r2is a similar record:
simHM M (r1, r2) = Y
t∈r1
(a0P(t|GE) + a1P(t|r2)) (6)
where a0and a1= 1 −a0are the transition states probabil-
ities of the Markov model and P(t|GE ) and P(t|r2) is given
by:
P(t|r2) = number of times tappears in r2
|r2|
P(t|GE) = Pr∈Rnumber of times tappears in r
Pr∈R|r|
3.4 Hybrid Measures
The implementation of these measures involves two simi-
larity functions, one that compares the strings by comparing
their word tokens and another similarity function which is
more suitable for short strings and is used for comparison of
the word tokens.
3.4.1 GES
The generalized edit similarity (GES) [7] which is a mod-
ified version of fuzzy match similarity presented in [6], takes
two strings r1and r2, tokenizes the strings into a set of
words and assigns a weight w(t) to each token. GES defines
the similarity between the two given strings as a minimum
transformation cost required to convert string r1to r2and
is given by:
simGES (r1, r2) = 1 −min tc(r1, r2)
wt(r1),1.0(7)
where wt(r1) is the sum of weights of all tokens in r1and
tc(r1, r2) is a sequence of the following transformation oper-
ations:
•token insertion: inserting a token tin r1with cost
w(t).cins where cins is the insertion factor constant and
is in the range between 0 and 1. In our experiments,
cins = 1.
•token deletion: deleting a token tfrom r1with cost
w(t).
•token replacement: replacing a token t1by t2in r1
with cost (1−simedit(t1, t2)).w(t) where simedit is the
edit-distance between t1and t2.
3.4.2 SoftTFIDF
SoftTFIDF is another hybrid measure proposed by Cohen
et al. [8], which relies on the normalized tf-idf weight of word
tokens and can work with any arbitrary similarity function
to find similarity between word tokens. In this measure, the
similarity score, simSoftT F I DF , is defined as follows:
X
t1∈C(θ,r1,r2)
w(t1, r1)·w(arg max
t2∈r2
(sim(t1, t2)), r2)·max
t2∈r2
(sim(t1, t2))
(8)
where w(t, r) is the normalized tf-idf weight of word token
tin record rand C(θ, r1, r2) returns a set of tokens t1∈r1
such that for t2∈r2we have sim(t1, t2)> θ for some sim-
ilarity function sim() suitable for comparing word strings.
In our experiments sim(t1, t2) is the Jaro-Winkler similarity
as suggested in [8].
4. EVALUATION
4.1 Datasets
In order to evaluate effectiveness of different similarity
measures described in previous section, we use the same
datasets used in [5]. These datasets were created using a
modified version of UIS data generator, which has previ-
ously been used for evaluation of data cleaning and record
linkage techniques [12, 1]. The data generator has the abil-
ity to inject several types of errors into a clean database of
string attributes. These errors include commonly occurring
typing mistakes (edit errors: character insertion, deletion,
replacement and swap), token swap and abbreviation errors
(e.g., replacing Inc. with Incorporated and vice versa).
The data generator has several parameters to control the
injected error in the data such as the size of the dataset to
be generated, the distribution of duplicates (uniform, Zip-
fian or Poisson), the percentage of erroneous duplicates, the
extent of error injected in each string, and the percentage
of different types of errors. The data generator keeps track
Percentage of
Group Name Erroneous Errors in Token Abbr.
Duplicates Duplicates Swap Error
Dirty D1 90 30 20 50
D2 50 30 20 50
Medium M1 30 30 20 50
M2 10 30 20 50
M3 90 10 20 50
M4 50 10 20 50
Low L1 30 10 20 50
L2 10 10 20 50
Single AB 50 0 0 50
Error TS 50 0 20 0
EDL 50 10 0 0
EDM 50 20 0 0
EDH 50 30 0 0
Table 1: Datasets Used in the Experiments
of the duplicate records by assigning a cluster ID to each
clean record and to all duplicates generated from that clean
record.
For the results presented in this paper, the datasets are
generated by the data generator out of a clean dataset of
2139 company names with average record length of 21.03
and an average of 2.9 words per record. The errors in
the datasets have a uniform distribution. For each dataset,
on average 5000 dirty records are created out of 500 clean
records. We have also run experiments on datasets gener-
ated using different parameters. For example, we generated
data using a Zipfian distribution, and we also used data from
another clean source (DBLP titles) as in [5]. We also cre-
ated larger datasets. For these other datasets, the accuracy
trends remain the same. Table 1 shows the description of
all the datasets used for the results in this paper. We used
8 different datasets with mixed types of errors (edit errors,
token swap and abbreviation replacement). Moreover, we
use 5 datasets with only a single type of error (edit errors,
token swap or abbreviation replacement errors) to measure
the effect of each type of error individually. Following [5],
we believe the errors in these datasets are highly represen-
tative of common types of errors in databases with string
attributes.
4.2 Measures
We use well-known measures from IR, namely precision,
recall, and F1, for different values of threshold to evaluate
the accuracy of the similarity join operation. We perform a
self-join on the input table using a similarity measure with a
fixed threshold θ.Precision (Pr) is defined as the percentage
of similar records among the records that have similarity
score above threshold θ. In our datasets, similar records are
marked with the same cluster ID as described above. Recall
(Re) is the ratio of the number of similar records that have
similarity score above threshold θto the total number of
similar records. A join that returns all the pairs of records in
the two input tables as output has low (near zero) precision
and recall of 1. A join that returns an empty answer has
precision 1 and zero recall. The F1measure is the harmonic
mean of precision and recall, i.e.,
F1=2×P r ×Re
P r +Re (9)
We measure precision, recall, and F1for different value of
similarity thresholds θ. For comparison of different similar-
Figure 3: Maximum F1score for different measures
on datasets with only edit errors
ity measures, we use the maximum F1score across different
thresholds.
4.3 Results
Figures 1 and 2 show the precision, recall, and F1values
for all measures described in Section 3, over the datasets we
have defined with mixed types of errors. For all measure
except HMM and BM25, the horizontal axis of the preci-
sion/recall graph is the value of the threshold. For HMM
and BM25, the horizontal axis is the percentage of maximum
value of the threshold, since these measure do not return a
score between 0 and 1.
Effect of amount of errors As shown in the precision/recall
curves in Figures 1 and 2, the “dirtiness” of the input data
greatly affects the value of the threshold that results in the
most accurate join. For all the measures, a lower value of
the threshold is needed as the degree of error in the data
increases. For example, Weighted Jaccard achieves the best
F1score over the dirtiest datasets with threshold 0.3, while
it achieves the best F1for the cleanest datasets at threshold
0.55. BM25 and HMM are less sensitive and work well on
both dirty and clean group of datasets with the same value
of threshold. We will discuss later how the degree of error
in the data affects the choice of the most accurate measure.
Effect of types of errors Figures 3 shows the maximum
F1score for different values of the threshold for different
measures on datasets containing only edit-errors (the EDL,
EDM and EDH datasets). These figures show that weighted
Jaccard and Cosine have the highest accuracy followed by
Jaccard, and edit similarity on the low-error dataset EDL.
By increasing the amount of edit error in each record, HMM
performs as well as weighted Jaccard, although Jaccard, edit
similarity, and GES perform much worse on high edit error
rates. Considering the fact that edit-similarity is mainly
proposed for capturing edit errors, this shows the effective-
ness of weighted Jaccard and its robustness with varying
amount of edit errors. Figure 4 shows the effect of token
swap and abbreviation errors on the accuracy of different
measures. This experiment indicates that edit similarity is
not capable of modeling such types of errors. HMM, BM25
and Jaccard also are not capable of modeling abbreviation
errors properly.
Comparison of measures Figures 5 shows the maximum
F1score for different values of the threshold for different
measures on dirty, medium and clean group of datasets.
(Here we have aggregated the results for all the dirty data
sets together (respectively, the moderately derity or medium
data sets and the clean data sets). The results show the ef-
Figure 4: Maximum F1score for different measures
on datasets with only token swap and abbreviation
errors
Figure 5: Maximum F1score for different measures
on clean, medium and dirty group of datasets
fectiveness and robustness of weighted Jaccard and cosine
in comparison with other measures. Again, HMM is among
the most accurate measures when the data is extremely dirty
and has relatively low accuracy when the percentage of error
in the data is low.
Remark As stated in Section 2, the performance of many
algorithms proposed for improving scalability of the join op-
eration highly depend on the value of similarity threshold
used for the join. Here we show the accuracy numbers on
our datasets using the value of the theshold that makes these
algorithms effective. Specifically we address the results in
[2] although similar observations can be made for results
of other similar works in this area. Table 2 shows the F1
value for thresholds that results in the best accuracy on our
datasets and the best performance in experimental results
of [2]. PartEnum and WtEnum algorithm presented in [2]
significantly outperform previous algorithms for 0.9 thresh-
old, but have roughly the same performance as previously
proposed algorithms such as LSH when threshold 0.8 or less
is used. The results in Table 2 shows that there is a big gap
between the value of the threshold that results in the most
accurate join on our datasets and the threshold that results
in effectiveness of PartEnum and WtEnum in studies in
[2].
5. CONCLUSION
We have presented an overview of several similarity mea-
sures for efficient approximate string joins and thoroughly
evaluated their accuracy on several datasets with different
characteristics and common quality problems. Our results
show the effect of the amount and type of errors in the
Jaccard Join Weighted Jaccard Join
Threshold F1Threshold F1
0.5 (Best Acc.) 0.293 0.3 (Best Acc.) 0.528
0.8 0.249 0.8 0.249
Dirty 0.85 0.248 0.85 0.246
0.9 (Best Performance) 0.247 0.9 (Best Performance) 0.244
0.65 (Best Acc.) 0.719 0.55 (Best Acc.) 0.776
Medium 0.8 0.611 0.8 0.581
0.85 0.571 0.85 0.581
0.9 (Best Performance) 0.548 0.9 (Best Performance) 0.560
0.7 (Best Acc.) 0.887 0.55 (Best Acc.) 0.929
0.8 0.854 0.8 0.831
Clean 0.85 0.831 0.85 0.819
0.9 (Best Performance) 0.812 0.9 (Best Performance) 0.807
Table 2: F1score for thresholds that result in best running time in previous performance studies and highest
accuracy on our datasets for two selected similarity measures
datasets and the similarity threshold used for the similarity
measures on the accuracy of the join operation. Considering
the fact that the effectiveness of many algorithms proposed
for enhancing the scalability of approximate join rely on the
value chosen for the similarity threshold, our results show
the effectiveness of those algorithms that are less sensitive to
the value of the threshold and opens an interesting direction
for future work which is finding algorithms that are both ef-
ficient and accurate using the same threshold. Finding an
algorithm that determines the best value of the threshold
regardless of the type and amount of errors for the simi-
larity measures that showed higher accuracy in our work is
another interesting subject for future work.
6. REFERENCES
[1] P. Andritsos, A. Fuxman, and R. J. Miller. Clean
answers over dirty databases: A probabilistic
approach. In ICDE’06.
[2] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact
set-similarity joins. In VLDB’06.
[3] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all
pairs similarity search. In WWW’07.
[4] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and
S. Fienberg. Adaptive name matching in information
integration. IEEE Intelligent Systems, 18(5), 2003.
[5] A. Chandel, O. Hassanzadeh, N. Koudas, M. Sadoghi,
and D. Srivastava. Benchmarking declarative
approximate selection predicates. In SIGMOD’07.
[6] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani.
Robust and efficient fuzzy match for online data
cleaning. In SIGMOD’03.
[7] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive
operator for similarity joins in data cleaning. In ICDE
’06.
[8] W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A
comparison of string distance metrics for
name-matching tasks. In IIWeb’03.
[9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios.
Duplicate record detection: A survey. IEEE TKDE,
19(1), 2007.
[10] L. Gravano, P. G. Ipeirotis, H. V. Jagadish,
N. Koudas, S. Muthukrishnan, and D. Srivastava.
Approximate string joins in a database (almost) for
free. In VLDB’01.
[11] D. Gusfield. Algorithms on strings, trees, and
sequences: computer science and computational
biology. Cambridge University Press, New York, NY,
USA, 1997.
[12] M. A. Hern´andez and S. J. Stolfo. Real-world data is
dirty: Data cleansing and the merge/purge problem.
Data Mining and Knowledge Discovery, 2(1):9–37,
1998.
[13] Indyk, Motwani, Raghavan, and Vempala.
Locality-preserving hashing in multidimensional
spaces. In STOC’97.
[14] N. Koudas and D. Srivastava. Approximate joins:
Concepts and techniques. In VLDB’05 Tutorial.
[15] C. Li, B. Wang, and X. Yang. Vgram: Improving
performance of approximate queries on string
collections using variable-length grams. In VLDB’07.
[16] D. R. H. Miller, T. Leek, and R. M. Schwartz. A
hidden markov model information retrieval system. In
SIGIR’99.
[17] S. E. Robertson, S. Walker, M. Hancock-Beaulieu,
M. Gatford, and A. Payne. Okapi at trec-4. In
TREC’95.
[18] S. Sarawagi and A. Kirpal. Efficient set joins on
similarity predicates. In SIGMOD’04.
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
Edit Similarity
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
Jaccard
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
Weighted Jaccard
Figure 1: Accuracy of Edit-Similarity, Jaccard and Weighted Jaccard measures relative to the value of the
threshold on different datasets
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
Cosine w/tf-idf
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
BM25
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
HMM
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
SoftTFIDF
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
GES
Figure 2: Accuracy of measures from IR and hybrid measures relative to the value of the threshold on
different datasets