Content uploaded by Mohammad Sadoghi

Author content

All content in this area was uploaded by Mohammad Sadoghi

Content may be subject to copyright.

Accuracy of Approximate String Joins Using Grams

Oktie Hassanzadeh

University of Toronto

10 King’s College Rd.

Toronto, ON M5S3G4, Canada

oktie@cs.toronto.edu

Mohammad Sadoghi

University of Toronto

10 King’s College Rd.

Toronto, ON M5S3G4, Canada

mo@cs.toronto.edu

Ren´ee J. Miller

University of Toronto

10 King’s College Rd.

Toronto, ON M5S3G4, Canada

miller@cs.toronto.edu

ABSTRACT

Approximate join is an important part of many data clean-

ing and integration methodologies. Various similarity mea-

sures have been proposed for accurate and eﬃcient matching

of string attributes. The accuracy of the similarity measures

highly depends on the characteristics of the data such as

amount and type of the errors and length of the strings. Re-

cently, there has been an increasing interest in using meth-

ods based on q-grams (substrings of length q) made out of

the strings, mainly due to their high eﬃciency. In this work,

we evaluate the accuracy of the similarity measures used

in these methodologies. We present an overview of several

similarity measures based on q-grams. We then thoroughly

compare their accuracy on several datasets with diﬀerent

characteristics. Since the eﬃciency of approximate joins de-

pend on the similarity threshold they use, we study how the

value of the threshold (including values used in recent per-

formance studies) eﬀects the accuracy of the join. We also

compare diﬀerent measures based on the highest accuracy

they can gain on diﬀerent datasets.

1. INTRODUCTION

Data quality is a major concern in operational databases

and data warehouses. Errors may be present in the data due

to a multitude of reasons including data entry errors, lack of

common standards and missing integrity constraints. String

data is by nature more prone to such errors. Approximate

join is an important part of many data cleaning methodolo-

gies and is well-studied: given two large relations, identify

all pairs of records that approximately match. A variety of

similarity measures have been proposed for string data in

order to match records. Each measure has certain charac-

teristics that makes it suitable for capturing certain types

of errors. By using a string similarity function sim() for the

approximate join algorithm, all pairs of records that have

similarity score above a threshold θare considered to ap-

proximately match and are returned as the output.

Performing approximate join on a large relation is a noto-

Permission to copy without fee all or part of this material is granted provided

that the copies are not made or distributed for direct commercial advantage,

theVLDBcopyrightnoticeandthe titleofthepublicationand its date appear,

and notice is given that copying is by permission of the Very Large Data

Base Endowment. To copy otherwise, or to republish, to post on servers

or to redistribute to lists, requires a fee and/or special permission from the

publisher, ACM.

VLDB ‘07, September 23-28, 2007, Vienna, Austria.

Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09.

riously time-consuming task. Recently, there has been an in-

creasing interest in using approximate join techniques based

on q-grams (substrings of length q) made out of strings.

Most of the eﬃcient approximate join algorithms (which we

describe in Section 2) are based on using a speciﬁc similarity

measure, along with a ﬁxed threshold value to return pairs of

records whose similarity is greater than the threshold. The

eﬀectiveness of majority of these algorithms depend on the

value of the threshold used. However, there has been little

work studying the accuracy of the join operation. The accu-

racy is known to be dataset-dependent and there is no com-

mon framework for evaluation and comparison of accuracy

of diﬀerent similarity measures and techniques. This makes

comparing their accuracy a diﬃcult task. Nevertheless, we

argue that it is possible to evaluate relative performance of

diﬀerent measures for approximate joins by using datasets

containing diﬀerent types of known quality problems such as

typing errors and diﬀerence in notations and abbreviations.

In this paper, we present an overview of several similarity

measures for approximate string joins using q-grams and

thoroughly evaluate their accuracy for diﬀerent values of

thresholds and on datasets with diﬀerent amount and types

of errors. Our results include:

•We show that for all similarity measures, the value of

the threshold that results in the most accurate join

highly depends on the type and amount of errors in

the data.

•We compare diﬀerent similarity measures by compar-

ing the maximum accuracy they can achieve on dif-

ferent datasets using diﬀerent thresholds. Although

choosing a proper threshold for the similarity measures

without a prior knowledge of the data characteristics

is known to be a diﬃcult task, our results show which

measures can potentially be more accurate assuming

that there is a way to determine the best threshold.

Therefore, an interesting direction for future work is

to ﬁnd an algorithm for determining the value of the

threshold for the most accurate measures.

•We show how the amount and type of errors aﬀect the

best value of the threshold. An interesting result of

this is that many previously proposed algorithms for

enhancing the performance of the join operation and

making it scalable for large datasets are not eﬀective

enough in many scenarios, since the performance of

these algorithms highly depends on choosing a high

value of threshold which could result in a very low

accuracy. This shows the eﬀectiveness of those algo-

rithms that are less sensitive to the value of the thresh-

old and opens another interesting direction for future

work which is ﬁnding algorithms that are both eﬃcient

and accurate using the same threshold.

The paper is organized as follows. In Section 2, we overview

related work on approximate joins. We present our frame-

work for approximate join and description of the similarity

measures used in Section 3. Section 4 presents thorough

evaluation of these measures and ﬁnally, Section 5 concludes

the paper and explains future directions.

2. RELATED WORK

Approximate join also known as similarity join or record

linkage has been extensively studied in the literature. Sev-

eral similarity measures for string data have been proposed

[14, 4, 5]. A recent survey[9], presents an excellent overview

of diﬀerent types of string similarity measures. Recently,

there has been an increasing interest in using measures from

the Information Retrieval (IR) ﬁeld along with q-grams made

out strings [10, 6, 2, 18, 5]. In this approach, strings are

treated as documents and q-grams are treated as tokens in

the documents. This makes it possible to take advantage

of several indexing techniques as well as various algorithms

that has been proposed for eﬃcient set-similarity joins. Fur-

thermore, these measures can be implemented declaratively

over a DBMS with vanilla SQL statements [5].

Various recent works address the problem of eﬃciency and

scalability of the similarity join operations for large datasets

[6, 2, 18]. Many techniques are proposed for set-similarity

join, which can be used along with q-grams for the purpose

of (string) similarity joins. Most of the techniques are based

on the idea of creating signatures for sets (strings) to re-

duce the search space. Some signature generations schemes

are derived from dimensionality reduction for the similar-

ity search problem in high dimensional space. One eﬃcient

approach uses the idea of Locality Sensitive Hashing (LSH)

[13] in order to hash similar sets into the same values with

high probability and therefore is an approximate solution to

the problem. Arasu et al. [2] propose algorithms speciﬁcally

for set-similarity joins that are exact and outperform pre-

vious approximation methods in their framework, although

parameters of the algorithms require extensive tuning. An-

other class of work is based on using indexing algorithms,

primarily derived from IR optimization techniques. A recent

proposal in this area [3] presents algorithms based on novel

indexing and optimization strategies that do not rely on ap-

proximation or extensive parameter tuning and outperform

previous state-of-the-art approaches. More recently, Li et

al.[15] propose VGRAM, a technique based on the idea of

using variable-length grams instead of q-grams. At a high

level, it can be viewed as an eﬃcient index structure over the

collection of strings. VGRAM can be used along with pre-

viously proposed signature-based algorithms to signiﬁcantly

improve their eﬃciency.

Most of the techniques described above mainly address

the scalability of the join operation and not the accuracy.

The choice of the similarity measure is often limited in these

algorithms. The signature-based algorithm of [6] also con-

siders accuracy by introducing a novel similarity measure

called fuzzy match similarity and creating signatures for this

measure. However, accuracy of this measure is not com-

pared with other measures. In [5] several such similarity

measures are benchmarked for approximate selection, which

is a special case of similarity join. Given a relation R, the

approximate selection operation using similarity predicate

sim(), will report all tuples t∈Rsuch that sim(tq, t)≥θ,

where θis a speciﬁed numerical ’similarity threshold’ and tq

is a query string. Approximate selections are special cases

of the similarity join operation. While several predicates

are introduced and benchmarked in [5], the extension of ap-

proximate selection to approximate joins is not considered.

Furthermore, the eﬀect of threshold values on accuracy for

approximate joins is also not considered.

3. FRAMEWORK

In this section, we explain our framework for similarity

join. The similarity join of two relations R={ri: 1 ≤i≤

N1}and S={sj: 1 ≤j≤N2}outputs is a set of pairs

(ri, sj)∈R×Swhere riand sjare similar. Two records

are considered similar when their similarity score based a

similarity function sim() is above a threshold θ. For the

deﬁnitions and experiments in this paper, we assume we are

performing a self-join on relation R. Therefore the output

is a set of pairs (ri, rj)∈R×Rwhere sim(ri, rj)≥θfor

some similarity function sim() and a threshold θ. This is a

common operation in many applications such as entity res-

olution and clustering. In keeping with many approximate

join methods, we model records as strings. We denote by r

the set of q-grams (sequences of qconsecutive characters of

a string) in r. For example, for t=‘db lab’, t={‘db ’ ,‘b l’,‘

la’, ‘lab’}for tokenization using 3-grams. In certain cases,

aweight may be associated with each token.

The similarity measures discussed here are those based

on q-grams created out of strings along with a similarity

measure that has shown to be eﬀective in previous work [5].

These measures share one or both of the following proper-

ties:

•High scalability: There are various techniques pro-

posed in the literature as described in Section 2 for

enhancing the performance of the similarity join oper-

ation using q-grams along with these measures.

•High accuracy: Previous work has proved that in most

scenarios these measures perform better or equally well

in terms of accuracy comparing with other string simi-

larity measures. Speciﬁcally, these measures have shown

good accuracy in name-matching tasks [8] or in ap-

proximate selection [5].

3.1 Edit Similarity

Edit-distance is widely used as the measure of choice in

many similarity join techniques. Speciﬁcally, previous work

[10] has shown how to use q-grams for eﬃcient implemen-

tation of this measure in a declarative framework. Recent

works on enhancing performance of similarity join has also

proposed techniques for scalable implementation of this mea-

sure [2, 15].

Edit distance between two string records r1and r2is de-

ﬁned as the transformation cost of r1to r2,tc(r1, r2), which

is equal to the minimum cost of edit operations applied to

r1to transform it to r2. Edit operations include character

copy,insert,delete and substitute [11]. The edit similarity is

deﬁned as:

simedit(r1, r2) = 1 −tc(r1, r2)

max{|r1|,|r2|} (1)

There is a cost associated with each edit operation. There

are several cost models proposed for edit operations for this

measure. In the most commonly used measure called Leven-

shtein edit distance, which we will refer to as edit distance

in this paper, uses unit cost for all operations except copy

which has cost zero.

3.2 Jaccard and WeightedJaccard

Jaccard similarity is the fraction of tokens in r1and r2

that are present in both. Weighted Jaccard similarity is the

weighted version of Jaccard similarity, i.e.,

simW Jacca rd(r1, r2) = Pt∈r1∩r2wR(t)

Pt∈r1∪r2wR(t)(2)

where w(t, R) is a weight function that reﬂects the com-

monality of the token tin the relation R. We choose RSJ

(Robertson-Sparck Jones) weight for the tokens which was

shown to be more eﬀective than the commonly-used Inverse

Document Frequency (IDF) weights [5]:

wR(t) = log N−nt+ 0.5

nt+ 0.5(3)

where Nis the number of tuples in the base relation Rand

ntis the number of tuples in Rcontaining the token t.

3.3 Measures from IR

A well-studied problem in information retrieval is that

given a query and a collection of documents, return the

most relevant documents to the query. In the measures in

this part, records are treated as documents and q-grams are

seen as words (tokens) of the documents. Therefore same

techniques for ﬁnding relevant documents to a query can

be used to return similar records to a query string. In the

rest of this section, we present three measures that previous

work has shown their higher performance for approximate

selection problem [5].

3.3.1 Cosine w/tf-idf

The tf-idf cosine similarity is a well established measure in

the IR community which leverages the vector space model.

This measure determines the closeness of the input strings

r1and r2by ﬁrst transforming the strings into unit vectors

and then measuring the angle between their corresponding

vectors. The cosine similarity with tf-idf weights is given

by:

simCosine (r1, r2) = X

t∈r1∩r2

wr1(t)·wr2(t) (4)

where wr1(t) and wr2(t) are the normalized tf-idf weights

for each common token in r1and r2respectively. The nor-

malized tf-idf weight of token tin a given string record ris

deﬁned as follows:

wr(t) = w′

r(t)

qPt′∈rw′

r(t′)2

, w′

r(t) = tfr(t)·idf(t)

where tfr(t) is the term frequency of token twithin string

rand idf(t) is the inverse document frequency with respect

to the entire relation R.

3.3.2 BM25

The BM25 similarity score for a query r1and a string

record r2is deﬁned as follows:

simBM 25(r1, r2) = X

t∈r1∩r2

ˆwr1(t)·wr2(t) (5)

where

ˆwr1(t) = (k3+ 1)·tfr1(t)

k3+tfr1(t)

wr2(t) = w(1)

R(t)(k1+1)·tfr2(t)

K(r2)+tfr2(t)

and w(1)

Ris the RSJ weight:

w(1)

R(t) = log N−nt+0.5

nt+0.5

K(r) = k1(1 −b) + b|r|

avgrl

where tfr(t) is the frequency of the token tin string record

r,|r|is the number of tokens in r,avgrl is the average

number of tokens per record, N is the number of records in

the relation R,ntis the number of record containing the

token tand k1,k3and bare set of independent parameters.

We set these parameters based on TREC-4 experiments [17]

where k∈[1,2], k3= 8 and b∈[0.6,0.75].

3.3.3 Hidden Markov Model

The approximate string matching could be modeled by

a discrete Hidden Markov process which has shown better

performance than Cosine w/tf-idf in IR literature [16] and

high accuracy and running time for approximate selection

[5]. This particular Markov model consists of only two states

where the ﬁrst state models the tokens that are speciﬁc to

one particular “String” and the second state models the to-

kens in the “General English”, i.e., tokens that are common

in many records. Refer to [5] and [16] for complete descrip-

tion of the model and possible extensions.

The HMM similarity function accepts two string records

r1and r2and returns the probability of generating r1given

r2is a similar record:

simHM M (r1, r2) = Y

t∈r1

(a0P(t|GE) + a1P(t|r2)) (6)

where a0and a1= 1 −a0are the transition states probabil-

ities of the Markov model and P(t|GE ) and P(t|r2) is given

by:

P(t|r2) = number of times tappears in r2

|r2|

P(t|GE) = Pr∈Rnumber of times tappears in r

Pr∈R|r|

3.4 Hybrid Measures

The implementation of these measures involves two simi-

larity functions, one that compares the strings by comparing

their word tokens and another similarity function which is

more suitable for short strings and is used for comparison of

the word tokens.

3.4.1 GES

The generalized edit similarity (GES) [7] which is a mod-

iﬁed version of fuzzy match similarity presented in [6], takes

two strings r1and r2, tokenizes the strings into a set of

words and assigns a weight w(t) to each token. GES deﬁnes

the similarity between the two given strings as a minimum

transformation cost required to convert string r1to r2and

is given by:

simGES (r1, r2) = 1 −min tc(r1, r2)

wt(r1),1.0(7)

where wt(r1) is the sum of weights of all tokens in r1and

tc(r1, r2) is a sequence of the following transformation oper-

ations:

•token insertion: inserting a token tin r1with cost

w(t).cins where cins is the insertion factor constant and

is in the range between 0 and 1. In our experiments,

cins = 1.

•token deletion: deleting a token tfrom r1with cost

w(t).

•token replacement: replacing a token t1by t2in r1

with cost (1−simedit(t1, t2)).w(t) where simedit is the

edit-distance between t1and t2.

3.4.2 SoftTFIDF

SoftTFIDF is another hybrid measure proposed by Cohen

et al. [8], which relies on the normalized tf-idf weight of word

tokens and can work with any arbitrary similarity function

to ﬁnd similarity between word tokens. In this measure, the

similarity score, simSoftT F I DF , is deﬁned as follows:

X

t1∈C(θ,r1,r2)

w(t1, r1)·w(arg max

t2∈r2

(sim(t1, t2)), r2)·max

t2∈r2

(sim(t1, t2))

(8)

where w(t, r) is the normalized tf-idf weight of word token

tin record rand C(θ, r1, r2) returns a set of tokens t1∈r1

such that for t2∈r2we have sim(t1, t2)> θ for some sim-

ilarity function sim() suitable for comparing word strings.

In our experiments sim(t1, t2) is the Jaro-Winkler similarity

as suggested in [8].

4. EVALUATION

4.1 Datasets

In order to evaluate eﬀectiveness of diﬀerent similarity

measures described in previous section, we use the same

datasets used in [5]. These datasets were created using a

modiﬁed version of UIS data generator, which has previ-

ously been used for evaluation of data cleaning and record

linkage techniques [12, 1]. The data generator has the abil-

ity to inject several types of errors into a clean database of

string attributes. These errors include commonly occurring

typing mistakes (edit errors: character insertion, deletion,

replacement and swap), token swap and abbreviation errors

(e.g., replacing Inc. with Incorporated and vice versa).

The data generator has several parameters to control the

injected error in the data such as the size of the dataset to

be generated, the distribution of duplicates (uniform, Zip-

ﬁan or Poisson), the percentage of erroneous duplicates, the

extent of error injected in each string, and the percentage

of diﬀerent types of errors. The data generator keeps track

Percentage of

Group Name Erroneous Errors in Token Abbr.

Duplicates Duplicates Swap Error

Dirty D1 90 30 20 50

D2 50 30 20 50

Medium M1 30 30 20 50

M2 10 30 20 50

M3 90 10 20 50

M4 50 10 20 50

Low L1 30 10 20 50

L2 10 10 20 50

Single AB 50 0 0 50

Error TS 50 0 20 0

EDL 50 10 0 0

EDM 50 20 0 0

EDH 50 30 0 0

Table 1: Datasets Used in the Experiments

of the duplicate records by assigning a cluster ID to each

clean record and to all duplicates generated from that clean

record.

For the results presented in this paper, the datasets are

generated by the data generator out of a clean dataset of

2139 company names with average record length of 21.03

and an average of 2.9 words per record. The errors in

the datasets have a uniform distribution. For each dataset,

on average 5000 dirty records are created out of 500 clean

records. We have also run experiments on datasets gener-

ated using diﬀerent parameters. For example, we generated

data using a Zipﬁan distribution, and we also used data from

another clean source (DBLP titles) as in [5]. We also cre-

ated larger datasets. For these other datasets, the accuracy

trends remain the same. Table 1 shows the description of

all the datasets used for the results in this paper. We used

8 diﬀerent datasets with mixed types of errors (edit errors,

token swap and abbreviation replacement). Moreover, we

use 5 datasets with only a single type of error (edit errors,

token swap or abbreviation replacement errors) to measure

the eﬀect of each type of error individually. Following [5],

we believe the errors in these datasets are highly represen-

tative of common types of errors in databases with string

attributes.

4.2 Measures

We use well-known measures from IR, namely precision,

recall, and F1, for diﬀerent values of threshold to evaluate

the accuracy of the similarity join operation. We perform a

self-join on the input table using a similarity measure with a

ﬁxed threshold θ.Precision (Pr) is deﬁned as the percentage

of similar records among the records that have similarity

score above threshold θ. In our datasets, similar records are

marked with the same cluster ID as described above. Recall

(Re) is the ratio of the number of similar records that have

similarity score above threshold θto the total number of

similar records. A join that returns all the pairs of records in

the two input tables as output has low (near zero) precision

and recall of 1. A join that returns an empty answer has

precision 1 and zero recall. The F1measure is the harmonic

mean of precision and recall, i.e.,

F1=2×P r ×Re

P r +Re (9)

We measure precision, recall, and F1for diﬀerent value of

similarity thresholds θ. For comparison of diﬀerent similar-

Figure 3: Maximum F1score for diﬀerent measures

on datasets with only edit errors

ity measures, we use the maximum F1score across diﬀerent

thresholds.

4.3 Results

Figures 1 and 2 show the precision, recall, and F1values

for all measures described in Section 3, over the datasets we

have deﬁned with mixed types of errors. For all measure

except HMM and BM25, the horizontal axis of the preci-

sion/recall graph is the value of the threshold. For HMM

and BM25, the horizontal axis is the percentage of maximum

value of the threshold, since these measure do not return a

score between 0 and 1.

Eﬀect of amount of errors As shown in the precision/recall

curves in Figures 1 and 2, the “dirtiness” of the input data

greatly aﬀects the value of the threshold that results in the

most accurate join. For all the measures, a lower value of

the threshold is needed as the degree of error in the data

increases. For example, Weighted Jaccard achieves the best

F1score over the dirtiest datasets with threshold 0.3, while

it achieves the best F1for the cleanest datasets at threshold

0.55. BM25 and HMM are less sensitive and work well on

both dirty and clean group of datasets with the same value

of threshold. We will discuss later how the degree of error

in the data aﬀects the choice of the most accurate measure.

Eﬀect of types of errors Figures 3 shows the maximum

F1score for diﬀerent values of the threshold for diﬀerent

measures on datasets containing only edit-errors (the EDL,

EDM and EDH datasets). These ﬁgures show that weighted

Jaccard and Cosine have the highest accuracy followed by

Jaccard, and edit similarity on the low-error dataset EDL.

By increasing the amount of edit error in each record, HMM

performs as well as weighted Jaccard, although Jaccard, edit

similarity, and GES perform much worse on high edit error

rates. Considering the fact that edit-similarity is mainly

proposed for capturing edit errors, this shows the eﬀective-

ness of weighted Jaccard and its robustness with varying

amount of edit errors. Figure 4 shows the eﬀect of token

swap and abbreviation errors on the accuracy of diﬀerent

measures. This experiment indicates that edit similarity is

not capable of modeling such types of errors. HMM, BM25

and Jaccard also are not capable of modeling abbreviation

errors properly.

Comparison of measures Figures 5 shows the maximum

F1score for diﬀerent values of the threshold for diﬀerent

measures on dirty, medium and clean group of datasets.

(Here we have aggregated the results for all the dirty data

sets together (respectively, the moderately derity or medium

data sets and the clean data sets). The results show the ef-

Figure 4: Maximum F1score for diﬀerent measures

on datasets with only token swap and abbreviation

errors

Figure 5: Maximum F1score for diﬀerent measures

on clean, medium and dirty group of datasets

fectiveness and robustness of weighted Jaccard and cosine

in comparison with other measures. Again, HMM is among

the most accurate measures when the data is extremely dirty

and has relatively low accuracy when the percentage of error

in the data is low.

Remark As stated in Section 2, the performance of many

algorithms proposed for improving scalability of the join op-

eration highly depend on the value of similarity threshold

used for the join. Here we show the accuracy numbers on

our datasets using the value of the theshold that makes these

algorithms eﬀective. Speciﬁcally we address the results in

[2] although similar observations can be made for results

of other similar works in this area. Table 2 shows the F1

value for thresholds that results in the best accuracy on our

datasets and the best performance in experimental results

of [2]. PartEnum and WtEnum algorithm presented in [2]

signiﬁcantly outperform previous algorithms for 0.9 thresh-

old, but have roughly the same performance as previously

proposed algorithms such as LSH when threshold 0.8 or less

is used. The results in Table 2 shows that there is a big gap

between the value of the threshold that results in the most

accurate join on our datasets and the threshold that results

in eﬀectiveness of PartEnum and WtEnum in studies in

[2].

5. CONCLUSION

We have presented an overview of several similarity mea-

sures for eﬃcient approximate string joins and thoroughly

evaluated their accuracy on several datasets with diﬀerent

characteristics and common quality problems. Our results

show the eﬀect of the amount and type of errors in the

Jaccard Join Weighted Jaccard Join

Threshold F1Threshold F1

0.5 (Best Acc.) 0.293 0.3 (Best Acc.) 0.528

0.8 0.249 0.8 0.249

Dirty 0.85 0.248 0.85 0.246

0.9 (Best Performance) 0.247 0.9 (Best Performance) 0.244

0.65 (Best Acc.) 0.719 0.55 (Best Acc.) 0.776

Medium 0.8 0.611 0.8 0.581

0.85 0.571 0.85 0.581

0.9 (Best Performance) 0.548 0.9 (Best Performance) 0.560

0.7 (Best Acc.) 0.887 0.55 (Best Acc.) 0.929

0.8 0.854 0.8 0.831

Clean 0.85 0.831 0.85 0.819

0.9 (Best Performance) 0.812 0.9 (Best Performance) 0.807

Table 2: F1score for thresholds that result in best running time in previous performance studies and highest

accuracy on our datasets for two selected similarity measures

datasets and the similarity threshold used for the similarity

measures on the accuracy of the join operation. Considering

the fact that the eﬀectiveness of many algorithms proposed

for enhancing the scalability of approximate join rely on the

value chosen for the similarity threshold, our results show

the eﬀectiveness of those algorithms that are less sensitive to

the value of the threshold and opens an interesting direction

for future work which is ﬁnding algorithms that are both ef-

ﬁcient and accurate using the same threshold. Finding an

algorithm that determines the best value of the threshold

regardless of the type and amount of errors for the simi-

larity measures that showed higher accuracy in our work is

another interesting subject for future work.

6. REFERENCES

[1] P. Andritsos, A. Fuxman, and R. J. Miller. Clean

answers over dirty databases: A probabilistic

approach. In ICDE’06.

[2] A. Arasu, V. Ganti, and R. Kaushik. Eﬃcient exact

set-similarity joins. In VLDB’06.

[3] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all

pairs similarity search. In WWW’07.

[4] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and

S. Fienberg. Adaptive name matching in information

integration. IEEE Intelligent Systems, 18(5), 2003.

[5] A. Chandel, O. Hassanzadeh, N. Koudas, M. Sadoghi,

and D. Srivastava. Benchmarking declarative

approximate selection predicates. In SIGMOD’07.

[6] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani.

Robust and eﬃcient fuzzy match for online data

cleaning. In SIGMOD’03.

[7] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive

operator for similarity joins in data cleaning. In ICDE

’06.

[8] W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A

comparison of string distance metrics for

name-matching tasks. In IIWeb’03.

[9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios.

Duplicate record detection: A survey. IEEE TKDE,

19(1), 2007.

[10] L. Gravano, P. G. Ipeirotis, H. V. Jagadish,

N. Koudas, S. Muthukrishnan, and D. Srivastava.

Approximate string joins in a database (almost) for

free. In VLDB’01.

[11] D. Gusﬁeld. Algorithms on strings, trees, and

sequences: computer science and computational

biology. Cambridge University Press, New York, NY,

USA, 1997.

[12] M. A. Hern´andez and S. J. Stolfo. Real-world data is

dirty: Data cleansing and the merge/purge problem.

Data Mining and Knowledge Discovery, 2(1):9–37,

1998.

[13] Indyk, Motwani, Raghavan, and Vempala.

Locality-preserving hashing in multidimensional

spaces. In STOC’97.

[14] N. Koudas and D. Srivastava. Approximate joins:

Concepts and techniques. In VLDB’05 Tutorial.

[15] C. Li, B. Wang, and X. Yang. Vgram: Improving

performance of approximate queries on string

collections using variable-length grams. In VLDB’07.

[16] D. R. H. Miller, T. Leek, and R. M. Schwartz. A

hidden markov model information retrieval system. In

SIGIR’99.

[17] S. E. Robertson, S. Walker, M. Hancock-Beaulieu,

M. Gatford, and A. Payne. Okapi at trec-4. In

TREC’95.

[18] S. Sarawagi and A. Kirpal. Eﬃcient set joins on

similarity predicates. In SIGMOD’04.

(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets

Edit Similarity

(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets

Jaccard

(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets

Weighted Jaccard

Figure 1: Accuracy of Edit-Similarity, Jaccard and Weighted Jaccard measures relative to the value of the

threshold on diﬀerent datasets

(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets

Cosine w/tf-idf

(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets

BM25

(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets

HMM

(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets

SoftTFIDF

(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets

GES

Figure 2: Accuracy of measures from IR and hybrid measures relative to the value of the threshold on

diﬀerent datasets