Conference PaperPDF Available

Accuracy of Approximate String Joins Using Grams.

Authors:

Abstract

Approximate join is an important part of many data cleaning and integration methodologies. Various similarity measures have been proposed for accurate and efficient matching of string attributes. The accuracy of the similarity measures highly depends on the characteristics of the data such as amount and type of the errors and length of the strings. Recently, there has been an increasing interest in using methods based on q-grams (substrings of length q) made out of the strings, mainly due to their high efficiency. In this work, we evaluate the accuracy of the similarity measures used in these methodologies. We present an overview of several similarity measures based on q-grams. We then thoroughly compare their accuracy on several datasets with different characteristics. Since the efficiency of approximate joins depend on the similarity threshold they use, we study how the value of the threshold (including values used in recent performance studies) effects the accuracy of the join. We also compare different measures based on the highest accuracy they can gain on different datasets.
Accuracy of Approximate String Joins Using Grams
Oktie Hassanzadeh
University of Toronto
10 King’s College Rd.
Toronto, ON M5S3G4, Canada
oktie@cs.toronto.edu
Mohammad Sadoghi
University of Toronto
10 King’s College Rd.
Toronto, ON M5S3G4, Canada
mo@cs.toronto.edu
Ren´ee J. Miller
University of Toronto
10 King’s College Rd.
Toronto, ON M5S3G4, Canada
miller@cs.toronto.edu
ABSTRACT
Approximate join is an important part of many data clean-
ing and integration methodologies. Various similarity mea-
sures have been proposed for accurate and efficient matching
of string attributes. The accuracy of the similarity measures
highly depends on the characteristics of the data such as
amount and type of the errors and length of the strings. Re-
cently, there has been an increasing interest in using meth-
ods based on q-grams (substrings of length q) made out of
the strings, mainly due to their high efficiency. In this work,
we evaluate the accuracy of the similarity measures used
in these methodologies. We present an overview of several
similarity measures based on q-grams. We then thoroughly
compare their accuracy on several datasets with different
characteristics. Since the efficiency of approximate joins de-
pend on the similarity threshold they use, we study how the
value of the threshold (including values used in recent per-
formance studies) effects the accuracy of the join. We also
compare different measures based on the highest accuracy
they can gain on different datasets.
1. INTRODUCTION
Data quality is a major concern in operational databases
and data warehouses. Errors may be present in the data due
to a multitude of reasons including data entry errors, lack of
common standards and missing integrity constraints. String
data is by nature more prone to such errors. Approximate
join is an important part of many data cleaning methodolo-
gies and is well-studied: given two large relations, identify
all pairs of records that approximately match. A variety of
similarity measures have been proposed for string data in
order to match records. Each measure has certain charac-
teristics that makes it suitable for capturing certain types
of errors. By using a string similarity function sim() for the
approximate join algorithm, all pairs of records that have
similarity score above a threshold θare considered to ap-
proximately match and are returned as the output.
Performing approximate join on a large relation is a noto-
Permission to copy without fee all or part of this material is granted provided
that the copies are not made or distributed for direct commercial advantage,
theVLDBcopyrightnoticeandthe titleofthepublicationand its date appear,
and notice is given that copying is by permission of the Very Large Data
Base Endowment. To copy otherwise, or to republish, to post on servers
or to redistribute to lists, requires a fee and/or special permission from the
publisher, ACM.
VLDB ‘07, September 23-28, 2007, Vienna, Austria.
Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09.
riously time-consuming task. Recently, there has been an in-
creasing interest in using approximate join techniques based
on q-grams (substrings of length q) made out of strings.
Most of the efficient approximate join algorithms (which we
describe in Section 2) are based on using a specific similarity
measure, along with a fixed threshold value to return pairs of
records whose similarity is greater than the threshold. The
effectiveness of majority of these algorithms depend on the
value of the threshold used. However, there has been little
work studying the accuracy of the join operation. The accu-
racy is known to be dataset-dependent and there is no com-
mon framework for evaluation and comparison of accuracy
of different similarity measures and techniques. This makes
comparing their accuracy a difficult task. Nevertheless, we
argue that it is possible to evaluate relative performance of
different measures for approximate joins by using datasets
containing different types of known quality problems such as
typing errors and difference in notations and abbreviations.
In this paper, we present an overview of several similarity
measures for approximate string joins using q-grams and
thoroughly evaluate their accuracy for different values of
thresholds and on datasets with different amount and types
of errors. Our results include:
We show that for all similarity measures, the value of
the threshold that results in the most accurate join
highly depends on the type and amount of errors in
the data.
We compare different similarity measures by compar-
ing the maximum accuracy they can achieve on dif-
ferent datasets using different thresholds. Although
choosing a proper threshold for the similarity measures
without a prior knowledge of the data characteristics
is known to be a difficult task, our results show which
measures can potentially be more accurate assuming
that there is a way to determine the best threshold.
Therefore, an interesting direction for future work is
to find an algorithm for determining the value of the
threshold for the most accurate measures.
We show how the amount and type of errors affect the
best value of the threshold. An interesting result of
this is that many previously proposed algorithms for
enhancing the performance of the join operation and
making it scalable for large datasets are not effective
enough in many scenarios, since the performance of
these algorithms highly depends on choosing a high
value of threshold which could result in a very low
accuracy. This shows the effectiveness of those algo-
rithms that are less sensitive to the value of the thresh-
old and opens another interesting direction for future
work which is finding algorithms that are both efficient
and accurate using the same threshold.
The paper is organized as follows. In Section 2, we overview
related work on approximate joins. We present our frame-
work for approximate join and description of the similarity
measures used in Section 3. Section 4 presents thorough
evaluation of these measures and finally, Section 5 concludes
the paper and explains future directions.
2. RELATED WORK
Approximate join also known as similarity join or record
linkage has been extensively studied in the literature. Sev-
eral similarity measures for string data have been proposed
[14, 4, 5]. A recent survey[9], presents an excellent overview
of different types of string similarity measures. Recently,
there has been an increasing interest in using measures from
the Information Retrieval (IR) field along with q-grams made
out strings [10, 6, 2, 18, 5]. In this approach, strings are
treated as documents and q-grams are treated as tokens in
the documents. This makes it possible to take advantage
of several indexing techniques as well as various algorithms
that has been proposed for efficient set-similarity joins. Fur-
thermore, these measures can be implemented declaratively
over a DBMS with vanilla SQL statements [5].
Various recent works address the problem of efficiency and
scalability of the similarity join operations for large datasets
[6, 2, 18]. Many techniques are proposed for set-similarity
join, which can be used along with q-grams for the purpose
of (string) similarity joins. Most of the techniques are based
on the idea of creating signatures for sets (strings) to re-
duce the search space. Some signature generations schemes
are derived from dimensionality reduction for the similar-
ity search problem in high dimensional space. One efficient
approach uses the idea of Locality Sensitive Hashing (LSH)
[13] in order to hash similar sets into the same values with
high probability and therefore is an approximate solution to
the problem. Arasu et al. [2] propose algorithms specifically
for set-similarity joins that are exact and outperform pre-
vious approximation methods in their framework, although
parameters of the algorithms require extensive tuning. An-
other class of work is based on using indexing algorithms,
primarily derived from IR optimization techniques. A recent
proposal in this area [3] presents algorithms based on novel
indexing and optimization strategies that do not rely on ap-
proximation or extensive parameter tuning and outperform
previous state-of-the-art approaches. More recently, Li et
al.[15] propose VGRAM, a technique based on the idea of
using variable-length grams instead of q-grams. At a high
level, it can be viewed as an efficient index structure over the
collection of strings. VGRAM can be used along with pre-
viously proposed signature-based algorithms to significantly
improve their efficiency.
Most of the techniques described above mainly address
the scalability of the join operation and not the accuracy.
The choice of the similarity measure is often limited in these
algorithms. The signature-based algorithm of [6] also con-
siders accuracy by introducing a novel similarity measure
called fuzzy match similarity and creating signatures for this
measure. However, accuracy of this measure is not com-
pared with other measures. In [5] several such similarity
measures are benchmarked for approximate selection, which
is a special case of similarity join. Given a relation R, the
approximate selection operation using similarity predicate
sim(), will report all tuples tRsuch that sim(tq, t)θ,
where θis a specified numerical ’similarity threshold’ and tq
is a query string. Approximate selections are special cases
of the similarity join operation. While several predicates
are introduced and benchmarked in [5], the extension of ap-
proximate selection to approximate joins is not considered.
Furthermore, the effect of threshold values on accuracy for
approximate joins is also not considered.
3. FRAMEWORK
In this section, we explain our framework for similarity
join. The similarity join of two relations R={ri: 1 i
N1}and S={sj: 1 jN2}outputs is a set of pairs
(ri, sj)R×Swhere riand sjare similar. Two records
are considered similar when their similarity score based a
similarity function sim() is above a threshold θ. For the
definitions and experiments in this paper, we assume we are
performing a self-join on relation R. Therefore the output
is a set of pairs (ri, rj)R×Rwhere sim(ri, rj)θfor
some similarity function sim() and a threshold θ. This is a
common operation in many applications such as entity res-
olution and clustering. In keeping with many approximate
join methods, we model records as strings. We denote by r
the set of q-grams (sequences of qconsecutive characters of
a string) in r. For example, for t=‘db lab’, t={‘db ’ ,‘b l’,‘
la’, ‘lab’}for tokenization using 3-grams. In certain cases,
aweight may be associated with each token.
The similarity measures discussed here are those based
on q-grams created out of strings along with a similarity
measure that has shown to be effective in previous work [5].
These measures share one or both of the following proper-
ties:
High scalability: There are various techniques pro-
posed in the literature as described in Section 2 for
enhancing the performance of the similarity join oper-
ation using q-grams along with these measures.
High accuracy: Previous work has proved that in most
scenarios these measures perform better or equally well
in terms of accuracy comparing with other string simi-
larity measures. Specifically, these measures have shown
good accuracy in name-matching tasks [8] or in ap-
proximate selection [5].
3.1 Edit Similarity
Edit-distance is widely used as the measure of choice in
many similarity join techniques. Specifically, previous work
[10] has shown how to use q-grams for efficient implemen-
tation of this measure in a declarative framework. Recent
works on enhancing performance of similarity join has also
proposed techniques for scalable implementation of this mea-
sure [2, 15].
Edit distance between two string records r1and r2is de-
fined as the transformation cost of r1to r2,tc(r1, r2), which
is equal to the minimum cost of edit operations applied to
r1to transform it to r2. Edit operations include character
copy,insert,delete and substitute [11]. The edit similarity is
defined as:
simedit(r1, r2) = 1 tc(r1, r2)
max{|r1|,|r2|} (1)
There is a cost associated with each edit operation. There
are several cost models proposed for edit operations for this
measure. In the most commonly used measure called Leven-
shtein edit distance, which we will refer to as edit distance
in this paper, uses unit cost for all operations except copy
which has cost zero.
3.2 Jaccard and WeightedJaccard
Jaccard similarity is the fraction of tokens in r1and r2
that are present in both. Weighted Jaccard similarity is the
weighted version of Jaccard similarity, i.e.,
simW Jacca rd(r1, r2) = Ptr1r2wR(t)
Ptr1r2wR(t)(2)
where w(t, R) is a weight function that reflects the com-
monality of the token tin the relation R. We choose RSJ
(Robertson-Sparck Jones) weight for the tokens which was
shown to be more effective than the commonly-used Inverse
Document Frequency (IDF) weights [5]:
wR(t) = log Nnt+ 0.5
nt+ 0.5(3)
where Nis the number of tuples in the base relation Rand
ntis the number of tuples in Rcontaining the token t.
3.3 Measures from IR
A well-studied problem in information retrieval is that
given a query and a collection of documents, return the
most relevant documents to the query. In the measures in
this part, records are treated as documents and q-grams are
seen as words (tokens) of the documents. Therefore same
techniques for finding relevant documents to a query can
be used to return similar records to a query string. In the
rest of this section, we present three measures that previous
work has shown their higher performance for approximate
selection problem [5].
3.3.1 Cosine w/tf-idf
The tf-idf cosine similarity is a well established measure in
the IR community which leverages the vector space model.
This measure determines the closeness of the input strings
r1and r2by first transforming the strings into unit vectors
and then measuring the angle between their corresponding
vectors. The cosine similarity with tf-idf weights is given
by:
simCosine (r1, r2) = X
tr1r2
wr1(t)·wr2(t) (4)
where wr1(t) and wr2(t) are the normalized tf-idf weights
for each common token in r1and r2respectively. The nor-
malized tf-idf weight of token tin a given string record ris
defined as follows:
wr(t) = w
r(t)
qPtrw
r(t)2
, w
r(t) = tfr(t)·idf(t)
where tfr(t) is the term frequency of token twithin string
rand idf(t) is the inverse document frequency with respect
to the entire relation R.
3.3.2 BM25
The BM25 similarity score for a query r1and a string
record r2is defined as follows:
simBM 25(r1, r2) = X
tr1r2
ˆwr1(t)·wr2(t) (5)
where
ˆwr1(t) = (k3+ 1)·tfr1(t)
k3+tfr1(t)
wr2(t) = w(1)
R(t)(k1+1)·tfr2(t)
K(r2)+tfr2(t)
and w(1)
Ris the RSJ weight:
w(1)
R(t) = log Nnt+0.5
nt+0.5
K(r) = k1(1 b) + b|r|
avgrl
where tfr(t) is the frequency of the token tin string record
r,|r|is the number of tokens in r,avgrl is the average
number of tokens per record, N is the number of records in
the relation R,ntis the number of record containing the
token tand k1,k3and bare set of independent parameters.
We set these parameters based on TREC-4 experiments [17]
where k[1,2], k3= 8 and b[0.6,0.75].
3.3.3 Hidden Markov Model
The approximate string matching could be modeled by
a discrete Hidden Markov process which has shown better
performance than Cosine w/tf-idf in IR literature [16] and
high accuracy and running time for approximate selection
[5]. This particular Markov model consists of only two states
where the first state models the tokens that are specific to
one particular “String” and the second state models the to-
kens in the “General English”, i.e., tokens that are common
in many records. Refer to [5] and [16] for complete descrip-
tion of the model and possible extensions.
The HMM similarity function accepts two string records
r1and r2and returns the probability of generating r1given
r2is a similar record:
simHM M (r1, r2) = Y
tr1
(a0P(t|GE) + a1P(t|r2)) (6)
where a0and a1= 1 a0are the transition states probabil-
ities of the Markov model and P(t|GE ) and P(t|r2) is given
by:
P(t|r2) = number of times tappears in r2
|r2|
P(t|GE) = PrRnumber of times tappears in r
PrR|r|
3.4 Hybrid Measures
The implementation of these measures involves two simi-
larity functions, one that compares the strings by comparing
their word tokens and another similarity function which is
more suitable for short strings and is used for comparison of
the word tokens.
3.4.1 GES
The generalized edit similarity (GES) [7] which is a mod-
ified version of fuzzy match similarity presented in [6], takes
two strings r1and r2, tokenizes the strings into a set of
words and assigns a weight w(t) to each token. GES defines
the similarity between the two given strings as a minimum
transformation cost required to convert string r1to r2and
is given by:
simGES (r1, r2) = 1 min tc(r1, r2)
wt(r1),1.0(7)
where wt(r1) is the sum of weights of all tokens in r1and
tc(r1, r2) is a sequence of the following transformation oper-
ations:
token insertion: inserting a token tin r1with cost
w(t).cins where cins is the insertion factor constant and
is in the range between 0 and 1. In our experiments,
cins = 1.
token deletion: deleting a token tfrom r1with cost
w(t).
token replacement: replacing a token t1by t2in r1
with cost (1simedit(t1, t2)).w(t) where simedit is the
edit-distance between t1and t2.
3.4.2 SoftTFIDF
SoftTFIDF is another hybrid measure proposed by Cohen
et al. [8], which relies on the normalized tf-idf weight of word
tokens and can work with any arbitrary similarity function
to find similarity between word tokens. In this measure, the
similarity score, simSoftT F I DF , is defined as follows:
X
t1C(θ,r1,r2)
w(t1, r1)·w(arg max
t2r2
(sim(t1, t2)), r2)·max
t2r2
(sim(t1, t2))
(8)
where w(t, r) is the normalized tf-idf weight of word token
tin record rand C(θ, r1, r2) returns a set of tokens t1r1
such that for t2r2we have sim(t1, t2)> θ for some sim-
ilarity function sim() suitable for comparing word strings.
In our experiments sim(t1, t2) is the Jaro-Winkler similarity
as suggested in [8].
4. EVALUATION
4.1 Datasets
In order to evaluate effectiveness of different similarity
measures described in previous section, we use the same
datasets used in [5]. These datasets were created using a
modified version of UIS data generator, which has previ-
ously been used for evaluation of data cleaning and record
linkage techniques [12, 1]. The data generator has the abil-
ity to inject several types of errors into a clean database of
string attributes. These errors include commonly occurring
typing mistakes (edit errors: character insertion, deletion,
replacement and swap), token swap and abbreviation errors
(e.g., replacing Inc. with Incorporated and vice versa).
The data generator has several parameters to control the
injected error in the data such as the size of the dataset to
be generated, the distribution of duplicates (uniform, Zip-
fian or Poisson), the percentage of erroneous duplicates, the
extent of error injected in each string, and the percentage
of different types of errors. The data generator keeps track
Percentage of
Group Name Erroneous Errors in Token Abbr.
Duplicates Duplicates Swap Error
Dirty D1 90 30 20 50
D2 50 30 20 50
Medium M1 30 30 20 50
M2 10 30 20 50
M3 90 10 20 50
M4 50 10 20 50
Low L1 30 10 20 50
L2 10 10 20 50
Single AB 50 0 0 50
Error TS 50 0 20 0
EDL 50 10 0 0
EDM 50 20 0 0
EDH 50 30 0 0
Table 1: Datasets Used in the Experiments
of the duplicate records by assigning a cluster ID to each
clean record and to all duplicates generated from that clean
record.
For the results presented in this paper, the datasets are
generated by the data generator out of a clean dataset of
2139 company names with average record length of 21.03
and an average of 2.9 words per record. The errors in
the datasets have a uniform distribution. For each dataset,
on average 5000 dirty records are created out of 500 clean
records. We have also run experiments on datasets gener-
ated using different parameters. For example, we generated
data using a Zipfian distribution, and we also used data from
another clean source (DBLP titles) as in [5]. We also cre-
ated larger datasets. For these other datasets, the accuracy
trends remain the same. Table 1 shows the description of
all the datasets used for the results in this paper. We used
8 different datasets with mixed types of errors (edit errors,
token swap and abbreviation replacement). Moreover, we
use 5 datasets with only a single type of error (edit errors,
token swap or abbreviation replacement errors) to measure
the effect of each type of error individually. Following [5],
we believe the errors in these datasets are highly represen-
tative of common types of errors in databases with string
attributes.
4.2 Measures
We use well-known measures from IR, namely precision,
recall, and F1, for different values of threshold to evaluate
the accuracy of the similarity join operation. We perform a
self-join on the input table using a similarity measure with a
fixed threshold θ.Precision (Pr) is defined as the percentage
of similar records among the records that have similarity
score above threshold θ. In our datasets, similar records are
marked with the same cluster ID as described above. Recall
(Re) is the ratio of the number of similar records that have
similarity score above threshold θto the total number of
similar records. A join that returns all the pairs of records in
the two input tables as output has low (near zero) precision
and recall of 1. A join that returns an empty answer has
precision 1 and zero recall. The F1measure is the harmonic
mean of precision and recall, i.e.,
F1=2×P r ×Re
P r +Re (9)
We measure precision, recall, and F1for different value of
similarity thresholds θ. For comparison of different similar-
Figure 3: Maximum F1score for different measures
on datasets with only edit errors
ity measures, we use the maximum F1score across different
thresholds.
4.3 Results
Figures 1 and 2 show the precision, recall, and F1values
for all measures described in Section 3, over the datasets we
have defined with mixed types of errors. For all measure
except HMM and BM25, the horizontal axis of the preci-
sion/recall graph is the value of the threshold. For HMM
and BM25, the horizontal axis is the percentage of maximum
value of the threshold, since these measure do not return a
score between 0 and 1.
Effect of amount of errors As shown in the precision/recall
curves in Figures 1 and 2, the “dirtiness” of the input data
greatly affects the value of the threshold that results in the
most accurate join. For all the measures, a lower value of
the threshold is needed as the degree of error in the data
increases. For example, Weighted Jaccard achieves the best
F1score over the dirtiest datasets with threshold 0.3, while
it achieves the best F1for the cleanest datasets at threshold
0.55. BM25 and HMM are less sensitive and work well on
both dirty and clean group of datasets with the same value
of threshold. We will discuss later how the degree of error
in the data affects the choice of the most accurate measure.
Effect of types of errors Figures 3 shows the maximum
F1score for different values of the threshold for different
measures on datasets containing only edit-errors (the EDL,
EDM and EDH datasets). These figures show that weighted
Jaccard and Cosine have the highest accuracy followed by
Jaccard, and edit similarity on the low-error dataset EDL.
By increasing the amount of edit error in each record, HMM
performs as well as weighted Jaccard, although Jaccard, edit
similarity, and GES perform much worse on high edit error
rates. Considering the fact that edit-similarity is mainly
proposed for capturing edit errors, this shows the effective-
ness of weighted Jaccard and its robustness with varying
amount of edit errors. Figure 4 shows the effect of token
swap and abbreviation errors on the accuracy of different
measures. This experiment indicates that edit similarity is
not capable of modeling such types of errors. HMM, BM25
and Jaccard also are not capable of modeling abbreviation
errors properly.
Comparison of measures Figures 5 shows the maximum
F1score for different values of the threshold for different
measures on dirty, medium and clean group of datasets.
(Here we have aggregated the results for all the dirty data
sets together (respectively, the moderately derity or medium
data sets and the clean data sets). The results show the ef-
Figure 4: Maximum F1score for different measures
on datasets with only token swap and abbreviation
errors
Figure 5: Maximum F1score for different measures
on clean, medium and dirty group of datasets
fectiveness and robustness of weighted Jaccard and cosine
in comparison with other measures. Again, HMM is among
the most accurate measures when the data is extremely dirty
and has relatively low accuracy when the percentage of error
in the data is low.
Remark As stated in Section 2, the performance of many
algorithms proposed for improving scalability of the join op-
eration highly depend on the value of similarity threshold
used for the join. Here we show the accuracy numbers on
our datasets using the value of the theshold that makes these
algorithms effective. Specifically we address the results in
[2] although similar observations can be made for results
of other similar works in this area. Table 2 shows the F1
value for thresholds that results in the best accuracy on our
datasets and the best performance in experimental results
of [2]. PartEnum and WtEnum algorithm presented in [2]
significantly outperform previous algorithms for 0.9 thresh-
old, but have roughly the same performance as previously
proposed algorithms such as LSH when threshold 0.8 or less
is used. The results in Table 2 shows that there is a big gap
between the value of the threshold that results in the most
accurate join on our datasets and the threshold that results
in effectiveness of PartEnum and WtEnum in studies in
[2].
5. CONCLUSION
We have presented an overview of several similarity mea-
sures for efficient approximate string joins and thoroughly
evaluated their accuracy on several datasets with different
characteristics and common quality problems. Our results
show the effect of the amount and type of errors in the
Jaccard Join Weighted Jaccard Join
Threshold F1Threshold F1
0.5 (Best Acc.) 0.293 0.3 (Best Acc.) 0.528
0.8 0.249 0.8 0.249
Dirty 0.85 0.248 0.85 0.246
0.9 (Best Performance) 0.247 0.9 (Best Performance) 0.244
0.65 (Best Acc.) 0.719 0.55 (Best Acc.) 0.776
Medium 0.8 0.611 0.8 0.581
0.85 0.571 0.85 0.581
0.9 (Best Performance) 0.548 0.9 (Best Performance) 0.560
0.7 (Best Acc.) 0.887 0.55 (Best Acc.) 0.929
0.8 0.854 0.8 0.831
Clean 0.85 0.831 0.85 0.819
0.9 (Best Performance) 0.812 0.9 (Best Performance) 0.807
Table 2: F1score for thresholds that result in best running time in previous performance studies and highest
accuracy on our datasets for two selected similarity measures
datasets and the similarity threshold used for the similarity
measures on the accuracy of the join operation. Considering
the fact that the effectiveness of many algorithms proposed
for enhancing the scalability of approximate join rely on the
value chosen for the similarity threshold, our results show
the effectiveness of those algorithms that are less sensitive to
the value of the threshold and opens an interesting direction
for future work which is finding algorithms that are both ef-
ficient and accurate using the same threshold. Finding an
algorithm that determines the best value of the threshold
regardless of the type and amount of errors for the simi-
larity measures that showed higher accuracy in our work is
another interesting subject for future work.
6. REFERENCES
[1] P. Andritsos, A. Fuxman, and R. J. Miller. Clean
answers over dirty databases: A probabilistic
approach. In ICDE’06.
[2] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact
set-similarity joins. In VLDB’06.
[3] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all
pairs similarity search. In WWW’07.
[4] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and
S. Fienberg. Adaptive name matching in information
integration. IEEE Intelligent Systems, 18(5), 2003.
[5] A. Chandel, O. Hassanzadeh, N. Koudas, M. Sadoghi,
and D. Srivastava. Benchmarking declarative
approximate selection predicates. In SIGMOD’07.
[6] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani.
Robust and efficient fuzzy match for online data
cleaning. In SIGMOD’03.
[7] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive
operator for similarity joins in data cleaning. In ICDE
’06.
[8] W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A
comparison of string distance metrics for
name-matching tasks. In IIWeb’03.
[9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios.
Duplicate record detection: A survey. IEEE TKDE,
19(1), 2007.
[10] L. Gravano, P. G. Ipeirotis, H. V. Jagadish,
N. Koudas, S. Muthukrishnan, and D. Srivastava.
Approximate string joins in a database (almost) for
free. In VLDB’01.
[11] D. Gusfield. Algorithms on strings, trees, and
sequences: computer science and computational
biology. Cambridge University Press, New York, NY,
USA, 1997.
[12] M. A. Hern´andez and S. J. Stolfo. Real-world data is
dirty: Data cleansing and the merge/purge problem.
Data Mining and Knowledge Discovery, 2(1):9–37,
1998.
[13] Indyk, Motwani, Raghavan, and Vempala.
Locality-preserving hashing in multidimensional
spaces. In STOC’97.
[14] N. Koudas and D. Srivastava. Approximate joins:
Concepts and techniques. In VLDB’05 Tutorial.
[15] C. Li, B. Wang, and X. Yang. Vgram: Improving
performance of approximate queries on string
collections using variable-length grams. In VLDB’07.
[16] D. R. H. Miller, T. Leek, and R. M. Schwartz. A
hidden markov model information retrieval system. In
SIGIR’99.
[17] S. E. Robertson, S. Walker, M. Hancock-Beaulieu,
M. Gatford, and A. Payne. Okapi at trec-4. In
TREC’95.
[18] S. Sarawagi and A. Kirpal. Efficient set joins on
similarity predicates. In SIGMOD’04.
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
Edit Similarity
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
Jaccard
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
Weighted Jaccard
Figure 1: Accuracy of Edit-Similarity, Jaccard and Weighted Jaccard measures relative to the value of the
threshold on different datasets
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
Cosine w/tf-idf
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
BM25
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
HMM
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
SoftTFIDF
(a) Low Error Data Sets (b) Medium Error Data Sets (c) Dirty Data Sets
GES
Figure 2: Accuracy of measures from IR and hybrid measures relative to the value of the threshold on
different datasets
... An algorithm has been proposed for each phase and explained properly through flow charts and diagrams. [17,18] using Fuzzy Match Similarity (FMS) [10,11] to find duplicity in the name fields. ...
... For if they can be transformed to another string that is under consideration within some reasonable cost then they could be considered as potential duplicates. A variety of string distance metrics can be found in literature [8,18,19]. These step calculations the cost of transforming a string (in this context, names) to another which is referred as the transformation cost. ...
... The algorithm for token/Q-gram formation is as follows in fig 8: Fig 8: Token/Q-Gram Formation Q-grams, also called n-grams [8,10,18], are sub-strings of length q in longer strings. The length q can be any value such that it is smaller than the length of string. ...
Article
Full-text available
Data cleansing is a process that deals with identification of corrupt and duplicate data inherent in the data sets of a data warehouse to enhance the quality of data. This paper aims to facilitate the data cleaning process by addressing the problem of duplicate records detection pertaining to the "name" attributes of the data sets. It provides a sequence of algorithms through a novel framework for identifying duplicity in the "name" attribute of the data sets of an already existing data warehouse. The key features of the research includes its proposal of a novel framework through a well defined sequence of algorithms and refining the application of alliance rules [1] by incorporating the use of previously existing and well defined similarity computation measures. The results depicted show the feasibility and validity of the suggested method.
... Despite the large, and growing, number of duplicate detection techniques, the research literature comparing their quality is surprisingly sparse. There are studies and surveys comparing the similarity measures used within these techniques [16, 29, 31]. However, to the best of our knowledge there are no comprehensive empirical studies that evaluate the quality of the grouping or clustering employed by these techniques. ...
... For a thorough evaluation of the clustering algorithms, it is essential to have datasets of varying sizes, error types and distributions , and for which the ground truth is known. For the experiments in this paper, we use datasets generated by a publicly available version of the widely used UIS database generator which has been effectively used in the past to evaluate different approximate selection and join predicates used within duplicate detection [29, 31] . We follow the best practice guidelines from information retrieval and data management, to generate realistic errors in string data. ...
... Similarity Function There are a large number of similarity measures for string data that can be used in the similarity join. Based on the comparison of several such measures in [31], we use weighted Jaccard similarity along with q-gram tokens (substrings of length q of the strings) as the measure of choice due to its relatively high efficiency and accuracy compared with other measures. Jaccard similarity is the fraction of tokens in r1 and r2 that are present in both. ...
Article
Full-text available
The presence of duplicate records is a major data quality concern in large databases. To detect duplicates, entity resolution also known as duplication detection or record linkage is used as a part of the data cleaning process to identify records that potentially refer to the same real-world entity. We present the Stringer system that provides an evaluation framework for understanding what barriers remain towards the goal of truly scalable and general purpose du- plication detection algorithms. In this paper, we use Strin ger to evaluate the quality of the clusters (groups of potential du plicates) obtained from several unconstrained clustering algorithms used in concert with approximate join techniques. Our work is motivated by the recent significant advancements that have made approx imate join algorithms highly scalable. Our extensive evaluation reveals that some clustering algorithms that have never been considered for duplicate detection, perform extremely well in terms of both accuracy and scalability.
... Several string-join algorithms have been proposed, as recently surveyed in [73], and their performance was compared in [9], [13], [26]. In our implementation, we used MassJoin [19] for finding similar tokens. ...
Preprint
This work tackles the problem of fuzzy joining of strings that naturally tokenize into meaningful substrings, e.g., full names. Tokenized-string joins have several established applications in the context of data integration and cleaning. This work is primarily motivated by fraud detection, where attackers slightly modify tokenized strings, e.g., names on accounts, to create numerous identities that she can use to defraud service providers, e.g., Google, and LinkedIn. To detect such attacks, all the accounts are pair-wise compared, and the resulting similar accounts are considered suspicious and are further investigated. Comparing the tokenized-string features of a large number of accounts requires an intuitive tokenized-string distance that can detect subtle edits introduced by an adversary, and a very scalable algorithm. This is not achievable by existing distance measure that are unintuitive, hard to tune, and whose join algorithms are serial and hence unscalable. We define a novel intuitive distance measure between tokenized strings, Normalized Setwise Levenshtein Distance (NSLD). To the best of our knowledge, NSLD is the first metric proposed for comparing tokenized strings. We propose a scalable distributed framework, Tokenized-String Joiner (TSJ), that adopts existing scalable string-join algorithms as building blocks to perform NSLD-joins. We carefully engineer optimizations and approximations that dramatically improve the efficiency of TSJ. The effectiveness of the TSJ framework is evident from the evaluation conducted on tens of millions of tokenized-string names from Google accounts. The superiority of the tokenized-string-specific TSJ framework over the general-purpose metric-spaces joining algorithms has been established.
... In addition to classical TFIDF, there are many other lexical measures, for instance the q-gram based measures [35] and the semantic similarity measures [36]. The former often heavily rely on the threshold used, including variations of TFIDF such as the ti-idf cosine measure. ...
Article
Full-text available
Background: The goal of ontology matching is to identify correspondences between entities from different yet overlapping ontologies so as to facilitate semantic integration, reuse and interoperability. As a well developed mathematical model for analyzing individuals and structuring concepts, Formal Concept Analysis (FCA) has been applied to ontology matching (OM) tasks since the beginning of OM research, whereas ontological knowledge exploited in FCA-based methods is limited. This motivates the study in this paper, i.e., to empower FCA with as much as ontological knowledge as possible for identifying mappings across ontologies. Methods: We propose a method based on Formal Concept Analysis to identify and validate mappings across ontologies, including one-to-one mappings, complex mappings and correspondences between object properties. Our method, called FCA-Map, incrementally generates a total of five types of formal contexts and extracts mappings from the lattices derived. First, the token-based formal context describes how class names, labels and synonyms share lexical tokens, leading to lexical mappings (anchors) across ontologies. Second, the relation-based formal context describes how classes are in taxonomic, partonomic and disjoint relationships with the anchors, leading to positive and negative structural evidence for validating the lexical matching. Third, the positive relation-based context can be used to discover structural mappings. Afterwards, the property-based formal context describes how object properties are used in axioms to connect anchor classes across ontologies, leading to property mappings. Last, the restriction-based formal context describes co-occurrence of classes across ontologies in anonymous ancestors of anchors, from which extended structural mappings and complex mappings can be identified. Results: Evaluation on the Anatomy, the Large Biomedical Ontologies, and the Disease and Phenotype track of the 2016 Ontology Alignment Evaluation Initiative campaign demonstrates the effectiveness of FCA-Map and its competitiveness with the top-ranked systems. FCA-Map can achieve a better balance between precision and recall for large-scale domain ontologies through constructing multiple FCA structures, whereas it performs unsatisfactorily for smaller-sized ontologies with less lexical and semantic expressions. Conclusions: Compared with other FCA-based OM systems, the study in this paper is more comprehensive as an attempt to push the envelope of the Formal Concept Analysis formalism in ontology matching tasks. Five types of formal contexts are constructed incrementally, and their derived concept lattices are used to cluster the commonalities among classes at lexical and structural level, respectively. Experiments on large, real-world domain ontologies show promising results and reveal the power of FCA.
Article
The large number of potential applications from bridging web data with knowledge bases have led to an increase in the entity linking research. Entity linking is the task to link entity mentions in text with their corresponding entities in a knowledge base. Potential applications include information extraction, information retrieval, and knowledge base population. However, this task is challenging due to name variations and entity ambiguity. In this survey, we present a thorough overview and analysis of the main approaches to entity linking, and discuss various applications, the evaluation of entity linking systems, and future directions.
Article
Entity resolution (ER) is to find the data objects referring to the same real-world entity. When ER is performed on relations, the crucial operator is record matching, which is to judge whether two tuples refer to the same real-world entity. Record matching is a longstanding issue. However, with massive and complex data in applications, current methods cannot satisfy the requirements. A Sequence-rule-based record matching (SeReMatching) is presented with the consideration of both which attributes should be used and their importance in record matching. We have changed the Bloom filter and therefore the checking speed is greatly increased. The best performance of the algorithm makes the complexity of entity resolution O(n). And extensive experiments were performed to evaluate our methods.
Article
Entity resolution (ER) is to find the data objects referring to the same real-world entity. When ER is performed on relations, the crucial operator is record matching, which is to judge whether two tuples referring to the same real-world entity. Record matching is a longstanding issue. However, with massive and complex data in applications, current methods cannot satisfy the requirements. A Sequence-rule-based record matching (SeReMatching) is presented with the consideration of both the values of the attributes and their importance in record matching. And with the help of the Bloom Filter we changed, the algorithm greatly increases the checking speed and makes the complexity of entity resolution almost O(n). And extensive experiments are performed to evaluate our methods.
Chapter
Many data sets of interest today are best described as networks or graphs of interlinked entities. Examples include Web and text collections, social networks and social media sites, information, transaction and communication networks, and all manner of scientific networks, including biological networks. Unfortunately, often the data collection and extraction process for gathering these network data sets is imprecise, noisy, and/or incomplete. In this chapter, we review a collection of link mining algorithms that are well suited to analyzing and making inferences about networks, especially in the case where the data is noisy or missing.
Conference Paper
The Linking Open Data community project is extending the Web by encouraging the creation of interlinks (RDF links between data items from different datasets identified using dereferenceable URIs). This emerging Web of linked data is closely intertwined with the existing Web, since structured data items can be embedded into Web documents (e.g., RDFa or microformat encoded data), and RDF links can reference classic Web pages. Abundant linked data justifies extending the capabilities of Web browsers and search engines, and enables new usage scenarios, novel applications and sophisticated mashups.This promising direction for publishing data on the Web brings forward a number of challenges. While existing data management techniques can be leveraged to address the challenges, there are unique aspects to managing Web scale interlinking. In this presentation, we describe two specific challenges; achieving and managing dense interlinking, and describing data, metadata, and interlinking within and among linked Web datasets. Approaches to solve these challenges are showcased in the context of LinkedMDB, the first open linked dataset for movies. LinkedMDB has a large number of interlinks (over a quarter of a million) to other open linked datasets, as well as RDF links to movie-related Webpages.
Conference Paper
Full-text available
Many applications need to solve the following problem of approximate string matching: from a collection of strings, how to flnd those similar to a given string, or the strings in another (possibly the same) collection of strings? Many algorithms are developed using flxed-length grams, which are substrings of a string used as signatures to identify sim- ilar strings. In this paper we develop a novel technique, called VGRAM, to improve the performance of these algo- rithms. Its main idea is to judiciously choose high-quality grams of variable lengths from a collection of strings to sup- port queries on the collection. We give a full speciflcation of this technique, including how to select high-quality grams from the collection, how to generate variable-length grams for a string based on the preselected grams, and what is the relationship between the similarity of the gram sets of two strings and their edit distance. A primary advantage of the technique is that it can be adopted by a plethora of approx- imate string algorithms without the need to modify them substantially. We present our extensive experiments on real data sets to evaluate the technique, and show the signiflcant performance improvements on three existing algorithms.
Conference Paper
Full-text available
The quality of the data residing in information repositories and databases gets degraded due to a multitude of reasons. Such reasons include typing mistakes during insertion (e.g., character transpositions), lack of standards for recording database fields (e.g., addresses), and various errors introduced by poor database design (e.g., missing integrity constraints). Data of poor quality can result in significant impediments to popular business practices: sending products or bills to incorrect addresses, inability to locate customer records during service calls, inability to correlate customers across multiple services, etc.
Conference Paper
Full-text available
Given two input collections of sets, a set-similarity join (SSJoin) identifies all pairs of sets, one from each collection, that have high similarity. Recent work has identified SSJoin as a useful primitive operator in data cleaning. In this paper, we propose new algorithms for SSJoin. Our algorithms have two important features: They are exact, i.e., they always produce the correct answer, and they carr y precise performance guarantees. We believe our algorithms are the first to have both features; previous algorithms with performance guarantees are only probabilistically approximate. We demonstrate the effectiveness of our algorithms using a thorough experimental evaluation over real-life and synthetic data sets.
Conference Paper
Full-text available
Declarative data quality has been an active research topic. The fundamental principle behind a declarative approach to data quality is the use of declarative statements to realize data quality primitives on top of any relational data source. A primary advantage of such an approach is the ease of use and integration with existing applications. Several similarity predicates have been proposed in the past for common quality primitives (approximate selections, joins, etc.) and have been fully expressed using declarative SQL statements. In this thesis, new similarity predicates are proposed along with their declarative realization, based on notions of probabilistic information retrieval. Then, full declarative specifications of previously proposed similarity predicates in the literature are presented, grouped into classes according to their primary characteristics. Finally, a thorough performance and accuracy study comparing a large number of similarity predicates for data cleaning operations is performed.
Article
Full-text available
The problem of merging multiple databases of information about common entities is frequently encountered in KDD and decision support applications in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and accuracy. Large repositories of data typically have numerous duplicate information entries about the same entities that are difficult to cull together without an intelligent “equational theory” that identifies equivalent items by a complex, domain-dependent matching process. We have developed a system for accomplishing this Data Cleansing task and demonstrate its use for cleansing lists of names of potential customers in a direct marketing-type application. Our results for statistically generated data are shown to be accurate and effective when processing the data multiple times using different keys for sorting on each successive pass. Combing results of individual passes using transitive closure over the independent results, produces far more accurate results at lower cost. The system provides a rule programming module that is easy to program and quite good at finding duplicates especially in an environment with massive amounts of data. This paper details improvements in our system, and reports on the successful implementation for a real-world database that conclusively validates our results previously achieved for statistically generated data.
Conference Paper
Full-text available
The detection of duplicate tuples, corresponding to the same real-world entity, is an important task in data integration and cleaning. While many techniques exist to identify such tuples, the merging or elimination of duplicates can be a difficult task that relies on ad-hoc and often manual solutions. We propose a complementary approach that permits declarative query answering over duplicated data, where each duplicate is associated with a probability of being in the clean database. We rewrite queries over a database containing duplicates to return each answer with the probability that the answer is in the clean database. Our rewritten queries are sensitive to the semantics of duplication and help a user understand which query answers are most likely to be present in the clean database. The semantics that we adopt is independent of the way the probabilities are produced, but is able to effectively exploit them during query answering. In the absence of external knowledge that associates each database tuple with a probability, we offer a technique, based on tuple summaries, that automates this task. We experimentally study the performance of our rewritten queries. Our studies show that the rewriting does not introduce a significant overhead in query execution time. This work is done in the context of the ConQuer project at the University of Toronto, which focuses on the efficient management of inconsistent and dirty databases.
Conference Paper
In this paper we present an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance. This expands the existing suite of algorithms for set joins on simpler predicates such as, set containment, equality and non-zero overlap. We start with a basic inverted index based probing method and add a sequence of optimizations that result in one to two orders of magnitude improvement in running time. The algorithm folds in a data partitioning strategy that can work efficiently with an index compressed to fit in any available amount of main memory. The optimizations used in our algorithm generalize to several weighted and unweighted measures of partial word overlap between sets.
Conference Paper
To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation.A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets.
Conference Paper
Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity score (as determined by a function such as cosine distance) is above a given threshold. We propose a simple algorithm based on novel indexing and optimization strategies that solves this problem without relying on approximation methods or extensive parameter tuning. We show the approach efficiently handles a variety of datasets across a wide setting of similarity thresholds, with large speedups over previous state-of-the-art approaches.
Conference Paper
Data cleaning based on similarities involves identification of "close" tuples, where closeness is evaluated using a variety of similarity functions chosen to suit the domain and application. Current approaches for efficiently implementing such similarity joins are tightly tied to the chosen similarity function. In this paper, we propose a new primitive operator which can be used as a foundation to implement similarity joins according to a variety of popular string similarity functions, and notions of similarity which go beyond textual similarity. We then propose efficient implementations for this operator. In an experimental evaluation using real datasets, we show that the implementation of similarity joins using our operator is comparable to, and often substantially better than, previous customized implementations for particular similarity functions.