# A Graph Theoretic Approach to Key Equivalence.

**ABSTRACT** This paper is concerned with redundancy detection and elim- ination in databases via the solution of a key equivalence problem. The approach is related to the hardening of soft databases method due to Co- hen et al., (4). Here, the problem is described in graph theoretic terms. An appropriate optimization model is drawn and solved indirectly. This ap- proach is shown to be effective. Computational results on test databases are included.

**0**Bookmarks

**·**

**71**Views

Page 1

A Graph Theoretic Approach to Key

Equivalence

J. Horacio Camacho, Abdellah Salhi, and Qingfu Zhang

University of Essex, Colchester CO43SQ, U.K.

{jhcama, as, qzhang} @essex.ac.uk

Abstract. This paper is concerned with redundancy detection and elim-

ination in databases via the solution of a key equivalence problem. The

approach is related to the hardening of soft databases method due to Co-

hen et al., [4]. Here, the problem is described in graph theoretic terms. An

appropriate optimization model is drawn and solved indirectly. This ap-

proach is shown to be effective. Computational results on test databases

are included.

1 Introduction

Textual databases often contain a lot of redundant information, and information

that is corrupt as a result of errors such as key punch errors, scanning errors,

spelling errors, to name a few.

Because of these errors, it is often hard to decide whether two entries (records)

in the database refer to the same real world object. The problem is, here, con-

sidered in the context of structured databases.

This problem has been tackled under different guises as the probabilistic

record linkage [1], the merge/purge problem [2], duplicate detection [3], and

others. Here we focus on an approach related to the Hardening Soft Information

Sources approach due to Cohen et al., [4]. It is related, but simpler, since as

will be seen, it is formulated as a simpler optimization problem than the global

optimization one suggested in [4]. This simplification follows from the fact that

in [4], a record has potentially many fields each pointing to a real world object,

i.e. it forms a reference. Here, we consider that the whole record however many

fields it may have, points to a single object. This is an important distinction

since the initial complete graph we work from is much simpler than what would

be considered if the model in [4] was exactly adhered to.

In the next subsection, formal models of the problem are presented. A solu-

tion approach is given in Section 2. Examples and experimental results on test

datasets are shown in Section 3. Section 4 introduces an important improvement

to the basic approach. Section 5 is a comparison of the enhanced approach with

a standard approach on many but small databases. Section 6 is the conclusion.

1.1

Let object identifier Oi be any record in a database with fields o1,o2,...,om.

Each one of these fields describes a specific characteristic of Oi, such as name,

Formulation of the key equivalence problem

Page 2

address, telephone number. Let also object be the real target which Oiis referring

to and key be the unique identification of the record in a database. Then, Key

equivalence occurs when two or more Oi’s in a database refer to the same object,

[5]. As said earlier, the main difference between our formulation and that of the

hardening approach, [4], is that here we consider a database as a set of Oi’s,

while in Cohen et al.’s work, a database consists of a set of tuples, each of which

consisting of a set of references, or fields. Each reference points to a real world

object.

Since, given a database, it is not easy to tell which records point to the

same object, we initially assume that all of them point to the same object. This

means that all records can potentially be represented by the same object identi-

fier. Therefore, initially at least, we in fact assume that when all redundancy is

removed, we will possibly be left with no database. This assumption may sound

unreasonable, since only a small percentage of records in a database might be

corrupted, but it is only necessary to motivate our method. It does not limit the

application of the method suggested.

Let now each object identifier be represented by a node. Then, the potential

redundancy of an identifier (node) may be represented by a directed arc between

this identifier and another one. An incoming arc means the source node is po-

tentially redundant. Since, as was assumed, initially they all point to each other,

no direction is required, leading to a complete graph.

Let G(V,E) be this graph with V = {v1,v2,...,vn} its set of nodes each

corresponding to an object identifier (record) in a given database, and E =

{(i,j)|i,j = 1,2,...,n,i ?= j} its set of arcs.

By some string similarity metric, such as SoftTF-IDF, [6], it is possible to

find weights for all edges of graph G specifying how likely it is that two object

identifiers point to the same real world object, i.e. one of them is redundant. A

large weight between two Oi’s says they are unlikely to point at the same object,

and a small weight says otherwise, i.e. there is redundancy. In this fashion, since

SoftTF-IDF scales the similarity with values between zero and one, where one is

the maximum similarity, we take as a weight its inverse value (1−SoftTF-IDF).

We are left with the question of how close to zero a weight has to be in order

to say that one of the records is redundant. It will become clear, later, that this

question is at the heart of the problem.

Clearly, a subgraph of G with minimum total weight will catch redundancy.

Moreover, this subgraph must have all the nodes of G.

2 Solution approach

A further formalization, is necessary to model this situation. In particular, we

consider that a subgraph of G that captures all or part of the redundancy in the

database, is generated by a function from V to V . As such, it has the properties

of totality and unicity. Given G, we want to find G?(V,E?) such that E?⊆ E,

and

Page 3

z =

?

(i,j)∈E,i?=j

eijwij+

⎛

⎝n −

?

(i,j)∈E,i?=j

eij

⎞

⎠λ1+

⎛

⎝

?

(i,j)∈E,i?=j

eij

⎞

⎠λ2

(1)

is minimized, where eij = 1 if (i,j) ∈ E?and 0, otherwise, n is the size of

the database, wij,i,j = 1,2,...,n are the weights, and λ1and λ2are constants

which control the size of the resulting database for the amount redundancy

detected. Equivalently, they are constants which when known exactly will give

a value z which is smallest for the database that has been cleaned of all its

redundancy and nothing else, i.e. the perfect solution. Of course the choice of

these constants will influence the effectiveness of the approach advocated here.

A simple manipulation of the z expression results in

z =

?

?

(i,j)∈E,i?=j

eijwij+ λ1n − λ1

?

(i,j)∈E,i?=j

eij+ λ2

?

(i,j)∈E,i?=j

eij

eij

=

(i,j)∈E,i?=j

eijwij+ λ1n − (λ1− λ2)

?

(i,j)∈E,i?=j

By constraining z with the requirements of the relation (function) between

the nodes, and after a slight transformation of the expression of z, due to the

fact that some terms are constants, and also by replacing λ1− λ2with a single

parameter k, we obtain the following optimization problem.

minz =

?

(i,j)∈E,i?=j

eijwij− k

?

(i,j)∈E,i?=j

eij

(2)

s.t.

?

?

ui− uj≤ n − 1 − neij,

i,j = 1,...,n,i ?= j,ui∈ R+

Note that restrictions (3) imply that there is at most one edge (i,j) from each

node i. Restrictions (4) and (5) eliminate cycles, [8]. From the above model, it is

clear that if k ≤ 0, the second term of z is zero or positive and so the minimum

corresponds to all eij= 0, i.e. no edge is worth including in the solution, giving

E?= ∅.

If k > 0 the minimum of z must be negative, i.e.

k?

will be less than k.

j∈E

eij≤ 1,∀i,i ?= j

(3)

(i,j)∈E

eij≤ n − 1(4)

(5)

?

(i,j)∈E,i?=jeijwij ≤

(i,j)∈E,i?=jeij in which case the solution to the above model will be those

arcs with small weights. Moreover, because we are minimizing, all these weights

Page 4

2.1 Formalization

Parameter k is essential for trapping redundancy and its proper setting will

decide on how successful the detection of redundancy will be. Set too large, a

connected subgraph of G will be the solution, thus including all nodes (object

identifiers). Set too low, very few if any will be included in the solution, thus

leaving out genuine redundancy. It must be clear already that trees satisfy the

constraints of the above optimization model. A solution to the problem is likely

to be a collection of subtrees of the minimum spanning tree of G. In other words,

it is likely to be a forest.

Definition 1. A spanning forest of a connected graph G is a forest whose com-

ponents are subtrees of a spanning tree of G.

Definition 2. A minimum spanning forest of a connected graph G is a forest

whose components are minimum spanning trees of the corresponding components

in G.

Proposition 1. The solution to the suggested optimisation model is a spanning

forest. Moreover, it is a minimum spanning forest.

Proposition 2. For a given k, the optimal solution to model (2)-(5) can be

obtained in polynomial time.

Proof. Trim the graph of all its edges with weights greater that k. Apply a greedy

algorithm to the remaining subgraph of G to find the minimum spanning tree

(forest if disconnected).

Remark 1. The problem of finding the optimum k may not be solvable in poly-

nomial time. A practical estimation of k is given below.

Remark 2. Parameter k varies from database to database.

2.2 Estimating k

The threshold constant k can be chosen arbitrarily, below 0.5, for instance. That

may well work in some cases. However, in general, it is better to find an estimate

directly related to the given database, (Remark 2). This can be done as follows.

Algorithm 1:

1. Find the weighted complete graph G corresponding to the given database;

2. Find the minimum spanning tree of G;

3. Assign the largest weight for which a good match between records is found,

to k;

2.3Detecting redundancy

Redundant records are detected according to the following algorithm.

Page 5

Algorithm 2:

1. Apply Algorithm 1 to the given database;

2. Remove all edges with weights > k, from the minimum spanning tree of G

output by Algorithm 1;

The output is a tree (or forest) that represents the detected redundant

records. Each tree of the spanning forest can be reduced to one node. The re-

maining nodes of the forest constitute the records of the resulting database after

removing redundancy.

2.4Illustration

Consider a database with only four records, i.e. four object identifiers. As shown

in Table 1, each record is associated with a node viof a graph G. Now consider

G as a complete graph; its set E of edges are given in Table 2. Moreover, we

also compute the weights of each edge (i,j) using the SoftTF-IDF method, [6].

These weights are also included in Table 2.

Database

v1 = Coby Lashiwn Y. 303 Main

v2 = Coby Angela 303 Main

v3 = Coby Wiliams A. 303 Main

v4= Coby Agel 303 Main

Table 1. A database example.

Edge (i,j)

(1,2)

(1,3)

(1,4)

(2,3)

(2,4)

(3,4)

Table 2. Set of edges and set of

weights.

Weight wij

w12 = w21 = 0.329

w13 = w31 = 0.400

w14 = w41 = 0.329

w23 = w32 = 0.329

w24 = w42 = 0.250

w34 = w43 = 0.329

Now, applying Algorithm 1 to the graph of Fig.1 results in the graph of Fig.2.

A greedy algorithm has been used to solve the minimum spanning tree problem,

[8]. By consulting Fig.2, it is obvious that k must take value 0.25. Removing all

edges of the minimum spanning tree with weights > k leads to record v2(or v4)

being redundant. Thus, E?= {(2,4)}.

3Experimental results

The approach presented in this paper is tested on the datasets of Table 3. Its per-

formance is measured via the quality of the solution it returns for each dataset.

This quality is calculated as, quality =2∗precision∗recall

recall =

pointing to the same real world object, |E?| is number of edges (i,j) in the so-

lution set E?and |E∗?| is the number of redundant records in the database, [6].

precision+recall, where, precision =

c

|E?|,

c

|E∗?|, c is the number of correctly linked rows, i.e. the number of records

Page 6

v1

v2

w12= 0.329

w14= 0.329

w23= 0.329

v3

v4

w24= 0.25

w34= 0.329

w13= 0.4

Fig.1. The complete graph G of

the dtatabase.

v1

v2

w12= 0.329

v3

v4

w24= 0.25

w23= 0.329

Fig.2. The minimum spanning tree

G?output by Algorithm 1.

Dataset

BirdKunkel

BirdScott2

Census

Cora

Parks

Restaurant

Table 3. Experimental data, source [6].

Records Redundancies

337

719

841

923

654

863

38

310

671

902

505

228

In each case, the quality is computed for 31 values of k, between zero and one.

The results are displayed in Fig.3.

Note that, from Fig.3, the quality of solution is not generally sensitive to the

value of k. However, for dataset BirdKunkel, it is: indeed, the quality of solution

drops from about 0.72 to 0.45 for values of k not near 0.3. This sensitivity to the

choice of k is a serious issue, particularly when new databases are considered. In

the following, a heuristic approach is suggested for reducing it.

4 Improvement: Reducing sensitivity to parameter k

The sensitivity to the choice of k can be reduced by the following pre-processing

steps. Before choosing k, the forest (tree) representing potential redundancy is

further trimmed as follows.

4.1 Reduce the degree of each node

For each nodes visuch that deg(vi) ≥ 2 do:

Compare the weights of its incident edges j,k,l,... pairwise.

if |wij− wik| > δ1then

if wij> wikthen remove (i,j); else remove (i,k); endif

endif

Here, we remove edges whose weights differences are bigger than δ1= 0.05.

The value of δ1is empirically determined.

Page 7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.30.40.5

k

0.6 0.70.80.91

Quality

BirdKunkel

BirdScott

Census

Cora

Parks

Restaurant

Fig.3. Results obtained for each dataset for 31 different values of k between 0 and 1.

4.2 Reduce the length of branches

Often, the minimum spanning tree representing potential redundancy, shows

many nodes (records) which do not seem to be similar. For instance, wij and

wjk are of the same order, but, wik is significantly different from both. This

occurs when branches have more than two nodes. So, reducing the size of the

long branches according to some difference of weights between adjacent edges

criterion, enhances the accuracy of estimating parameter k. We proceed as fol-

lows.

Link each vertex in the solution to the root of its tree and obtain the weights

between the actual vertex and the intermediary vertices to the root node. If a

weight is bigger than some value δ2, then remove from the branch, the edge with

largest weight. Here, again, the value of δ2is empirically settled, and found to

be 0.6.

Having incorporated the pre-processing steps suggested in this section, the

results of the enhanced approach suggest in all cases, and in particular in the

case of Birdkunkel, a wider interval in which k can be chosen without affecting

too much the quality of redundancy detection. The results are represented in

Fig4. Note, especially, that in the cases of Census, Cora and Parks datasets, the

intervals from which k can be chosen are very wide, indicating little sensitivity

indeed. The results are strikingly better than the earlier ones (Fig.3) obtained

before the improvements to the method.

5 Comparison with a standard approach

In [7], manual matching of records in record-linkage is treated via the solution

of an assignment problem. A comparison of our approach to this mathematical

Page 8

BirdKunkel

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.20.30.40.5

k

0.6 0.7 0.8 0.91

Quality

MST

MST & heuristic

BirdScott

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.10.2 0.3 0.4 0.5

k

0.6 0.70.80.91

Quality

MST

MST & heruristic

Census

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.10.2 0.3 0.40.5

k

0.6 0.7 0.80.91

Quality

MST

MST & heuristic

Cora

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.10.2 0.30.4 0.5

k

0.6 0.70.8 0.91

Quality

MST

MST & heuristic

Parks

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.20.3 0.40.5

k

0.60.7 0.80.91

Quality

MST

MST & heuristic

Restaurant

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.10.2 0.3 0.40.5

k

0.60.7 0.80.91

Quality

MST

MST & heuristic

Fig.4. Heuristic results to reduce the sensitivity of k obtained for each dataset for 31

different values of k between 0 and 1.

Page 9

BirdKunkel

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.10.20.3 0.4 0.5

k

0.60.7 0.8 0.91

Quality

Proposed method

Baseline

BirdScott

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.10.2 0.3 0.40.5

k

0.6 0.70.80.91

Quality

Proposed method

Baseline

Census

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.30.4 0.5

k

0.6 0.7 0.80.91

Quality

Proposed method

Baseline

Cora

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.10.20.3 0.40.5

k

0.6 0.70.8 0.91

Quality

Proposed method

Baseline

Parks

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.10.2 0.30.4 0.5

k

0.6 0.70.8 0.91

Quality

Proposed method

Baseline

Restaurant

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

00.1 0.2 0.30.4 0.5

k

0.60.70.80.91

Quality

Proposed method

Baseline

Fig.5. Baseline results for each dataset for 31 different values of k between 0 and 1.

Page 10

programming based method is reasonable. The same weights provided by the

softTF-IDF method are used for both approaches. Results are given in Fig.5.

Note that the proposed method performs better on all datasets.

6 Conclusion

We have looked at the problem of detecting and removing redundancy in struc-

tured databases. An optimization model of the integer programming type has

been devised for it. Although, this model is difficult to solve directly, it turns

out that a tree graph of a certain kind (a forest) provides an optimum solu-

tion, given k. This solution can be obtained in polynomial time. The approach

suggested has been tried on many but small structured databases. The results

are encouraging. Further efforts are currently expanded in order to improve the

estimation of the threshold parameter k in the case of large databases.

Acknowledgment This work is supported by Conacyt grant number 168588.

References

1. H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage

of vital records. Science, 130:954–959, 1959.

2. M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In

Proceedings of the 1995 ACM SIGMOD International Conference on Management

of Data (SIGMOD-95), pages 127–138, San Jose, CA, May 1995.

3. A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for de-

tecting approximately duplicate database records. In Proceedings of the SIGMOD

1997 Workshop on Research Issues on Data Mining and Knowledge Discovery, pages

23–29, Tuscon, AZ, May 1997.

4. W. W. Cohen, H. Kautz, and D. McAllester. Hardening soft information sources.

In Proceedings of the Sixth International Conference on Knowledge Discovery and

Data Mining (KDD-2000), Boston, MA, Aug. 2000.

5. Carlton Pu, Key Equivalence in Heterogeneous Databases, Department of Computer

Science, Columbia University, New York, NY, 1991.

6. William W. Cohen, Pradeep Ravikumar, Stephen E. Fienberg, A Comparison of

String Distance Metrics for Name-Matching Tasks, 2003.

7. Jaro, M. A. Advances in Record-Linkage Methodology as Applied to Matching the

1985 Census of Tampa, Florida, Journal of the American Statistical Association,

89, 414-420, 1989.

8. Nemhauser and Wolsey. Integer and Combinatorial Optimization. Willey Inter-

science,1988.