Page 1

Access Structures for Advanced Similarity

Search in Metric Spaces

Vlastislav Dohnal1, Claudio Gennaro2, Pasquale Savino2, and Pavel Zezula1

1Masaryk University

Brno, Czech Republic

{xdohnal, zezula}@fi.muni.cz

2ISTI-CNR

Pisa, Italy

{gennaro, savino}@isti.pi.cnr.it

Abstract. Similarity retrieval is an important paradigm for searching

in environments where exact match has little meaning. Moreover, in or-

der to enlarge the set of data types for which the similarity search can

efficiently be performed, the notion of mathematical metric space pro-

vides a useful abstraction for similarity. In this paper we consider the

problem of organizing and searching large data-sets from arbitrary met-

ric spaces, and a novel access structure for similarity search in metric

data, called D-Index, is discussed. D-Index combines a novel clustering

technique and the pivot-based distance searching strategy to speed up

execution of similarity range and nearest neighbor queries for large files

with objects stored in disk memories. Moreover, we propose an extension

of this access structure (eD-Index) which is able to deal with the problem

of similarity self join. Though this approach is not able to eliminate the

intrinsic quadratic complexity of similarity joins, significant performance

improvements are confirmed by experiments.

1 Introduction

Similarity searching has become a fundamental computational task in a variety

of application areas, including multimedia information retrieval, data mining,

pattern recognition, machine learning, computer vision, genome databases, data

compression, and statistical data analysis. This problem, originally mostly stud-

ied within the theoretical area of computational geometry, is recently attracting

more and more attention in the database community, because of the increas-

ingly growing needs to deal with large, often distributed, volume of data. Con-

sequently, high performance has become an important feature of cutting edge

designs.

In this article, we consider the problem from a broad perspective and assume

the data to be objects from a metric space where only pair-wise distances be-

tween objects are known. We present the D-Index, a multi-level access structure

of separable buckets on each level which supports easy insertion, range search,

and nearest neighbor search. Moreover, in order to complete the set of similarity

Page 2

search operations, similarity joins are needed. For this purpose, an extension of

D-Index is introduced. Applications of similarity join operation are data cleaning

or copy detection, to name a few. The former is useful, for instance, in the devel-

opment of Internet services which often require an integration of heterogeneous

sources of data. Such sources are typically unstructured whereas the intended

services often require structured data. In this case, the main challenge is to pro-

vide consistent and error-free data, which implies the data cleaning, typically

implemented by a sort of similarity join. This work focuses on similarity self join

useful for copy detection and data cleaning of large databases.

The paper is organized as follows. In Section 2, we define the general princi-

ples of our access structures. The system structure, algorithms, of D-Index are

presented in Section 2.3, while the eD-Index is illustrated in Section 2.4. Section

3 briefly summarize the experimental evaluation results. Finally, Appendix A

formally presents the proposed algorithms. The article concludes in Section 4.

2Access structures and search algorithms

2.1Search Strategies

A convenient way to assess similarity between two objects is to apply metric

functions to decide the closeness of objects as a distance, which can be seen as

a measure of the objects dis-similarity. A metric space M = (D,d) is defined

by a domain of objects (elements, points) D and a total (distance) function d –

a non negative (d(x,y) ≥ 0 with d(x,y) = 0 iff x = y) and symmetric (d(x,y) =

d(y,x)) function, which satisfies the triangle inequality (d(x,y) ≤ d(x,z)+d(z,y),

∀x,y,z ∈ D).

Without any loss of generality, we assume that the maximum distance never

exceeds the distance d+. For a query object q ∈ D, two fundamental similarity

queries can be defined. A range query retrieves all elements within distance r to

q, that is the set {x ∈ X,d(q,x) ≤ r}. A k-nearest neighbor query retrieves the

k closest elements to q, that is a set R ⊆ X such that |R| = k and ∀x ∈ R,y ∈

X − R,d(q,x) ≤ d(q,y).

The similarity join is a search primitive which combines objects of two subsets

of D into one set such that a similarity condition is satisfied. The similarity

condition between two objects is defined according to the metric distance d.

Formally, the similarity join X

?? Y between two finite sets X = {x1,...,xN}

and Y = {y1,...,yM} (X ⊆ D and Y ⊆ D) is defined as the set of pairs X

Y = {(xi,yj) | d(xi,yj) ≤ µ}, where the threshold µ is a real number such that

0 ≤ µ ≤ d+. If the sets X and Y coincide, we talk about the similarity self join.

sim

sim

??

2.2General approach: Clustering Through Separable Partitioning

To achieve our objectives, we base our partitioning principles on a mapping

function, which we call the ρ-split function, where ρ is a real number constrained

as 0 ≤ ρ < d+. In order to gradually explain the concept of ρ-split functions, we

Page 3

first define a first order ρ-split function and its properties. More details about

the mathematical specification of the ρ-split function are available in [4].

Definition 1. Given a metric space (D,d), a first order ρ-split function s1,ρ

is the mapping s1,ρ: D → {0,1,−}, such that for arbitrary different objects

x,y ∈ D, s1,ρ(x) = 0 ∧ s1,ρ(y) = 1

and ρ2≥ ρ1 ∧ s1,ρ2(x) ?= − ∧ s1,ρ1(y) = − ⇒ d(x,y) > ρ2− ρ1(symmetry

property).

⇒

d(x,y) > 2ρ (separable property)

In other words, the ρ-split function assigns to each object of the space D one of

the symbols 0, 1, or −.

We can generalize the ρ-split function by concatenating n first order ρ-split

functions with the purpose of obtaining a split function of order n.

Definition 2. Given n first order ρ-split functions s1,ρ

space (D,d), a ρ-split function of order n sn,ρ= (s1,ρ

{0,1,−}nis the mapping, such that for arbitrary different objects x,y ∈ D,

∀i s1,ρ

property) and ρ2≥ ρ1 ∧∀i s1,ρ2

(symmetry property).

1,...,s1,ρ

1,s1,ρ

n

in the metric

n) : D →

2,...,s1,ρ

i(x) ?= − ∧ ∀j s1,ρ

j(y) ?= − ∧ sn,ρ(x) ?= sn,ρ(y) ⇒ d(x,y) > 2ρ (separable

i

(x) ?= − ∧ ∃j s1,ρ1

j

(y) = − ⇒ d(x,y) > ρ2−ρ1

An obvious consequence of the ρ-split function definitions, useful for our

purposes, is that by combining n ρ-split functions of order 1 s1,ρ

obtain a ρ-split function of order n sn,ρ. We often refer to the number of bits

generated by sn,ρ, that is the parameter n, as the order of the ρ-split function. In

order to obtain an addressing scheme, we need another function that transforms

the ρ-split strings into integers, which we define as follows.

1,...,s1,ρ

n, we

Definition 3. Given a string b =(b1,...,bn) of n elements 0, 1, or −, the

function ?·? : {0,1,−}n→ [0..2n] is specified as:

?[b1,b2,...,bn]2=?n

When all the elements are different from ‘−’, the function ?b? simply translates

the string b into an integer by interpreting it as a binary number (which is always

< 2n), otherwise the function returns 2n.

By means of the ρ-split function and the ?b? operator we can assign an integer

number i (0 ≤ i ≤ 2n) to each object x ∈ D, i.e., the function can group objects

from X ⊂ D in 2n+ 1 disjoint sub-sets.

Though several different types of first order ρ-split functions are proposed,

analyzed, and evaluated in [3], the ball partitioning split (bps) originally proposed

in [6] under the name excluded middle partitioning, provided the smallest exclu-

sion set. For this reason, we also apply this approach, which can be characterized

as follows.

The ball partitioning ρ-split function bpsρ(x,xv) uses one object xv ∈ D

and the medium distance dm to partition the data file into three subsets (see

Figure 1a). The object corresponding to the value 1 or 0, of the bps function,

?b? =

j=12j−1bj, if ∀j bj?= −

2n, otherwise

Page 4

form separable partitions, and the object corresponding to ‘-’ form the exclusion

set.

combine them in order to obtain a function which generates more partitions.

This idea is depicted in Figure 1b, where two bps-split functions, on the two-

dimensional space, are used. The domain D, represented by the grey square, is

divided into four regions, corresponding to the separable partitions. The exclu-

sion set is represented by the brighter region and it is formed by the union of

the exclusion sets resulting from the two splits.

bps1,ρ(x) =

0 if d(x,xv) ≤ dm− ρ

1 if d(x,xv) > dm+ ρ

− otherwise

(1)

Once we have defined a set of first order ρ-split functions, it is possible to

2.3 D-Index

The basic idea of the D-Index [4] is to create a multilevel storage and retrieval

structure that uses several ρ-split functions, one for each level, to create an

array of buckets for storing objects. On the first level, we use a ρ-split function

for separating objects of the whole data set. For any other level, objects mapped

to the exclusion bucket of the previous level are the candidates for storage in

separable buckets of this level. Finally, the exclusion bucket of the last level

forms the exclusion bucket of the whole D-Index structure. It is worth noting

that the ρ-split functions of individual levels use the same ρ. Moreover, split

functions can have different order, typically decreasing with the level, allowing

the D-Index structure to have levels with a different number of buckets. More

precisely, the D-Index structure can be defined as follows.

From the structure point of view, you can see the buckets organized as the

following two dimensional array consisting of 1 +?h

B1,0,B1,1,...,B1,2m1−1

B2,0,B2,1,...,B2,2m2−1

...

Bh,0,Bh,1,...,Bh,2mh−1,Eh

All separable buckets are included, but only the Ehexclusion bucket is present

– exclusion buckets Ei<hare recursively re-partitioned on level i + 1. Then, for

each row (i.e. the D-Index level) i, 2mibuckets are separable up to 2ρ thus we are

sure that do not exist two buckets at the same level i both containing relevant

objects for any similarity range query with radius rx≤ ρ.

D-Index Insertion. The insertion of an object x ∈ X in DIρ(X,m1,m2,...,mh)

starts from the first level, tries to accommodate x into a separable bucket. If a

suitable bucket exists, the object is stored in this bucket. If it fails for all levels,

the object x is placed in the exclusion bucket Eh. In any case, the insertion

algorithm determines exactly one bucket to store the object. The pseudo–code

i=12mielements.

Page 5

of the this procedure can be found in Algorithm A1 of the Appendix A.

D-Index Naive Search. Given a query region Q = R(q,rq) with q ∈ D and

rq≤ ρ, a simple algorithm can execute the query as follows.

Algorithm 21 Search

for i = 1 to h

return all objects x such that x ∈ Q ∩ Bi,?smi,0

end for

return all objects x such that x ∈ Q ∩ Eh;

The function ?smi,0

Consequently, one separable bucket on each level i is determined. Note that, in

Algorithm 21 the exclusion buckets is always accessed. The execution of Algo-

rithm 21 requires h+1 bucket accesses, which forms the upper bound of a more

sophisticated algorithm described in the next section. Moreover, the pivots can

save many distance computations, for further details the reader is referred to [1].

i

(q)?;

i

(q)? always gives a value smaller than 2mi, because ρ = 0.

D-Index Advanced Searches. This search algorithm is useful for understand-

ing the basic idea of the D-Index structure, however it requires to access one

bucket at each level of the D-Index, plus the exclusion bucket. In order to speedup

the search, the following two situations can be exploited: if the query region is

contained in the exclusion partition of the level i, then the query cannot have

objects in the separable buckets of this level and only the next level, if it exists,

must be considered; if the query region is contained in a separable partition of

the level i, the following levels, as well as the exclusion bucket for i = h, need

not be accessed, thus the search terminates on this level.

(a)

2ρ

xv

x0

x1

dm

x2

1)(

1=

xbps

−=

)(

2 x bps

0)(

0=

xbps

dm

2ρ

Separable

set 4

Separable

set 1

Separable

set 2

Separable

set 3

Exclusion

Set

(b)

Fig.1. The excluded middle partitioning (a). Clustering through partitioning in the

two-dimensional space (b).

Another drawback of the simple algorithm is that it works only for search

radii up to ρ. However, with additional computational effort, queries with rq> ρ

can also be executed. Indeed, queries with rq> ρ can be executed by evaluating

the split function srq−ρ. In case srq−ρreturns a string without any ‘−’, the result

Page 6

is contained in a single bucket (namely B?srq−ρ?) plus, possibly, the exclusion

bucket.

Let us now consider that the string returned contains at least one ‘−’. We

indicate this string as (b1,...,bn) with bi= {0,1,−}. In case there is only one

bi= ‘−’, we must access all buckets B, whose index is obtained by substituting

in srq−ρthe ‘−’ with 0 and 1. In the most general case we must substitute in

(b1,...,bn) all the ‘−’ with zeros and ones and generate all possible combina-

tions.

In order to define an algorithm for this process, we need some additional

terms and notation.

Definition 4. We define an extended exclusive OR bit operator, ⊗, which is

based on the following truth table:

bit1 0 0 0 1 1 1 − − −

bit2 0 1 − 0 1 − 0 1 −

⊗

0 1 0 1 0 0 0 0 0

Note that the operator ⊗ can be used bit-wise on two strings of the same length

and that it always returns a standard binary number (i.e., it does not contain

any ‘−’). Consequently, ?s1⊗ s2? < 2nis always true for strings s1 and s2 of

length n (see Definition 3).

Definition 5. Given a string s of length n, G(s) denotes a sub-set of Ωn =

{0,1,...,2n− 1}, such that all elements ei ∈ Ωn for which ?s ⊗ ei? ?= 0 are

eliminated (interpreting eias a binary string of length n).

Observe that, G(− − ...−) = Ωn, and that the cardinality of the set is the

second power of the number of ‘−’ elements, that is generating all partition

identifications in which the symbol ‘−’ is alternatively substituted by zeros and

ones.

Given a query region Q = R(q,rq) with q ∈ D and rq≤ d+. The advanced

similarity range query can be executed following the Algorithm A2.

The task of the nearest neighbor search is to retrieve k closest elements from

X to q ∈ D, respecting the metric M. For convenience, we designate the distance

to the k-th nearest neighbor as dk(dk≤ d+). Due to the lack of space, we do

not comment this algorithm, however we can say that the general strategy of

Algorithm A3 works as follows. The algorithm starts with an optimistic strategy

assuming that the k-th nearest neighbor is at distance maximally ρ. If it fails

additional steps of search are performed to find the correct result.

2.4eD-Index

The idea behind the eD-Index is to modify the insertion algorithm of the D-

Index in way that the exclusion set overlaps with the separable partitions of ?

(see Figure 2). The objects which fall in the partition where the exclusion set in-

tersects the separable partitions are replicated in both those sets. This principle,

called the overloading exclusion set, ensures that any similarity self join query

Page 7

with threshold µ ≤ ? cannot find a qualifying pair of objects (x,y) from different

sets, specifically bps(x) ?= bps(y), because all objects of a separable set which

can make a qualifying pair with an object of the exclusion set are copied to the

exclusion set. In this way, the eD-Index speeds up the execution of similarity self

2ρ

dm

dm

2ρ

εε

Separable partitionsExclusion Set

(a)(b)

Fig.2. The difference between D-Index (a) and eD-Index (b).

join queries upto a predefined value of threshold ? apart in each separable bucket.

eD-Index Insertion. It proceeds according to the Algorithm A4. The algo-

rithm is similar the one for the insertion of D-Index (A1). The difference is that

in eD-Index we use both the split functions smi,ρ

replicate all the objects which fall in the overlapping region of two consecutive

levels.

i

and smi,ρ+?

i

, allowing us to

eD-Index Similarity Self Join. The outline of the similarity self join al-

gorithm is following: execute the join query independently on every separable

bucket of every level of the eD-index and additionally on the exclusion bucket of

the whole structure. This behaviour is correct due to the overloading exclusion

set principle, which copies every object of a separable set which can make a

qualifying pair with an object of the exclusion set to the exclusion set. In this

respect, the join query elaboration can be done independently in every bucket

and the results of individual join subqueries form the result of the given join

query. The Algorithm A5 describes this idea.

window

≤

d

p

oj

d(p,oj)

olo

oup

µ

o1o2

on

Fig.3. The Sliding Window algorithm.

Page 8

The procedure SimJoin(Bi,j,µ) executes the similarity self join in one bucket.

It is possible to improve the algorithm by exploiting the pivots during the

SimJoin elaboration. This idea is based on the sliding window algorithm (see

Algorithm A6): objects of a bucket are ordered with respect to a pivot p, which

is the reference object of a ρ-split function used by the eD-Index, and we define a

sliding window of objects as [olo,oup]. This window is moved through the whole

order set of the bucket in order to find all qualifying pairs (see Figure 3).

3Performance Evaluation and Comparison

We have implemented the D-Index and the eD-Index and conducted numerous

experiments to verify its properties on two metric data sets. The first data set

(STR) consisted of sentences of Czech language corpus compared by the edit

distance measure, so-called Levenshtein distance [5]. The most frequent distance

was around 100 and the longest distance was 500, equal to the length of the

longest sentence. The second data set (VEC) was composed of 45-dimensional

vectors of color features extracted from images. Vectors were compared by the

quadratic form distance measure. The distance distribution of this data set was

practically normal distribution with the most frequent distance equal to 4,100

and the maximum distance equal to 8,100.

3.1D-Index Performance Evaluation

In all our experiments, the query objects are not chosen from the indexed data

sets, but they follow the same distance distribution. The search costs are mea-

sured in terms of distance computations. All presented cost values are averages

obtained by executing queries for 50 different query objects and constant search

selectivity, that is queries using the same search radius or the same number of

nearest neighbors. We have considered about 11,000 objects for each of our data

sets VEC and STR.

We have compared the performance of D-Index with other index structures

under the same workload. In particular, we considered the M-tree3[2] and the

sequential organization, SEQ.

Efficiency. The main objective of these experiments was to compare the simi-

larity search efficiency of the D-Index with the other organizations. The results

for the range search are shown in Figures 4a and 4b, while the performance for

the nearest neighbor search is presented in Figures 4c and 4d.

For all tested queries, i.e. retrieving subsets up to 20% of the database, the

M-tree and the D-Index always needed less distance computations than the se-

quential scan. The D-Index certainly performed much better for the sentences

and also for the range search over vectors. However, the nearest neighbor search

on vectors needed practically the same number of distance computations.

3The software is available at http://www-db.deis.unibo.it/research/Mtree/

Page 9

3.2 eD-Index Performance Evaluation

In order to demonstrate the suitability of the eD-index to the problem of simi-

larity self join, we have compared several different approaches to join operation.

The nested loops, uses the symmetric property of metric distance functions for

pruning some pairs. The time complexity is O(N·(N−1)

method, called range query join, uses the D-index structure. Specifically, we as-

sume a data set X ⊆ D organized by the D-index and apply the search strategy

as follows: for ∀o ∈ X, perform range query(o,µ). Finally, the last compared

method is the overloading join algorithm which is described in Section 2.4.

In all experiments, we have compared three different techniques for the prob-

lem of the similarity self join, the nested loops (NL) algorithm, the range query

join (RJ), and the overloading join (OJ) algorithm, applied on the eD-index. As

performance index we have used the speedup with respect the naive approach,

i.e., the number of distance computation of the examined algorithm divided by

N·(N−1)

2

, where N is the number of object stored.

2

). A more sophisticated

Efficiency. The objective of this group of tests was to study the relationship

between the query size and the efficiency measured in terms of distance compu-

tations. Figures 4e and 4f show results of experiments. The experiments show

that OJ is more than twice faster than RJ for small µ, which are used in data

cleaning area. On the vector data set, OJ algorithm performed even better es-

pecially for small query radii.

Scalability. It is probably the most important issue to investigate consider-

ing the web-based dimension of data. In the elementary case, it is necessary to

study what happens with the performance of algorithms when the size of a data

set grows. We have experimentally investigated the behaviour of the eD-index on

the text data set with sizes from 50,000 to 300,000 objects (sentences). We have

mainly concentrated on small queries which are typical for data cleaning area.

Figures 4g and 4h report the speedups of RJ and OJ algorithms, respectively.

In summary, the figures demonstrate that the speedup is very high and con-

stant for different values of µ with respect to the data set size. This implies

that the similarity self join with the eD-index, specifically the overloading join

algorithm, is also suitable for large and growing data sets.

4 Conclusions

In this paper we have presented a new access structure able to cope with range

queries, nearest neighbor queries, and similarity join. In the performance evalua-

tion, we have concentrated on the distance computation. However, the presented

structure is also suitable for working on a disk storage and it is not limited to

operate in the main memory only. Our experiments, not reported in this paper,

exhibit very good performance in terms of disk block accesses. Compared to the

M-tree, it typically needs less distance computations and much less disk reads to

Page 10

speedup VEC

0

200

400

600

800

0 300 6009001200 1500 1800

search radius

(h)

RJ

OJ

speedup STR

0

200

400

600

800

1000

1200

048 12 1620 24 28

search radius

RJ

OJ

speedup scalability RJ

0

100

200

300

400

12345

Dataset size (x 50,000)

µ=1

µ=2

µ=3

speedup scalability OJ

0

200

400

600

800

1000

1200

1400

12345

Dataset size (x 50,000)

µ=1

µ=2

µ=3

distance computations Range Search VEC

0

2000

4000

6000

8000

10000

12000

0 500 10001500 2000

search radius

D-Index

mtree

seq

distance computations Range Search STR

0

2000

4000

6000

8000

10000

12000

0 204060 80 100

search radius

D-Index

mtree

seq

distance computations NN VEC

0

2000

4000

6000

8000

10000

12000

0204060 80 100

search radius

D-Index

mtree

seq

distance computations NN STR

0

2000

4000

6000

8000

10000

12000

0 10203040 506070

search radius

D-Index

mtree

seq

(a)(b)(c)

(d)(e)(f)

(g)

Fig.4. The experiments of the performance evaluation.

execute a query. The D-Index is also economical in space requirements. It needs

slightly more space than the sequential organization, but at least two times less

disk space compared to the M-tree.

We have extended D-Index to implement two similarity join algorithms and

we have performed numerous experiments to analyze their search properties and

suitability for the similarity join implementation.

The main advantage of these structures is that they can also perform similar

operations on other metric data. The challenge is to apply our access structure

for problems of similarity on XML structures, where metric indexes could be

applied for approximate matching of tree structures.

References

1. Edgar Chvez, Jos Luis Marroqun, and Gonzalo Navarro. Fixed queries array: A

fast and economical data structure for proximity searching. Multimedia Tools and

Applications (MTAP), 14(2):113–135, 2001.

Page 11

2. Paolo Ciaccia, Marco Patella, and Pavel Zezula. M-tree: An efficient access method

for similarity search in metric spaces. In Proc. of VLDB’97, pages 426–435. Morgan

Kaufmann, August 1997.

3. V. Dohnal, C. Gennaro, P. Savino, and P. Zezula. Separable splits of metric data

sets. In 9th Italian Conf. on Database Systems (SEBD), pages 45–62. LCM Selecta

Group - Milano, 2001.

4. V. Dohnal, C. Gennaro, P. Savino, and P. Zezula. D-index: Distance searching index

for metric data sets. Multimedia Tools and Applications, 21(1), 2003. To appear.

5. G. Navarro.A guided tour to approximate string matching.

Surveys, 33(1):31–88, 2001.

6. P.N. Yianilos. Excluded middle vantage point forests for nearest neighbor search.

In Sixth DIMACS Implementation Challenge: Nearest Neighbor Searches workshop,

January 1999.

ACM Computing

A The algorithms

Algorithm A1 Insertion

for i = 1 to h

if ?smi,ρ

i

(x)? < 2mithen

x → Bi,?smi,ρ

i

(x)?; exit;

end if

end for

x → Eh;

Algorithm A2 Range Search

for i=1 to h

if ?smi,ρ+rq

i

return all objects x such that x ∈ Q ∩ Bi,?smi,ρ+rq

end if

if rq ≤ ρ then (search radius up to ρ)

if ?smi,ρ−rq

return all objects x such that x ∈ Q ∩ Bi,?smi,ρ−rq

end if

else

let{l1,l2,...,lk} = G(smi,rq−ρ

return all objects x such that x ∈ Q ∩ Bi,l1or ... or x ∈ Q ∩ Bi,lk;

end if

end for

return all objects x such that x ∈ Q ∩ Eh;

(q)? < 2mithen

i

(q)?; exit;

i

(q)? < 2mithen

i

(q)?;

i

(q))

Algorithm A3 Nearest Neighbor Search

A = ∅,dk= d+; (initialization)

for i=1 to h (first – optimistic – phase)

r = min{dk,ρ};

if ?smi,ρ+r

access bucket Bi,?smi,ρ+r

i

(q)? < 2mithen

i

(q)?; update A and dk;

Page 12

if dk≤ ρ then exit;

if ?smi,ρ−r

access bucket Bi,?smi,0

end if

end if

end for

access bucket Eh; update A and dk;

if dk> ρ (second phase – if needed)

for i=1 to h

if ?smi,ρ+dk

access bucket Bi,?smi,ρ+dk

dk; exit;

else

let{b1,b2,...,bk} = G(smi,dk−ρ

access buckets Bi,b1,Bi,b2,Bi,bkif not accessed;

update A and dk;

end if

end for

end if

else

i

(q)? < 2mi

i

(q)?; update A and dk;

i

(q)? < 2mithen

i

(q)?if not already accessed; update A and

i

(q))

Algorithm A4 eD-Index Insertion

for i = 1 to h

if ?smi,ρ+?

if ?smi,ρ

end for

i

(x)? < 2mithen x → Bi,?smi,ρ

i

(x)?; exit;

(x)?;

i

(x)? < 2mithen x → Bi,?smi,ρ

i

x → Eh;

Algorithm A5 eD-Index Similarity Self Join

for i = 1 to h

for j = 0 to 2mi− 1

SimJoin(Bi,j,µ);

end for

end for

SimJoin(Eh,µ);

Algorithm A6 Sliding Window

lo = 1

for up = 2 to n

increment lo while d(oup,p) − d(olo,p) > µ

for j = lo to up

if PivotCheck() = FALSE then

if d(oj,oup) ≤ µ then add pair (oj,oup) to result

end if

end for

end for