Access Structures for Advanced Similarity Search in Metric Spaces.
-
Citations (0)
-
Cited In (0)
Page 1
Access Structures for Advanced Similarity
Search in Metric Spaces
Vlastislav Dohnal1, Claudio Gennaro2, Pasquale Savino2, and Pavel Zezula1
1Masaryk University
Brno, Czech Republic
{xdohnal, zezula}@fi.muni.cz
2ISTI-CNR
Pisa, Italy
{gennaro, savino}@isti.pi.cnr.it
Abstract. Similarity retrieval is an important paradigm for searching
in environments where exact match has little meaning. Moreover, in or-
der to enlarge the set of data types for which the similarity search can
efficiently be performed, the notion of mathematical metric space pro-
vides a useful abstraction for similarity. In this paper we consider the
problem of organizing and searching large data-sets from arbitrary met-
ric spaces, and a novel access structure for similarity search in metric
data, called D-Index, is discussed. D-Index combines a novel clustering
technique and the pivot-based distance searching strategy to speed up
execution of similarity range and nearest neighbor queries for large files
with objects stored in disk memories. Moreover, we propose an extension
of this access structure (eD-Index) which is able to deal with the problem
of similarity self join. Though this approach is not able to eliminate the
intrinsic quadratic complexity of similarity joins, significant performance
improvements are confirmed by experiments.
1 Introduction
Similarity searching has become a fundamental computational task in a variety
of application areas, including multimedia information retrieval, data mining,
pattern recognition, machine learning, computer vision, genome databases, data
compression, and statistical data analysis. This problem, originally mostly stud-
ied within the theoretical area of computational geometry, is recently attracting
more and more attention in the database community, because of the increas-
ingly growing needs to deal with large, often distributed, volume of data. Con-
sequently, high performance has become an important feature of cutting edge
designs.
In this article, we consider the problem from a broad perspective and assume
the data to be objects from a metric space where only pair-wise distances be-
tween objects are known. We present the D-Index, a multi-level access structure
of separable buckets on each level which supports easy insertion, range search,
and nearest neighbor search. Moreover, in order to complete the set of similarity
Page 2
search operations, similarity joins are needed. For this purpose, an extension of
D-Index is introduced. Applications of similarity join operation are data cleaning
or copy detection, to name a few. The former is useful, for instance, in the devel-
opment of Internet services which often require an integration of heterogeneous
sources of data. Such sources are typically unstructured whereas the intended
services often require structured data. In this case, the main challenge is to pro-
vide consistent and error-free data, which implies the data cleaning, typically
implemented by a sort of similarity join. This work focuses on similarity self join
useful for copy detection and data cleaning of large databases.
The paper is organized as follows. In Section 2, we define the general princi-
ples of our access structures. The system structure, algorithms, of D-Index are
presented in Section 2.3, while the eD-Index is illustrated in Section 2.4. Section
3 briefly summarize the experimental evaluation results. Finally, Appendix A
formally presents the proposed algorithms. The article concludes in Section 4.
2Access structures and search algorithms
2.1Search Strategies
A convenient way to assess similarity between two objects is to apply metric
functions to decide the closeness of objects as a distance, which can be seen as
a measure of the objects dis-similarity. A metric space M = (D,d) is defined
by a domain of objects (elements, points) D and a total (distance) function d –
a non negative (d(x,y) ≥ 0 with d(x,y) = 0 iff x = y) and symmetric (d(x,y) =
d(y,x)) function, which satisfies the triangle inequality (d(x,y) ≤ d(x,z)+d(z,y),
∀x,y,z ∈ D).
Without any loss of generality, we assume that the maximum distance never
exceeds the distance d+. For a query object q ∈ D, two fundamental similarity
queries can be defined. A range query retrieves all elements within distance r to
q, that is the set {x ∈ X,d(q,x) ≤ r}. A k-nearest neighbor query retrieves the
k closest elements to q, that is a set R ⊆ X such that |R| = k and ∀x ∈ R,y ∈
X − R,d(q,x) ≤ d(q,y).
The similarity join is a search primitive which combines objects of two subsets
of D into one set such that a similarity condition is satisfied. The similarity
condition between two objects is defined according to the metric distance d.
Formally, the similarity join X
?? Y between two finite sets X = {x1,...,xN}
and Y = {y1,...,yM} (X ⊆ D and Y ⊆ D) is defined as the set of pairs X
Y = {(xi,yj) | d(xi,yj) ≤ µ}, where the threshold µ is a real number such that
0 ≤ µ ≤ d+. If the sets X and Y coincide, we talk about the similarity self join.
sim
sim
??
2.2General approach: Clustering Through Separable Partitioning
To achieve our objectives, we base our partitioning principles on a mapping
function, which we call the ρ-split function, where ρ is a real number constrained
as 0 ≤ ρ < d+. In order to gradually explain the concept of ρ-split functions, we
Page 3
first define a first order ρ-split function and its properties. More details about
the mathematical specification of the ρ-split function are available in [4].
Definition 1. Given a metric space (D,d), a first order ρ-split function s1,ρ
is the mapping s1,ρ: D → {0,1,−}, such that for arbitrary different objects
x,y ∈ D, s1,ρ(x) = 0 ∧ s1,ρ(y) = 1
and ρ2≥ ρ1 ∧ s1,ρ2(x) ?= − ∧ s1,ρ1(y) = − ⇒ d(x,y) > ρ2− ρ1(symmetry
property).
⇒
d(x,y) > 2ρ (separable property)
In other words, the ρ-split function assigns to each object of the space D one of
the symbols 0, 1, or −.
We can generalize the ρ-split function by concatenating n first order ρ-split
functions with the purpose of obtaining a split function of order n.
Definition 2. Given n first order ρ-split functions s1,ρ
space (D,d), a ρ-split function of order n sn,ρ= (s1,ρ
{0,1,−}nis the mapping, such that for arbitrary different objects x,y ∈ D,
∀i s1,ρ
property) and ρ2≥ ρ1 ∧∀i s1,ρ2
(symmetry property).
1,...,s1,ρ
1,s1,ρ
n
in the metric
n) : D →
2,...,s1,ρ
i(x) ?= − ∧ ∀j s1,ρ
j(y) ?= − ∧ sn,ρ(x) ?= sn,ρ(y) ⇒ d(x,y) > 2ρ (separable
i
(x) ?= − ∧ ∃j s1,ρ1
j
(y) = − ⇒ d(x,y) > ρ2−ρ1
An obvious consequence of the ρ-split function definitions, useful for our
purposes, is that by combining n ρ-split functions of order 1 s1,ρ
obtain a ρ-split function of order n sn,ρ. We often refer to the number of bits
generated by sn,ρ, that is the parameter n, as the order of the ρ-split function. In
order to obtain an addressing scheme, we need another function that transforms
the ρ-split strings into integers, which we define as follows.
1,...,s1,ρ
n, we
Definition 3. Given a string b =(b1,...,bn) of n elements 0, 1, or −, the
function ?·? : {0,1,−}n→ [0..2n] is specified as:
?[b1,b2,...,bn]2=?n
When all the elements are different from ‘−’, the function ?b? simply translates
the string b into an integer by interpreting it as a binary number (which is always
< 2n), otherwise the function returns 2n.
By means of the ρ-split function and the ?b? operator we can assign an integer
number i (0 ≤ i ≤ 2n) to each object x ∈ D, i.e., the function can group objects
from X ⊂ D in 2n+ 1 disjoint sub-sets.
Though several different types of first order ρ-split functions are proposed,
analyzed, and evaluated in [3], the ball partitioning split (bps) originally proposed
in [6] under the name excluded middle partitioning, provided the smallest exclu-
sion set. For this reason, we also apply this approach, which can be characterized
as follows.
The ball partitioning ρ-split function bpsρ(x,xv) uses one object xv ∈ D
and the medium distance dm to partition the data file into three subsets (see
Figure 1a). The object corresponding to the value 1 or 0, of the bps function,
?b? =
j=12j−1bj, if ∀j bj?= −
2n, otherwise
Page 4
form separable partitions, and the object corresponding to ‘-’ form the exclusion
set.
combine them in order to obtain a function which generates more partitions.
This idea is depicted in Figure 1b, where two bps-split functions, on the two-
dimensional space, are used. The domain D, represented by the grey square, is
divided into four regions, corresponding to the separable partitions. The exclu-
sion set is represented by the brighter region and it is formed by the union of
the exclusion sets resulting from the two splits.
bps1,ρ(x) =
0 if d(x,xv) ≤ dm− ρ
1 if d(x,xv) > dm+ ρ
− otherwise
(1)
Once we have defined a set of first order ρ-split functions, it is possible to
2.3 D-Index
The basic idea of the D-Index [4] is to create a multilevel storage and retrieval
structure that uses several ρ-split functions, one for each level, to create an
array of buckets for storing objects. On the first level, we use a ρ-split function
for separating objects of the whole data set. For any other level, objects mapped
to the exclusion bucket of the previous level are the candidates for storage in
separable buckets of this level. Finally, the exclusion bucket of the last level
forms the exclusion bucket of the whole D-Index structure. It is worth noting
that the ρ-split functions of individual levels use the same ρ. Moreover, split
functions can have different order, typically decreasing with the level, allowing
the D-Index structure to have levels with a different number of buckets. More
precisely, the D-Index structure can be defined as follows.
From the structure point of view, you can see the buckets organized as the
following two dimensional array consisting of 1 +?h
B1,0,B1,1,...,B1,2m1−1
B2,0,B2,1,...,B2,2m2−1
...
Bh,0,Bh,1,...,Bh,2mh−1,Eh
All separable buckets are included, but only the Ehexclusion bucket is present
– exclusion buckets Ei<hare recursively re-partitioned on level i + 1. Then, for
each row (i.e. the D-Index level) i, 2mibuckets are separable up to 2ρ thus we are
sure that do not exist two buckets at the same level i both containing relevant
objects for any similarity range query with radius rx≤ ρ.
D-Index Insertion. The insertion of an object x ∈ X in DIρ(X,m1,m2,...,mh)
starts from the first level, tries to accommodate x into a separable bucket. If a
suitable bucket exists, the object is stored in this bucket. If it fails for all levels,
the object x is placed in the exclusion bucket Eh. In any case, the insertion
algorithm determines exactly one bucket to store the object. The pseudo–code
i=12mielements.
Page 5
of the this procedure can be found in Algorithm A1 of the Appendix A.
D-Index Naive Search. Given a query region Q = R(q,rq) with q ∈ D and
rq≤ ρ, a simple algorithm can execute the query as follows.
Algorithm 21 Search
for i = 1 to h
return all objects x such that x ∈ Q ∩ Bi,?smi,0
end for
return all objects x such that x ∈ Q ∩ Eh;
The function ?smi,0
Consequently, one separable bucket on each level i is determined. Note that, in
Algorithm 21 the exclusion buckets is always accessed. The execution of Algo-
rithm 21 requires h+1 bucket accesses, which forms the upper bound of a more
sophisticated algorithm described in the next section. Moreover, the pivots can
save many distance computations, for further details the reader is referred to [1].
i
(q)?;
i
(q)? always gives a value smaller than 2mi, because ρ = 0.
D-Index Advanced Searches. This search algorithm is useful for understand-
ing the basic idea of the D-Index structure, however it requires to access one
bucket at each level of the D-Index, plus the exclusion bucket. In order to speedup
the search, the following two situations can be exploited: if the query region is
contained in the exclusion partition of the level i, then the query cannot have
objects in the separable buckets of this level and only the next level, if it exists,
must be considered; if the query region is contained in a separable partition of
the level i, the following levels, as well as the exclusion bucket for i = h, need
not be accessed, thus the search terminates on this level.
(a)
2ρ
xv
x0
x1
dm
x2
1)(
1=
xbps
−=
)(
2 x bps
0)(
0=
xbps
dm
2ρ
Separable
set 4
Separable
set 1
Separable
set 2
Separable
set 3
Exclusion
Set
(b)
Fig.1. The excluded middle partitioning (a). Clustering through partitioning in the
two-dimensional space (b).
Another drawback of the simple algorithm is that it works only for search
radii up to ρ. However, with additional computational effort, queries with rq> ρ
can also be executed. Indeed, queries with rq> ρ can be executed by evaluating
the split function srq−ρ. In case srq−ρreturns a string without any ‘−’, the result
Page 6
is contained in a single bucket (namely B?srq−ρ?) plus, possibly, the exclusion
bucket.
Let us now consider that the string returned contains at least one ‘−’. We
indicate this string as (b1,...,bn) with bi= {0,1,−}. In case there is only one
bi= ‘−’, we must access all buckets B, whose index is obtained by substituting
in srq−ρthe ‘−’ with 0 and 1. In the most general case we must substitute in
(b1,...,bn) all the ‘−’ with zeros and ones and generate all possible combina-
tions.
In order to define an algorithm for this process, we need some additional
terms and notation.
Definition 4. We define an extended exclusive OR bit operator, ⊗, which is
based on the following truth table:
bit1 0 0 0 1 1 1 − − −
bit2 0 1 − 0 1 − 0 1 −
⊗
0 1 0 1 0 0 0 0 0
Note that the operator ⊗ can be used bit-wise on two strings of the same length
and that it always returns a standard binary number (i.e., it does not contain
any ‘−’). Consequently, ?s1⊗ s2? < 2nis always true for strings s1 and s2 of
length n (see Definition 3).
Definition 5. Given a string s of length n, G(s) denotes a sub-set of Ωn =
{0,1,...,2n− 1}, such that all elements ei ∈ Ωn for which ?s ⊗ ei? ?= 0 are
eliminated (interpreting eias a binary string of length n).
Observe that, G(− − ...−) = Ωn, and that the cardinality of the set is the
second power of the number of ‘−’ elements, that is generating all partition
identifications in which the symbol ‘−’ is alternatively substituted by zeros and
ones.
Given a query region Q = R(q,rq) with q ∈ D and rq≤ d+. The advanced
similarity range query can be executed following the Algorithm A2.
The task of the nearest neighbor search is to retrieve k closest elements from
X to q ∈ D, respecting the metric M. For convenience, we designate the distance
to the k-th nearest neighbor as dk(dk≤ d+). Due to the lack of space, we do
not comment this algorithm, however we can say that the general strategy of
Algorithm A3 works as follows. The algorithm starts with an optimistic strategy
assuming that the k-th nearest neighbor is at distance maximally ρ. If it fails
additional steps of search are performed to find the correct result.
2.4eD-Index
The idea behind the eD-Index is to modify the insertion algorithm of the D-
Index in way that the exclusion set overlaps with the separable partitions of ?
(see Figure 2). The objects which fall in the partition where the exclusion set in-
tersects the separable partitions are replicated in both those sets. This principle,
called the overloading exclusion set, ensures that any similarity self join query
Page 7
with threshold µ ≤ ? cannot find a qualifying pair of objects (x,y) from different
sets, specifically bps(x) ?= bps(y), because all objects of a separable set which
can make a qualifying pair with an object of the exclusion set are copied to the
exclusion set. In this way, the eD-Index speeds up the execution of similarity self
2ρ
dm
dm
2ρ
εε
Separable partitionsExclusion Set
(a)(b)
Fig.2. The difference between D-Index (a) and eD-Index (b).
join queries upto a predefined value of threshold ? apart in each separable bucket.
eD-Index Insertion. It proceeds according to the Algorithm A4. The algo-
rithm is similar the one for the insertion of D-Index (A1). The difference is that
in eD-Index we use both the split functions smi,ρ
replicate all the objects which fall in the overlapping region of two consecutive
levels.
i
and smi,ρ+?
i
, allowing us to
eD-Index Similarity Self Join. The outline of the similarity self join al-
gorithm is following: execute the join query independently on every separable
bucket of every level of the eD-index and additionally on the exclusion bucket of
the whole structure. This behaviour is correct due to the overloading exclusion
set principle, which copies every object of a separable set which can make a
qualifying pair with an object of the exclusion set to the exclusion set. In this
respect, the join query elaboration can be done independently in every bucket
and the results of individual join subqueries form the result of the given join
query. The Algorithm A5 describes this idea.
window
≤
d
p
oj
d(p,oj)
olo
oup
µ
o1o2
on
Fig.3. The Sliding Window algorithm.
Page 8
The procedure SimJoin(Bi,j,µ) executes the similarity self join in one bucket.
It is possible to improve the algorithm by exploiting the pivots during the
SimJoin elaboration. This idea is based on the sliding window algorithm (see
Algorithm A6): objects of a bucket are ordered with respect to a pivot p, which
is the reference object of a ρ-split function used by the eD-Index, and we define a
sliding window of objects as [olo,oup]. This window is moved through the whole
order set of the bucket in order to find all qualifying pairs (see Figure 3).
3Performance Evaluation and Comparison
We have implemented the D-Index and the eD-Index and conducted numerous
experiments to verify its properties on two metric data sets. The first data set
(STR) consisted of sentences of Czech language corpus compared by the edit
distance measure, so-called Levenshtein distance [5]. The most frequent distance
was around 100 and the longest distance was 500, equal to the length of the
longest sentence. The second data set (VEC) was composed of 45-dimensional
vectors of color features extracted from images. Vectors were compared by the
quadratic form distance measure. The distance distribution of this data set was
practically normal distribution with the most frequent distance equal to 4,100
and the maximum distance equal to 8,100.
3.1D-Index Performance Evaluation
In all our experiments, the query objects are not chosen from the indexed data
sets, but they follow the same distance distribution. The search costs are mea-
sured in terms of distance computations. All presented cost values are averages
obtained by executing queries for 50 different query objects and constant search
selectivity, that is queries using the same search radius or the same number of
nearest neighbors. We have considered about 11,000 objects for each of our data
sets VEC and STR.
We have compared the performance of D-Index with other index structures
under the same workload. In particular, we considered the M-tree3[2] and the
sequential organization, SEQ.
Efficiency. The main objective of these experiments was to compare the simi-
larity search efficiency of the D-Index with the other organizations. The results
for the range search are shown in Figures 4a and 4b, while the performance for
the nearest neighbor search is presented in Figures 4c and 4d.
For all tested queries, i.e. retrieving subsets up to 20% of the database, the
M-tree and the D-Index always needed less distance computations than the se-
quential scan. The D-Index certainly performed much better for the sentences
and also for the range search over vectors. However, the nearest neighbor search
on vectors needed practically the same number of distance computations.
3The software is available at http://www-db.deis.unibo.it/research/Mtree/
Page 9
3.2 eD-Index Performance Evaluation
In order to demonstrate the suitability of the eD-index to the problem of simi-
larity self join, we have compared several different approaches to join operation.
The nested loops, uses the symmetric property of metric distance functions for
pruning some pairs. The time complexity is O(N·(N−1)
method, called range query join, uses the D-index structure. Specifically, we as-
sume a data set X ⊆ D organized by the D-index and apply the search strategy
as follows: for ∀o ∈ X, perform range query(o,µ). Finally, the last compared
method is the overloading join algorithm which is described in Section 2.4.
In all experiments, we have compared three different techniques for the prob-
lem of the similarity self join, the nested loops (NL) algorithm, the range query
join (RJ), and the overloading join (OJ) algorithm, applied on the eD-index. As
performance index we have used the speedup with respect the naive approach,
i.e., the number of distance computation of the examined algorithm divided by
N·(N−1)
2
, where N is the number of object stored.
2
). A more sophisticated
Efficiency. The objective of this group of tests was to study the relationship
between the query size and the efficiency measured in terms of distance compu-
tations. Figures 4e and 4f show results of experiments. The experiments show
that OJ is more than twice faster than RJ for small µ, which are used in data
cleaning area. On the vector data set, OJ algorithm performed even better es-
pecially for small query radii.
Scalability. It is probably the most important issue to investigate consider-
ing the web-based dimension of data. In the elementary case, it is necessary to
study what happens with the performance of algorithms when the size of a data
set grows. We have experimentally investigated the behaviour of the eD-index on
the text data set with sizes from 50,000 to 300,000 objects (sentences). We have
mainly concentrated on small queries which are typical for data cleaning area.
Figures 4g and 4h report the speedups of RJ and OJ algorithms, respectively.
In summary, the figures demonstrate that the speedup is very high and con-
stant for different values of µ with respect to the data set size. This implies
that the similarity self join with the eD-index, specifically the overloading join
algorithm, is also suitable for large and growing data sets.
4 Conclusions
In this paper we have presented a new access structure able to cope with range
queries, nearest neighbor queries, and similarity join. In the performance evalua-
tion, we have concentrated on the distance computation. However, the presented
structure is also suitable for working on a disk storage and it is not limited to
operate in the main memory only. Our experiments, not reported in this paper,
exhibit very good performance in terms of disk block accesses. Compared to the
M-tree, it typically needs less distance computations and much less disk reads to
Page 10
speedup VEC
0
200
400
600
800
0 300 6009001200 1500 1800
search radius
(h)
RJ
OJ
speedup STR
0
200
400
600
800
1000
1200
048 12 1620 24 28
search radius
RJ
OJ
speedup scalability RJ
0
100
200
300
400
12345
Dataset size (x 50,000)
µ=1
µ=2
µ=3
speedup scalability OJ
0
200
400
600
800
1000
1200
1400
12345
Dataset size (x 50,000)
µ=1
µ=2
µ=3
distance computations Range Search VEC
0
2000
4000
6000
8000
10000
12000
0 500 10001500 2000
search radius
D-Index
mtree
seq
distance computations Range Search STR
0
2000
4000
6000
8000
10000
12000
0 204060 80 100
search radius
D-Index
mtree
seq
distance computations NN VEC
0
2000
4000
6000
8000
10000
12000
0204060 80 100
search radius
D-Index
mtree
seq
distance computations NN STR
0
2000
4000
6000
8000
10000
12000
0 10203040 506070
search radius
D-Index
mtree
seq
(a)(b)(c)
(d)(e)(f)
(g)
Fig.4. The experiments of the performance evaluation.
execute a query. The D-Index is also economical in space requirements. It needs
slightly more space than the sequential organization, but at least two times less
disk space compared to the M-tree.
We have extended D-Index to implement two similarity join algorithms and
we have performed numerous experiments to analyze their search properties and
suitability for the similarity join implementation.
The main advantage of these structures is that they can also perform similar
operations on other metric data. The challenge is to apply our access structure
for problems of similarity on XML structures, where metric indexes could be
applied for approximate matching of tree structures.
References
1. Edgar Chvez, Jos Luis Marroqun, and Gonzalo Navarro. Fixed queries array: A
fast and economical data structure for proximity searching. Multimedia Tools and
Applications (MTAP), 14(2):113–135, 2001.
Page 11
2. Paolo Ciaccia, Marco Patella, and Pavel Zezula. M-tree: An efficient access method
for similarity search in metric spaces. In Proc. of VLDB’97, pages 426–435. Morgan
Kaufmann, August 1997.
3. V. Dohnal, C. Gennaro, P. Savino, and P. Zezula. Separable splits of metric data
sets. In 9th Italian Conf. on Database Systems (SEBD), pages 45–62. LCM Selecta
Group - Milano, 2001.
4. V. Dohnal, C. Gennaro, P. Savino, and P. Zezula. D-index: Distance searching index
for metric data sets. Multimedia Tools and Applications, 21(1), 2003. To appear.
5. G. Navarro.A guided tour to approximate string matching.
Surveys, 33(1):31–88, 2001.
6. P.N. Yianilos. Excluded middle vantage point forests for nearest neighbor search.
In Sixth DIMACS Implementation Challenge: Nearest Neighbor Searches workshop,
January 1999.
ACM Computing
A The algorithms
Algorithm A1 Insertion
for i = 1 to h
if ?smi,ρ
i
(x)? < 2mithen
x → Bi,?smi,ρ
i
(x)?; exit;
end if
end for
x → Eh;
Algorithm A2 Range Search
for i=1 to h
if ?smi,ρ+rq
i
return all objects x such that x ∈ Q ∩ Bi,?smi,ρ+rq
end if
if rq ≤ ρ then (search radius up to ρ)
if ?smi,ρ−rq
return all objects x such that x ∈ Q ∩ Bi,?smi,ρ−rq
end if
else
let{l1,l2,...,lk} = G(smi,rq−ρ
return all objects x such that x ∈ Q ∩ Bi,l1or ... or x ∈ Q ∩ Bi,lk;
end if
end for
return all objects x such that x ∈ Q ∩ Eh;
(q)? < 2mithen
i
(q)?; exit;
i
(q)? < 2mithen
i
(q)?;
i
(q))
Algorithm A3 Nearest Neighbor Search
A = ∅,dk= d+; (initialization)
for i=1 to h (first – optimistic – phase)
r = min{dk,ρ};
if ?smi,ρ+r
access bucket Bi,?smi,ρ+r
i
(q)? < 2mithen
i
(q)?; update A and dk;
Page 12
if dk≤ ρ then exit;
if ?smi,ρ−r
access bucket Bi,?smi,0
end if
end if
end for
access bucket Eh; update A and dk;
if dk> ρ (second phase – if needed)
for i=1 to h
if ?smi,ρ+dk
access bucket Bi,?smi,ρ+dk
dk; exit;
else
let{b1,b2,...,bk} = G(smi,dk−ρ
access buckets Bi,b1,Bi,b2,Bi,bkif not accessed;
update A and dk;
end if
end for
end if
else
i
(q)? < 2mi
i
(q)?; update A and dk;
i
(q)? < 2mithen
i
(q)?if not already accessed; update A and
i
(q))
Algorithm A4 eD-Index Insertion
for i = 1 to h
if ?smi,ρ+?
if ?smi,ρ
end for
i
(x)? < 2mithen x → Bi,?smi,ρ
i
(x)?; exit;
(x)?;
i
(x)? < 2mithen x → Bi,?smi,ρ
i
x → Eh;
Algorithm A5 eD-Index Similarity Self Join
for i = 1 to h
for j = 0 to 2mi− 1
SimJoin(Bi,j,µ);
end for
end for
SimJoin(Eh,µ);
Algorithm A6 Sliding Window
lo = 1
for up = 2 to n
increment lo while d(oup,p) − d(olo,p) > µ
for j = lo to up
if PivotCheck() = FALSE then
if d(oj,oup) ≤ µ then add pair (oj,oup) to result
end if
end for
end for
View other sources
Hide other sources
-
Available from Pavel Zezula · 19 Oct 2012
-
Available from psu.edu