Similarity Grid for Searching in Metric Spaces.
ABSTRACT Similarity search in metric spaces represents an important paradigm for content-based retrieval of many applications. Existing
centralized search structures can speed-up retrieval, but they do not scale up to large volume of data because the response
time is linearly increasing with the size of the searched file. The proposed GHT* index is a scalable and distributed structure.
By exploiting parallelism in a dynamic network of computers, the GHT* achieves practically constant search time for similarity
range queries in data-sets of arbitrary size. The structure also scales well with respect to the growing volume of retrieved
data. Moreover, a small amount of replicated routing information on each server increases logarithmically. At the same time,
the potential for interquery parallelism is increasing with the growing data-sets because the relative number of servers utilized
by individual queries is decreasing. All these properties are verified by experiments on a prototype system using real-life
- SourceAvailable from: David Novak[Show abstract] [Hide abstract]
ABSTRACT: Metric space is a universal and versatile model of similarity that can be applied in various areas of non-text information retrieval. However, a general, efficient and scalable solution for metric data management is still a resisting research challenge. In this work, we try to make an important step towards such management system that would be able to scale to data collections of billions of objects. We propose a distributed index structure for similarity data management called the Metric Index (M-Index) which can answer queries in precise and approximate manner. This technique can take advantage of any distributed hash table that supports interval queries and utilize it as an underlying index. We have performed numerous experiments to test various settings of the M-Index structure and we have proved its usability by developing a full-featured publicly-available Web application.Information Processing & Management 01/2012; 48(5):855-872. · 0.82 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: We propose a novel approach for solving the approximate nearest neighbor search problem in arbitrary metric spaces. The distinctive feature of our approach is that we can incrementally build a non-hierarchical distributed structure for given metric space data with a logarithmic complexity scaling on the size of the structure and adjustable accuracy probabilistic nearest neighbor queries. The structure is based on a small world graph with vertices corresponding to the stored elements, edges for links between them and the greedy algorithm as base algorithm for searching. Both search and addition algorithms require only local information from the structure. The performed simulation for data in the Euclidian space shows that the structure built using the proposed algorithm has navigable small world properties with logarithmic search complexity at fixed accuracy and has weak (power law) scalability with the dimensionality of the stored data.Proceedings of the 5th international conference on Similarity Search and Applications; 08/2012
- [Show abstract] [Hide abstract]
ABSTRACT: In general metric spaces, one of the most widely used indexing techniques is the partitioning of the objects using pivot elements. The efficiency of partitioning depends on the selection of the appropriate set of pivot elements. In the paper, some methods are presented to improve the quality of the partitioning in GHT structure from the viewpoint of balancing factor. The main goal of the investigation is to determine the conditions when costs of distance computations can be reduced. We show with different tests that the proposed methods work better than the usual random and incremental pivot search methods.Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition; 07/2012
Similarity Grid for Searching in Metric Spaces
Michal Batko1, Claudio Gennaro2, and Pavel Zezula1
1Masaryk University, Brno, Czech Republic
2ISTI-CNR, Pisa, Italy
Abstract. Similarity search in metric spaces represents an important paradigm
for content-based retrieval of many applications. Existing centralized search
structures can speed-up retrieval, but they do not scale up to large volume of data
because the response time is linearly increasing with the size of the searched file.
The proposed GHT* index is a scalable and distributed structure. By exploiting
parallelism in a dynamic network of computers, the GHT* achieves practically
constant search time for similarityrange queries in data-sets of arbitrary size. The
structure also scales well with respect to the growing volume of retrieved data.
Moreover, a small amount of replicated routing information on each server in-
creases logarithmically. At the same time, the potential for interquery parallelism
is increasing with the growing data-sets because the relative number of servers
utilized by individual queries is decreasing. All these properties are verified by
experiments on a prototype system using real-life data-sets.
Search operations have traditionally been applied to structured (attribute-type) data.
Therefore, when a query is given, records exactly matching the query are returned.
Complex data types – such as images, videos, time series, text documents, DNA se-
quences, etc. – are becoming increasingly important in modern data processing appli-
cations. A common type of searching in such applications is based on gradual rather
than exact relevance, so it is called the similarity or content-based retrieval. Given a
query object q, this process involves finding objects in the database D that are simi-
lar to q. It has become customary to assume similarity measure as a distance metric d
so that d(o1,o2) becomes smaller as o1and o2are more similar. Formally, (D,d) can
be seen as the mathematical metric space. From an implementation point of view, the
distance function d is typically expensive to compute. The primary challenge in per-
forming similarity search for such data is to structure the database D in such a way that
the search can be performed fast – that is a small number of distance computations is
needed to execute a query.
and , most of them are only main memory structures and thus not suitable for a large
volume of data. The scalability of two disk oriented metric indexes (the M-tree  and
the D-index ) have recently been studied in . The results demonstrate significant
speed-up (both in terms of distance computations and disk-page reads) in comparison
C. T¨ urker et al. (Eds.): P2P, Grid, and Service Orientation ..., LNCS 3664, pp. 25–44, 2005.
c ? Springer-Verlag Berlin Heidelberg 2005
26 M. Batko, C. Gennaro, and P. Zezula
with the sequential search. Unfortunately, the search costs are also linearly increasing
with the size of the data-set. This means that when the data file grows, sooner or later
the response time becomes intolerable. Though the approximate versions of similarity
search structures, e.g. the approximate M-tree in , improvethe query execution a lot,
they are also not scalable for large volumes of data. Classical parallel implementations
of such indexes, see for example , have not been very successful either.
On the other hand, it is estimated that 93% of data now produced is in a digital
form. The amount of data added each year exceeds exabyte (i.e. 1018bytes) and it is
estimated to grow exponentially. In order to manage similarity search in multimedia
data types such as plain text, music, images, and video, this trend calls for putting
equally scalable infrastructures in motion. In this respect, the Grid infrastructures and
the Peer-to-Peer (P2P) communication paradigm are quickly gaining in popularity due
to their scalability and self-organizing nature, forming bases for building large-scale
similarity search indexes at low costs.
Most of the numerous P2P search techniques proposed in the recent years have fo-
cused on the single-key retrieval. Since the retrieved records either exist (and then they
are retrieved) or they do not, there are no problems with the query relevance or degree
of matching.The Content Addressable Network (CAN)  provides a distributed hash
table abstraction over the Cartesian space. CAN allows efficient storage and retrieval
of (key, object) pairs, but the key is seen as a point in n-dimensional Cartesian space.
Sucharchitecturehas recently beenappliedin  to developa P2P structureto support
text retrieval in the Information Retrieval style. Since data partitioning coincides with
the multidimensional space partitioning, such strategy cannot be applied to the generic
problem of metric data such as strings compared by the edit distance. We are not aware
of the existence of any distributed storage structure for similarity searching in metric
Our objective is to develop a distributed storage structure for similarity search in
metric spaces that would scale up with (nearly) constant search time. In this respect,
our proposal can be seen as a Scalable and Distributed Data Structure (SDDS), which
uses the P2P paradigm for communication between computer nodes and assumes a
Grid-like infrastructurefor dynamically adjusting computationalresources. We achieve
the desired effects in a given(arbitrary)metric by linearly increasingthe numberof net-
work nodes (computers), where each of them can act as a client and some of them can
also be servers. Clients insert metric objects and issue queries, but there is no specific
(centralized) node to be accessed for all (insertion or search) operations. At the same
time, insertion of an object, even the one causing a node split, does not require imme-
diate update propagation to all network nodes. A certain data replication is tolerated.
Each server provides some storage space for objects and servers also have the capacity
to compute distances between pairs of objects. A server can send objects to other peer
servers and can also allocate a new server.
The rest of the paper is organized as follows. In Sec. 2, we summarize the nec-
essary background information. Section 3 presents the GHT* distributed structure, its
functionality,and basic analytic properties.Section 4 reports the results of performance
evaluationexperiments.Section 5 concludesthe paperand outlines directionsfor future
work. Appendix A contains the key algorithms of the GHT*.
Similarity Grid for Searching in Metric Spaces27
Probablythe most famousscalable and distributed data structureis the LH* , which
is an extension of the linear hashing (LH) for a dynamic network of computers en-
abling the exact-match queries. The paper also clearly defines the necessary properties
of SDDSs in terms of scalability, no hot-spots, and update independence. The advan-
tage of the Distributed Dynamic Hashing DDH  over the LH* is that it can imme-
diately split overflowing buckets – LH* always splits buckets in a predefined order.
The first tree-oriented structure is the Distributed Random Tree (). Contrary to the
hash-based techniques, this tree structure keeps the search keys ordered. Therefore, it
is able to perform not only the exact-match queries, but also the range and the nearest
neighbor queries on keys from a sortable domain, but not on objects from the generic
2.1 Metric Space Searching Methods
The effectiveness of metric search structures [3,8] consists in their considerable exten-
sibility, i.e. the degree of ability to support executionof diverse types of queries. Metric
structures can support not only the exact-match and range queries on sortable domains,
but they are also able to perform similarity queries in the generic metric space. In this
respect, the important Euclidean vector spaces can be considered a special case.
The mathematical metric space is a pair (D,d), where D is the domain of objects
and d is the distance function able to compute distances between any pair of objects
from D. It is typically assumed that the smaller the distance, the closer or more simi-
lar the objects are. For any distinct objects x,y,z ∈ D, the distance must satisfy the
d(x,x) = 0
d(x,y) > 0
d(x,y) = d(y,x)
d(x,y) ≤ d(x,z) + d(z,y) triangle inequality
Though several sophisticated metric search structures have been proposed in literature,
plane Tree (GHT) , because it can easily be combined with the locally autonomous
splittingpolicyoftheDDH. Forthe sakeofclarityofthefurtherdiscussion,wedescribe
the main features of the GHT in the following.
Generalized Hyperplane Tree – GHT. The GHT is a binary tree with metric objects
kept in leaf nodes (buckets) of fixed capacity. The internal nodes contain two pointers
to descendantnodes (sub-trees)representedby a pair of objects called the pivots. Pivots
representroutinginformation,whichis usedin bucket locationalgorithms.As a rule,all
objects closer to the first pivot are in the leaf nodes of the left sub-tree and the objects
closer to the second pivot are in the leaf nodes of the right sub-tree.
The creation of the GHT structure starts with the bucket B0. When the bucket B0
is full, we create a new empty bucket, say B1, and move some objects from B0(one
28 M. Batko, C. Gennaro, and P. Zezula
Fig.1. Split of the bucket
half of them if possible) to B1. In this way, we gain some space in B0(see Fig. 1 for
illustration). This idea of split is implemented by choosing a pair of pivots P1and P2
(P1 ?= P2) from B0and by moving all the objects O, which are closer to P2than to
P1, into the bucket B1. The pivots P1and P2are then placed into a new root node
and the tree grows by one more level. This split algorithm is an autonomous operation,
which can be applied to any leaf node – no other tree nodes need to be modified. In
general, given an internal node i of the GHT structure with the pivots P1(i) and P2(i),
the objects that meet Condition (1) are stored in the right sub-tree. Otherwise, they are
found in the left sub-tree.
d(P1(i),O) > d(P2(i),O).
To Insert a new object O, we first traverse the GHT to find the correct storage bucket.
In each inner node i, we test Condition (1): if it is true, we follow the right branch.
Otherwise, we follow the left one. This is repeated until a leaf node is found. Finally,
we insert O into the leaf bucket and, if necessary, the split is applied.
In orderto performa similarity Range Search for the query object Q and the search
left child of each inner node if Condition (2) is satisfied and the right child if Condition
(3) is true.
d(P1(i),Q) − r ? d(P2(i),Q) + r
d(P1(i),Q) + r > d(P2(i),Q) − r
Dependingonthe size ofthe radiusr, Conditions(2) and(3)can bemet simultaneously.
This means that both of the sub-trees can contain qualifying objects and therefore both
of them must be searched.
In general, the scalable and distributed data structure GHT* consists of network nodes
that can insert, store, and retrieve objects using similarity queries. The nodes with all
these functions are called Servers and the nodes with only the insertion and query for-
mulation functions are called Clients. The GHT* architecture assumes that:
– Network nodes communicate through the message passing paradigm. For consis-
tency reasons, each request message expects a confirmation by a proper reply mes-
Similarity Grid for Searching in Metric Spaces29
– Each node of the network has a unique Network Node IDentifier (NNID).
– Each server maintains data objects in a set of buckets. Within a server, the Bucket
IDentifier (BID) is used to address a bucket.
– Each object is stored exactly in one bucket.
An essential part of the GHT* structure is the Address Search Tree (AST). In prin-
ciple, it is a structure similar to the GHT. In the GHT*, the AST is used to actually de-
termine the necessary (distributed) buckets when data objects are stored and retrieved.
3.1The Address Search Tree
Contrary to the GHT, which contains data objects in leaves, every leaf of the AST
includes exactly one pointer to either a bucket (using BID) or a server (using NNID)
holding the data. Specifically, NNIDs are used if the data are on a remote server. BIDs
are used if the data are in a bucket on the local server. Since the clients do not maintain
data buckets, their ASTs contain only the NNID pointers in leaf nodes.
In order to avoid hot-spots caused by the existence of a centralized node accessed
by every request, a form of the AST structure is present in every network node. This
tures in individual network nodes may not be identical – with respect to the complete
tree view,somesub-treesmay be missing.As we shall see in thenextsection, the GHT*
Server 1 Server 2
NNID or BID
Fig.2. Address Search Tree and the GHT* network
30M. Batko, C. Gennaro, and P. Zezula
Figure 2 illustrates the AST structure in a network of one client and two servers.
The dashed arrows indicate the NNID pointers while the solid arrows represent the
BID pointers. In the following part, we describe the basic operations of the GHT*.
Specification of the most important algorithms can be found in Appendix A.
Insert. Insertion of an object starts in the node asking for insertion by traversing its
AST from the root to a leaf using Condition (1). If a BID pointer is found, the inserted
object is stored in this bucket. Otherwise, the found NNID pointer is applied to forward
the request to the proper server where the insertion continues recursively until an AST
leaf with the BID pointer is reached. If a client starts the insertion, the AST traversal
always terminates in a leaf node with an NNID pointer since clients do not maintain
In order to avoid repeated distance computations when searching the AST on the
new server, a once-determinedpath specification in the original AST is also forwarded.
The path sent to the server is encoded as a bit-string called BPATH, where each node is
represented by one bit – “0” represents the left branch, “1” represents the right branch.
Due to the construction of the GHT*, it is guaranteed that the forwarded path always
exists on the target server.
Range Search. By analogy to insertion, the range search also starts by traversing its
local AST, but as Sec. 2.1 explains, multiple paths can qualify. For all qualifying paths
having a NNID pointer in their leaves, the request is recursively forwarded (including
knownBPATH) to identified servers until a BID pointer occurs in everyleaf. If multiple
paths point to the same server, the request is sent only once but with multiple BPATH
attachments. The range search condition is evaluated by the servers in every bucket
determined by the BID pointers.
An important advantage of the GHT* structure is the update independence. During
object insertion, a server can split an overflowing bucked without informing the other
nodes of the network. Consequently, the network nodes need not have their ASTs up to
date with respect to the data, but the advantage is that the network is not flooded with
multiple messages at every update. The updates of the ASTs are thus postponed and
actually done when respective insertion or range search operations are executed.
The inconsistency in the ASTs is recognized on a server that receives an operation
request with corresponding BPATH from another client or server. In fact, if the BPATH
derived from the AST of the current server is longer than the received BPATH, this
indicates that the sending server (client) has an out-of-date version of the AST and
must be updated. The current server easily determines a sub-tree that is missing on
the sending server (client) because the root of this sub-tree is the last element of the
received BPATH. Such a sub-tree is sent back to the server (client) through the Image
Adjustment Message, IAM.
If multiple BPATHs are received by the current server (which can occur in case of
range queries) more sub-trees are sent back throughone IAM (providedinconsistencies
Similarity Grid for Searching in Metric Spaces31
a server finds a NNID in its AST leaf during the path expansion, the request must be
forwarded to the found server. This server can also detect an inconsistency and respond
with an IAM. This image adjustment message updates the ASTs of all the previous
servers, including the first server (client) starting the operation. This is a recursive pro-
cedure which guarantees that, for an insert or a search operation, every involved server
(client) is correctly updated.
3.3Logarithmic Replication Strategy
Using the described IAM mechanism, the GHT* structure maintains the ASTs practi-
cally equal on all servers. However, every inner node of the AST contains two pivots.
The number of replicated pivots increases linearly with the number of servers used. In
order to reduce the replication, we have also implemented a much more economical
strategy which achieves logarithmic replication on servers at the cost of moderately in-
creased number of forwarded requests. The image adjustment for clients is performed
as described in Sec. 3.2.
Fig.3. Example of the logarithmic AST
Inspired by the lazy updates strategy from , our logarithmic replication scheme
uses a slightly modified AST containing only the necessary number of inner nodes.
More precisely, the AST on a specific server stores only the nodes containing pointers
to local buckets (i.e. leaf nodes with BID pointers) and all their ancestors. However,the
resulting AST is still a binary tree where all the sub-trees leading exclusively to leaf
nodes with the NNID pointers are substituted by the leftmost leaf node of this sub-tree.
always keeps the left node and adds the right one. Figure 3 illustrates this principle. In
a way, the logarithmic AST can be seen as the minimum sub-tree of the fully updated
AST. The search operation with the logarithmic replication scheme may require more
forwarding(comparedto the full replicationscheme), but the replicationis significantly
32 M. Batko, C. Gennaro, and P. Zezula
3.4 Storage Management
As we have already explained, the atomic storage unit of the GHT* is a bucket. The
number of buckets and their capacity on a server are bounded by specified constant
numbers, which can be different for different servers. Since the bucket identifiers are
only unique within a server, a bucket is addressed by a pair (NNID, BID). To achieve
Bucket Splitting. The bucket splitting operation is triggered by an insertion of an
object into an already-full bucket. The procedure is performed in the following three
1. A new bucket is allocated. If there is a capacity on the current server, the bucket is
activated there. Otherwise, the bucket is allocated either on another existing server
with free capacity or a new server is used (see Sec. 3.4).
2. A pair of pivots is chosen from the objects of the overflowingbucket (see Sec. 3.4).
3. Objects from the overflowing bucket that are closer to the second pivot than to the
first one are moved to the new bucket.
Choosing Pivots. A specific choice of pivots directly affects the performance of the
GHT* structure. However, the selection can be a time-consuming operation typically
requiring many distance computations. To make this process smooth, we use an incre-
mental pivot selection algorithm as proposed in . It is based on the hypotheses that
the GHT structure performs better if the distance between the pivots is high.
At the beginning, the first two objects inserted into an empty bucket become the
candidates for pivots. Then, we compute distances to the current candidates for every
additionallyinserted object. If at least one of these distances is greater than the distance
between the current candidates, the new object replaces one of the candidates so that
the distance between the new pair of candidates grows. After a sufficient number of
insertions is reached, it is guaranteed that the distance between the candidates, with re-
spect to the bucket data-set, is large.When the bucket overflows,the candidates become
pivots and the split is executed.
New Server Allocation. The GHT* scales up to processing a large volume of data
by utilizing more and more servers. In principle, such an extension can be solved in
several ways. In the Grid infrastructure,for example, new servers are addedby standard
commands. In our prototype implementation, we use a pool of available servers which
is known to every active server. We do not use a centralized registeringservice. Instead,
we exploit the broadcast messaging to notify the active servers. When a new network
node becomes available, the following actions occur:
1. The new node with its NNID sends a broadcast message saying “I am here”. This
message is received by each active server in the network.
2. The receiving servers add the announced NNID to theirs local pool of available
Additional data and computational resources required by an active server are extended
Similarity Grid for Searching in Metric Spaces33
1. Theactiveserverpicksuponeitem fromthepoolofavailableservers.Anactivation
message is sent to the chosen server.
2. With another broadcast message, the chosen server announces: “I am being used
now” so that other active servers can removeits NNID from their pools of available
3. The chosen server initializes its own pool of available servers, creates a copy of the
AST, and sends to the caller the “Ready to serve” reply message.
In this section, we provide analytic formulas concerning the bucket load and the level
of replication – the validity of proposed formulas is verified by experiments in Sec. 4.
The necessary symbols are given in the following table:
total number of active servers
size of the data-set
maximal number of buckets per server
maximal bucket capacity in number
length of the average path in the AST
in number of nodes involved
The load factor L represents a relative utilization of active buckets. We define the
load factor as the total number of objects stored in the GHT* structure divided by the
capacity of storage available on all servers, thus:
bs· nb· S
A peculiar feature of the GHT* structure is the replication factor R – the objects used
as routing keys (pivots) are replicated on the copies of the ASTs. We define the factor
R as the ratio of the numberof pivots that are replicated amongthe servers and the total
amount of stored objects. Using Expression (4), the total number of stored objects can
be determined as N = L · bs· nb· S.
The worst replication factor occurs when all the servers have the fully updated
ASTs. If maximally nb buckets are maintained on each server, the AST must have
S · nbleaf nodes and S · nb− 1 inner nodes with 2(S · nb− 1) pivots. Consequently,
(S − 1) · 2(S · nb− 1) objects are replicated (S − 1 because a version of each pivot is
considered to be original on one server). Thus, the replication factor is given by
R =2(S − 1)(S · nb− 1)
S · nb· bs· L
2 · S
Obviously, such a replication factor grows linearly with the number of servers S.
With the logarithmic replication scheme (see Sec. 3.3), we do not replicate all the
inner nodes of the full AST, but only those nodes that are on the paths ending in leaf
nodes with the BID pointers. Corresponding to a balanced AST, the average length of
34 M. Batko, C. Gennaro, and P. Zezula
sucha pathis TL= log2(S·nb), thusthe numberofpivots onthatpathis 2(TL−1)–the
leaf nodes do not contain pivots. Since a server can store up to nbbuckets, a server AST
can have maximally nbpaths from the root to a leaf node with the BID pointer. In such
a situation, the number of pivots for all buckets of a server is boundedby nb·2(TL−1)
because we neglect the commonbeginningsof the paths (such as the root that is present
in all the paths). Then we can express the average replication factor as
R =(S − 1) · nb· 2(TL− 1)
S · nb· bs· L
∼=2 · log2(S · nb)
Naturally, the factor is logarithmic with the number of servers S. Though our GHT*
does not guaranteea balancedAST, our experimentsin Sec. 4 revealthat the replication
factor also grows in a logarithmic way when organizing real-life data-sets.
In this section, we present results of performance experiments that assess different as-
pects of our GHT* prototype implemented in Java. To isolate the GHT* algorithms
from the network and storage bucket implementation details, we use a layered architec-
ture – two lowest layers implement the buckets and the network, the GHT* algorithms
represent the middle layer, and the top layer forms the servers (clients) available to end
users as an application.
Specifically, the bucket layer implements a particularstorage strategy for buckets. It
is responsible for allocation of buckets and it also maintains chosen limits (the maximal
number of buckets per server and the maximal capacity of each bucket on a server).
Moreover, the bucket layer implements the necessary strategy for choosing pivots.
The network layer is responsible for message delivery. Our prototype uses the stan-
dard TCP/IP network with multicast capability to report new servers. In general, we
use the UDP protocol for sending messages. For transferring large volumes of data (for
instance after a bucket split), the TCP connection is used.
The GHT* algorithms exploit the two lower layers and provide methods for the in-
sertion of objects and the similarity search retrieval used in server (client) applications.
We conductedour experimentson two real-lifedata-sets. First, we have used a data-
set of 45-dimensional vectors of color image features with the L2(Euclidian) metric
distance function (VEC). Practically, the data in this set have a normal distribution of
distances and every object has the same size. The second data-set is formed by sen-
tences of Czech national corpus with the edit distance function as the metric (TXT).
The distribution of distances in this data-set is rather skewed – most of the distances are
within a very small range of values. Moreover,the size of sentences varies significantly.
There are sentences with only a few words, but also quite long sentences with tenths of
We have focused on three different properties of the GHT*: insertion of objects
and its consequences on the load and replication factors (Sec. 4.1), global range search
performance (Sec. 4.2), and parallel aspects of query execution (Sec. 4.3).
Similarity Grid for Searching in Metric Spaces35
4.1Load and Replication Factors
In order to assess the properties of the GHT* storage, we have inserted 100,000 objects
using the following four different allocation configurations:
bs 4,000 1,000 2,000 1,000
We used a network of 100 PCs connected by a high-speed local network (100Mbps).
These were gradually activated on demand by the split functions. In Fig. 4, we report
load factors for the increasing size of the TXT data-set, because similar experiments
for the VEC data-set exhibit similar behavior. Observe that the load factor is increasing
until a bucket split occurs, then it goes sharply down. Naturally, the effects of a new
bucket activation become less significant as the size of the data-set and the number of
In sufficiently large data-sets, the load factor was around 35% for the TXT and
53% for the VEC data-set. The lower load for the TXT data is mainly attributed to the
variable size of the TXT sentences – each vector in the VEC data-set was of the same
length. In general, the overall load factor depends on the pivot selection method and
on the distribution of objects in the metric space. It can also be observed from Fig. 4
that the server capacity settings, i.e. the number of buckets and the bucket size, do not
significantly affect the overall bucket load.
0 20000 40000
60000 80000 100000
Fig.4. Load factor for the TXT data-set
The second set of experiments evaluates the replication factor of the GHT* sepa-
rately for the logarithmic and the full replication strategies. Figures 5a and 5b show
the replication factors for increasing data-set sizes using different allocation configura-
tions. We again report only results for the TXT data-set because the experiments with
the VEC data-set did not reveal any different dependencies.
Figure 5a concerns the full replication strategy. Though the observed replication
is quite low, it grows in principle linearly with the increasing data-set size, because
complete AST is replicated on each computer node. In particular, the replication factor
rises whenever a new server is applied (after splitting) and then it goes down slowly
as new objects are inserted until another server is activated. The increasing number of
used (active) servers can be seen in Fig. 6.
36M. Batko, C. Gennaro, and P. Zezula
0 20000 40000
60000 80000 100000
replication factor (full)
0 20000 40000
60000 80000 100000
replication factor (logarithmic)
Fig.5. Replication factor for the TXT data-set: (a) full replication scheme, (b) logarithmic repli-
Figure 5b is reflecting the same experiments, but using the logarithmic replication
scheme. The replication factor is more than 10 times lower than it is for the full repli-
cation strategy. Moreover,we can see the logarithmic tendency of the graphs for all the
allocation configurations. Using the same allocation configuration (for example when
nb= 5,bs= 2000),the replicationfactor after 100,000inserted objects is 0.118for the
full replication scheme and only 0.005 for the logarithmic replication scheme. In this
case, the replicationusing the logarithmic scheme is more than twenty times lower than
for the full replication scheme.
0 20000 40000
60000 80000 100000
number of active servers
Fig.6. Number of servers used to store the TXT data-set
Though the trends of the graphs are very similar, the specific instances depend on
the appliedallocationconfiguration.In particular,themoreobjects storedon oneserver,
the lower the replicationfactor. For example,the configurationnb= 5,bs= 1,000 (i.e.
nb = 5,bs = 2000 with maximally 10,000 objects per server. Moreover, we can see
that the configuration with one bucket per server is significantly worse than the similar
setting with 5 buckets per server and 1,000 objects in one bucket. On the other hand,
the other two settings with 10,000 objects per server (either as 10 buckets with 1 000
objects or 5 buckets with 2,000 objects) achieve almost the same replication.
Similarity Grid for Searching in Metric Spaces37
Using the graph in Fig. 5a, it is possible to verify that the replication factor for the
full scheme is actually bounded by Equation (5). For example, if nb= 10, bs= 1,000,
andthe TXT data-set of N = 100,000objects are used,what we can read fromFig. 6 is
that S = 36 and from Fig. 4 that L = 0.37. Therefore, the replication factor calculated
using Equation (5) is
2 · S
2 · 36
1,000· 0.37= 0.19
and the graph in Fig. 5a shows the value of 0.184. All the other cases can be easily
verified by analogy.
The graph for the logarithmic scheme in Fig. 5b shows the value of 0.0055 for the
same configuration. If we use Equation (6), we get the following upper-bound of the
R∼=2 · log2(S · nb)
=2 · log2(36 · 10)
Due to the very pessimistic assumptions of the formula, the real replication is well
below the analytic boundary.
4.2Range Search Performance
In these experiments, we have analyzed the performance of the range searches with
respect to the different sizes of query radii. We have measured the costs of the range
search in terms of: (1) the number of servers involved in execution of a query, (2) the
numberof buckets accessed, (3) the numberof distance computationsin the AST and in
all the buckets accessed. We have not used the query executiontime as the performance
criterion, because we could not ensure the same environmentfor all participatingwork-
stations on which other processes might be running at the same time. In this way, we
were unable to maintain the deterministic behavior of the computer network.
Experiments in this section were computed using the GHT* structure with configu-
ration nb= 10,bs= 1,000 which was filled with 100,000 objects either from the VEC
or the TXT data-sets. We have used the logarithmic replication strategy. Each point of
every graph in this section was obtained by averaging results of 50 range queries with
the same radius and a different (randomly chosen) query object.
In the first experiment,we have focusedon relationshipsbetween queryradius sizes
iments togetherwith the numberof objects retrieved.If the radiusincreases, the number
a bit faster. However, the number of retrieved objects satisfying the query grows even
exponentially. This is in accordance with the behavior of centralized metric indexes
such as the M-tree or the D-indexon the global (not distributed) scale. The main advan-
tages of the GHT* structure are demonstrated in Sec. 4.3 when the parallel execution
costs are considered.
Another important aspect of a distributed structure is the number of messages ex-
changed during search operations. Figure 8 presents the average number of messages
38 M. Batko, C. Gennaro, and P. Zezula
0 5 10 15 20
range query radius
result set size/100
range query radius
1200 1400 1600
result set size/100
Fig.7. Average number of buckets, servers, and retrieved objects (divided by 100) as a function
of the radius
sent by a client and servers involved in a range search as a function of the query radii.
We have measured the number of messages sent by servers (i.e. forwarded messages)
andthenumberofmessagessent bya client(thisvaluesaredividedby10)separately.In
fact, the total number of messages is strictly related to the number of servers accessed.
We have observed that even with the logarithmic replication strategy the number of
forwardings is below 10% of the total number of messages sent during the query exe-
0 5 10 15 20
av. number of messages
range query radius
600 800 1000 1200 1400 1600
range query radius
av. number of messages
Fig.8. Average number of messages sent by a client (divided by 10) and server as a function of
In Fig. 9, we show the average number of distance computations performed by a
client and the necessary servers during a query execution. We only focus on distance
computations that are needed during the traversal of the AST (two distance computa-
tions must be evaluated per inner node traversed). We do not report the computations
inside the accessed buckets, because they are specific to the bucket implementation
strategy. In our current implementation, the buckets are organized in a dynamic list
Similarity Grid for Searching in Metric Spaces39
0 5 10 15 20
av. number of distance comp.
range query radius
800 1000 1200 1400 1600
range query radius
av. number of distance comp.
Fig.9. Average number of distance computations in the AST for clients (divided by 10) and
servers as a function of the radius
so the number of distance computations per bucket is simply given by the number of
objects stored in the bucket.
4.3Parallel Performance Scalability
The most important advantage of the GHT* with respect to the single-site access struc-
tures concerns its scalability through parallelism. As the size of a data-set grows, new
server nodes are plugged in and their storage as well as the computing capacities are
exploited. In the following, we experimentally study different forms of the intraquery
and interquery parallelism to show how they actually contribute to the GHT* scalabil-
ity. First, we consider the scalability from the perspective of a growing data-set and a
fixed set of queries, i.e. the data volume scalability. Then, we consider a constant data-
set and study how the structure scales up with the growing search radii, i.e. the query
It is determined as the maximum of the costs incurred on servers involved in the query
computations (both in the AST and in all the accessed buckets) as the computational
costs of a query execution. In our experiments, we have neglected the communication
cost and we have assumed that all active servers start evaluating the query at the same
time. This is reasonable, because the distance computations are much more time con-
suming than sending messages throughout the network.
The interquery parallelism is more difficult to quantify. For simplicity, we char-
acterize the interquery parallelism as the ratio of the number of servers involved in a
range query to the total number of servers. In this way, we assume that the lower the
ratio, the higher the chances for other queries to be executed in parallel. Naturally, such
assumption is only valid if each server is used with equal probability.
In summary, the intraquery parallelism is proportional to the response time of a
query and the interquery parallelism represents the relative utilization of computing