ArticlePDF Available

Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs

Authors:
  • DeepMind
  • Institute of Applied Physics of the Russian Academy of Sciences (IAP RAS), Nizhny Novgorod, Russia

Abstract and Figures

We present a new algorithm for the approximate nearest neighbor search based on navigable small world graphs with controllable hierarchy (Hierarchical NSW) admitting simple insertion, deletion and K-nearest neighbor queries. The Hierarchical NSW is a fully graph-based approach without a need for additional search structures (such as kd-trees or Cartesian concatenation) typically used at coarse search stage of the most proximity graph techniques. The algorithm incrementally builds a layered structure consisting from hierarchical set of proximity graphs (layers) for nested subsets of the stored elements. The maximum layer in which an element is present is selected randomly with exponentially decaying probability distribution. This allows producing graphs similar to the previously studied Navigable Small World (NSW) structures while additionally having the links separated by their characteristic distance scales. Starting search from the upper layer instead of random seeds together with utilizing the scale separation boosts the performance compared to the NSW and allows a logarithmic complexity scaling. Additional employment of a simple heuristic for selecting proximity graph neighbors increases performance at high recall and in case of highly clustered data. Performance evaluation on a large number of datasets has demonstrated that the proposed general metric space method is able to strongly outperform many previous state-of-art vector-only approaches such as FLANN, FALCONN and Annoy. Similarity of the algorithm to a well-known 1D skip list structure allows straightforward efficient and balanced distributed implementation.
Content may be subject to copyright.
1
Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small
World graphs
Yu. A. Malkov1, D. A. Yashunin
1. Federal state budgetary institution of science Institute of Applied Physics of the Russian
Academy of Sciences, 46 Ul'yanov Street, 603950 Nizhny Novgorod, Russia
Abstract
We present a new algorithm for the approximate K-nearest neighbor search based on navigable
small world graphs with controllable hierarchy (Hierarchical NSW). The proposed approach is
fully graph-based, without any need for additional search structures which are typically used at
coarse search stage of the most proximity graph techniques. Hierarchical NSW incrementally
builds multi-layer structure consisting from hierarchical set of proximity graphs (layers) for
nested subsets of the stored elements. The maximum layer in which an element is present is
selected randomly with exponentially decaying probability distribution. This allows producing
graphs similar to the previously studied Navigable Small World (NSW) structures while
additionally having the links separated by their characteristic distance scales. Starting search
from the upper layer together with utilizing the scale separation boosts the performance
compared to the NSW and allows a logarithmic complexity scaling. Additional employment of a
heuristic for selecting proximity graph neighbors significantly increases performance at high
recall and in case of highly clustered data. Performance evaluation has demonstrated that the
proposed general metric space method is able to strongly outperform many previous state-of-
art vector-only approaches. Similarity of the algorithm to the skip list structure allows
straightforward balanced distributed implementation.
Introduction
Constantly growing amount of the available information resources leads to high demand in
scalable and efficient similarity search data structures. One of the generally used approaches
for information search is the K-Nearest Neighbor Search (K-NNS). The K-NNS assumes you can
measure distance between the data elements and aims at finding the K elements from the
dataset which minimize the distance to a given query. Such algorithms are used in many
applications such as non-parametric machine learning algorithms [1], image features matching
in large scale databases [2] and semantic document retrieval [3]. A naïve approach to solve the
problem is to compute distances between the query and every element in the dataset and
select elements with minimal distance. However the complexity of the naïve approach scales
linearly with the size of the dataset making it infeasible for big data. This has led to a high
interest in development of fast and scalable K-NNS algorithms.
Exact solutions for the K-NNS [4-6] may offer a substantial search speedup only in case of
relatively low dimensional data due to “curse of dimensionality”. To overcome this problem a
concept of Approximate Nearest Neighbors search (k-ANNS) was proposed, which relaxes the
condition of exact search by allowing a small number of errors. The quality of inexact search
(the recall) is usually defined as the ratio between the number of found true nearest neighbors
2
and K. Most popular K-ANNS solutions are based on approximated versions of tree algorithms
[7-9] and hashing techniques [10, 11].
Proximity graph K-ANNS algorithms [12-19] have recently gained popularity offering a better
performance compared to tree techniques in some cases. In the vast majority of studied graph
algorithms searching the nearest neighbors takes a form of greedy routing k-Nearest Neighbor
(k-NN) Graphs. One of the main drawbacks of this approach are the power law scaling of the
number of hops during the routing process [20, 21] as well as a possible loss of global
connectivity in such graphs. To overcome this problems many hybrid approaches have been
proposed that use auxiliary algorithms applicable only for vector data (such as kd-trees [12, 14]
and Cartesian concatenation [13]) to find candidate seeds by doing a coarse search.
The first works to consider networks with polylogarithmic greedy routing hops scaling (such
networks are called navigable) were done by J. Kleinberg [22, 23] as social network models for
the famous Milgram experiment [24]. Kleinberg studied a variant of random Watts-Strogatz
networks [25] on a regular lattice with a specific long link length distribution r-. For =d (where
d is the dimensionality of the lattice) the number of hops to get to the target by greedy routing
scales polylogarithmically (instead of a power law for any other value of ). The idea has led to
development of many K-NNS and K-ANNS algorithms based on the navigation effect [26-29].
Kleinberg’s navigability criterion in principle can be extended for more general spaces,
however, unfortunately, in order to build a Kleinberg’s navigable network one has to know the
data distribution, which strongly limits the approach.
In [30-32] authors proposed a new proximity graph K-ANNS algorithm called Navigable Small
World (NSW, also known as Metricized Small World, MSW), which utilized navigable graphs
with long range links constructed by a much simpler model. The model is based on growth and
connection to the approximate nearest neighbors and studied as Growing Homophilic (GH)
networks in [33]. The GH networks are constructed by consecutive insertion of elements in
random order by connecting them to M closest neighbors from the previously inserted
elements. Links to the closest neighbors of the elements inserted in the beginning of the
construction later become bridges connecting different parts of the network and allow
logarithmic scaling of the greedy algorithm hop number. It was suggested [33] that the
mentioned network formation mechanism based on growth and homophily may be responsible
for navigability of large-scale biological neural networks (presence of which is disputable):
similar models were able to describe growth of small brain networks, while the GH mechanism
predicts several high level features observed in large scale neural networks.
The NSW algorithm uses a variant of greedy search with overall polylogarithmic time
complexity and can outperform rival algorithms on many real-world datasets [34, 35]. However,
the polylogarithmic complexity scaling of the algorithm causes notable performance
degradation on large datasets, especially in the case of low dimensional data [35].
In this paper we propose a new algorithm based on ideas close to NSW, which offers a much
better logarithmic complexity scaling. The main contributions are smart selection of the graph’s
3
enter-point node, separation of links by different scales and using a slightly more complicated
heuristic to select the neighbors. Alternatively the Hierarchical NSW algorithm can be seen as
an extension of the probabilistic skip list structure [36] with proximity graphs instead of the
linked lists.
Core idea
The base NSW algorithm builds a graph by consecutive insertion of elements through
connecting them to M previously inserted closest neighbors. Links to the closest neighbors of
elements inserted in the beginning of the construction of the NSW graph later become long-
range links connecting distant network parts allowing logarithmic scaling of the number of
hops [33].
The NSW algorithm performs reasonably fast, however, it still has several drawbacks such as
low performance at low dimensional data and polylogarithmic scalability of the total number of
distance calculations at best. There are well-known alternative construction models for
navigable small-world networks. They, however, do not provide a better scaling: Kleinberg’s
model as well as its derivatives also offers a polylogarithmic scalability at best, while the scale-
free models [37-39] have an even worse power law scaling [33]. It is not clear whether a single
layer graph can have a logarithmic scalability in principle.
The process of routing in navigable small-world networks with strong correlation between the
degree and characteristic connections distance was studied in detail in [33, 37] and can be
divided into two phases: "zoom-out and zoom-in [37]. The algorithm starts in the zoom-
out phase from a low degree node and traverses the graph increasing the node’s degree until
the characteristic radius of the node links length reaches the scale of the distance to the query.
Before the latter happens, the average degree of a node can stay relatively small, which leads
to an increased probability of being stuck in a distant false local minimum. Obviously one can
avoid this problem in the NSW by starting the search from a node with the maximum degree
(good candidates are the first nodes inserted in the NSW structure [33]), directly going to the
“zoom-in” phase of the search. Simulations show that setting hubs as starting points
substantially increases probability of successful routing in the structure and offers significantly
better performance at low dimensional data. However it still has only a polylogarithmic
complexity scalability of a single greedy search and performs worse on high dimensional data
compared to the unmodified NSW.
The reason for the polylogarithmic complexity scaling of a single greedy search in the NSW is
that the overall number of distance computations is roughly proportional to a product of the
average number of greedy algorithm hops and the average degree of the nodes on the greedy
path. The average number of hops scales logarithmically [32, 33], while the average degree of
the nodes on the greedy path also growths logarithmically due to the facts that: 1) the greedy
search tends to go through the hubs [33, 37]; 2) the number of hub connections is growing
logarithmically with an increase of the network size. Thus we get an overall polylogarithmic
dependence of the resulting complexity.
4
The idea of the Hierarchical NSW algorithm is to separate the links according to their length
scale, producing a multilayer graph. In this case we can evaluate only a needed portion of
connections for each element independent of the networks size, thus getting a logarithmic
scalability at the “zoom-in” phase (see fig. 1 for illustration). The search starts from the upper
layer greedily selecting elements only from the layer until a local minimum is reached. After
that the search switches to the lower layer and restarts from the element which was the local
minimum in the previous layer. The average number of connections per element in all layers
can be made constant thus allowing getting a logarithmic complexity scaling.
One way to form such a layered structure is to explicitly set links with different distance scales
by artificially introducing layers. For every element we define an integer level which defines the
maximum layer which element belongs to. For all elements that are present in a layer a
proximity graph (i.e. graph containing only “short” links which approximate Delaunay graph) is
incrementally built. If we set exponentially decaying probability of item’s level we get a
logarithmic scaling of the number of layers in the structure. The search procedure is an iterative
greedy search starting from the top layer. In case we merge connections from all layers, the
structure becomes similar to the NSW graph (in this case the level can be put in
correspondence to the node degree in NSW). Note that in contrast to the NSW, the Hierarchical
NSW construction algorithm does not require the elements to be inserted in random order.
The Hierarchical NSW idea is also very similar to a well-know 1D probabilistic skip list structure
[36] and can be described using its terms. The major difference is that we generalize the
structure by replacing the linked list with proximity graphs. Hierarchical NSW approach thus can
utilize the same methods for making the distributed approximate search/overlay structures
[40].
For selection of proximity graph connections we utilized a heuristic that uses the distances
between the candidate elements to create diverse direction connections (a similar algorithm
was utilized in the spatial approximation tree [5] to select the tree children) instead of just
using the closest neighbors. The heuristic examines the candidates starting from the closest and
creates a connection to a candidate only if it is closer to the base element compared to any of
the already connected elements (see Algorithm section for details). When the number of
candidates is large enough the heuristic allows getting the exact relative neighborhood graph
[41] (a minimal subgraph of the Delaunay graph deducible using only distances between the
nodes) as a subgraph, thus easily keeping a global connected component, even in case of highly
clustered data (see fig. 2 for illustration). Note that the heuristic creates many extra
connections compared to the relative neighborhood graph, allowing control of the connections
number which is important for search performance. For the case of 1D the heuristic allows
getting the exact Delaunay graph (which coincides with the relative neighborhood in this case)
by using only distances between the elements, thus making a direct transition from the
Hierarchical NSW to the 1D probabilistic skip list algorithm.
5
Layer=0
Layer=1
Layer=2
Decreasing characteristic radius
Cluster 2
Cluster 1
Inserted
element
Extra link selected by
the heuristic
e2
Fig. 1. Illustration of the Hierarchical NSW idea.
The search starts from an element from the top
layer (shown red). Red arrows show direction
of the greedy algorithm from the entry point to
the query (shown green).
Fig. 2. Illustration of the heuristic used to
select the neighbors in the proximity graph.
The data in the example consists from two
isolated clusters. A new element is being
inserted on the boundary of the first cluster.
All of the closest neighbors of the new
element are belonging to the same (first)
cluster thus missing the Delaunay graph links
between the clusters. The heuristic, however,
selects an element e2 from another cluster
maintaining the global connectivity in case
the inserted element is the closest to e2
compared to any other element from
Cluster 1.
Algorithm
Network construction algorithm is based on sequential insertions of the metric elements into
the structure. For every inserted element an integer maximum layer level is randomly selected
with an exponentially decaying probability distribution (normalized by a levelMult parameter,
see alg. 1).
The first phase of the insertion process starts from the top layer by greedily traversing the
graph in order to find the closest neighbor in the layer. After the algorithm finds a local
minimum in a layer, it continues the search from the next layer using the found closest
neighbors from previous layer as enter points, and the process repeats. Closest neighbors at
each layer are found by a variant of the greedy search algorithm described in alg. 2, which is
updated compared to described in [32]. To obtain approximate K nearest neighbors in some
layer level, a dynamic list of ef closest of the found elements (initially filled with a list of enter
points) is kept during the search. The list is updated at each step by evaluating the
6
neighborhood of the closest previously non-evaluated element in the list until the
neighborhood of every element from the list is evaluated. Such stop condition allows avoiding
bloating of priority queues by discarding candidate elements that are not fit to the list. The
distinctions from the algorithm described in [32] (along with queue optimizations) is that: 1) the
enter point is a fixed parameter; 2) instead of changing the number of multi-searches, the
quality of the search is controlled by different parameter ef (which was set to K in ref. [32]).
During the first phase of the search the ef parameter is set to 1 (simple greedy search).
When the search reaches layer equal or less than level, the second phase of the construction
algorithm is started which differs in two points: 1) the ef parameter is increased from 1 to
efConstruction in order to control the recall of the greedy search procedure; 2) the found
closest neighbors on each level are also used as candidates for connections of the inserted
element.
Two methods for selection of M neighbors from the candidates were considered for the
algorithm: simple connection to the closest elements (alg. 3) and the heuristic that uses the
distances between the candidate elements to create connections in diverse directions (alg. 4),
described in the “Core idea” section. The maximum number of connections that an element can
have per layer is defined by a parameter Mmax for every layer higher than zero (special
parameter Mmax0 is used for the ground layer separately). If a node is already full at the
moment of making of a new connection, then its connection list gets shrunken excluding a
neighbor by the same heuristic that described in algs. 3-4.
The insertion procedure ends when the connections of the inserted elements are established
on the ground (zero) layer.
Algorithm 1
Insertion (object q, integer: M, efConstruction, levelMult)
1 Set [object] tempRes, candidates, visitedSet, enterPoints=enterpoint
2 integer level=floor(-log(random(0..1))*levelMult) // Selecting a random level
3 for i=maxLayer to level-1 do:
4 tempRes=SearchAtLayer (q, enterPoints, M, 1, i)
5 enterPoints=closest elements from tempRes
6 for i=min(maxLayer,level) downto 0 do:
7 tempRes=SearchAtLayer (q, enterPoints, M, efConstruction, i)
8 select best M elements from tempRes by using a heuristic // alg. 3 or alg. 4
9 bidirectionally connect best M elements from tempRes to q
10 shrink lists of connected elements
11 enterPoints=closest elements from tempRes
13 if (level> maxLayer) do: // update the enterpoint
14 maxLayer=level
15 enterpoint=q
Algorithm 2
SearchAtLayer (object q, Set[object] enterPoints, integer: M, ef, layer)
1 Set [object] visitedSet
2 priority_queue [object] candidates (closer - first), result (further - first)
3 candidates, visitedSet, result enterPoints
7
4 repeat:
5 object c =candidates.top()
6 candidates.pop()
7 //check stop condition:
8 if d(c,q)>d(result.top(),q) do:
9 break
10 //update list of candidates:
11 for_each object e from c.friends(layer) do:
12 if e is not in visitedSet do:
13 add e to visitedSet
14 if d(e, q)< d(result.top(),q) or result.size()<ef do:
15 add e to candidates, result
16 if result.size()>ef do:
17 result.pop()
18 return best k elements from result
Algorithm 3
SelectNeighbors_simple(object baseElement, Set [object] candidates, integer M)
1 return M closest elements from candidates
Algorithm 4
SelectNeighbors_heuristic(object baseElement, Set [object] candidates, integer M)
Set [object] result, tempList
1 extend the neighborhood of candidates (optional)
2 sort candidates so that items closer to baseElement come first
3 for_each object e from candidates do:
4 if e is closer to baseElement compared to any element from result do:
5 add e to result
6 else do:
7 add e to tempList
8 if result.size()>M do:
9 break
10 // (optionally) add some discarded connections:
11 sort tempList so that items closer to baseElement come first
12 for_each object e from tempList do:
13 add e to result
14 if(result.size()>=M) do:
15 break
16 return result
The K-ANNS algorithm used in the Hierarchical NSW is presented in alg. 5. It is roughly equivalent to the
insertion algorithm for an item with level=0. The difference is that the closest neighbors found at ground
layer which were used as candidates for the connections are now returned as the search result. The
quality of the result is controlled by the ef parameter (corresponding to efConstruction in the
construction algorithm).
Algorithm 5
K-NNSearch (object query, integer: ef)
1 Set [object] tempRes, enterPoints=[enterpoint]
2 for i=maxLayer downto 1 do:
3 tempRes=SearchAtLayer (query, enterPoints, M, 1, i)
4 enterPoints =closest elements from tempRes
5 tempRes=SearchAtLayer (query, enterPoints, M, ef, 0)
6 return best K of tempRes
8
Performance evaluation
Influence of parameters
Algorithm construction parameters levelMult and Mmax0 are responsible for maintaining the
small world navigability in the constructed graphs. Setting levelMult to zero (this corresponds to
a single layer in graph) and Mmax0 to M leads to production of directed k-NN graphs with power
law search complexity well studied before for K-ANN search [16, 21] (assuming using the alg. 3
for neighbor selection). Setting levelMult to zero and Mmax0 to infinity leads to production of the
NSW graphs with polylogarithmic complexity [30, 32]. Finally, setting levelMult to some non-
zero value leads to emergence of controllable hierarchy graphs with logarithmic search
complexity by introduction of layers (see the Algorithm section).
To achieve the optimum performance advantage of the controllable hierarchy, the overlap
between neighbors on different layers (i.e. percent of element neighbors that are also belong
to other layers) has to be small. In order to decrease the overlap we need to decrease the
levelMult. However, at the same time, decrease of the levelMult leads to an increase of average
hop number during a greedy search on each layer, which negatively affects the performance.
This leads to existence of the optimal value for the levelMult parameter.
An obvious choice for the optimal levelMult is setting it to 1/log(M): this makes the generated
level a dimensionless quantity (assuming that M has some specific units). The simulations done
on an Intel Core i5 2400 CPU agree well with this assumption, demonstrating a very large
speedup on low dimensional data when increasing the levelMult from zero (see fig. 3 for 1-NN
searches, K=1 on 10M random d=4 vectors, a suggested value for levelMult is shown by an
arrow). It is hard to expect the same behavior for high dimensional data since in this case the k-
NN graph already has very short greedy algorithm paths [20]. Surprisingly, increasing the
levelMult from zero leads to a measurable increase in speed on very high dimensional data
(100k dense random d=1024 vectors, see plot in fig. 4), and does not introduce any penalty for
the Hierarchical NSW approach. For mid-dimensional data, such as SIFT vectors [2], the
performance advantage of increasing the levelMult is moderate (see fig. 5 for 10-NN search
performance on 5 million 128-dimensional SIFT vectors from the learning set of BIGANN [42]).
9
0,0 0,2 0,4 0,6 0,8 1,0 1,2 1,4 1,6 1,8 2,0 2,2
0,00
0,02
0,04
0,06
0,08
0,10
Query time, ms
levelMult
10M d=4 random vectors,
M=6, Mmax0=12
Recall 0.9, 1-NN
Autoselect
0,0 0,5 1,0 1,5 2,0 2,5 3,0
18,4
18,5
18,6
18,7
18,8
18,9
19,0
19,1
19,2
19,3
Query time, ms
levelMult
100k d=1024 random vectors,
M=20, Mmax0=40,
Recall=0.9, 1-NN
Autoselect
Fig. 3. Plots for query time vs levelMult
parameter for 10M random vectors with d=4.
The autoselected value for levelMult is shown
by an arrow.
Fig. 4. Plots for query time vs levelMult
parameter for 100k random vectors with
d=1024. The autoselected value for levelMult
is shown by an arrow.
Selection of the Mmax0 also has a strong influence on the search performance, especially in case
of high quality (high recall) search. Simulations show that setting Mmax0 to M (this corresponds
to k-NN graphs on each layer if the neighbors selection heuristic is not used) leads to a very
strong performance penalty at high recall, while setting too large Mma x0 leads to excessive long-
range links in the base (zero) layer. Simulations also suggest that 2M is a good choice for Mmax0.
In fig. 6 there are presented results of 10-NN search performance for the 5M SIFT dataset
depending on the Mmax0 parameter. The suggested value gives performance close to optimal at
different recalls.
0,0 0,5 1,0 1,5 2,0 2,5 3,0
0,16
0,18
0,20
Query time, ms
levelMult
5M d=128 SIFT vectors
M=16, Mmax0=32,
Recall=0.9, 1-NN
Autoselect
020 40 60 80 100 120 140 160 180 200
0,1
1
Query time, ms
Mmax0
5M SIFT, 10-NN
M=20, levelMult=0.33
Recall=0.4
Recall=0.8
Recall=0.94
Autoselect
Fig. 5. Plots for query time vs levelMult
parameter for 5M SIFT. The autoselected
value for levelMult is shown by an arrow.
Fig. 6. Plots for query time vs Mmax0 parameter
for 5M SIFT. The autoselected value for Mmax0
is shown by an arrow.
Selection of the efConstruction is straightforward. As it was suggested in [32] it has to be large
enough to produce K-ANNS recall close to unity during the construction process (0.95 is enough
for the most use-cases). And just like in [32] this parameter can possibly be auto-configured by
using sample data.
10
In all of the considered cases, use of the heuristic for proximity graph neighbors selection
(alg. 4) leads to a higher or equal search performance compared to the naïve connection to the
nearest neighbors (alg. 3). The effect is very strong for low dimensional data, at high recall for
mid-dimensional data and for the case of highly clustered data (ideologically discontinuity can
be regarded as a local low dimensional feature), see the comparison in fig. 7. When using the
closest neighbors as connections for the proximity graph, the Hierarchical NSW algorithm fails
to achieve a high recall for clustered data because the search stucks on the clusters boundaries.
While for the heuristic the introduced clustering leads to even higher performance. For uniform
and very high dimensional data there is a little difference between the neighbors selecting
methods, possibly due to the fact that in this case almost all of the nearest neighbors are
selected by the heuristic.
The only meaningful construction parameter left for the user is the M parameter. A reasonable
range of M is from 5 to 48. Simulations show that smaller M produces better results for lower
recalls and/or lower dimensional data, while bigger M is better for high recall and/or high
dimensional data (see fig. 8 for illustration).
The construction process can be easily parallelized with very few synchronization points.
Building a high quality index (efConstruction=200, M=20) in multithreading regime for 1M SIFT
data from [42] with 40 parallel threads on four Xeon E5-4650 v2 CPUs takes about 1-2 minutes
at current implementation, which does not have many of optimizations used in the search
algorithm.
0,2 0,4 0,6 0,8 1,0
0,01
0,1
Query time, ms
Recall
10M random vectors, d=10
M=16
baseline - no clusters
heuristic - no clusters
baseline - 100 clusters
heuristic - 100 clusters
10-3 10-2 10-1 100
0,01
0,1
1
Query time, ms
Recall error (1-recall)
5M SIFT vectors
M=2
M=3
M=6
M=12
M=20
M=40
Fig. 7. Effect of the method of neighbor
selections (baseline corresponds to alg. 3,
heuristic to alg. 4) on clustered (100 random
isolated clusters) and non-clustered d=10
random vector data.
Fig. 8. Plots for query time vs recall for
different parameters of M for the Hierarchical
NSW on 5M SIFT dataset.
Comparison with basic NSW
The Hierarchical NSW algorithm is implemented on top of the Non Metric Space Library. Due to
several limitations posed by the library, to achieve a better performance the implementation
uses its own versions of distance functions together with C-style memory management at the
11
search phase. For the baseline NSW algorithm, we used the version from NMSLIB 1.1 which is
slightly faster compared to the implementation tested in [34, 35] to demonstrate
improvements in speed and algorithmic complexity.
Figure 9 presents a comparison of the Hierarchical NSW to the basic NSW algorithm for d=4
random hypercube data made on an i5-2400 Intel CPU (10-NN search). The M parameter was
set to 6 for both algorithms. The Hierarchical NSW algorithm uses much less distance
computations during a search on the dataset while the advance in actual performance is even
higher (more than two orders of magnitude for high recall values) mostly due to better
algorithm implementation.
10-5 10-4 10-3 10-2 10-1 100
100
1000
10000
100000
Queries per second
Recall error (1-recall)
10 million random vectors, d=4
NSM
Hierarchical NSM
10-5 10-4 10-3 10-2 10-1 100
100
1000
10000
100000
Distance computations
Recall error (1-recall)
10 million random vectors, d=4
NSM
Hierarchical NSM
Fig. 9. Performance comparison of the NSW and the Hierarchical NSW on a 10 million 4-
dimensional random vectors dataset.
The scalings of the algorithms on a d=8 random hypercube dataset for a 10-NN search with a
fixed recall of 0.95 are presented in fig. 10. The M parameter was set to 6 for both algorithms. It
clearly follows that the Hierarchical NSW algorithm has a complexity scaling for this setting not
worse than logarithmic and outperforms the NSW algorithm at any dataset size.
102103104105106107108
0
200
400
600
800
1000
1200
Distance computations
Dataset size
10M random vectors, d=8
recall=0.95, 10-NN
NSW
Hierarchical NSW
102103104105106107108
0,0
0,5
1,0
1,5
2,0
2,5
Query time, ms
Dataset size
10M random vectors, d=8
recall=0.95, k=10
NSW
Hierarchical NSW
Fig. 10. Comparison between the NSW and the Hierarchical NSW in terms complexity scaling
with the dataset size. The tests were performed on d=8 random vectors.
12
With a rise of the dataset dimensionality the scaling with the dataset size changes to a power-
law at low size with a transition to logarithmic at some point. Scalings for random d=24 vectors
(number of distance computations) and d=32 random vectors (query time in milliseconds) are
presented in fig. 11, demonstrating the transition from power-law to logarithmic scaling at
relatively high dimensions.
102103104105106107108
0
500
1000
1500
2000
2500
3000
Distance computations
Dataset size
d=24 random vectors,
M=28, Recall=0.9,10-NN
102103104105106107
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
Query time, ms
Dataset size
d=32 random vectors,
M=20, Recall=0.9, 10-NN
Fig. 11. Scalings for the number of distance computations (left, random d=24 vectors) and the
query time (right, random d=32 vectors).
Comparison to rival methods
Comparing the performance of K-ANNS algorithms is a nontrivial task since the state-of-art is
constantly changing as new algorithms and implementations are emerging. In this work we
concentrated on comparison with the state-of-art algorithms that have open source
implementations, which is beneficial for the users. An implementation of the Hierarchical NSW
algorithm presented in this paper is also distributed as a part of the open source Non-Metric
Space Library [43].
For comparison on vector data we used a popular K-ANNS benchmark from [44] as the base
system of our comparison. The testing system utilizes python bindings of the algorithms and
consequentially runs the K-ANN search for one thousand queries (randomly extracted from the
initial dataset) with preset algorithm parameters producing an output containing recall and
average time of a single search. The considered algorithms are:
1. FLANN 1.8.4 [7]. A popular library containing several algorithms, also used in
OpenCV [45]. We used the built-in auto-tuning procedure with several reruns to infer
the best parameters.
2. Annoy 02.02.2016 build [8]. A new but already popular algorithm based on random
projection tree forest.
3. VP-tree. A general metric space algorithm implemented as a part of Non-Metric Space
Library 1.1 [43].
4. FALCONN, version 1.2. A new efficient LSH algorithm for cosine similarity data [46].
The test parameters for VP-tree, FALCONN and Annoy were taken from [44]. The comparison
was done on a 4-CPU Intel Xeon E5-4650 v2 system with 120 Gb of RAM under Debian OS. For
13
every algorithm we carefully chose the best results at every recall range to evaluate the best
possible performance. All tests were done in a single thread regime. The Hierarchical NSW was
compiled using the GCC 5.3 with -Ofast optimization flag.
Used datasets (vector data):
SIFT dataset consisting of a million 128-dimensional vectors from [42].
GloVe dataset consisting of 1.2 million 100-dimensional word embedding in vector
space trained from tweets [47]. The cosine similarity distance was used as the distance
function.
CoPhIR dataset [48] consisting of 2 million 272-dimensional MPEG-7 features extracted
from the images.
Dataset of 30 million random points in unitary 4-dimensional cube with Euclid distance
(to test the performance in a low dimensional case).
Dataset containing 60 thousand handwritten digit images in 784-dimensional vector
space from the MNIST database [49].
For all of the datasets except GloVe we used the L2 distance. For GloVe we used the cosine
similarity.
Results for the vector data are presented in fig. 12. For SIFT, GloVE and CoPhIR dataset the
Hierarchical NSW algorithm clearly outperforms the rivals by a large margin. For low
dimensional data (d=4) the Hierarchical NSW is slightly faster at high recall compared to the
Annoy while strongly outperforming the other algorithms.
For comparison in a case of more general spaces with no constrains on the data, we have used
the built-in testing system from Non Metric Space Library, repeating a subset of tests from the
review [35]. The evaluated algorithms included the VP-tree, permutation techniques (NAPP and
brutoforce filtering) [43, 50-52], the basic NSW algorithm and NNDescent-produced proximity
graphs[21] (in pair with the NSW graph search algorithm). For every dataset the test includes
the results of either NSW or NNDescent, depending on which structure performed better. No
custom distance functions or special memory management were used in this case leading to
some performance loss for the Hierarchical NSW.
Used datasets (Non-Metric data):
Wiki-sparse dataset containing 4 million sparse 105-dimensional TF-IDF (term
frequencyinverse document frequency) vectors (created via GENSIM [53]) with the
sparse cosine distance.
Wiki-128, Wiki-8 consisting of 2 million dense vectors of topic histograms created from
sparse TF-IDF vectors of the wiki-sparse dataset (created via GENSIM [53]). Jensen
Shannon (JS) divergence was used as the distance function.
ImageNet dataset contacting a million signatures extracted from LSVRC-2014 with SQFD
(signature quadratic form) distance [54].
14
1M DNA (deoxyribonucleic acid) dataset sampled from the Human Genome 5. The
employed distance function was the normalized Levenshtein distance.
Further details of the dataset origin and processing can be found in the original work [35]. The
parameters of the rival methods were taken from the sample scripts of the Non Metric Space
Library [43].
The results are presented in fig. 13. The Hierarchical NSW algorithm significantly improves the
performance of the NSW algorithm and is a leader for any of the tested datasets. The strongest
enhancement, almost by 3 orders of magnitude is observed for the dataset with the lowest
dimensionality, the wiki-8 with JS-divergence. This is an important result that demonstrates the
robustness of the Hierarchical NSW as for the original NSW this dataset was a stumbling block.
Note that for the wiki-8 to nullify the effect of implementation results are presented for the
distance computations number instead of the CPU time.
Indirect comparison to the competitive graph search techniques [13, 14] is presented in fig. 14.
We were not able to make a direct comparison since there were no open source
implementations of the rival algorithms [13, 14]. The timings of a 10-NN search on a one million
SIFT dataset from [42] for different recall values were taken from [13], where a 3.4 GHz Intel
CPU was used. For the test with the Hierarchical NSW we used a 3.4 GHz Intel i5-3570K CPU,
which achieved FLANN performance close to the one from [13]. The latter is an additional
correctness justification for such a comparison. The plot clearly shows that the Hierarchical
NSW algorithm outperforms the algorithms [13, 14] for this setting, especially in case of high
recall.
15
0,0
0,2
0,4
0,6
0,8
1,0
0,01 0,1 1 10
Query time, ms
Recall
1M SIFT, 10-NN
Hierarchical NSW
Annoy
VP-tree
FLANN
0,01 0,1 1 10 100
0,0
0,2
0,4
0,6
0,8
1,0
Recall
Query time, ms
2M CoPhIR, 10-NN
VP-tree
Hierarchical NSW
Annoy
FLANN
0,0
0,2
0,4
0,6
0,8
1,0
0,01 0,1 1 10 100
Query time, ms
Recall
GloVe, 1.2M, 10-NN
FLANN
VP-tree
Annoy
FALCONN
Hierarchical NSW
1E-4
1E-3
0,01
0,1
1
0,01 0,1 1 10 100
Query time, ms
Recall error(1-recall)
GloVe, 1.2M, 10-NN
Recall error (1-recall)
FLANN
VP-tree
Annoy
FALCONN
Hierarchical NSW
0,01 0,1 1
0,0
0,2
0,4
0,6
0,8
1,0
Recall
Time, ms
30M random d=4 vectors,
10-NN
Hierachical NSW
Annoy
VP-tree
FLANN
10-1 100101
0,0
0,2
0,4
0,6
0,8
1,0
Recall
Query time, ms
60k MNIST, 10-NN
Hierarchical NSW
Annoy
VP-tree
flann
Fig. 12. Results of the comparison with open source implementations of K-ANNS algorithms on
five datasets for 10-NN searches.
16
0,2
0,4
0,6
0,8
1,0
0,1 1 10 100
Query time, ms
Recall
2M Wiki-128,
JS-div, 10-NN
Hierarchical NSW
NSW
VP-tree
NAPP
0,6
0,7
0,8
0,9
1,0
100 1000 10000 100000
Number of distance computations
Recall
2M Wiki-8
JS-div, 10-NN
Hierarchical NSW
NNDescent
VP-tree
NAPP
0,80
0,85
0,90
0,95
1,00
10 100 1000
Query time, ms
Recall
1M ImageNet (SQFD)
10-NN
Hierarchical NSW
NSW
VP-tree
NAPP
bruto-force filtering
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1,0
110 100 1000
Query time, ms
Recall
4M Wiki-sparse (cosine similarity)
10-NN
Hierarchical NSW
NSW
NAPP
0,6
0,8
1,0
10 100 1000
Query time, ms
Recall
1M DNA, 10-NN search
Hierarchical NSW
NNDescent
VP-tree
NAPP
brute-force filt. bin.
Fig. 13. Results of the comparison with general space K-ANNS algorithms from the Non Metric
Space Library on five datasets for 10-NN searches.
17
0,2
0,4
0,6
0,8
1,0
0,0 0,2 0,4 0,6 0,8 1,0
Query time, ms
Recall
1M SIFT, 10-NN search
Cartesian Concatenation+kNN Graph
Iterative Graph
Hierarchical NSW
Fig. 14 Results of indirect comparison to the graph methods from [13, 14] on the 1M SIFT
dataset from [42].
Complexity analysis
Search complexity
The complexity of a single search can be divided into complexity of two search phases. The first
phase is used to coarsely find the closest element. The second phase corrects the error of the
first phase and extracts the other K-1 nearest neighbors. During the first phase we make a
constant number of steps on each layer (because the element level is uncorrelated with the
distances) while the maximum layer number scales as log(N). Thus, the complexity of the first
phase is logarithmical by the structure design. Intuitively, if the size of the dataset is large
enough, the complexity of the second phase does not depend on the dataset size since we are
extracting the graph only locally. Under this assumption the overall complexity scaling is log(N),
in agreement with simulations on low dimensional datasets. For small-size large-dimensional
datasets, the assumption of locality during the first stage may not hold and the search
complexity is determined by the second stage, thus having at best sub-linear scalability.
Element removal complexity
If we remove an element from the structure we have to update the connections of its
neighbors, which is roughly equivalent to the second phase of the search for each of the
updated elements. Thus, under the same assumptions in the limit of large dataset element
removal has a complexity independent of N.
Construction complexity
The construction complexity is defined by the search complexity since the insertion of an
element is just a sequence of K-ANN-searches at different layers. The number of searches is
equal to the element’s level, which is close to unity on average. This means that the insertion
cost is roughly equal to the search cost. And thus, for relatively low dimensional datasets the
construction time scales as N∙log(N).
Memory cost
The total memory cost for the structure is determined by the number of element connections
and is about 2Mnumber_of_bytes_per_link. If we limit the maximum total number of
18
elements by approximately four billions, we can use four-byte unsigned integers to store the
connections. Tests suggest that typical close to optimal M values usually lie in a range between
6 and 48. This means that the typical memory requirements for the index (excluding the size of
the data) are about 48-384 bytes per object.
Discussion
By using structure decomposition of navigable small world graphs together with the smart
neighbor selection heuristics the proposed Hierarchical NSW approach overcomes several
important problems of the basic NSW structure advancing the state of art in K-ANN search. The
Hierarchical NSW offers an excellent performance and wins on a large variety of the datasets,
surpassing the rivals by a large margin in case of high dimensional data. Even for the datasets
where the previous version has lost by orders of magnitude, the Hierarchical NSW was able to
come first. Hierarchical NSW can also be used as an efficient method for getting approximate K-
NN and relative neighborhood graphs, which are byproducts of the structure construction.
Robustness of the approach is a strong feature which makes it very attractive for practical
applications. The algorithm is applicable in generalized metric spaces and performs best on any
of the datasets tested in this paper, thus eliminating the need for complicated selection of the
best algorithm for a specific problem. We stress the importance of the algorithm’s robustness
since the data may have a complex structure with different effective dimensionality across the
scales. For instance a high dimensional dataset can consist of large number of clusters arranged
in a line, thus being low dimensional at large distance scale or, which is equivalent, at relatively
small dataset size. In order to perform efficient search in such a dataset an approximate
nearest neighbor algorithm has to work well for both cases of high and low dimensionality.
There are several ways to further increase the efficiency and applicability of the Hierarchical
NSW approach. There is still one meaningful parameter left which strongly affects the
construction of the index the number of added connection per layer M. Potentially this
parameter can be inferred directly by using different heuristics [5, 55]. It would also be
interesting to compare the Hierarchical NSW at BIGANN dataset which consists from 1 billion
SIFT feature vectors and is becoming a popular benchmark for the algorithms [13, 56-58].
One of the apparent shortcomings of the proposed approach compared to the NSW is the loss
of the possibility of distributed search. The search in Hierarchical NSW structure always starts
from the top layer, thus the structure cannot be made distributed by using the same techniques
as described in [32] due to excessive load on the higher layer elements. Simple workarounds
can be used to distribute the structure, such as partitioning the data across cluster nodes
studied in [7], however in this case, the total parallel throughput of the system does not scale
well with the number of computer nodes.
Still, there are other possible known ways to make this particular structure distributed. The
Hierarchical NSW is ideologically very similar to the well-known one-dimensional exact search
probabilistic skip list structure, and thus can use the same techniques to make the structure
19
distributed [40]. Potentially this can lead to even better distributed performance compared to
the base NSW due to logarithmic scalability and ideally uniform load on the nodes.
Acknowledgements
We thank Leonid Boytsov for many helpful discussions, assistance with Non-Metric Space
Library integration and comments on the manuscript. We also thank Valery Kalyagin for support
of this work.
The reported study was funded by RFBR, according to the research project No.
16-31-60104 mol_а_dk.
References
[1] S. Cost and S. Salzberg, "A Weighted Nearest Neighbor Algorithm for Learning with Symbolic
Features," Machine Learning, vol. 10, pp. 57-78, 1993.
[2] D. G. Lowe, "Distinctive image features from scale-invariant keypoints," International journal of
computer vision, vol. 60, pp. 91-110, 2004.
[3] S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman, "Indexing by
Latent Semantic Analysis," J. Amer. Soc. Inform. Sci., vol. 41, pp. 391-407, 1990.
[4] P. N. Yianilos, "Data structures and algorithms for nearest neighbor search in general metric
spaces," in SODA, 1993, pp. 311-321.
[5] G. Navarro, "Searching in metric spaces by spatial approximation," The VLDB Journal, vol. 11, pp.
28-46, 2002.
[6] E. S. Tellez, G. Ruiz, and E. Chavez, "Singleton indexes for nearest neighbor search," Information
Systems, 2016.
[7] M. Muja and D. G. Lowe, "Scalable nearest neighbor algorithms for high dimensional data,"
Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 36, pp. 2227-2240, 2014.
[8] (2016). Annoy:Approximate Nearest Neighbors in C++/Python optimized for memory usage and
loading/saving to disk. Available: https://github.com/spotify/annoy
[9] M. E. Houle and M. Nett, "Rank-based similarity search: Reducing the dimensional dependence,"
Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 37, pp. 136-150, 2015.
[10] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt, "Practical and optimal LSH for
angular distance," in Advances in Neural Information Processing Systems, 2015, pp. 1225-1233.
[11] P. Indyk and R. Motwani, "Approximate nearest neighbors: towards removing the curse of
dimensionality," in Proceedings of the thirtieth annual ACM symposium on Theory of computing,
1998, pp. 604-613.
[12] S. Arya and D. M. Mount, "Approximate Nearest Neighbor Queries in Fixed Dimensions," in
SODA, 1993, pp. 271-280.
[13] J. Wang, J. Wang, G. Zeng, R. Gan, S. Li, and B. Guo, "Fast neighborhood graph search using
cartesian concatenation," in Multimedia Data Mining and Analytics, ed: Springer, 2015, pp. 397-
417.
[14] J. Wang and S. Li, "Query-driven iterated neighborhood graph search for large scale indexing," in
Proceedings of the 20th ACM international conference on Multimedia, 2012, pp. 179-188.
[15] Z. Jiang, L. Xie, X. Deng, W. Xu, and J. Wang, "Fast Nearest Neighbor Search in the Hamming
Space," in MultiMedia Modeling, 2016, pp. 325-336.
[16] E. Chávez and E. S. Tellez, "Navigating k-nearest neighbor graphs to solve nearest neighbor
searches," in Advances in Pattern Recognition, ed: Springer, 2010, pp. 270-280.
[17] K. Aoyama, K. Saito, H. Sawada, and N. Ueda, "Fast approximate similarity search based on
degree-reduced neighborhood graphs," in Proceedings of the 17th ACM SIGKDD international
conference on Knowledge discovery and data mining, 2011, pp. 1055-1063.
[18] G. Ruiz, E. Chávez, M. Graff, and E. S. llez, "Finding Near Neighbors Through Local Search," in
Similarity Search and Applications, ed: Springer, 2015, pp. 103-109.
20
[19] R. Paredes, "Graphs for metric space searching," PhD thesis, University of Chile, Chile, 2008.
Dept. of Computer Science Tech Report TR/DCC-2008-10. Available at http://www. dcc. uchile.
cl/~ raparede/publ/08PhDthesis. pdf, 2008.
[20] C. C. Cartozo and P. De Los Rios, "Extended navigability of small world networks: exact results
and new insights," Physical review letters, vol. 102, p. 238703, 2009.
[21] W. Dong, C. Moses, and K. Li, "Efficient k-nearest neighbor graph construction for generic
similarity measures," in Proceedings of the 20th international conference on World wide web,
2011, pp. 577-586.
[22] J. M. Kleinberg, "Navigation in a small world," Nature, vol. 406, pp. 845-845, 2000.
[23] J. Kleinberg, "The small-world phenomenon: An algorithmic perspective," in Proceedings of the
thirty-second annual ACM symposium on Theory of computing, 2000, pp. 163-170.
[24] J. Travers and S. Milgram, "An experimental study of the small world problem," Sociometry, pp.
425-443, 1969.
[25] D. J. Watts and S. H. Strogatz, "Collective dynamics of ‘small-world’networks," Nature, vol. 393,
pp. 440-442, 1998.
[26] Y. Lifshits and S. Zhang, "Combinatorial algorithms for nearest neighbors, near-duplicates and
small-world design," in Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete
Algorithms, 2009, pp. 318-326.
[27] A. Karbasi, S. Ioannidis, and L. Massoulie, "From Small-World Networks to Comparison-Based
Search," Information Theory, IEEE Transactions on, vol. 61, pp. 3056-3074, 2015.
[28] O. Beaumont, A.-M. Kermarrec, and É. Rivière, "Peer to peer multidimensional overlays:
Approximating complex structures," in Principles of Distributed Systems, ed: Springer, 2007, pp.
315-328.
[29] O. Beaumont, A.-M. Kermarrec, L. Marchal, and É. Rivière, "VoroNet: A scalable object network
based on Voronoi tessellations," in Parallel and Distributed Processing Symposium, 2007. IPDPS
2007. IEEE International, 2007, pp. 1-10.
[30] Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, "Scalable distributed algorithm for
approximate nearest neighbor search problem in high dimensional general metric spaces," in
Similarity Search and Applications, ed: Springer Berlin Heidelberg, 2012, pp. 132-147.
[31] A. Ponomarenko, Y. Malkov, A. Logvinov, and V. Krylov, "Approximate Nearest Neighbor Search
Small World Approach," in International Conference on Information and Communication
Technologies & Applications, Orlando, Florida, USA, 2011.
[32] Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, "Approximate nearest neighbor
algorithm based on navigable small world graphs," Information Systems, vol. 45, pp. 61-68,
2014.
[33] Y. A. Malkov and A. Ponomarenko, "Growing homophilic networks are natural navigable small
worlds," arXiv preprint arXiv:1507.06529, 2015.
[34] A. Ponomarenko, N. Avrelin, B. Naidan, and L. Boytsov, "Comparative Analysis of Data Structures
for Approximate Nearest Neighbor Search," In Proceedings of The Third International Conference
on Data Analytics, 2014.
[35] B. Naidan, L. Boytsov, and E. Nyberg, "Permutation search methods are efficient, yet faster
search is possible," VLDB Procedings, vol. 8, pp. 1618-1629, 2015.
[36] W. Pugh, "Skip lists: a probabilistic alternative to balanced trees," Communications of the ACM,
vol. 33, pp. 668-676, 1990.
[37] M. Boguna, D. Krioukov, and K. C. Claffy, "Navigability of complex networks," Nature Physics,
vol. 5, pp. 74-80, 2009.
[38] D. Krioukov, F. Papadopoulos, M. Kitsak, A. Vahdat, and M. Boguná, "Hyperbolic geometry of
complex networks," Physical Review E, vol. 82, p. 036106, 2010.
[39] A. Gulyás, J. J. Bíró, A. Kőrösi, G. Rétvári, and D. Krioukov, "Navigable networks as Nash
equilibria of navigation games," Nature Communications, vol. 6, p. 7651, 2015.
[40] M. T. Goodrich, M. J. Nelson, and J. Z. Sun, "The rainbow skip graph: a fault-tolerant constant-
degree distributed data structure," in Proceedings of the seventeenth annual ACM-SIAM
symposium on Discrete algorithm, 2006, pp. 384-393.
[41] G. T. Toussaint, "The relative neighbourhood graph of a finite planar set," Pattern recognition,
vol. 12, pp. 261-268, 1980.
21
[42] H. Jegou, M. Douze, and C. Schmid, "Product quantization for nearest neighbor search," Pattern
Analysis and Machine Intelligence, IEEE Transactions on, vol. 33, pp. 117-128, 2011.
[43] L. Boytsov and B. Naidan, "Engineering Efficient and Effective Non-metric Space Library," in
Similarity Search and Applications, ed: Springer, 2013, pp. 280-293.
[44] ANN benchmark. Available: https://github.com/erikbern/ann-benchmarks
[45] K. Pulli, A. Baksheev, K. Kornyakov, and V. Eruhimov, "Real-time computer vision with OpenCV,"
Communications of the ACM, vol. 55, pp. 61-69, 2012.
[46] A. Andoni and I. Razenshteyn, "Optimal Data-Dependent Hashing for Approximate Near
Neighbors," presented at the Proceedings of the Forty-Seventh Annual ACM on Symposium on
Theory of Computing, Portland, Oregon, USA, 2015.
[47] J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation,"
Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), vol. 12,
pp. 1532-1543, 2014.
[48] P. Bolettieri, A. Esuli, F. Falchi, C. Lucchese, R. Perego, T. Piccioli, et al., "CoPhIR: a test collection
for content-based image retrieval," arXiv preprint arXiv:0905.4627, 2009.
[49] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document
recognition," Proceedings of the IEEE, vol. 86, pp. 2278-2324, 1998.
[50] E. Chávez, M. Graff, G. Navarro, and E. Téllez, "Near neighbor searching with K nearest
references," Information Systems, vol. 51, pp. 43-61, 2015.
[51] E. C. Gonzalez, K. Figueroa, and G. Navarro, "Effective proximity retrieval by ordering
permutations," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 30, pp.
1647-1658, 2008.
[52] E. S. Tellez, E. Chávez, and G. Navarro, "Succinct nearest neighbor search," Information Systems,
vol. 38, pp. 1019-1030, 2013.
[53] P. Sojka, "Software framework for topic modelling with large corpora," in In Proceedings of the
LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010.
[54] C. Beecks, "Distance-based similarity models for content-based multimedia retrieval,"
Hochschulbibliothek der Rheinisch-Westfälischen Technischen Hochschule Aachen, 2013.
[55] A. Ponomarenko, "Query-Based Improvement Procedure and Self-Adaptive Graph Construction
Algorithm for Approximate Nearest Neighbor Search," in Similarity Search and Applications, ed:
Springer, 2015, pp. 314-319.
[56] M. Norouzi, A. Punjani, and D. J. Fleet, "Fast exact search in hamming space with multi-index
hashing," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 36, pp. 1107-
1119, 2014.
[57] W. Zhou, C. Yuan, R. Gu, and Y. Huang, "Large Scale Nearest Neighbors Search Based on
Neighborhood Graph," in Advanced Cloud and Big Data (CBD), 2013 International Conference
on, 2013, pp. 181-186.
[58] A. Babenko and V. Lempitsky, "The inverted multi-index," in Computer Vision and Pattern
Recognition (CVPR), 2012 IEEE Conference on, 2012, pp. 3069-3076.
... Another core component in RAG systems is retrieval, which identifies information from external knowledge databases. A common approach to performing this retrieval is vector search, which has become the cornerstone of recent information retrieval systems [39,73]. Vector search enables the system to assess semantic relevance by encoding both documents and queries as high-dimensional vectors (e.g., hundreds to thousands dimensions), where proximity in this vector space reflects semantic similarity. ...
... In practice, vector search retrieves the K most similar vectors to a given D-dimensional query vector from a database Y populated with many D-dimensional vectors. This similarity is typically computed using metrics such as L2 distance or cosine similarity [39,73]. Since exact K Nearest Neighbor (KNN) search is costly on large-scale datasets, real-world vector search systems adopt Approximate Nearest Neighbor (ANN) search algorithms, which provide a scalable alternative to exact KNN by trades recall for much higher system performance. ...
... as graph-based search algorithms [29,30,70,72,73,107,110], due to its memory efficiency (e.g., one byte can represent 4∼16 dimensions in PQ [39,43,47]) -a crucial advantage when RAG systems operate on large databases, sometimes containing up to 64 billion vectors (92 TB before quantization) [21,96]. ...
Preprint
Full-text available
Retrieval-augmented generation (RAG), which combines large language models (LLMs) with retrievals from external knowledge databases, is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload characteristics across them. In this paper, we make three fundamental contributions to advancing RAG serving. First, we introduce RAGSchema, a structured abstraction that captures the wide range of RAG algorithms, serving as a foundation for performance optimization. Second, we analyze several representative RAG workloads with distinct RAGSchema, revealing significant performance variability across these workloads. Third, to address this variability and meet diverse performance requirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), a system optimization framework for efficient RAG serving. Our evaluation shows that RAGO achieves up to a 2x increase in QPS per chip and a 55% reduction in time-to-first-token latency compared to RAG systems built on LLM-system extensions.
... IoT-RAG-SE handles heterogeneous real-time IoT sensing data and routes user queries based on time and spatial factors. Utilizing the Sentence-BERT [31] method and a vector database based on the Hierarchical Navigable Small World (Hierarchical NSW, HNSW) [32] approach, as illustrated in Fig. 3, IoT-RAG-SE carries out two main processes: services descriptions embedding and performing semantic search. ...
... First, the query passes through the same data pipeline: tokenization, embedding, pooling, and normalization process, generating the normalized embedding vector of the query. Then, a Hierarchical NSW, HNSW [32] semantic search is performed within the vector database between the query and the stored services descriptions to determine the top k similar services. Finally, with the top k service names and the query location, the real-time IoT database routes the query, returning nearby IoT data documents, which are interpretable by LLM models. ...
Preprint
Full-text available
The Internet of Things (IoT) has enabled diverse devices to communicate over the Internet, yet the fragmentation of IoT systems limits seamless data sharing and coordinated management. We have recently introduced SensorsConnect, a unified framework to enable seamless content and sensor data sharing in collaborative IoT systems, inspired by how the World Wide Web (WWW) enabled a shared and accessible space for information among humans. This paper presents the IoT Agentic Search Engine (IoT-ASE), a real-time search engine tailored for IoT environments. IoT-ASE leverages Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) techniques to address the challenge of searching vast, real-time IoT data, enabling it to handle complex queries and deliver accurate, contextually relevant results. We implemented a use-case scenario in Toronto to demonstrate how IoT-ASE can improve service quality recommendations by leveraging real-time IoT data. Our evaluation shows that IoT-ASE achieves a 92\% accuracy in retrieving intent-based services and produces responses that are concise, relevant, and context-aware, outperforming generalized responses from systems like Gemini. These findings highlight the potential IoT-ASE to make real-time IoT data accessible and support effective, real-time decision-making.
... In the retrieval module, the construction of the database serves as the starting point for all subsequent steps. A knowledge library (containing all knowledge documents) first undergoes data cleaning and partitioning, followed by question splitting, ultimately constructing an index through HNSW (Hierarchical Navigable Small World) [13]. Through this, the documents are converted into question-and-answer pairs, where key values are transformed into vectors and stored in a vector database. ...
... The specialty of vector databases is that they are focused on the storage of vectors in a proficient manner [4,6]. Also, vector databases are able to handle real-time high-dimensional vector indices. ...
... Unlike COLMAP's BA, which incurs a substantial computational cost as the number of images increases, Hierarchical Weighted Local BA significantly reduces time complexity, ensuring efficient processing. Additionally, Onthe-Fly SfM can achieve online feature matching via global feature retrieval on HNSW (Hierarchical Navigable Small World) [43], enabling efficient dynamic updates for the matching matrix. On-the-fly SfM allows for near real-time pose estimation and sparse point cloud generation immediately after capturing a new image, ensuring a continuous and timely input for our approach. ...
Preprint
Full-text available
3D Gaussian Splatting (3DGS) achieves high-fidelity rendering with fast real-time performance, but existing methods rely on offline training after full Structure-from-Motion (SfM) processing. In contrast, this work introduces On-the-Fly GS, a progressive framework enabling near real-time 3DGS optimization during image capture. As each image arrives, its pose and sparse points are updated via on-the-fly SfM, and newly optimized Gaussians are immediately integrated into the 3DGS field. We propose a progressive local optimization strategy to prioritize new images and their neighbors by their corresponding overlapping relationship, allowing the new image and its overlapping images to get more training. To further stabilize training across old and new images, an adaptive learning rate schedule balances the iterations and the learning rate. Moreover, to maintain overall quality of the 3DGS field, an efficient global optimization scheme prevents overfitting to the newly added images. Experiments on multiple benchmark datasets show that our On-the-Fly GS reduces training time significantly, optimizing each new image in seconds with minimal rendering loss, offering the first practical step toward rapid, progressive 3DGS reconstruction.
... We implemented our system as a web application using SvelteKit. 4 For sentence retrieval, we used WeaviateDB 5 as our vector database, using its built-in text embedding capabilities and approximate k-Nearest Neighbor (kNN) search implementation [35]. The system was deployed on a workstation with 12 GB of GPU memory and made remotely accessible to study participants. ...
Preprint
Full-text available
Many communities, including the scientific community, develop implicit writing norms. Understanding them is crucial for effective communication with that community. Writers gradually develop an implicit understanding of norms by reading papers and receiving feedback on their writing. However, it is difficult to both externalize this knowledge and apply it to one's own writing. We propose two new writing support concepts that reify document and sentence-level patterns in a given text corpus: (1) an ordered distribution over section titles and (2) given the user's draft and cursor location, many retrieved contextually relevant sentences. Recurring words in the latter are algorithmically highlighted to help users see any emergent norms. Study results (N=16) show that participants revised the structure and content using these concepts, gaining confidence in aligning with or breaking norms after reviewing many examples. These results demonstrate the value of reifying distributions over other authors' writing choices during the writing process.
... Faiss [10] addresses this by employing approximate nearest-neighbor (ANN) search, enabling large-scale similarity retrieval. Similarly, HNSW graphs [15] optimize nearest-neighbor retrieval using multi-layer navigable small-world graphs, but these methods are not accurate. FastNN, introduced in MASt3R, aimed to reduce this computational overhead while preserving accuracy, but it still remains a bottleneck in dense matching pipelines. ...
Preprint
Full-text available
Image matching is a key component of modern 3D vision algorithms, essential for accurate scene reconstruction and localization. MASt3R redefines image matching as a 3D task by leveraging DUSt3R and introducing a fast reciprocal matching scheme that accelerates matching by orders of magnitude while preserving theoretical guarantees. This approach has gained strong traction, with DUSt3R and MASt3R collectively cited over 250 times in a short span, underscoring their impact. However, despite its accuracy, MASt3R's inference speed remains a bottleneck. On an A40 GPU, latency per image pair is 198.16 ms, mainly due to computational overhead from the ViT encoder-decoder and Fast Reciprocal Nearest Neighbor (FastNN) matching. To address this, we introduce Speedy MASt3R, a post-training optimization framework that enhances inference efficiency while maintaining accuracy. It integrates multiple optimization techniques, including FlashMatch-an approach leveraging FlashAttention v2 with tiling strategies for improved efficiency, computation graph optimization via layer and tensor fusion having kernel auto-tuning with TensorRT (GraphFusion), and a streamlined FastNN pipeline that reduces memory access time from quadratic to linear while accelerating block-wise correlation scoring through vectorized computation (FastNN-Lite). Additionally, it employs mixed-precision inference with FP16/FP32 hybrid computations (HybridCast), achieving speedup while preserving numerical precision. Evaluated on Aachen Day-Night, InLoc, 7-Scenes, ScanNet1500, and MegaDepth1500, Speedy MASt3R achieves a 54% reduction in inference time (198 ms to 91 ms per image pair) without sacrificing accuracy. This advancement enables real-time 3D understanding, benefiting applications like mixed reality navigation and large-scale 3D scene reconstruction.
Article
The lack of retrieval methods in the textile subdivision field makes it difficult for universal retrieval methods to meet the precision requirements of plaid fabric manufacturing enterprises. Based on the characteristics of plaid fabrics, this paper presents a novel image retrieval method for plaid fabrics using a convolutional neural network (CNN) with an attention mechanism. Global and local deep features are extracted by combining the attention mechanism with the CNN branches to fully characterize the plaid fabric images, which are then fused by an orthogonal fusion module. To reduce the amount of training data, a novel training strategy is designed to optimize the feature extraction and fusion tasks. The Annoy algorithm is used as the similarity measurement method to balance the retrieval precision and efficiency. To verify the proposed scheme, over 44,000 fabric samples are collected from the factory to build the image database as the benchmark. Experiments show that precision and recall at rank five are up to 77.5% and 57.1%, respectively, and the mean average precision is up to 0.758. Results prove that the proposed method is effective and efficient; it can provide references for fabric manufacturing and exploit the advantages of historical production experience.
Article
Full-text available
Navigability, an ability to find a logarithmically short path between elements using only local information, is one of the most fascinating properties of real-life networks. However, the exact mechanism responsible for the formation of navigation properties remained unknown. We show that navigability can be achieved by using only two ingredients present in the majority of networks: network growth and local homophily, giving a persuasive answer how the navigation appears in real-life networks. A very simple algorithm produces hierarchical self-similar optimally wired navigable small world networks with exponential degree distribution by using only local information. Adding preferential attachment produces a scale-free network which has shorter greedy paths, but worse (power law) scaling of the information extraction locality (algorithmic complexity of a search). Introducing saturation of the preferential attachment leads to truncated scale-free degree distribution that offers a good tradeoff between these parameters and can be useful for practical applications. Several features of the model are observed in real-life networks, in particular in the brain neural networks, supporting the earlier suggestions that they are navigable.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Conference Paper
Recent years have witnessed growing interests in computing compact binary codes and binary visual descriptors to alleviate the heavy computational costs in large-scale visual research. However, it is still computationally expensive to linearly scan the large-scale databases for nearest neighbor (NN) search. In [15], a new approximate NN search algorithm is presented. With the concept of bridge vectors which correspond to the cluster centers in Product Quantization [10] and the augmented neighborhood graph, it is possible to adopt an extract-on-demand strategy on the online querying stage to search with priority. This paper generalizes the algorithm to the Hamming space with an alternative version of k-means clustering. Despite the simplicity, our approach achieves competitive performance compared to the state-of-the-art methods, i.e., MIH and FLANN, in the aspects of search precision, accessed data volume and average querying time.
Conference Paper
We present a new approach for efficient approximate nearest neighbor (ANN) search in high dimensional spaces, extending the idea of Product Quantization. We propose a two-level product and vector quantization tree that reduces the number of vector comparisons required during tree traversal. Our approach also includes a novel highly parallelizable re-ranking method for candidate vectors by efficiently reusing already computed intermediate values. Due to its small memory footprint during traversal, the method lends itself to an efficient, parallel GPU implementation. This Product Quantization Tree (PQT) approach significantly outperforms recent state of the art methods for high dimensional nearest neighbor queries on standard reference datasets. Ours is the first work that demonstrates GPU performance superior to CPU performance on high dimensional, large scale ANN problems in time-critical real-world applications, like loop-closing in videos.
Conference Paper
This paper considers the problem of approximate nearest neighbor search in the compressed domain. We introduce polysemous codes, which offer both the distance estimation quality of product quantization and the efficient comparison of binary codes with Hamming distance. Their design is inspired by algorithms introduced in the 90’s to construct channel-optimized vector quantizers. At search time, this dual interpretation accelerates the search. Most of the indexed vectors are filtered out with Hamming distance, letting only a fraction of the vectors to be ranked with an asymmetric distance estimator. The method is complementary with a coarse partitioning of the feature space such as the inverted multi-index. This is shown by our experiments performed on several public benchmarks such as the BIGANN dataset comprising one billion vectors, for which we report state-of-the-art results for query times below 0.3 millisecond per core. Last but not least, our approach allows the approximate computation of the k-NN graph associated with the Yahoo Flickr Creative Commons 100M, described by CNN image descriptors, in less than 8 h on a single machine.
Conference Paper
Proximity searching can be formulated as an optimization problem, being the goal function to find the object minimizing the distance to a given query by traversing a graph with a greedy algorithm. This formulation can be traced back to early formulations defined for vector spaces, and other recent approaches defined for the more general setup of metric spaces. In this paper we introduce three searching algorithms generalizing to local search other than greedy, and experimentally prove that our approach improves significantly the state of the art. In particular, our contributions have excellent trade-offs among speed, recall and memory usage; making our algorithms suitable for real world applications. As a byproduct, we present an open source implementation of most of the near neighbor search algorithms in the literature.
Conference Paper
The nearest neighbor search problem is well known since 60s. Many approaches have been proposed. One is to build a graph over the set of objects from a given database and use a greedy walk as a basis for a search algorithm. If the greedy walk has an ability to find the nearest neighbor in the graph starting from any vertex with a small number of steps, such a graph is called a navigable small world. In this paper we propose a new algorithm for building graphs with navigable small world properties. The main advantage of the proposed algorithm is that it is free from input parameters and has an ability to adapt on the fly to any changes in the distribution of data. The algorithm is based on the idea of removing local minimums by adding new edges. We realize this idea to improve search properties of the structure by using the set of queries in the execution stage. An empirical study of the proposed algorithm and comparison with previous works are reported in the paper.