PreprintPDF Available

Medical Image Retrieval via Nearest Neighbor Search on Pre-trained Image Features

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Nearest neighbor search (NNS) aims to locate the points in high-dimensional space that is closest to the query point. The brute-force approach for finding the nearest neighbor becomes computationally infeasible when the number of points is large. The NNS has multiple applications in medicine, such as searching large medical imaging databases, disease classification, diagnosis, etc. With a focus on medical imaging, this paper proposes DenseLinkSearch an effective and efficient algorithm that searches and retrieves the relevant images from heterogeneous sources of medical images. Towards this, given a medical database, the proposed algorithm builds the index that consists of pre-computed links of each point in the database. The search algorithm utilizes the index to efficiently traverse the database in search of the nearest neighbor. We extensively tested the proposed NNS approach and compared the performance with state-of-the-art NNS approaches on benchmark datasets and our created medical image datasets. The proposed approach outperformed the existing approach in terms of retrieving accurate neighbors and retrieval speed. We also explore the role of medical image feature representation in content-based medical image retrieval tasks. We propose a Transformer-based feature representation technique that outperformed the existing pre-trained Transformer approach on CLEF 2011 medical image retrieval task. The source code of our experiments are available at https://github.com/deepaknlp/DLS.
Content may be subject to copyright.
Medical Image Retrieval via Nearest Neighbor Search
on Pre-trained Image Features
Deepak Gupta, Russell Loane, Soumya Gayen, Dina Demner-Fushman
Lister Hill National Center for Biomedical Communications
National Library of Medicine, National Institutes of Health
Bethesda, MD, USA
Abstract
Nearest neighbor search (NNS) aims to locate the points in high-dimensional
space that is closest to the query point. The brute-force approach for finding
the nearest neighbor becomes computationally infeasible when the number
of points is large. The NNS has multiple applications in medicine, such as
searching large medical imaging databases, disease classification, and diagnosis.
With a focus on medical imaging, this paper proposes DenseLinkSearch an
effective and efficient algorithm that searches and retrieves the relevant images
from heterogeneous sources of medical images. Towards this, given a medical
database, the proposed algorithm builds an index that consists of pre-computed
links of each point in the database. The search algorithm utilizes the index to
efficiently traverse the database in search of the nearest neighbor. We extensively
tested the proposed NNS approach and compared the performance with state-
of-the-art NNS approaches on benchmark datasets and our created medical
image datasets. The proposed approach outperformed the existing approaches
in terms of retrieving accurate neighbors and retrieval speed. We also explore
the role of medical image feature representation in content-based medical image
retrieval tasks. We propose a Transformer-based feature representation technique
that outperformed the existing pre-trained Transformer-based approaches on
Corresponding author
Email addresses: deepak.gupta@nih.gov (Deepak Gupta), russellloane@gmail.com
(Russell Loane), soumya.gayen@nih.gov (Soumya Gayen), ddemner@mail.nih.gov (Dina
Demner-Fushman)
Preprint submitted to Artificial Intel ligence In Medicine October 6, 2022
arXiv:2210.02401v1 [cs.CV] 5 Oct 2022
CLEF 2011 medical image retrieval task. The source code and datasets of our
experiments are available at https://github.com/deepaknlp/DLS.
Keywords: Content-based image retrieval, Nearest neighbor search, Image
feature representation, Indexing and Searching in High Dimensions
1. Introduction
Over the past few decades, medical imaging has significantly improved
healthcare services. Medical imaging helps to save lives, increase life expectancy,
lower mortality rates, reduce the need for exploratory surgery, and shorten
hospital stays. With medical imaging, the physician makes better medical
decisions regarding diagnosis and treatment. Medical imaging procedures are
non-invasive and painless and often do not necessitate any particular preparation
beforehand. With the growing demand for medical imaging, the workload of
radiologists has increased significantly over the past decades. Mayo Clinic has
observed a ten-fold increase in the demand for radiology imaging from just over
9 million in 1999 to more than 94 million in 2010 (McDonald et al., 2015). To
meet the growing demand, radiologists must process one image every three to
four seconds (McDonald et al., 2015). Consequently, the increase in workload
may lead to the incorrect interpretation of the radiology images and compromise
the quality and safety of patient care.
The recent advancement in the Artificial intelligence (AI) fields of computer
vision and machine learning has the potential to quickly interpret and analyze
different forms of medical images (Lambin et al., 2012; Gupta et al., 2021;
Yu et al., 2020) and videos (Gupta et al., 2022; Gupta & Demner-Fushman,
2022). Content-based image retrieval (CBIR) is one of the key tasks in analyzing
medical images. It involves indexing the large-scale medical-image datasets and
retrieving visually similar images from the existing datasets. With an efficient
CBIR system, one can browse, search, and retrieve from the databases images
that are visually similar to the query image.
CBIR systems are used to support cancer diagnosis (Wei et al., 2009; Bressan
2
et al., 2019), diagnosis of infectious diseases (Zhong et al., 2021) and analyze the
central nervous system (Mesbah et al., 2015; Conjeti et al., 2016; Li et al., 2018b),
biomedical image archive (Antani et al., 2004), malaria parasite detection (Khan
et al., 2011; Rajaraman et al., 2018; Kassim et al., 2020). Given the growing
size of the medical imaging databases, efficiently finding the relevant images
is still an important issue to address. Consider a large-scale medical imaging
database with hundreds of thousands to millions of medical images, in which
each image is represented by high-dimensional (thousands of features) dense
vectors. Searching over the millions of images in such high-dimensional space
requires an efficient search. The features used to represent the image are another
key aspect that affects the image search results. Image features with reduced
expressive ability often fail to discriminate the images with the near-similar
visual appearance. The role of image features becomes more prominent with
image search applications that search over millions of images and demand a
higher degree of precision. To address the aforementioned challenges, we focus
on developing an algorithm that can efficiently search over millions of medical
images. We also examine the role of image features in obtaining relevant and
similar images from large-scale medical imaging datasets.
This study presents DenseLinkSearch an efficient algorithm to search and
retrieve the relevant images from the heterogeneous sources of medical images
and nearest neighbor search benchmark datasets. We first index the feature
vectors of the images. The indexing produces a graph with feature vectors as
vertices and euclidean distance between the endpoints vectors as edges. In the
literature, the tree-based data structure has been used to build indexes to speed
up search retrieval. Beygelzimer et al. (2006) proposed Cover Tree that was
specifically designed to facilitate the speed up of the nearest neighbor search by
efficiently building the index. We compare our proposed DenseLinkSearch
with the existing tree-based and approximate nearest neighbor approaches and
provide a detailed quantitative analysis.
To evaluate the proposed DenseLinkSearch algorithm, we collected 12
,
851
,
263
3
medical images from the Open
i1
biomedical search engine. We extend our
experiments on 11 benchmarked NNS datasets (Artificial, Faces, Corel, MNIST,
FMNIST, TinyImages, CovType, Twitter, YearPred, SIFT, and GIST). The
experimental results show that our proposed DenseLinkSearch is more effi-
cient and accurate in finding the nearest neighbors in comparison to the existing
approaches.
We summarize the contributions of our study as follows:
1.
We devise a robust nearest neighbor search algorithm DenseLinkSearch
to efficiently search large-scale datasets in which the data points are often
represented by the high dimension vectors. To perform the search, we
develop an indexing technique that processes the dataset and builds a
graph to store the link information of each data point present in the dataset.
The created graph in the form of an index is used to quickly scan over the
millions of data points in search of the nearest neighbors of the query data
point.
2.
We also perform an extensive study on the role of features that are used
to represent medical images in the dataset. To assess the effectiveness of
the features in retrieving the relevant images, we explore multiple deep
neural-based features such as ResNet, ViT, and ConvNeXt and analyze
their effectiveness in accurately representing the images in high-dimensional
spaces.
3.
We demonstrate the effectiveness of our proposed DenseLinkSearch
on newly created Open
i
medical imaging datasets and eleven other
benchmarked NNS datasets. The results show that our proposed NNS
technique accurately searches the nearest neighbor orders of magnitude
faster than any comparable algorithm.
1https://openi.nlm.nih.gov/
4
2. Related Work
2.1. Content-based Image Retrieval
Content-based image retrieval focuses on retrieving images by considering the
visual content of the image, such as color, texture, shape, size, intensity, location,
etc. For the instance of medical image retrieval, Xue et al. (2008) introduced the
CervigramFinder system that operates on cervicographic images and aims to
find similar images in the database as per the user-defined region. The system
extracted color, texture, and size as the visual features. Antani et al. (2007)
developed SPIRS-IRMA that combines the capability of IRMA (Lehmann et al.,
2004) system (global image data) and SPIRS (Hsu et al., 2007) system (local
region-of-interest image data) to facilitate retrieval based not only on the whole
image but also on local image features so that users can retrieve images that are
not only similar in terms of their overall appearance but also similar in terms
of the pathology that is displayed locally. Depeursinge et al. (2011) proposed a
3D localization system based on lung anatomy that is used to localize low-level
features used for CBIR. The image retrieval task of the Conference and Labs
of the Evaluation Forum (ImageCLEF) has organized multiple medical image
retrieval tasks (Clough et al., 2004; uller et al., 2009; Kalpathy-Cramer et al.,
2011; uller et al., 2012) from the year 2004 to 2013. ImageCLEF has provided
a venue for the researcher to present their findings and engage in head-to-head
comparisons of the efficiency of their medical image retrieval strategies. Over the
years, the participants at ImageCLEF made use of a diverse selection of local
and global textural features. These included the Tamura features: coarseness,
contrast, directionality, line-likeness, regularity, and roughness. Multiple filters
such as Gabor, Haar, and Gaussian filters have been used to generate a diverse
set of visual features. The visual features (Kalpathy-Cramer et al., 2015) for
medical image retrieval are also generated using Haralick’s co-occurrence matrix
and fractal dimensions.
Rahman et al. (2008) proposed a content-based image retrieval framework
that deals with the diverse collections of medical images of different modalities,
5
anatomical regions, acquisition views, and biological systems. They extracted
the low-level image features such as MPEG (Moving Picture Experts Group)-7
based Edge Histogram Descriptor (EHD) and Color Layout Descriptor (CLD) to
represent the images. Further, Rahman et al. (2011) presents an image retrieval
framework based on image filtering and image similarity fusion. The framework
utilizes the support vector machine (SVM) (Cortes & Vapnik, 1995) to predict
the category of query images and images stored in the database. In this way,
the irrelevant images are filtered out, which leads to reduced search space for
image similarity matching. A three-stage approach for human brain magnetic
resonance image retrieval was introduced by Nazari & Fatemizadeh (2010). In
the first stage, the gray level co-occurrence matrix (GLCM) (Haralick, 1979)
was constructed thereafter, the image features were extracted by computing
Features Energy, Entropy, Contrast, Inverse Difference Moment, Variance, Sum
Average, Sum Entropy, Sum Variance, Difference Variance, Difference Entropy,
and Information measure of correlation. Principal component analysis (PCA)
was used for feature reduction in the second stage. An SVM classifier was used
in the last stage to perform decision-making.
With the success of the convolution neural network (CNN) for image classifi-
cation, CNN-based pre-trained models (Simonyan & Zisserman, 2014; He et al.,
2016; Huang et al., 2017; Szegedy et al., 2016, 2015) became the de-facto archi-
tecture for image classification, feature extraction, and analysis. Qayyum et al.
(2017) proposed a deep learning-based framework for medical image retrieval
tasks. A deep convolutional neural network was trained for the medical image
classification, and the trained model was used to extract the image features. Cao
et al. (2014) developed a deep Boltzmann machine-based multimodal learning
model for fusion of the visual and textual information. The proposed multimodal
approach enabled searches of the most relevant images for a given image query.
Off-the-shelf pre-trained language models were used to extract the image
features for the open domain (Chen et al., 2021). The straightforward approach
(Sharif Razavian et al., 2014; Gong et al., 2014; Babenko et al., 2014) is to extract
the image features from the last fully-connected layer; however, it may include
6
irrelevant patterns or background clutter. In another strategy (Yue-Hei Ng
et al., 2015; Razavian et al., 2016; Lou et al., 2018), the image features are
extracted from convolutional layers that preserve more structural details. In
order to extract global and local features for the images, the layer-level (Do &
Cheung, 2017; Cao et al., 2020; Yu et al., 2017; Zhang et al., 2019) feature fusion
mechanism has been adapted to complement each feature for the image retrieval
task.
2.2. Nearest Neighbor Search
In the course of the last several decades, numerous optimization strategies
that are aimed at speeding up the nearest neighbor search have been presented.
The tree-based methods have been widely used to speed up the NNS process. In
the tree-based approach, a tree-like data structure is used to organize the data
points in such a way that it can be efficiently traversed in search of the nearest
neighbors for the input query. In general, once the tree structure has been built,
the triangle inequality is used to filter the nodes of the tree that can not be the
nearest neighbor, thereby reducing the computation and speeding-up the search
process. Friedman et al. (1977) proposed the first space-partitioning tree KD
Tree that uses a depth-first tree traversal technique, followed by backtracking
to locate the nearest neighbor in logarithmic time. At each node of KD Tree,
d
-dimensional data points are recursively partitioned into two sets by splitting
along one dimension of the data. The split value is often determined to be the
median value along the dimension being split. This leads to the points being
evenly distributed over the axis-aligned hyper-rectangles. Choosing the split
value and split dimensions are the key challenges in KD Tree. To alleviate these
issues, Ball Tree (Fukunaga & Narendra, 1975; Omohundro, 1989) was proposed
that considered the hyper-spheres instead of hyper-rectangles that form a cluster
of the data points in high-dimensional spaces. The Ball Tree computes the
centroid of the whole data set, which is then used to recursively partition the
data set into two subgroups. It uses triangle inequality to prune the ball and all
data points within the ball while searching for nearest neighbors. Other tree-
7
based approaches such as PCA Tree, (Sproull, 1991), VP Tree (Yianilos, 1993),
M Tree (Ciaccia et al., 1997), R Tree (Kamel & Faloutsos, 1993), and Cover
Tree (Beygelzimer et al., 2006) have been introduced in the literature of nearest
neighbour search. Most of the tree-based methods for nearest-neighbor search
are often well-suited for low dimensional data points, however, they perform
poorly in high dimensional spaces.
Wang (2011) observed that the performance of tree-based methods is con-
sidered to be satisfactory if the search for nearest neighbors requires only a
few nodes at each level of the tree. However, in the case of high dimensional
data, these approaches lose their effectiveness as the histograms of distances
and 1-Lipschitz function values become concentrated (Pestov, 2012; Boytsov
& Naidan, 2013). In this case, indexes built with a clustering-based partition
technique seem to perform better than the tree-based indexes. By following
the clustering-based partition scheme, numerous approaches (Prerau & Eskin,
2000; Wang, 2011; Almalawi et al., 2015) have been proposed to find the nearest
neighbors in high dimensional spaces. In clustering-based indexes, the data
points form multiple clusters. While searching for the nearest neighbors, the
triangle inequality can be applied to prune the clusters that can’t hold the
nearest neighbors.
In another line of research, lower bound-based methods for efficient nearest
neighbor search have been proposed. The key idea of the lower bound-based
methods is to reduce the distance computation between the query and the candi-
date data points, which leads to an efficient NNS. Liu et al. (2018) proposed two
lower bounds: progressive lower bound (PLB) and statistical lower bound (SLB),
that aim to reduce the distance computation and accelerate the approximate
NNS of the HNSW indexing method (Malkov et al., 2014). Hwang et al. (2012)
consider the mean and variance of data points to derive the lower bounds that
significantly reduce the distance computations. Further, Hwang et al. (2018)
introduced product quantized translation that aims to eliminate nearest neighbor
candidates effectively using their euclidean distance lower bounds in nonlinear
embedded spaces. Jeong et al. (2006); Li et al. (2018a) also utilized the lower
8
bounds strategy to reduce the distance computations. Recently, Zhang et al.
(2022) introduced the concept of block vectors based on lower bounds that
reduce the expensive distance computation. Further, they designed a multilevel
lower bound that computes the lower bound step-by-step and makes use of the
multistep filtering technique to speed up the search further.
3. Proposed Approach
3.1. Background
In the nearest neighbor search, given a set of
N
data points
D
=
{v1, v2, . . . , vN}
where
vi Rd
, and a distance metric
dist
(
vi, vj
) for points
vi
and
vj
, for any
given query point
q Rd
, the goal is to find the nearest point
v
in the data
points such that:
v= arg min
vi∈D
dist(vi, q) (1)
A variant of NNS is the
k
-nearest neighbors (kNN) problem, which aims to
find the
k
nearest points in
D
to query
q
, where
k
is a constant. A brute-force
algorithm requires computing the distance between a query point
q
and each
data point in
D
, resulting in
O
(
N
) time. In applications where the number of
points is large, and each data point is high dimensional, it is not computationally
feasible to use the brute-force algorithm. Therefore, our goal is to process the
data points in advance, which can reduce the computational complexity and
quickly find the nearest neighbors of the queries.
3.2. Nearest Neighbor Search
We propose an effective approach to finding the kNN in high-dimensional
vector space. The approach deals with building the index and finding the nearest
neighbors to the given query with a time-efficient DenseLinkSearch algorithm.
In this section, we describe the indexing algorithm and DenseLinkSearch
algorithm in detail.
9
3.2.1. Summary of the Approach
Given a high-dimensional dataset, our proposed approach first builds the
index considering all the data points from the dataset. Formally, the indexing
algorithm builds the descend and spread links (to be introduced shortly) for
each data point by considering the
Kindex
nearest neighbors to them. We use
the notation
Kindex
to indicate the number of nearest neighbors embedded in
the index. In the indexing process, each vertex
2
forms links to the other vertices,
making a cluster of the links that appear to be dense links. While searching for
the neighbors nearest to a query after the index is built, DenseLinkSearch
algorithm follows the two-stage approach. In the first stage, it follows the descend
links of the closest indexed vector to the query vector, and in the second stage,
it follows the spread links of the closest indexed vector to surround the query
vector. While searching the neighbors, the algorithm spends the majority of
the calculation time in the spread stage to find the
Ksearch
nearest neighbors.
We use the notation
Ksearch
to indicate the number of nearest neighbors to the
query image that has to be searched in the dataset.
3.2.2. Indexing Algorithm
Given a dataset
D
having
N
vectors, the algorithm builds the index by
considering the
Kindex
nearest neighbors for each vector in the data. The
indexing algorithm constructs an index graph
G
where the vertices of the graph
are the vectors, and the edges between them hold the Euclidean distances between
the two vertices of the graph. The vectors are added to the index one by one.
Once the vector is added to the index, it is called a node (vertex), so all nodes
are vectors. During indexing, each vector has a spherical neighborhood that
encloses neighboring vectors. Neighborhoods contract from infinity (the entire
set of data points) at the start of indexing to just the nearest neighbors at
the end. When a vector becomes a node, its final neighborhood (the
Kindex
nearest neighbors) is calculated, so its neighborhood shrinks to a minimum. In
2We use the term ‘vertex,’ ‘vector’, and ‘data point’ interchangeably in the paper.
10
A
B C
E
Node
Distance
E
42
B
32
Node
Distance
57
42
32
45
39
Node
Distance
D
41
B
39
Node
Distance
Node
B
30
D
29
Node
Distance
A
42
Node
Distance
E
30
D
24
Node
Distance
C
39
A
32
D
Node
Distance
E
29
B
24
Node
Distance
C
41
24
29
47
41 30
List
Heap List
Heap List
Heap List
Heap List
A
BC
E
Node
Distance
E
42
B
32
Node
Distance
57
42
32
45
39
Node
Distance
D
41
B
39
Node
Distance
Node
Distance
B
30
D
29
Node
Distance
A
42
Node
Distance
A
32
E
30
Node
Distance
C
39
D
Node
Distance
C
41
E
29
Node
Distance
29
47
41 30
Heap List
Heap List
Heap List
Heap List
Heap List
A
B C
ED
(a) (b)
Data Points Representation A snapshot of Index Construction
Heap
Final Index Construction Index Links with Distances
Nodes
Links and Distances
A
(B, 32), (E, 42)
B
(D, 24), (E, 30), (A, 32), (C, 39)
C
(B, 39), (D, 41)
D
(B, 24), (E, 29), (C, 41)
E
(D, 29), (B, 30), (A, 42)
Figure 1: Illustration of the indexing algorithm. Subfigure
(a)
demonstrates the image
vector representation in high-dimensional space. The indexing algorithm considers each vector
representation as a vertex and the distance between them as the edge of the index graph. The
indexing starts with a random vertex and calls the vertex a node. At any point of time during
index construction, a vector representation that has not become a node of the index graph is
shown as . The algorithm initializes a heap data structure of size
Kindex
= 2 and a list.
The heap holds links to the nearest vectors known so far. The list holds links to other vectors
that consider this vector near. The top link in a heap defines the neighborhood radius, also
known as the near distance. Subfigure
(b)
represents a snapshot of the index after the data
point
E
became a node, and its heap and list are updated based on distances between the node
and its neighbors. The existing nodes and edges are shown as and respectively. The
recently processed node
E
and added edges are shown as and respectively. Subfigure
(c)
demonstrates each node’s heap and list items and the index graph with nodes and edges
after the index is built. The created links and distances are shown in subfigure
(d)
. The final
index contains the descend and spread links for each node in the dataset. The descend links
contain the endpoints vertices and distances from the heap just before the vertex becomes
a node while building the indexes. Similarly, the spread links of a given vertex store the
endpoints vertices and distances from the heap and list once the algorithm ends.
11
Variable Description
RvNeighborhood radius, also known as near distance.
CvDistance to closest indexed vector.
Hv
Heap to hold the links to the near neighbors of vector v. The top of the heap Hvis the link
to the furthest of the near neighbor vector.
Lv
List to hold the links to the far neighbors of vector v. This list hold links to other vectors
that consider this vector near.
IvList to hold finished index links for vector v.
Table 1: Description of the variables for each vector v D .
this process, links are created to neighbors, often causing their neighborhoods
to shrink a little as well. Initially, all neighborhoods are huge and overlap, so
links are created to all nodes. Later on, neighborhoods overlap with fewer and
fewer other neighborhoods, creating fewer links. Shrinking neighborhoods allows
indexing to occur with much less than O(N2) distance calculations.
The proposed indexing algorithm has the following steps to build the index:
1. Initialization
: For each vector
v
in the dataset
D
, we initialize the
variables listed in Table 1. We also initialize a global max-heap
H
of size
N
that stores the distance of the vector that is furthest from all previous
indexed nodes at any point of time throughout the indexing process. To
hold the final index, we initialize a list
I
that stores the list of links
Iv
for
each vector vin the dataset D.
2. Link Creation
: To start the indexing, we choose a random vector from
the dataset that will become a node of the graph
G
. The heap
H
tracks the
nearest known distance
Cv
, for all unindexed vectors. The top of the heap
H
is the vector that is the furthest from all indexed nodes. This vector is
known as the create vector and will become the next node in the indexing
process. The distance from the create vector to the nearest node is known
as the create distance. When a vector becomes a node, it is removed from
the heap
H
, and we also copy all the links from the local heap
Hv
of the
vector
v
and add them into the list
Iv
. The stored links are long at the
12
beginning of indexing, get shorter as indexing progresses, and provide a
multiply connected network of links at all length scales for moving around
the dataset in descend stage of the DenseLinkSearch.
During the search, this network of links makes it possible to descend from
the root node to a node close to the query vector in
O
(
log N
) steps. These
network links are called descend links.
For a vector
v
, the near distance is defined by the link at the top of the
local heap Hv, which is the distance to the furthest of the Kindex nearest
neighbors linked so far. If the heap is not yet full, the near distance
is infinite. A vector
A
considers another vector
B
near if the vector
B
is within the vector
A
’s near distance
RA
(also known as neighborhood
radius). Since vectors have different neighborhood radii, it is often the
case that a vector
A
considers a vector
B
near, but the vector
B
does
not consider the vector
A
near. For the first
Kindex
nodes indexed, no
vector’s heaps are full, and every vector considers every other vector near.
Further indexing adds links to vector heaps, pushing out the longest and
therefore shrinking the near distance. When neighborhoods shrink to the
point that both ends of a link no longer consider the other end near, the
link is dropped from the vectors’ heaps and lists.
Further, to create the links for the new node, we search for the potential
neighbors that are not yet nodes, i.e., not yet indexed. A link between
vectors
A
&
B
will only be created if one or both of the vectors consider
the other one near. The new link is added to heap
HA
of vector
A
if vector
A
considers vector
B
near, i.e., the distance
dA,B<RA
, otherwise to list
LA
of vector
A
. Likewise, it is added to heap
HB
of vector
B
if vector
B
considers vector
A
near, i.e., the distance
dA,B<RB
, otherwise to list
LB
of vector
B
. Adding a link to a full heap will push the top vector out,
shrinking the near distance,
RA
The link that is pushed out is moved to
the list if the other endpoint vector still considers this one near. If neither
endpoint considers the other near, the link is dropped.
13
When a vector becomes a node, all
Kindex
of its existing descend links
are to existing nodes. Due to the node creation order enforced by the
global heap
H
, the early nodes will have long existing links to far away
nodes. Creation always chooses the vector that is the furthest away from
all existing nodes so that create distance will continually shrink. Therefore,
the nodes created later in the indexing process will have mid-range links to
nearer nodes. And the last nodes created will have short links to nearest
neighbors.
3. Post-processing
: At the end of indexing, all nodes have
Kindex
links
to their nearest neighbors. These nearest neighbor links are also stored,
providing a dense mesh of short links between nearest neighbors. These
mesh links make it possible to spread out from a close node to a given
query vector and find all the query’s nearest neighbors. These mesh links
are called spread links. The same link may be both a descend and spread
link, and no real distinction is made during the search. At the end of
indexing, all links are unique and sorted by length to optimize the search.
We have illustrated the indexing algorithm with a running example in Fig. 1
and provided the detailed pseudocode for the indexing algorithm in
Appendix
.
3.2.3. DenseLinkSearch Algorithm
Given the dataset
D
with
N
vectors and its index containing links
I
, the
DenseLinkSearch algorithm finds the
Ksearch
nearest neighbors to the query
vector qusing the following steps:
1. Initialization
: Firstly, we initialize a global lookup table
L
that stores
the vector
v D
as key and distance
dv,q
between vector
v
and query
vector
q
as the value. We also initialize the global heap
H
of size
Ksearch
to
hold
Ksearch
nearest neighbors to the query vector
q
. We start the search
from the root vector of the dataset
D
and compute the distance between
the root vector of dataset
D
and the query vector. Initially, the root vector
is the nearest neighbor to the query; therefore, the vector is pushed into
14
A
BC
ED
Q
B
AA
BC
ED
Q
B
A
18
A
BC
ED
Q
B
A
18
A
BC
ED
Q
B
A
18
28
32
(a) (b) (c) (d)
32
32
Descend Stage Spread Stage
52
28
52
28
52
Figure 2: The demonstration of DenseLinkSearch that utilizes the index built as shown in
Fig. 1. The DenseLinkSearch operates on the Descend and Spread stages to find the
Ksearch
nearest neighbors. The input to the search algorithm is a query vector
Q
and the pre-built
index with the links. In the subfigure (
a
), the search starts with a random node (here
E
)
and uses the links of the node
E
to compute the distance between the query vector and the
node listed in the links of
E
. In each step of the Descend stage, the algorithm keeps track of
the closest vector to the query. Also, a heap
H
of size
Ksearch
is maintained to track
Ksearch
nearest neighbors throughout the search process. In the subfigure (
b
), the distance between
the query vector and the links (
A
,
B
, and
D
) of
E
are computed, and the closest vector
is updated accordingly. The subfigure (
c
) shows that the closest node
A
is chosen for the
next stage of the Descend. Since the distances (23 and 32) between the query and the links of
the node
A
are greater than distance (18) between query and node
A
, the search switches
to the Spread stage. The Spread stage is demonstrated in subfigure (
d
), where, we aim to
compute the distance between the query and the links (
B
and
E
) of node
A
, those have
already been computed. At the end of the algorithm, the
Ksearch
nearest neighbors can be
found in the heap H.
the global heap
H
and also recorded into the
L
. We also initialize two
global variables
VC
and
DC
. The
VC
and
DC
are used to keep track of
the nearest vector and closest distance (to the query vector) found at any
point of time during the NNS.
2. Descend Stage
: The descend stage utilizes the index
I
that is built
during the indexing process. During the search, we start with the closest
vector
VC
and retrieve all the links
L
of
VC
from the
I
. We traverse each
link
l L
, and compute the distance between the vector associated with
link
l
and query vector
q
. During the traversal of the link, we find the
local nearest vector and nearest distance from the
L
and update the global
15
VC
and
DC
. We repeat the Descend step as long as we keep getting closer
to the query (the closest distance keeps shrinking). Then, we switch to the
Spread stage.
3. Spread Stage
: In the Spread stage, we traverse the global heap
H
to find
the closest vector and its distance to the query vector
q
. Specifically, for
each vector
v H
, we extracted their links
L
from
L
and traverse each
link
l L
in search of the closest vector to the query vector
q
. We continue
traversing
H
until we find the new closest vector
VN
for which the distance
DN
is smaller than the maximum distanced (
DL
) vector (
VL
) recorded in
the
H
. Once, we find the closest vector
VN
with their distance to query
DN
, we compare the
DN
with the global closest distance
DC
. In this case,
if the
DN
is smaller than the
DC
then we update the global closest vector
VC
and closet distance
DC
and perform the Descend stage. Otherwise,
if the the
DN
is smaller than the
DL
then we perform the Spread stage
again.
We have illustrated the DenseLinkSearch algorithm with a running exam-
ple in Fig. 2 and provided the detailed pseudocode for the DenseLinkSearch
algorithm in Appendix.
3.3. Image Feature Representation
To examine the effectiveness of the image feature representation, we performed
an extensive experiment considering multiple feature extraction models. We also
explore the different feature aggregation techniques to assess their significance in
image-based retrieval. In this section, we discuss the feature extraction models
and feature aggregators in detail.
3.3.1. Medical Image Feature Extractors
1. Deep Residual Network
: Gradient degradation is a key challenge in
training deep neural networks. It is the issue of an increase in training
error when layers get added to the network. This causes the low accuracy
16
of the neural network model. With the increase in layers of the network,
the gradient computed in the back-propagation step starts to diminish.
This problem of vanishing derivatives in deep neural networks is called
vanishing gradient descent (Hochreiter, 1998). To overcome these prob-
lems, He et al. (2016), introduced a deep residual learning framework with
deep convolutional neural networks (CNNs) that reformulate the layers
as learning residual functions with reference to the layer inputs instead
of learning unreferenced functions. Given the input
x
received from the
previous layer, the residual learning framework, the original mapping is
recast into
F
(
x
) +
x
. The formulation of
F
(
x
) +
x
can be realized by feed-
forward neural networks with “shortcut connections”. These connections
perform identity mapping, and the outputs of connections are summed
to the stacked layers’ outputs. In this study, we utilize the pre-trained
ResNet50 model as a feature extractor.
2. Vision Transformers
: Inspired by the success of Transformer archi-
tecture (Vaswani et al., 2017) in Natural Language Processing (NLP),
Dosovitskiy et al. (2020) explore the Transformer architecture with the
images and develop Vision Transformer (ViT) . With the Transformer,
the images are treated like tokens, as in NLP, by splitting the images
into patches. The patch embedding is added with position embedding
and passed as input to the Transformer layer. The study conducted by
Dosovitskiy et al. (2020) shows that Transformer applied directly to se-
quences of image patches achieved better results on image classification
tasks compared to the CNNs while using fewer computational resources
to train the model. We utilize the pre-trained ViT-Base, ViT-Large and
ViT-Huge models as image feature extractors.
3. ConvNeXt
: The ConvNeXt (Liu et al., 2022) is a family of pure CNNs,
which is developed by considering the multiple design decisions in Trans-
formers. The ConvNeXt family considers the following key design decisions:
(a)
The ConvNeXt follows the Swin Transformers (Liu et al., 2021) stage
17
compute ratio strategy. To adopt the strategy, it adjusts the number
of blocks sampled in each stage of the network from (3
,
4
,
6
,
3) in
ResNet50 to (3,3,9,3) in ConvNeXt.
(b)
The ConvNeXt adopted the grouped convolution approach of ResNeXt
(Xie et al., 2017) and uses depth-wise convolution, a special case of
grouped convolution where the number of groups equals the number
of channels.
(c)
The ConvNeXt considers the larger kernel-size convolution operations.
Varying the kernel sizes from 3 to 11, they found the optimal kernel
size of 7 ×7.
(d)
Additionally, ConvNeXt replaces the ReLU activation with the fewer
GELU (Hendrycks & Gimpel, 2016) activation functions and fewer
batch normalization (Santurkar et al., 2018) layers.
In this work, we utilized pre-trained ConvNeXt-B, ConvNeXt-L, and ConvNeXt-
XL as image feature extractors.
3.3.2. Feature Aggregations
Given an image
M
and an image feature extractor
G
(
M
;
θ
), where
θ
denotes
the feature extractor parameters that are frozen during feature extraction, we first
pre-process the image and transform the image
M
into
X Rc×w×h
, where
c
,
w
and
h
are number of channels, width, and height of the image respectively. The
feature extractor
3G
takes
X
as input and produces the 3D tensor
x RK×W×H
,
where
K
is the number of feature maps in the last layer of feature extractor,
W
and Hrefers to the width and height of the feature map respectively.
1. Max Pooling
: In max pooling, the maximum value of the spatial feature
activation for each feature map is considered for feature representation.
Formally,
f(max)= [f(max)
1, f (max)
2,...f(max)
K]>(2)
3We are focused here on CNNs-based feature extractors.
18
where
f(max)
k
=
maxyxk
(
y
) is the value obtained after applying
max
operation on kth feature map xk RW×H.
2. Sum Pooling
: The sum pooling aims to obtain the feature representation
by summing the value of the spatial feature activation for each. Formally,
f(sum)= [f(sum)
1, f (sum)
2,...f(sum)
K]>(3)
where
f(sum)
k
=
sumyxk
(
y
) is the value obtained after applying
sum
operation on kth feature map xk RW×H.
3. Mean Pooling
: In mean pooling, the average value of the spatial feature
activation for each feature map is considered for feature representation.
Formally,
f(mean)= [f(mean)
1, f (mean)
2,...f(mean)
K]>(4)
where
f(mean)
k
=
meanyxk
(
y
) is the value obtained after applying
mean
operation on kth feature map xk RW×H.
4. Generalized Mean Pooling
: It is the generalization of the pooling
technique where generalized mean (Doll´ar et al., 2009; Radenovi´c et al.,
2018) is used to derive the pooling function. Formally,
f(gmean)= [f(gmean)
1, f (gmean)
2,...f(gmean)
K]>(5)
where f(gmean)
k= 1
|xk|X
yxk
ypk!
1
pk
(6)
where pkis a hyper-parameter.
5. Spatial-wise Attention
: In spatial-wise attention, we model the impor-
tance of each spatial feature by computing the weight. The final feature is
generated by considering the spatial weight. We build an attention matrix
α RW×H
. The sum of any row or column of the matrix
α
is 1, which
signifies the importance of each spatial position. First, we compute the
weight matrix w RW×H, as follows:
w=
K
X
k=1
yk(7)
19
Then we apply the
row-wise softmax
and
column-wise softmax
opera-
tion on
w
to obtain the attention matrix
α
. The final aggregated features
fspatial are obtained as follows:
f(spatial)= [f(spatial)
1, f (spatial)
2,...f(spatial)
K]>(8)
where
f(spatial)
k
=
mean
(
xkα
), obtained after applying element-wise
multiplication on
xk
and
α
. The
mean
operation is used to aggregate the
feature activation.
6. Channel-wise Attention
: In the channel-wise attention, we model the
importance of each feature map (channel) by computing the weight. The
final feature is generated by considering the channel weight. We build
an attention matrix
β RK
. The sum of each element of
β
is 1, which
signifies the importance of each feature map. Similar to the spatial-wise
attention; first, we compute the weight matrix w RK, as follows:
wk=
W
X
i=1
H
X
j=1
ak
i,j (9)
where
ak
i,j
is an element of
kth
feature map. Then we apply the
softmax
operation on
w
to obtain the attention matrix
β
. The final aggregated
features fchannel are obtained as follows:
f(channel)= [f(channel)
1, f (channel)
2,...f(channel)
K]>(10)
where
f(channel)
k
=
mean
(
xk×β
), obtained after applying scalar multipli-
cation on
xk
with
β
. The
mean
operation is used to aggregate the feature
activation.
For the case of the ConvNeXt feature extractor, we performed a detailed
investigation on the ConvNeXt layers. We observe that pre-trained Lay-
erNorm (Ba et al., 2016) weights from the last convolution layer can be
exploited to obtain better feature representation. Therefore, we pool the
ConvNeXt features as follows:
f(mean)= [f(mean)
1, f (mean)
2,...f(mean)
K]>
f(lmean)=LayerNorm(σ(f(mean)))
(11)
20
where
f(mean)
k
=
meanyxk
(
y
) is the value obtained after applying
mean
operation on
kth
feature map
xk RW×H
.
σ
denotes the
sigmoid
acti-
vation function.
LayerNorm(.)
denotes the
LayerNorm
operation whose
weights are initialized with the pre-trained weights from ConvNeXt model.
4. Experimental Setup
This section describes the experimental setups for NNS and medical image
feature extractors for image retrieval.
4.1. Nearest Neighbor Search
4.1.1. Index and DenseLinkSearch Computational Details
We ran all the experiments on a computing node having 72 CPU, 36 cores,
and 256 GB RAM. The DenseLinkSearch has
Kindex
and
Ksearch
as two
key hyper-parameters. The former denotes the number of nearest neighbors
calculated and kept in the index, while the latter represents the number of
nearest neighbors found by DenseLinkSearch during the search. We have
provided the best hyper-parameters values for each NNS approaches in Tables
A.7, and A.8 in the Appendix.
4.1.2. Datasets for NNS
We evaluated our NNS approach on multiple benchmark datasets from UCI
repository
4
and from the ANN-Benchmark
5
. The datasets are diverse in the
number of samples (minimum 10K, maximum 12.85M) as well as the dimensions
of the sample (minimum 20, maximum 1536). The datasets MNIST, FMNIST,
SIFT, and GIST come with the train and test split; we split the remaining
datasets into train and test. We use the training datasets to build the indexes
and the test datasets to find the 10 nearest neighbors for each query. We have
provided the details of each dataset along with their size and dimensions in Table
2.
4https://archive.ics.uci.edu/ml/index.php
5https://github.com/erikbern/ann-benchmarks
21
OpenI Datasets. The OpenI datasets comprises images from the Open Access
(OA) subset of PubMed Central
6
(PMC). We have curated around 4
,
222
,
779
million articles from PMC OA and extracted 14
,
908
,
095 images from the curated
articles. The scientific articles also contain multi-panel images. The multi-panel
images are further split into single-panel images using the panel segmentation
approach (Demner-Fushman et al., 2012) resulting in a total of 24
,
625
,
466
images. These images also contain graphical illustrations such as charts and
graphs. We employ our in-house modality detector Rahman et al. (2013) that
categorizes each image into one of the eight modality classes (Rahman et al.,
2013). We excluded the images that have been predicted as graphical and only
considered medical non-graphical images. This process yields 12
,
851
,
263 images.
The medical image retrieval task experiments described in Section 5.2 showed
that the ConvNeXt-L (IN-22K) model outperforms the existing medical image
feature extractors. Therefore, we utilized ConvNeXt-L (IN-22K) to generate the
image features for the OpenI dataset and called it the OpenI-ConvNeXt dataset.
To assess the role of feature dimensions for the OpenI dataset, we also generated
the image features from the ResNet50 model having 512 dimensions; we call this
OpenI-ResNet dataset.
4.1.3. Evaluation Metrics for NNS
We evaluate the performance of NNS using the following two metrics:
1. Recall@k (R@k)
: It is the ratio of the true nearest neighbors retrieved
in the top-k nearest neighbors by the algorithm to the
k
true nearest
neighbors in the test set. In this study, we focused on retrieving the 10
nearest neighbors; therefore, the metric is R@10.
2. Average time per query (ATPQ)
: It measures the average time per
query taken by NNS approaches to retrieve the 10 nearest neighbors. We
report this metric in milliseconds.
6https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
22
Dataset # Samples Dimension # Build Index # Query
Artificial (Botta et al., 1993) 10,000 40 9,000 1,000
Faces (Dua & Graff, 2017) 10,304 20 9,304 1,000
Corel (Ortega et al., 1998) 68,040 32 58,040 10,000
MNIST (LeCun et al., 1998) 70,000 784 60,000 10,000
FMNIST (Xiao et al., 2017) 70,000 784 60,000 10,000
TinyImages (Dua & Graff, 2017) 100,000 384 90,000 10,000
CovType (Blackard & Dean, 1999) 581,012 54 571,012 10,000
Twitter (Dua & Graff, 2017) 583,250 78 573,250 10,000
YearPred (Dua & Graff, 2017) 515,345 90 505,345 10,000
SIFT (Jegou et al., 2010) 1,000,000 128 990,000 10,000
GIST (Jegou et al., 2010) 1,000,000 960 999,000 1,000
OpenI-ResNet (Ours) 12,851,263 512 12,841,263 10,000
OpenI-ConvNeXt (Ours) 12,851,263 1,536 12,841,263 10,000
Table 2: The details of the benchmark and our created OpenI datasets used for the nearest
neighbor search experiments.
4.1.4. Baseline NNS Methods
We compare the performance of our proposed DenseLinkSearch (DLS)
approach with the following competitive NNS methods.
1. KD Tree
(Friedman et al., 1977) and
Ball Tree
(Fukunaga & Narendra,
1975): These are the tree-based approach used to find the nearest neighbors.
We have discussed these approaches in Section 2. We use the
scikit-learn
(Pedregosa et al., 2011) implementation7of KD and Ball Tree.
2. RP Forest
(Yan et al., 2019): It works on the concepts of random projec-
tion tree (Dasgupta & Freund, 2008) where nearest neighbors are found
by combining multiple trees with each constructed recursively through a
series of random projections. We utilized the rpForest implementation8.
3. Facebook Artificial Intelligence Similarity Search (FAISS)
: Faiss
7https://scikit-learn.org/stable/modules/classes.html#module- sklearn.neighbors
8https://github.com/lyst/rpforest
23
(Johnson et al., 2019) is an approximate nearest neighbors implementation
of Locality Sensitive Hashing (LSH) (Datar et al., 2004) based indexing,
Inverse Vector File (IVF) (Babenko & Lempitsky, 2014), and Product
Quantization (Jegou et al., 2010). We use the FAISS implementation9.
4. Annoy
(Bernhardsson): We utilized the approximate nearest neighbors
method Annoy10 that strives to minimize memory usage.
5. Multiple Random Projection Trees (MRPT)
(Hyv¨onen et al., 2016):
In MRPT, multiple random projection trees are combined by a voting
scheme. The overall idea is to exploit the redundancy in a large number of
candidate data points and eventually reduce the number of expensive exact
distance computations using independently generated random projections.
We utilized the official implementation11.
6. Hierarchical Navigable Small World (HNSW)
(Malkov & Yashunin,
2018): This is a graph-based approximate nearest neighbor approach.
HNSW builds a multi-layer structure consisting of a hierarchical set of
proximity graphs for nested subsets of the stored elements. To obtain the
results from HNSW, we utilized the official implementation12.
7. Scalable Nearest Neighbors (ScaNN)
(Guo et al., 2020): It is a
quantization based nearest search method that computes the approximate
distance between each data point and query vector. We utilized the official
implementation13.
9https://github.com/facebookresearch/faiss
10https://github.com/spotify/annoy
11https://github.com/vioshyvo/mrpt
12https://github.com/nmslib/hnswlib
13https://github.com/google- research/google-research/tree/master/scann
24
4.2. Medical Image Feature Extractors for Image Retrieval Evaluation
4.2.1. Feature Extractors Details and Hyper-parameters
To extract the features for the images, we use pre-trained weights from
the ResNet, ViT, and ConvNeXt models. For the ResNet, we use pre-trained
ResNet50 weights14. We extracted the features from the conv5 block1 2 conv
layer of the ResNet50 model. This layer returns the output tensor of shape
512
×
7
×
7. For ViT, we experiment with the ViT-Base (dimension 768),
ViT-Large (dimension 1024), and ViT-Huge (dimension 1280) models. Since
ViT is a Transformer model, the
[CLS]
token representation is considered as
the image feature representation. For the ConvNeXt model, we experiment
with the variants of the ConvNeXt models. We mainly experiment with the
model variants trained on ImageNet-1K (Russakovsky et al., 2015) dataset and
pre-trained on ImageNet-22K (Ridnik et al., 2021) dataset. We experiment with
ConvNeXt-B, ConvNeXt-L and ConvNeXt-XL models that output feature map
shapes 1024
×
9
×
9, 1536
×
9
×
9 and 2048
×
9
×
9 respectively. We utilized
pre-trained weights of ViT and ConvNeXt models from
timm
(Wightman, 2019).
For generalized mean pooling, we set the pkhyper-parameter value as 2.
4.2.2. Image Retrieval Dataset and evaluation
We used the ImageCLEF 2011 dataset (Kalpathy-Cramer et al., 2011).The
dataset consists of 231
,
000 images from PubMed Central articles, 30 textual
and visual queries, and relevance judgments. For the visual queries, 2-3 sample
images (enumerated) are provided. In this study, we used only the first image
as visual query to retrieve the relevant images from the pool of 231
,
000 images.
For each image, we extract the features from the pre-trained models and rank
the images based on the cosine similarity between the query image and database
image. As in ImageCLEF evaluations, we evaluated the performance using Mean
Average Precision, Precision@k, R-precision, and binary preference. The metrics
14https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/
ResNet50
25
are defined as follows:
1. Precision at k (P@k):
Precision at
k
is the proportion of system
retrieved images that are correct.
P@k = # of correctly retrieved images in the top-k
# of retrieved images (12)
2. Mean Average Precision (MAP):
MAP is the average precision
averaged across a set of queries.
MAP = 1
|Q|
|Q|
X
j=1
1
mj
mj
X
k=1
P@kj(13)
where
Q
is the total set of queries,
mj
is the number of correct images
returned for the jth query.
3. R-precision (Rprec) (Buckley & Voorhees, 2004):
It is defined as
the precision of the retrieval system after
R
documents are retrieved where
R is the number of relevant images for the given image query.
4. Binary Preference (bpref) (Buckley & Voorhees, 2004):
It com-
putes a preference relation of whether judged relevant images are retrieved
ahead of judged irrelevant images. It is defined as follows:
bpref = 1
RX
r
(1 |nranked higher than r|
R) (14)
Where
N
and
R
are the numbers of judged non-relevant and relevant
images, respectively. The notation
r
is a relevant image, and
n
is a member
of the first
R
judged non-relevant images as retrieved by the system. We use
the trec eval evaluation tool15 to compute the aforementioned metrics.
5. Results and Discussion
5.1. Nearest Neighbor Search
The detailed results comparing the proposed DLS with tree-based and ap-
proximate NNS methods are shown in Tables 3 and 4 respectively. The best-
15https://trec.nist.gov/trec_eval/
26
performing FAISS approach on the respective dataset is shown in Table 4, and
the performance of the multiple FAISS approaches are shown in Table B.9. We
have highlighted the FAISS family approach that yield the best performance on
the respective dataset in Table B.9. To report the results (ATPQ and R@10)
using the DLS approach on each dataset, we ran the experiments three times
and reported the mean value of the results.
For the Faces dataset, our approach outperformed the most competitive
approaches (ScaNN and MRPT) with an Avg value of 0
.
10 and an R@10 value of
99
.
20. The tree-based approaches also reported the R@10
99%; however, their
ATPQ was very high (0
.
216 for KD Tree and 0
.
214 for the Ball Tree). Similarly,
for the datasets (MNIST, FMNIST, Corel) having size
70
,
000, our approach
outperformed the counterpart NNS approaches. We observe that approximate
NNS approaches MRPT and HNSW also reported the competitive R@10 values
(
99%) however, the proposed DLS approach has lower ATPQ with
99%
R@10 values on MNIST, FMNIST, and Corel datasets. On the TinyImages
(size=100
,
000 and dimension=384), our approach reported the ATPQ of 3
.
7
ms with an R@10 of 99
.
10%; the closet competitive approach ScaNN reported
the ATPQ of 0
.
649 ms; however, their R@10 was 98
.
6%. The other competitive
approach MRPT obtained the R@10 of 99
.
99% on the TinyImages dataset;
however, their ATPQ was 14.31 ms.
We also analyze the results on the CovType, Twitter, and YearPred datasets,
which have the dataset size
500
,
000. For the CovType, our approach has the
ATPQ of 0
.
36 ms with R@10 of 99
.
1%. The closest NNS approach HNSW, has
the ATPQ of 0
.
024 with R@10 of 98
.
63%. Another approximate NNS approach,
Annoy, also obtained the R@10 of 99
.
99% with the ATPQ of 0
.
575 ms. Our
proposed DLS approach obtained the ATPQ of 0
.
42 ms with R@10 of 99
.
5% on
the Twitter dataset. The HNSW recorded the AvgTime of 0
.
051 with the R@10
of 92.79. We observe similar patterns on the large-scale datasets (SIFT, GIST,
OpenI-ResNet, OpenI-ConvNeXt) with the high dimensions and the dataset
sizes in the millions. Our approach outperformed the existing competitive
approaches on these high-dimension datasets as well. To summarize, the tree-
27
Dataset
Method Brute Search KD Tree Ball Tree RP Forest DLS
ATPQ
(ms)
R@10
(%)
ATPQ
(ms)
R@10
(%)
ATPQ
(ms)
R@10
(%)
ATPQ
(ms)
R@10
(%)
ATPQ
(ms)
R@10
(%)
Artificial 0.860 100 0.471 100.0 0.471 100.0 0.776 47.79 0.460 99.30
Faces 0.560 100 0.216 99.98 0.214 99.98 0.934 48.13 0.100 99.20
Corel 4.200 100 2.432 99.99 2.383 99.99 6.880 68.31 0.090 99.50
MNIST 100 100 87.30 100.0 87.24 100.0 31.48 71.74 0.850 99.00
FMNIST 100 100 88.92 100.0 88.70 100.0 38.20 47.17 0.780 99.30
CovType 74.00 100 46.21 99.99 46.14 99.99 40.43 73.93 0.360 99.10
TinyImages 76.00 100 62.93 99.99 63.01 99.99 34.16 35.35 3.700 99.10
Twitter 110.0 100 70.19 99.60 70.06 99.60 78.01 18.68 0.420 99.50
YearPred 120.0 100 74.08 99.99 74.09 99.99 58.48 25.30 2.000 99.30
SIFT 280.0 100 215.4 99.93 216.0 99.93 184.5 99.21 1.600 99.70
GIST 2200 100 1920 99.91 1919 99.91 735.1 35.70 36.00 99.10
Table 3: Comparison of the proposed DLS approach with multiple tree-based nearest neighbors
approaches on benchmark datasets. The highlighted cells represent the R@10
99% for which
the ATPQ is the lowest amongst all the approaches. Due to the overhead of memory footprint,
we could not run the experiments with tree-based approaches on the OpenI datasets.
Dataset
Method Brute Search FAISS Annoy MRPT HNSW ScaNN DLS
ATPQ
(ms)
R@10
(%)
ATPQ
(ms)
R@10
(%)
ATPQ
(ms)
R@10
(%)
ATPQ
(ms)
R@10
(%)
ATPQ
(ms)
R@10
(%)
ATPQ
(ms)
R@10
(%)
ATPQ
(ms)
R@10
(%)
Artificial 0.860 100 3.754 49.63 1.034 99.20 0.123 100.0 0.037 71.39 0.033 96.65 0.460 99.30
Faces 0.560 100 0.044 89.46 0.992 99.41 0.111 99.97 0.019 86.51 0.025 96.66 0.100 99.20
Corel 4.200 100 0.217 90.50 1.306 99.92 0.657 99.99 0.024 96.62 0.078 57.95 0.090 99.50
MNIST 100.0 100 4.683 86.02 2.235 99.29 17.28 100.0 0.124 93.42 0.878 100.0 0.850 99.00
FMNIST 100.0 100 4.986 92.99 2.259 99.06 17.61 100.0 0.119 94.04 0.898 100.0 0.780 99.30
CovType 74.00 100 6.532 99.22 0.575 99.99 15.18 99.99 0.024 98.63 0.081 17.25 0.360 99.10
TinyImages 76.00 100 3.074 73.11 3.692 86.27 14.31 99.99 0.140 63.16 0.649 98.96 3.700 99.10
Twitter 110.0 100 10.19 97.23 2.056 98.81 20.22 98.95 0.051 92.79 0.038 2.228 0.420 99.50
YearPred 120.0 100 7.299 90.48 1.995 97.28 19.50 99.99 0.077 79.95 0.895 5.818 2.000 99.30
SIFT 280.0 100 13.41 89.00 2.542 92.57 51.61 99.92 0.106 79.48 3.625 98.72 1.600 99.70
GIST 2200 100 140.9 78.59 9.081 69.19 368.9 99.89 0.718 54.32 26.21 96.20 36.00 99.10
OpenI-ResNet 19760 100 869.72 85.80 11.96 87.03 2815.4 99.83 1.187 68.74 178.83 98.68 18.92 99.20
OpenI-ConvNeXt 3423 100 2475.7 90.20 18.35 82.23 7993.4 99.48 0.7852 54.32 530.20 99.10 100.2 97.89
Table 4: Comparison of the proposed DLS approach with approximate nearest neighbors
approaches on benchmark datasets. The performance of the best approach from the FAISS
family approaches on each dataset is reported under
FAISS
column. FAISS-IVF performed
best amongst all the FAISS’s approaches for each dataset except Artificial dataset. Average
time per query (ATPQ) is reported in milliseconds, and Recall@10 (R@10) is reported in
percentage. The highlighted cells represent the R@10
99% for which the ATPQ is the lowest
amongst all the approaches. The highlighted cells represent the R@10
97% for which the
ATPQ is the lowest amongst all the approaches. Lower ATPQ and higher R@10 are better.
28
based approaches consistently produced
99% R@10; however, the AvgTime
were high, which makes them unsuitable for real-time applications, where the
latency and accuracy both matter equally. In comparison to the approximate
NNS approaches, our proposed DLS approach outperformed them in terms of
lower AvgTime and 99% R@10 on 11 out of 13 datasets.
5.1.1. Effect of Kindex and Ksearch
To analyze the effect of the hyper-parameters
Kindex
and
Ksearch
of our
proposed DLS approach, we performed detailed sensitivity analysis on each
benchmark dataset with combination of
Kindex
and
Ksearch
values. We observed
that search speed is slower when
Kindex
increases (i.e., includes more nearest
neighbor links in the index), and
Ksearch
increases (i.e., finds more nearest
neighbors with spread during search). In contrast, the R@10 improves as
Kindex
and
Ksearch
increase. Similar patterns were observed throughout all the
datasets, which confirms that there is a trade-off between search speed and
recall; accordingly
Kindex
and
Ksearch
should be chosen for the DLS approach
as needed. To perform this sensitivity analysis on the OpenI datasets, we chose
a subset of the datasets having a size of 1
,
000
,
000 and chose the combinations
of
Kindex
and
Ksearch
to perform the analysis. We have provided the chart for
both the OpenI datasets in Fig. 3 and 4.
10 20 30 40 50 60
Ksearch
20
30
40
50
60
Average time/query (ms)
Kindex
=20
Kindex
=40
Kindex
=60
Kindex
=50
Kindex
=160
Kindex
=80
Kindex
=110
Kindex
=30
(a)
10 20 30 40 50 60
Ksearch
80.0
82.5
85.0
87.5
90.0
92.5
95.0
97.5
100.0
Recall@10
Kindex
=20
Kindex
=40
Kindex
=60
Kindex
=50
Kindex
=160
Kindex
=80
Kindex
=110
Kindex
=30
(b)
Figure 3: The effect of
Kindex
and
Ksearch
on ATPQ (a) and Recall@10 (b) on 1,000,000
samples of OpenI-ConvNeXt dataset.
29
10 20 30 40 50 60
Ksearch
5
10
15
20
25
Average time/query (ms)
Kindex
=210
Kindex
=30
Kindex
=80
Kindex
=160
Kindex
=60
Kindex
=40
Kindex
=110
(a)
10 20 30 40 50 60
Ksearch
90
92
94
96
98
100
Recall@10
Kindex
=210
Kindex
=30
Kindex
=80
Kindex
=160
Kindex
=60
Kindex
=40
Kindex
=110
(b)
Figure 4: The effect of
Kindex
and
Ksearch
on ATPQ (a) and Recall@10 (b) on 1,000,000
samples of the OpenI-ResNet dataset.
5.1.2. Effect of dataset size and feature dimensions
To understand the effect of the dataset size and feature dimensions on the
DLS approach, we performed an in-depth analysis by varying the dataset size
from 250
K
to 12
.
85
M
on the OpenI datasets having feature dimensions 512 and
1
,
536. We observe that the average time per query increases with an increase in
dataset size as the algorithm needs to perform more distance computations to
retrieve the nearest neighbors. With respect to the increase in feature dimensions
from 512 to 1
,
536, we notice that the average time per query increases 2
.
27,
3
.
87, 4
.
74 times on
Ksearch
= 10 for 1
M
, 3
M
and 12
M
dataset sizes respectively.
We have provided the trends for both the OpenI datasets in Fig. 5 and 6.
5.2. Medical Image Feature Extractors
Table 5) presents the results from multiple feature extractors on the Image-
CLEF 2011 medical image retrieval task.
For the ResNet50 model, mean and sum pooling performed better than max
pooling on multiple metrics, with a maximum of 0
.
0518 and 0
.
1500 MAP and
P@20 scores using mean pooling with the ResNet50 model. We observed that the
ViT-Base model performs better than its counterpart ViT-Large and ViT-Huge
models. It achieves the maximum MAP of 0
.
0713 compared to the ResNet-Mean,
30
250K 500K 750K 1M 2M 3M 6M 8M 12M
Sample Size
0
5
10
15
20
25
30
35
Average time/query (ms)
Ksearch
=10
Ksearch
=20
Ksearch
=30
Ksearch
=40
Ksearch
=50
Ksearch
=60
Figure 5: Effect of the sample size on the average time per query for OpenI-ResNet dataset.
250K 500K 750K 1M 3M 6M 8M 12M
Sample Size
0
20
40
60
80
100
Average time/query (ms)
Ksearch
=20
Ksearch
=40
Ksearch
=60
Ksearch
=80
Ksearch
=100
Figure 6: Effect of the sample size on the average time per query for OpenI-ConvNeXt dataset.
which achieved a MAP score of 0
.
0518. The ConvNeXt-L (IN-22K) model
achieved the maximum evaluation scores amongst all the variants of ConvNeXt
and their counterpart feature extractors. The ConvNeXt-L (IN-22K) model
outperformed the best ResNet model by 0
.
032, 0
.
02, and 0
.
0674 in terms of
MAP, P@20, and bpref, respectively. It outperformed the best ViT model by
0
.
0125, 0
.
02, and 0
.
0126 in terms of MAP, P@20, and bpref, respectively. Our
results are somewhat higher than the best performing system at CLEF 2011
(0.05, 0.0533, and 0.1346 in terms of MAP, P@10, and bpref, respectively.)
We report the ConvNeXt-L (IN-22K) with mean pooling results (row 1) in
Table 6. We aggregated the features with multiple pooling strategies discussed
in Section 3.3.2 and report the results in Table 6. We did not observe any
31
Models MAP P@5 P@10 P@20 Rprec bpref
ResNet with Max-pooling 0.0463 0.1733 0.1533 0.1383 0.0841 0.1455
ResNet with Mean-pooling 0.0518 0.2000 0.1900 0.1500 0.0838 0.1479
ResNet with Sum-pooling 0.0517 0.2000 0.1900 0.1500 0.0838 0.1479
ViT-Base 0.0713 0.2133 0.1833 0.1433 0.1109 0.2027
ViT-Large 0.0679 0.1933 0.1833 0.1500 0.0907 0.1934
ViT-Huge 0.0666 0.1600 0.1467 0.1450 0.1146 0.1962
ConvNeXt-B 0.0459 0.1067 0.1033 0.0933 0.0759 0.1595
ConvNeXt-B (IN-22K) 0.0699 0.2133 0.1667 0.1500 0.1037 0.2041
ConvNeXt-L 0.0450 0.1267 0.1133 0.0950 0.0722 0.1568
ConvNeXt-L (IN-22K) 0.0838 0.2333 0.2033 0.1700 0.1186 0.2153
ConvNeXt-XL (IN-22K) 0.0802 0.2067 0.1700 0.1617 0.1091 0.2146
Kalpathy-Cramer et al. (2011) 0.0338 0.1500 0.0807
Table 5: Performance comparison on CLEF 2011 dataset on medical image retrieval task.
IN-22K refers to the model that is pre-trained on the ImageNet-22K dataset.
improvements with other feature aggregation strategies over the mean pooling.
Since the mean pooling operation performed by the ConvNeXt-L model also uti-
lizes the pre-trained LayerNorm parameters, we also performed the experiments
with pre-trained LayerNorm parameters. With this approach, we recorded the
improvement in terms of multiple evaluation metrics using the Generalized Mean
pooling strategy over mean pooling. The Generalized Mean pooling strategy
obtained an improvement of 0
.
0134, 0
.
02, and 0
.
01 in terms of P@5, P@10, and
P@20, respectively, in comparison to the mean pooling.
5.3.
Effect of Medical Image Features and DenseLinkSearch on Open-i Service
Open-i service at the National Library of Medicine enables search and retrieval
of abstracts and images from the open source literature and biomedical image
collections. We aim to improve the Open-i
16
image search service by the proposed
medical feature extraction, and DenseLinkSearch approaches. We have shown
16https://openi.nlm.nih.gov/
32
Models MAP P@5 P@10 P@20 Rprec bpref
ConvNeXt-L (IN-22K) 0.0838 0.2333 0.2033 0.1700 0.1186 0.2153
Pooling
Max 0.0279 0.1267 0.1067 0.0933 0.0542 0.1290
Sum 0.0698 0.2000 0.1800 0.1500 0.1033 0.1926
Spatial-wise Attention 0.0677 0.2133 0.1733 0.1533 0.1024 0.1923
Channel-wise Attention 0.0002 0.0000 0.0000 0.0000 0.0031 0.0143
Generalized Mean 0.0521 0.1600 0.1167 0.1117 0.0833 0.1706
LayerNorm
with Pooling
Max 0.0289 0.1133 0.1100 0.0900 0.0614 0.1133
Sum 0.0836 0.2467 0.2133 0.1833 0.1126 0.2100
Spatial-wise Attention 0.0821 0.2333 0.1967 0.1783 0.1110 0.2086
Channel-wise Attention 0.0046 0.0200 0.0233 0.0200 0.0162 0.0554
Generalized Mean 0.0838 0.2467 0.2233 0.1800 0.1153 0.2129
Table 6: Performance comparison of different feature aggregation (pooling) techniques on
CLEF 2011 dataset on medical image retrieval task. The feature maps from model ConvNeXt-L
(IN-22K) are extracted, and respective pooling operation was performed to report the numbers.
The first row results are reported with mean pooling operation.
(a) (b)
Figure 7: Comparison of the nearest images retrieved from the existing OpenI system
(a)
and
our proposed system
(b)
. The first image in each of the subfigures is the query image, and the
remaining are the nearest images in the decreasing order of their similarity to the query image.
33
the effect of the proposed approaches for retrieving similar images in Fig. 7.
It is clearly illustrated that the proposed medical features extraction with the
effective DenseLinkSearch is able to retrieve similar retinal images given an
image of retina as query, while the current Open
i
system retrieves mostly
brain images as the nearest images to the same retinal image.
6. Conclusion
In this paper, we introduced an effective approach to nearest neighbor search
that focuses on traversing the links created during indexing to reduce the distance
computation for NNS. We evaluated multiple state-of-the-art approximate NNS
algorithms and tree-based algorithms on multiple benchmark datasets in a
comprehensive manner. The proposed DenseLinkSearch outperformed the
existing approaches, where our approach achieves
99% R@10 with the lowest
average time/query on most of the benchmark datasets. In addition, we explored
the role of image feature extractors for medical image retrieval tasks. We
experimented with multiple pre-trained vision models and devised an effective
approach for image feature extractors that outperformed the pre-trained vision
model-based image feature extractors with a fair margin on multiple evaluation
metrics.
Acknowledgements
This work was supported by the intramural research program at the U.S.
National Library of Medicine, National Institutes of Health (NIH), and utilized
the computational resources of the NIH HPC Biowulf cluster (
http://hpc.
nih.gov
). The content is solely the responsibility of the authors and does not
necessarily represent the official views of the National Institutes of Health.
References
Almalawi, A. M., Fahad, A., Tari, Z., Cheema, M. A., & Khalil, I. (2015).
k
nnvwc: An efficient
k
-nearest neighbors approach based on various-widths
34
clustering. IEEE Transactions on Knowledge and Data Engineering ,28 ,
68–81.
Antani, S., Long, L. R., & Thoma, G. R. (2004). Content-based image retrieval
for large biomedical image archives. In MEDINFO 2004 (pp. 829–833). IOS
Press.
Antani, S. K., Deserno, T. M., Long, L. R., uld, M. O., Neve, L., & Thoma,
G. R. (2007). Interfacing global and local cbir systems for medical image
retrieval. In Bildverarbeitung ur die Medizin 2007 (pp. 166–171). Springer.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv
preprint arXiv:1607.06450 , .
Babenko, A., & Lempitsky, V. (2014). The inverted multi-index. IEEE transac-
tions on pattern analysis and machine intelligence,37 , 1247–1260.
Babenko, A., Slesarev, A., Chigorin, A., & Lempitsky, V. (2014). Neural codes
for image retrieval. In European conference on computer vision (pp. 584–599).
Springer.
Bernhardsson, E. (). URL: https://github.com/spotify/annoy.
Beygelzimer, A., Kakade, S., & Langford, J. (2006). Cover trees for nearest
neighbor. In Proceedings of the 23rd international conference on Machine
learning (pp. 97–104).
Blackard, J. A., & Dean, D. J. (1999). Comparative accuracies of artificial
neural networks and discriminant analysis in predicting forest cover types from
cartographic variables. Computers and electronics in agriculture ,24 , 131–151.
Botta, M., Giordana, A., & Saitta, L. (1993). Learning fuzzy concept definitions.
In [Proceedings 1993] Second IEEE International Conference on Fuzzy Systems
(pp. 18–22). IEEE.
Boytsov, L., & Naidan, B. (2013). Learning to prune in metric and non-metric
spaces. Advances in Neural Information Processing Systems,26 .
35
Bressan, R. S., Bugatti, P. H., & Saito, P. T. (2019). Breast cancer diagnosis
through active learning in content-based image retrieval. Neurocomputing ,
357 , 1–10.
Buckley, C., & Voorhees, E. M. (2004). Retrieval evaluation with incomplete
information. In Proceedings of the 27th annual international ACM SIGIR
conference on Research and development in information retrieval (pp. 25–32).
Cao, B., Araujo, A., & Sim, J. (2020). Unifying deep local and global features
for image search. In European Conference on Computer Vision (pp. 726–743).
Springer.
Cao, Y., Steffey, S., He, J., Xiao, D., Tao, C., Chen, P., & uller, H. (2014).
Medical image retrieval: a multimodal approach. Cancer informatics,13 ,
CIN–S14053.
Chen, W., Liu, Y., Wang, W., Bakker, E. M., Georgiou, T., Fieguth, P., Liu, L.,
& Lew, M. (2021). Deep image retrieval: A survey. ArXiv, .
Ciaccia, P., Patella, M., & Zezula, P. (1997). M-tree: An efficient access method
for similarity search in metric spaces. In Vldb (pp. 426–435). volume 97.
Clough, P., Sanderson, M., & uller, H. (2004). The clef cross language image
retrieval track (imageclef) 2004. In International Conference on Image and
Video Retrieval (pp. 243–251). Springer.
Conjeti, S., Mesbah, S., Negahdar, M., Rautenberg, P. L., Zhang, S., Navab, N.,
& Katouzian, A. (2016). Neuron-miner: an advanced tool for morphological
search and retrieval in neuroscientific image databases. Neuroinformatics,14 ,
369–385.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning,
20 , 273–297.
Dasgupta, S., & Freund, Y. (2008). Random projection trees and low dimensional
manifolds. In Proceedings of the fortieth annual ACM symposium on Theory
of computing (pp. 537–546).
36
Datar, M., Immorlica, N., Indyk, P., & Mirrokni, V. S. (2004). Locality-sensitive
hashing scheme based on p-stable distributions. In Proceedings of the twentieth
annual symposium on Computational geometry (pp. 253–262).
Demner-Fushman, D., Antani, S., Simpson, M., & Thoma, G. R. (2012). Design
and development of a multimodal biomedical information retrieval system.
Journal of Computing Science and Engineering ,6, 168–177.
Depeursinge, A., Zrimec, T., Busayarat, S., & uller, H. (2011). 3d lung image
retrieval using localized features. In Medical Imaging 2011: Computer-Aided
Diagnosis (p. 79632E). International Society for Optics and Photonics volume
7963.
Do, T.-T., & Cheung, N.-M. (2017). Embedding based on function approximation
for large scale image search. IEEE transactions on pattern analysis and machine
intelligence,40 , 626–638.
Doll´ar, P., Tu, Z., Perona, P., & Belongie, S. (2009). Integral channel features, .
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S. et al. (2020). An
image is worth 16x16 words: Transformers for image recognition at scale. In
International Conference on Learning Representations.
Dua, D., & Graff, C. (2017). UCI machine learning repository. URL:
http:
//archive.ics.uci.edu/ml.
Friedman, J. H., Bentley, J. L., & Finkel, R. A. (1977). An algorithm for
finding best matches in logarithmic expected time. ACM Transactions on
Mathematical Software (TOMS),3, 209–226.
Fukunaga, K., & Narendra, P. M. (1975). A branch and bound algorithm
for computing k-nearest neighbors. IEEE transactions on computers ,100 ,
750–753.
37
Gong, Y., Wang, L., Guo, R., & Lazebnik, S. (2014). Multi-scale orderless
pooling of deep convolutional activation features. In European conference on
computer vision (pp. 392–407). Springer.
Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern, F., & Kumar, S.
(2020). Accelerating large-scale inference with anisotropic vector quantization.
In International Conference on Machine Learning (pp. 3887–3896). PMLR.
Gupta, D., Attal, K., & Demner-Fushman, D. (2022). A dataset for medical
instructional video classification and question answering. arXiv preprint
arXiv:2201.12888 , .
Gupta, D., & Demner-Fushman, D. (2022). Overview of the MedVidQA 2022
shared task on medical video question-answering. In Proceedings of the 21st
Workshop on Biomedical Language Processing (pp. 264–274). Dublin, Ireland:
Association for Computational Linguistics. URL:
https://aclanthology.
org/2022.bionlp-1.25. doi:10.18653/v1/2022.bionlp-1.25.
Gupta, D., Suman, S., & Ekbal, A. (2021). Hierarchical deep multi-modal network
for medical visual question answering. Expert Systems with Applications,164 ,
113993.
Haralick, R. M. (1979). Statistical and structural approaches to texture. Pro-
ceedings of the IEEE ,67 , 786–804.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and
pattern recognition (pp. 770–778).
Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv
preprint arXiv:1606.08415 , .
Hochreiter, S. (1998). The vanishing gradient problem during learning recurrent
neural nets and problem solutions. International Journal of Uncertainty,
Fuzziness and Knowledge-Based Systems,6, 107–116.
38
Hsu, W., Long, L. R., Antani, S. et al. (2007). Spirs: a framework for content-
based image retrieval from large biomedical databases. MedInfo,12 , 188–192.
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely
connected convolutional networks. In Proceedings of the IEEE conference on
computer vision and pattern recognition (pp. 4700–4708).
Hwang, Y., Baek, M., Kim, S., Han, B., & Ahn, H.-K. (2018). Product quantized
translation for fast nearest neighbor search. In Proceedings of the AAAI
Conference on Artificial Intelligence. volume 32.
Hwang, Y., Han, B., & Ahn, H.-K. (2012). A fast nearest neighbor search
algorithm by nonlinear embedding. In 2012 IEEE Conference on Computer
Vision and Pattern Recognition (pp. 3053–3060). IEEE.
Hyv¨onen, V., Pitk¨anen, T., Tasoulis, S., asaari, E., Tuomainen, R., Wang,
L., Corander, J., & Roos, T. (2016). Fast nearest neighbor search through
sparse random projections and voting. In Big Data (Big Data), 2016 IEEE
International Conference on (pp. 881–888). IEEE.
Jegou, H., Douze, M., & Schmid, C. (2010). Product quantization for near-
est neighbor search. IEEE transactions on pattern analysis and machine
intelligence,33 , 117–128.
Jeong, S., Kim, S.-W., Kim, K., & Choi, B.-U. (2006). An effective method for
approximating the euclidean distance in high-dimensional space. In Interna-
tional Conference on Database and Expert Systems Applications (pp. 863–872).
Springer.
Johnson, J., Douze, M., & egou, H. (2019). Billion-scale similarity search with
GPUs. IEEE Transactions on Big Data,7, 535–547.
Kalpathy-Cramer, J., de Herrera, A. G. S., Demner-Fushman, D., Antani, S.,
Bedrick, S., & uller, H. (2015). Evaluating performance of biomedical image
retrieval systems—an overview of the medical image retrieval task at imageclef
2004–2013. Computerized Medical Imaging and Graphics,39 , 55–61.
39
Kalpathy-Cramer, J., uller, H., Bedrick, S., Eggel, I., de Herrera, A. G. S., &
Tsikrika, T. (2011). Overview of the clef 2011 medical image classification and
retrieval tasks. In CLEF (notebook papers/labs/workshop) (pp. 97–112).
Kamel, I., & Faloutsos, C. (1993). Hilbert R-tree: An improved R-tree using
fractals. Technical Report.
Kassim, Y. M., Palaniappan, K., Yang, F., Poostchi, M., Palaniappan, N.,
Maude, R. J., Antani, S., & Jaeger, S. (2020). Clustering-based dual deep
learning architecture for detecting red blood cells in malaria diagnostic smears.
IEEE Journal of Biomedical and Health Informatics,25 , 1735–1746.
Khan, M. I., Acharya, B., Singh, B. K., & Soni, J. (2011). Content based
image retrieval approaches for detection of malarial parasite in blood images.
International Journal of Biometrics and Bioinformatics (IJBB),5, 97.
Lambin, P., Rios-Velazquez, E., Leijenaar, R., Carvalho, S., Van Stiphout, R. G.,
Granton, P., Zegers, C. M., Gillies, R., Boellard, R., Dekker, A. et al. (2012).
Radiomics: extracting more information from medical images using advanced
feature analysis. European journal of cancer ,48 , 441–446.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning
applied to document recognition. Proceedings of the IEEE ,86 , 2278–2324.
Lehmann, T. M., uld, M. O., Thies, C., Fischer, B., Spitzer, K., Keysers, D.,
Ney, H., Kohnen, M., Schubert, H., & Wein, B. B. (2004). Content-based
image retrieval in medical applications. Methods of information in medicine,
43 , 354–361.
Li, M., Zhang, Y., Sun, Y., Wang, W., Tsang, I. W., & Lin, X. (2018a).
An efficient exact nearest neighbor search by compounded embedding. In
International Conference on Database Systems for Advanced Applications (pp.
37–54). Springer.
40
Li, Z., Zhang, X., uller, H., & Zhang, S. (2018b). Large-scale retrieval for
medical image analytics: A comprehensive review. Medical image analysis ,
43 , 66–84.
Liu, Y., Wei, H., & Cheng, H. (2018). Exploiting lower bounds to accelerate
approximate nearest neighbor search on high-dimensional data. Information
Sciences,465 , 484–504.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021).
Swin transformer: Hierarchical vision transformer using shifted windows. In
Proceedings of the IEEE/CVF International Conference on Computer Vision
(pp. 10012–10022).
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022).
A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (pp. 11976–11986).
Lou, Y., Bai, Y., Wang, S., & Duan, L.-Y. (2018). Multi-scale context attention
network for image retrieval. In Proceedings of the 26th ACM international
conference on Multimedia (pp. 1128–1136).
Malkov, Y., Ponomarenko, A., Logvinov, A., & Krylov, V. (2014). Approximate
nearest neighbor algorithm based on navigable small world graphs. Information
Systems,45 , 61–68.
Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate
nearest neighbor search using hierarchical navigable small world graphs. IEEE
transactions on pattern analysis and machine intelligence,42 , 824–836.
McDonald, R. J., Schwartz, K. M., Eckel, L. J., Diehn, F. E., Hunt, C. H.,
Bartholmai, B. J., Erickson, B. J., & Kallmes, D. F. (2015). The effects
of changes in utilization and technological advancements of cross-sectional
imaging on radiologist workload. Academic radiology,22 , 1191–1198.
Mesbah, S., Conjeti, S., Kumaraswamy, A., Rautenberg, P., Navab, N., &
Katouzian, A. (2015). Hashing forests for morphological search and retrieval in
41
neuroscientific image databases. In International Conference on Medical Image
Computing and Computer-Assisted Intervention (pp. 135–143). Springer.
uller, H., de Herrera, A. G. S., Kalpathy-Cramer, J., Demner-Fushman, D., An-
tani, S. K., & Eggel, I. (2012). Overview of the imageclef 2012 medical image re-
trieval and classification tasks. In CLEF (online working notes/labs/workshop)
(pp. 1–16).
uller, H., Kalpathy-Cramer, J., Eggel, I., Bedrick, S., Radhouani, S., Bakke,
B., Kahn, C. E., & Hersh, W. (2009). Overview of the clef 2009 medical image
retrieval track. In Workshop of the Cross-Language Evaluation Forum for
European Languages (pp. 72–84). Springer.
Nazari, M. R., & Fatemizadeh, E. (2010). A cbir system for human brain magnetic
resonance image indexing. International Journal of Computer Applications,
7, 33–37.
Omohundro, S. M. (1989). Five balltree construction algorithms . International
Computer Science Institute Berkeley.
Ortega, M., Rui, Y., Chakrabarti, K., Porkaew, K., Mehrotra, S., & Huang,
T. S. (1998). Supporting ranked boolean similarity queries in mars. IEEE
Transactions on Knowledge and Data Engineering ,10 , 905–925.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-
learn: Machine learning in Python. Journal of Machine Learning Research ,
12 , 2825–2830.
Pestov, V. (2012). Indexability, concentration, and vc theory. Journal of Discrete
Algorithms,13 , 2–18.
Prerau, M. J., & Eskin, E. (2000). Unsupervised anomaly detection using an
optimized k-nearest neighbors algorithm. Undergraduate Thesis, Columbia
University: December, .
42
Qayyum, A., Anwar, S. M., Awais, M., & Majid, M. (2017). Medical image
retrieval using deep convolutional neural network. Neurocomputing,266 , 8–20.
Radenovi´c, F., Tolias, G., & Chum, O. (2018). Fine-tuning cnn image retrieval
with no human annotation. IEEE transactions on pattern analysis and machine
intelligence,41 , 1655–1668.
Rahman, M. M., Antani, S. K., & Thoma, G. R. (2011). A learning-based
similarity fusion and filtering approach for biomedical image retrieval using
svm classification and relevance feedback. IEEE Transactions on Information
Technology in Biomedicine,15 , 640–646.
Rahman, M. M., Desai, B. C., & Bhattacharya, P. (2008). Medical image
retrieval with probabilistic multi-class support vector machine classifiers and
adaptive similarity fusion. Computerized Medical Imaging and Graphics,32 ,
95–108.
Rahman, M. M., You, D., Simpson, M. S., Antani, S. K., Demner-Fushman, D.,
& Thoma, G. R. (2013). Multimodal biomedical image retrieval using hierar-
chical classification and modality fusion. International Journal of Multimedia
Information Retrieval,2, 159–173.
Rajaraman, S., Antani, S. K., Poostchi, M., Silamut, K., Hossain, M. A., Maude,
R. J., Jaeger, S., & Thoma, G. R. (2018). Pre-trained convolutional neural
networks as feature extractors toward improved malaria parasite detection in
thin blood smear images. PeerJ ,6, e4568.
Razavian, A. S., Sullivan, J., Carlsson, S., & Maki, A. (2016). Visual instance
retrieval with deep convolutional networks. ITE Transactions on Media
Technology and Applications,4, 251–258.
Ridnik, T., Ben-Baruch, E., Noy, A., & Zelnik-Manor, L. (2021). Imagenet-21k
pretraining for the masses. In Thirty-fifth Conference on Neural Information
Processing Systems Datasets and Benchmarks Track (Round 1).
43
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015).
ImageNet Large Scale Visual Recognition Challenge. International Journal of
Computer Vision (IJCV),115 , 211–252. doi:10.1007/s11263-015-0816-y.
Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How does batch
normalization help optimization? Advances in neural information processing
systems,31 .
Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). Cnn
features off-the-shelf: an astounding baseline for recognition. In Proceedings
of the IEEE conference on computer vision and pattern recognition workshops
(pp. 806–813).
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for
large-scale image recognition. arXiv preprint arXiv:1409.1556 , .
Sproull, R. F. (1991). Refinements to nearest-neighbor searching ink-dimensional
trees. Algorithmica,6, 579–589.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In
Proceedings of the IEEE conference on computer vision and pattern recognition
(pp. 1–9).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking
the inception architecture for computer vision. In Proceedings of the IEEE
conference on computer vision and pattern recognition (pp. 2818–2826).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
Kaiser,
L
., & Polosukhin, I. (2017). Attention is all you need. Advances in
neural information processing systems,30 .
Wang, X. (2011). A fast exact k-nearest neighbors algorithm for high dimen-
sional search using k-means clustering and triangle inequality. In The 2011
International Joint Conference on Neural Networks (pp. 1293–1299). IEEE.
44
Wei, L., Yang, Y., & Nishikawa, R. M. (2009). Microcalcification classification
assisted by content-based image retrieval for breast cancer diagnosis. Pattern
recognition,42 , 1126–1132.
Wightman, R. (2019). Pytorch image models.
https://github.com/rwightman/
pytorch-image-models. doi:10.5281/zenodo.4414861.
Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-mnist: a novel im-
age dataset for benchmarking machine learning algorithms. arXiv preprint
arXiv:1708.07747 , .
Xie, S., Girshick, R., Doll´ar, P., Tu, Z., & He, K. (2017). Aggregated resid-
ual transformations for deep neural networks. In Proceedings of the IEEE
conference on computer vision and pattern recognition (pp. 1492–1500).
Xue, Z., Long, L. R., Antani, S., Jeronimo, J., & Thoma, G. R. (2008). A
web-accessible content-based cervicographic image retrieval system. In Medical
Imaging 2008: PACS and Imaging Informatics (pp. 46–54). SPIE volume
6919.
Yan, D., Wang, Y., Wang, J., Wang, H., & Li, Z. (2019). K-nearest neighbor
search by random projection forests. IEEE Transactions on Big Data ,7,
147–157.
Yianilos, P. N. (1993). Data structures and algorithms for nearest neighbor.
In Proceedings of the fourth annual ACM-SIAM Symposium on Discrete
algorithms (p. 311). SIAM volume 66.
Yu, J., Zhu, Z., Wang, Y., Zhang, W., Hu, Y., & Tan, J. (2020). Cross-modal
knowledge reasoning for knowledge-based visual question answering. Pattern
Recognition,108 , 107563.
Yu, W., Yang, K., Yao, H., Sun, X., & Xu, P. (2017). Exploiting the complemen-
tary strengths of multi-layer cnn features for image retrieval. Neurocomputing ,
237 , 235–241.
45
Yue-Hei Ng, J., Yang, F., & Davis, L. S. (2015). Exploiting local features from
deep networks for image retrieval. In Proceedings of the IEEE conference on
computer vision and pattern recognition workshops (pp. 53–61).
Zhang, H., Dong, Y., & Xu, D. (2022). Accelerating exact nearest neighbor
search in high dimensional euclidean space via block vectors. International
Journal of Intelligent Systems ,37 , 1697–1722.
Zhang, Z., Xie, Y., Zhang, W., & Tian, Q. (2019). Effective image retrieval
via multilinear multi-index fusion. IEEE Transactions on Multimedia ,21 ,
2878–2890.
Zhong, A., Li, X., Wu, D., Ren, H., Kim, K., Kim, Y., Buch, V., Neumark,
N., Bizzo, B., Tak, W. Y. et al. (2021). Deep metric learning-based image
retrieval system for chest radiograph and its clinical applications in covid-19.
Medical Image Analysis,70 , 101993.
46
Appendix A. Hyper-parameters Details
Method
Dataset Artificial Faces Corel MNIST FMNIST CovType
Ball Tree leaf size=9K leaf size=4500 leaf size=58K leaf size=60K leaf size=45K leaf size=580K
KD Tree leaf size=9K leaf size=6K leaf size=58K leaf size=30K leaf size=45K leaf size=580K
RP Forest leaf size=900
n trees=10
leaf size=600
n trees=15
leaf size=3800
n trees=15
leaf size=6K
n trees=10
leaf size=4K
n trees=15
leaf size=29K
n trees=20
FAISS-LSH n bits=16384 n bits=16384 n bits=4096 n bits=4096 n bits=4096 n bits=4096
FAISS-IVF n list=5 n list=5 n list=5 n list=5 n list=5 n list=5
FAISS-IVFPQfs n list=5 n list=5 n list=10 n list=5 n list=5 n list=100
Annoy n trees=500 n trees=500 n trees=500 n trees=500 n trees=500 n trees=300
MRPT - - - - - -
HNSW ef construction=100
M=50
ef construction=100
M=10
ef construction=100
M=50
ef construction=100
M=50
ef construction=100
M=100
ef construction=100
M=20
ScaNN
n leaves=1
avq threshold=0.20
dims per block=2
n leaves=1
avq threshold=0.20
dims per block=2
n leaves=1
avq threshold=0.20
dims per block=2
n leaves=1
avq threshold=0.2
dims per block=2
n leaves=1
avq threshold=0.2
dims per block=2
n leaves=5000
avq threshold=0.2
dims per block=2
DLS Kindex=50
Ksearch=20
Kindex=50
Ksearch=20
Kindex=50
Ksearch=10
Kindex=40
Ksearch=15
Kindex=50
Ksearch=10
Kindex=40
Ksearch=50
Table A.7: The hyper-parameters details that resulted in the best performance of the NNS
approaches on benchmark datasets.
Method
Dataset TinyImages Twitter YearPred SIFT GIST OpenI-ResNet OpenI-ConvNeXt
Ball Tree leaf size=60K leaf size=570K leaf size=500K leaf size=700K leaf size=700K NA NA
KD Tree leaf size=60K leaf size=570K leaf size=500K leaf size=500K leaf size=500K NA NA
RP Forest leaf size=6K
n trees=15
leaf size=57K
n trees=10
leaf size=30K
n trees=15
leaf size=5K
n trees=200
leaf size=5K
n trees=200
NA NA
FAISS-LSH n bits=4096 n bits=4096 n bits=4096 n bits=4096 n bits=4096 n bits=4096 n bits=4096
FAISS-IVF n list=5 n list=5 n list=5 n list=5 n list=5 n list=5 n list=25
FAISS-IVFPQfs n list=5 n list=100 n list=5 n list=5 n list=5 n list=5 n list=5
Annoy n trees=1000 n trees=500 n trees=500 n trees=500 n trees=1000 n trees=1000 n trees=500
MRPT - - - - - - -
HNSW ef construction=100
M=100
ef construction=100
M=50
ef construction=100
M=50
ef construction=100
M=200
ef construction=100
M=200
ef construction=100
M=100
ef construction=100
M=100
ScaNN
n leaves=1
avq threshold=0.2
dims per block=2
n leaves=75
avq threshold=0.2
dims per block=2
n leaves=1
avq threshold=0.2
dims per block=2
n leaves=1
avq threshold=0.2
dims per block=2
n leaves=1
avq threshold=0.2
dims per block=2
n leaves=1
avq threshold=0.2
dims per block=2
n leaves=1
avq threshold=0.2
dims per block=2
DLS Kindex=20
Ksearch=50
Kindex=30
Ksearch=20
Kindex=30
Ksearch=25
Kindex=40
Ksearch=30
Kindex=40
Ksearch=55
Kindex=110
Ksearch=35
Kindex=30
Ksearch=110
Table A.8: The hyper-parameters details that resulted in the best performance of the NNS
approaches on benchmark datasets.
47
Appendix B. Additional Results
Dataset
Method FAISS-LSH FAISS-IVF FAISS-IVFPQfs
ATPQ
(ms)
R@10
(%)
ATPQ
(ms)
R@10
(%)
ATPQ
(ms)
R@10
(%)
Artificial 3.754 49.63 0.061 39.86 0.030 30.11
Faces 3.772 47.56 0.044 89.46 0.027 42.27
Corel 6.699 66.23 0.217 90.50 0.0260 31.22
MNIST 7.396 68.16 4.683 86.02 0.292 79.29
FMNIST 7.921 43.86 4.986 92.99 0.314 81.12
CovType 64.746 65.35 6.532 99.22 0.036 2.528
TinyImages 10.501 27.62 3.074 73.11 0.217 50.76
Twitter 65.131 17.62 10.192 97.23 0.042 00.78
YearPred 58.845 22.97 7.299 90.48 0.241 8.731
SIFT 115.584 73.38 13.410 89.00 1.020 57.02
GIST 118.554 25.42 140.993 78.59 10.896 52.84
OpenI-ResNet 1684.21 56.26 869.72 85.80 56.07 58.37
OpenI-ConvNeXt 1379.24 61.48 2475.75 90.20 171.90 75.01
Table B.9: Performance comparison of the FAISS implementation of multiple NNS approaches
on benchmark datasets. The highlighted cells represent the approach having maximum R@10
amongst all the approaches.
Appendix C. Pseudo codes of the Indexing and
DenseLinkSearch
48
Algorithm 1 Indexing Algorithm
1: function BUI LD INDE X(D,Kindex )
2: Build index from vector data Dby finding Kindex nearest neighbors for each vector
3: N=length(D)Number of vectors in data
4: I=list() Initialize the index - a list of lists of linkss
5: for each vector vin Ddo
6: Rv=+Neighborhood radius, also known as near distance
7: Cv=+Distance to closest indexed vector
8: Hv=heap(Kindex)Initialize max-heap Hvof size Kindex to hold links for vector v
9: Lv=list() Initializing list Lvto hold links for vector v
10: Iv=list() Initialize list Ivto hold finished index links for vector v
11: end for
12: H=heap(N)Initialize max-heap Hof size Nto hold Cvfor unindexed vectors
13: v=H.pop() Choose first vector as the root node
14: for n= 1 to Ndo For all Nvectors
15: for each link lin Hvdo Traverse each near link for vector v
16: Iv.append(l)Store longer links (AKA descend links) to index list for vector v
17: end for
18: CRE ATELINKS(v)Creating nearest neighbor links for vector v
19: v=H.pop() Pop vector furthest from existing nodes to become next node
20: end for
21: for each vector vin Ddo
22: for each link lin Hvdo Traverse each near link for vector v
23: Iv.append(l)Store nearest neighbor links (AKA spread links) to index list for vector v
24: end for
25: Iv.sort() Sort links by increasing distance for vector v
26: I.append(Iv)Add the vector list Ivto full index I
27: end for
28: return IReturn index for each vector in data D
29: end function
Algorithm 2 Creating Links
1: function CRE ATELI N KS(A)
2: Create nearest neighbor links for vector A, after which it will be considered indexed
3: P= CO LL ECT NEIGHBORS(A)Collecting 2nd neighbors for vector A
4: for each vector Bin Pdo
5: dA,B=dist(A,B)Compute distance between Aand B
6: if dA,B<RAor dA,B<RBthen Only create a link if either endpoint considers the other near
7: if dA,B<CBthen Check if Ais nearest to B
8: CB=dA,BUpdate distance to closest indexed vector
9: H.heapify() Update heap Hwhich has CBas a key for one of its vectors
10: end if
11: ADD LINK(HA,LA,B,dA,B)Add a link to Bto vector As heap or list
12: ADD LINK(HB,LB,A,dA,B)Add a link to Ato vector Bs heap or list
13: end if
14: end for
15: return
16: end function
Algorithm 3 Add link
1: function ADD LINK(HX,LX,Y,dX,Y)
2: Add a link Xto Yto vector X’s heap HXor list LXbased on the distance dX,Y
3: if dX,Y RXthen Check if Yis near X
4: LX.append(Y,dX,Y)If not, append new link to list LX
5: else If so, push new link into heap HX
6: if HX.isFull() then Check if heap HXis full
7: V,dX,V=HX.pop() If so, pop old top link to vector V
8: if dX,V RVthen Check if Vconsiders Xnear
9: LX.append(V,dX,V)If so, append old link to list LX.
10: else
11: delete V,dX,YIf not, neither endpoint considers other near, old link deleted
12: end if
13: end if
14: HX.push(Y,dX,Y)Push new link into heap HX
15: RX=HX.peek() Update neighborhood radius RXto new distance at top of heap HX
16: end if
17: return
18: end function
Algorithm 4 Collect Neighbors
1: function COL LE CTNEIGHBORS(A)
2: Collect the 2nd neighbors of the vector A
3: N2nd=list() Initialize the 2nd neighbors set
4: N1stH=list() Initialize the 1st neighbors from heap set
5: N1stL=list() Initialize the 1st neighbors from list set
6: for each link lin HAdo Traverse each link in heap HAof A
7: B, dA,B=l Extract vector and distance
8: N1stH.append(B)Add the vector Binto N1st H
9: end for
10: for each link lin LAdo Traverse each link in list LAof A
11: B, dA,B=l Extract vector and distance
12: if dA,B>RBthen Check if Bconsiders Anear
13: LA.remove(l)Neither endpoint considers other near, drop link
14: else
15: N1stL.append(B)Add the vector Binto N1st L
16: end if
17: end for
18: for each vector uin N1stHdo For each vector in N1stH
19: for each link lin Hudo Traverse each link in heap Huof vector u
20: v, du,v =l Extract vector and distance
21: N2nd.append(v)Add the vector uinto N2nd
22: end for
23: for each link lin Ludo Traverse each link in list Luof vector u
24: v, du,v =l Extract vector and distance
25: if du,v >Rvthen Check if distance du,v is greater than vector vnear distance Rv
26: Lu.remove(l)Neither endpoint considers other near, drop link
27: else
28: N2nd.append(v)Add the vector vinto N2nd
29: end if
30: end for
31: end for
32: for each vector uin N1stLdo For each vector in N1stL
33: for each link lin Hudo Traverse each link in heap Huof vector u
34: v, du,v =l Extract vector and distance
35: N2nd.append(v)Add the vector vinto N2nd
36: end for
37: end for
38: for each vector uin N2nd do For each vector in N2nd
39: if N1stH.contains(u)or N1st L.contains(u)or u== Athen
40: N2nd.remove(u)Don’t include uif it is a 1st neighbor or self
41: end if
42: end for
43: return N2nd
44: end function
Algorithm 5 Search Algorithm
1: function DEN SE LINKSEARCH(q,D,Ksearch,I)
2: Find the Ksearch nearest neighbors to query qusing index links I
3: N=length(D)Number of vectors in data and index
4: L=[+] N Initialize lookup table to avoid repeating distance calculations
5: F=[F alse] N Initialize flags to avoid repeating traversals
6: H=heap(Ksearch)Create results heap Hof size Ksear ch to hold nearest vectors
7: v=D[0] Start at root vector
8: dv,q =dist(v, q )Calculate distance between vector vand query vector q
9: L[v]= dv,q Store distance dv,q in lookup table
10: H.push(v,dv,q )Push the vector vinto results heap H
11: VC=v Initialize closest vector
12: DC=dv,q Initialize closest distance
13: DES CE ND(q, H,I)Call the DES CEN D procedure
14: return HReturn heap containing nearest neighbors of query vector
15: end function
Algorithm 6 Descend Stage
1: function DES CE ND(q, H,I)
2: Descend along links of the closest vector to query vector qand store results in heap H
3: L=I[VC]Get links for the closest vector
4: VN,DN= LINKTR AVER SE (q,VC,H,L)Traverse links, return nearest vector and distance found
5: if DN<DCthen Check if the closest vector has changed
6: VC=VNUpdate the closest vector and distance
7: DC=DN
8: DES CE ND(q, H,I)Repeat the DES CEN D procedure
9: else
10: SPR EA D(q , H,I)Switch to the SPR EA D procedure
11: end if
12: return
13: end function
Algorithm 7 Spread Stage
1: function SPR EA D(q , H,I)
2: Spread along links of the Ksearch nearest vectors to query vector qand store results in heap H
3: DL=H.peek() Get limit distance from top of results heap
4: for each vector vin Hdo For each vector vin heap H
5: if F[v] == F alse then Skip vector if it has already been traversed
6: L=I[v]Get index links for the near vector v
7: VN,DN= LINKTRAVERSE(q, v, H,L)Traverse links, return nearest vector and distance found
8: if DN<DLthen Check if a new near vector was found
9: break New near vector found - break out of loop
10: end if
11: end if
12: end for
13: if DN<DCthen Check if the closest vector has changed
14: VC=VNUpdate the closest vector and distance
15: DC=DN
16: DES CE ND(q, H,I)Switch to the the DES CEN D procedure
17: else if DN<DLthen Check if a new near vector has been found
18: SPR EA D(q , H,I)Repeat the SPR EA D procedure
19: end if
20: return
21: end function
Algorithm 8 Link Traverse
1: function LINKTRAVERS E(q, x, H,L)
2: Follow links in list Lfor vector x, calculating distance to query vector q, and storing results in heap H
3: VN=null Initialize nearest vector VN
4: DN=+Initialize nearest distance DN
5: DT=H.peek() Get distance from top of results heap
6: for each link lin Ldo For each link lin list L
7: v=l Extract other endpoint vector vfrom link
8: if L[v] == +then Skip vector if distance has already been calculated
9: dv,q =dist(v, q )Calculate distance between vector vand query vector q
10: L[v] = dv,q Store distance dv,q in lookup table
11: if dv,q <DTthen If vector vis near enough
12: H.push(v,dv,q )Update results heap and top distance
13: DT=dv,q
14: end if
15: if dv,q <DNthen If vector vis nearest found so far
16: VN=v Update nearest vector and distance
17: DN=dv,q
18: end if
19: end if
20: end for
21: F[x] = T rue Set traverse flag
22: return VN,DNReturn nearest vector and distance
23: end function
Chapter
Nearest neighbor-based similarity searching is a common task in chemistry, with notable use cases in drug discovery. Yet, some of the most commonly used approaches for this task still leverage a brute-force approach. In practice this can be computationally costly and overly time-consuming, due in part to the sheer size of modern chemical databases. Previous computational advancements for this task have generally relied on improvements to hardware or dataset-specific tricks that lack generalizability. Approaches that leverage lower-complexity searching algorithms remain relatively underexplored. However, many of these algorithms are approximate solutions and/or struggle with typical high-dimensional chemical embeddings. Here we evaluate whether a combination of low-dimensional chemical embeddings and a k-d tree data structure can achieve fast nearest neighbor queries while maintaining performance on standard chemical similarity search benchmarks. We examine different dimensionality reductions of standard chemical embeddings as well as a learned, structurally-aware embedding—SmallSA—for this task. With this framework, searches on over one billion chemicals execute in less than a second on a single CPU core, five orders of magnitude faster than the brute-force approach. We also demonstrate that SmallSA achieves competitive performance on chemical similarity benchmarks.
Article
Full-text available
This paper introduces a new challenge and datasets to foster research toward designing systems that can understand medical videos and provide visual answers to natural language questions. We believe medical videos may provide the best possible answers to many first aid, medical emergency, and medical education questions. Toward this, we created the MedVidCL and MedVidQA datasets and introduce the tasks of Medical Video Classification (MVC) and Medical Visual Answer Localization (MVAL), two tasks that focus on cross-modal (medical language and medical video) understanding. The proposed tasks and datasets have the potential to support the development of sophisticated downstream applications that can benefit the public and medical practitioners. Our datasets consist of 6,117 fine-grained annotated videos for the MVC task and 3,010 questions and answers timestamps from 899 videos for the MVAL task. These datasets have been verified and corrected by medical informatics experts. We have also benchmarked each task with the created MedVidCL and MedVidQA datasets and propose the multimodal learning methods that set competitive baselines for future research.
Article
Full-text available
Computer-assisted algorithms have become a mainstay of biomedical applications to improve accuracy and reproducibility of repetitive tasks like manual segmentation and annotation. We propose a novel pipeline for red blood cell detection and counting in thin blood smear microscopy images, named RBCNet, using a dual deep learning architecture. RBCNet consists of a U-Net first stage for cell-cluster segmentation, followed by a second stage Faster R-CNN for detecting small cell objects within clusters, identified as connected components from the U-Net stage. RBCNet uses cell clustering instead of region proposals, which is robust to cell fragmentation, is highly scalable for detecting small objects or fine scale morphological structures in very large images, can be trained using non-overlapping tiles, and during inference is adaptive to the scale of cell-clusters with a low memory footprint. We tested our method on an archived collection of human malaria smears with nearly 200,000 labeled cells across 965 images from 193 patients, acquired in Bangladesh, with each patient contributing five images. Cell detection accuracy using RBCNet was higher than 97%. The novel dual cascade RBCNet architecture provides more accurate cell detections because the foreground cell-cluster masks from U-Net adaptively guide the detection stage, resulting in a notably higher true positive and lower false alarm rates, compared to traditional and other deep learning methods. The RBCNet pipeline implements a crucial step towards automated malaria diagnosis.
Article
2018 Curran Associates Inc.All rights reserved. Batch Normalization (BatchNorm) is a widely adopted technique that enables faster and more stable training of deep neural networks (DNNs). Despite its pervasiveness, the exact reasons for BatchNorm's effectiveness are still poorly understood. The popular belief is that this effectiveness stems from controlling the change of the layers' input distributions during training to reduce the so-called “internal covariate shift”. In this work, we demonstrate that such distributional stability of layer inputs has little to do with the success of BatchNorm. Instead, we uncover a more fundamental impact of BatchNorm on the training process: it makes the optimization landscape significantly smoother. This smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training.
Article
The nearest neighbor search is an essential operation for many computer vision, data mining, and machine learning problems. Since it is so widely used, the nearest neighbor search should be as fast as possible. This paper explores lower bound-based approaches to speed up the exact nearest neighbor search in high dimensional Euclidean space. We compute the lower bound of Euclidean Distance by using the block vectors and Cauchy–Schwartz inequality. The proposed lower bound is calculated efficiently and is close to the real Euclidean Distance. Besides, the preprocessing step of the proposal has linear time complexity. Given a query, during the procedure of identifying the nearest neighbor, our method can eliminate many expensive actual distance computations using the lower bound to approximate Euclidean Distance. In addition, we develop a multilevel lower bound strategy, which calculates the lower bound step by step and utilizes the multistep filtering mechanism to improve the searching process further. Theoretical analysis is provided to show that the proposals can guarantee to obtain the same result as the brute-force search. Comprehensive experiments on 16 public data sets collected from various domains demonstrate that our approach performs well in finding the exact nearest neighbor compared to related competitors. The experimental results also illustrate that the multilevel lower bound strategy is effective.
Article
In recent years, deep learning-based image analysis methods have been widely applied in computer-aided detection, diagnosis and prognosis, and has shown its value during the public health crisis of the novel coronavirus disease 2019 (COVID-19) pandemic. Chest radiograph (CXR) has been playing a crucial role in COVID-19 patient triaging, diagnosing and monitoring, particularly in the United States. Considering the mixed and unspecific signals in CXR, an image retrieval model of CXR that provides both similar images and associated clinical information can be more clinically meaningful than a direct image diagnostic model. In this work we develop a novel CXR image retrieval model based on deep metric learning. Unlike traditional diagnostic models which aim at learning the direct mapping from images to labels, the proposed model aims at learning the optimized embedding space of images, where images with the same labels and similar contents are pulled together. The proposed model utilizes multi-similarity loss with hard-mining sampling strategy and attention mechanism to learn the optimized embedding space, and provides similar images, the visualizations of disease-related attention maps and useful clinical information to assist clinical decisions. The model is trained and validated on an international multi-site COVID-19 dataset collected from 3 different sources. Experimental results of COVID-19 image retrieval and diagnosis tasks show that the proposed model can serve as a robust solution for CXR analysis and patient management for COVID-19. The model is also tested on its transferability on a different clinical decision support task for COVID-19, where the pre-trained model is applied to extract image features from a new dataset without any further training. The extracted features are then combined with COVID-19 patient's vitals, lab tests and medical histories to predict the possibility of airway intubation in 72 hours, which is strongly associated with patient prognosis, and is crucial for patient care and hospital resource planning. These results demonstrate our deep metric learning based image retrieval model is highly efficient in the CXR retrieval, diagnosis and prognosis, and thus has great clinical value for the treatment and management of COVID-19 patients.
Article
Visual Question Answering in Medical domain (VQA-Med) plays an important role in providing medical assistance to the end-users. These users are expected to raise either a straightforward question with a Yes/No answer or a challenging question that requires a detailed and descriptive answer. The existing techniques in VQA-Med fail to distinguish between the different question types sometimes complicates the simpler problems, or over-simplifies the complicated ones. It is certainly true that for different question types, several distinct systems can lead to confusion and discomfort for the end-users. To address this issue, we propose a hierarchical deep multi-modal network that analyzes and classifies end-user questions/queries and then incorporates a query-specific approach for answer prediction. We refer our proposed approach as Hierarchical Question Segregation based Visual Question Answering, in short HQS-VQA. Our contributions are three-fold, viz. firstly, we propose a question segregation (QS) technique for VQA-Med; secondly, we integrate the QS model to the hierarchical deep multi-modal neural network to generate proper answers to the queries related to medical images; and thirdly, we study the impact of QS in Medical-VQA by comparing the performance of the proposed model with QS and a model without QS. We evaluate the performance of our proposed model on two benchmark datasets, viz. RAD and CLEF18. Experimental results show that our proposed HQS-VQA technique outperforms the baseline models with significant margins. We also conduct a detailed quantitative and qualitative analysis of the obtained results and discover potential causes of errors and their solutions.
Article
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image. This ability is challenging but indispensable to achieve general VQA. One limitation of existing KVQA solutions is that they jointly embed all kinds of information without fine-grained selection, which introduces unexpected noises for reasoning the correct answer. How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem. Inspired by the human cognition theory, in this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views. Thereinto, the visual graph and semantic graph are regarded as image-conditioned instantiation of the factual graph. On top of these new representations, we re-formulate Knowledge-based Visual Question Answering as a recurrent reasoning process for obtaining complementary evidence from multimodal information. To this end, we decompose the model into a series of memory-based reasoning steps, each performed by a Graph-based Read, Update, and Control (GRUC) module that conducts parallel reasoning over both visual and semantic information. By stacking the modules multiple times, our model performs transitive reasoning and obtains question-oriented concept representations under the constrain of different modalities. Finally, we perform graph neural networks to infer the global-optimal answer by jointly considering all the concepts. We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA, and demonstrate the effectiveness and interpretability of our model with extensive experiments.
Article
One of the cornerstones of content-based image retrieval (CBIR) for medical image diagnosis is to select the images that present higher similarity with a given query image. Different from previous literature efforts, the present work aims to seamlessly fuse a powerful machine learning strategy based on the active learning paradigm, in order to obtain greater efficacy regarding similarity queries in medical CBIR systems. To do so, we propose a new approach, named as Medical Active leaRning and Retrieval (MARRow) to aid the breast cancer diagnosis. It enables to deal with more feasible strategies, specifically for the medical context and its inherent constraints. We also proposed an active learning strategy to select a small set of more informative images, considering selection criteria based on not only similarity, but also on certain degrees of diversity and uncertainty. To validate our proposed approach, we performed experiments using public medical image datasets, different descriptors for each one and compared our approach against four widely applied and well-known literature approaches, such as: Traditional CBIR without relevance feedback strategies, Query Point Movement Strategy (QPM), Query Expansion (QEX) and SVM Active Learning (SVM-AL). From the experiments, we can observe that our approach presents a strong performance over state-of-the-art ones reaching a precision gain of up to 87.3%. MARRow also presented a well-suited and consistent increasing rate along the learning iterations. Moreover, our approach can significantly minimize the expert's involvement in the analysis and annotation process (reducing up to 88%). The results testify that MARRow improves the precision of the similarity queries. It is capable to explore at the maximum the experts’ intentions, which are captured during the relevance feedback process, incrementally improving the learning model. Therefore, our approach can be suitable and applied in challenging processes, such as real and medical contexts, enhancing medical decision support systems (e.g. breast cancer diagnosis).
Article
K-nearest neighbors (kNN) search is an important problem in data mining and knowledge discovery. Inspired by the huge success of tree-based methodology and ensemble methods over the last decades, we propose a new method for kNN search, random projection forests (rpForests). rpForests finds nearest neighbors by combining multiple kNN-sensitive trees with each constructed recursively through a series of carefully chosen random projections. As demonstrated by experiments on a wide collection of real datasets, our method achieves a remarkable accuracy in terms of fast decaying missing rate of kNNs and that of discrepancy in the k-th nearest neighbor distances. rpForests has a very low computational complexity as a tree-based methodology. The ensemble nature of rpForests makes it easily parallelized to run on clustered or multicore computers; the running time is expected to be nearly inversely proportional to the number of cores or machines. We give theoretical insights on rpForests by showing the exponential decay of neighboring points being separated by ensemble random projection trees when the ensemble size increases. Our theory can also be used to refine the choice of random projections in the growth of rpForests; experiments show that the effect is remarkable.