Content uploaded by Ilias Gialampoukidis

Author content

All content in this area was uploaded by Ilias Gialampoukidis on Nov 29, 2020

Content may be subject to copyright.

Incremental estimation of visual vocabulary size

for image retrieval

Ilias Gialampoukidis, Stefanos Vrochidis and Ioannis Kompatsiaris

Abstract The increasing amount of image databases over the last years has high-

lighted our need to represent an image collection efﬁciently and quickly. The major-

ity of image retrieval and image clustering approaches has been based on the con-

struction of a visual vocabulary in the so called Bag-of-Visual-words (BoV) model,

analogous to the Bag-of-Words (BoW) model in the representation of a collection

of text documents. A visual vocabulary (codebook) is constructed by clustering all

available visual features in an image collection, using k-means or approximate k-

means, requiring as input the number of visual words, i.e. the size of the visual

vocabulary, which is hard to be tuned or directly estimated by the total amount of

visual descriptors. In order to avoid tuning or guessing the number of visual words,

we propose an incremental estimation of the optimal visual vocabulary size, based

on the DBSCAN-Martingale, which has been introduced in the context of text clus-

tering and is able to estimate the number of clusters efﬁciently, even for very noisy

datasets. For a sample of images, our method estimates the potential number of very

dense SIFT patterns for each image in the collection. The proposed approach is

evaluated in an image retrieval and in an image clustering task, by means of Mean

Average Precision and Normalized Mutual Information.

Ilias Gialampoukidis

Centre for Research and Technology Hellas - Information Technologies Institute, 6th km Charilaou-

Thermi Road, 57001 Thermi-Thessaloniki, Greece, e-mail: heliasgj@iti.gr

Stefanos Vrochidis

Centre for Research and Technology Hellas - Information Technologies Institute, 6th km Charilaou-

Thermi Road, 57001 Thermi-Thessaloniki, Greece, e-mail: stefanos@iti.gr

Ioannis Kompatsiaris

Centre for Research and Technology Hellas - Information Technologies Institute, 6th km Charilaou-

Thermi Road, 57001 Thermi-Thessaloniki, Greece, e-mail: ikom@iti.gr

1

This is the accepted manuscript. The final published version appears at Springer in:

Proceedings of the INNS Conference on Big Data 2016

http://link.springer.com/chapter/10.1007/978-3-319-47898-2_4

2 Ilias Gialampoukidis, Stefanos Vrochidis and Ioannis Kompatsiaris

1 Introduction

Image retrieval and image clustering are related tasks because of their need to efﬁ-

ciently and quickly search for nearest neighbors in an image collection. Taking into

account that image collections are dramatically increasing (eg. Facebook, Flickr,

etc.), both tasks, retrieval and clustering, become very challenging and traditional

techniques show reduced functionality. Nowadays, there are many applications of

image retrieval and image clustering which support image search, personal photo

organization, etc.

Searching in an image collection for similar images is strongly affected by the

representation of all images. Spatial veriﬁcation techniques for image representa-

tion, like RANSAC and pixel-to-pixel comparisons, are computationally expensive

and have been outperformed by the Bag-of-Visual-words (BoV) model, which is

based on the construction of a visual vocabulary, known also as visual codebook,

using a vocabulary of visual words [16], by clustering all visual features. The visual

vocabulary construction is motivated by the Bag-of-Words (BoW) model in a col-

lection of text documents. The set of all visual descriptors in an image collection is

clustered using k-means clustering techniques, which are replaced by approximate

k-means methods [14], in order to reduce the computational cost of visual vocab-

ulary construction. However, both k-means techniques require as input the number

of visual words k, which we shall estimate incrementally.

The visual vocabulary is usually constructed, using an empirical number of vi-

sual words k, such as k=4000 in [17]. The optimal number kis hard to be tuned

in very large databases, and impossible when ground truth does not exist. An em-

pirical guess of kmay lead to the construction of visual codebooks, which are not

optimal when involved in an image retrieval of image clustering task. To that end,

we propose a scalable estimation of the optimal number of visual words kin an

incremental way, using a recent modiﬁcation of DBSCAN [3], which also has a

scalable and parallel implementation [6]. In the proposed framework, the ﬁnal num-

ber of visual words is incrementally estimated on a sample of images and therefore,

it can easily scale up to very large image collections, in the context of Big Data.

The most prominent visual features are SIFT descriptors [8], but several other

methods have been proposed to represent an image, such as VLAD [7], GIST [11],

Fisher vectors [12] or DCNN features [9]. In this work, we will restrict our study on

the estimation of the visual vocabulary size, based on SIFT descriptors, and com-

parison in terms of the optimal visual features is beyond the scope of this work.

The main research contributions of this work are:

•Estimate the optimal size of a visual vocabulary

•Build the size estimation incrementally

Therefore, we are able to build efﬁcient visual vocabularies without tuning the

size or guessing a value for k. Our proposed method is a hybrid framework, which

combines the recent DBSCAN-Martingale [5] and k-means clustering. The pro-

posed hybrid framework is evaluated in the image retrieval and image clustering

problems, where we initially provide an estimation for the number of visual words

Incremental estimation of visual vocabulary size for image retrieval 3

k, using the DBSCAN-Martingale, and then cluster all visual descriptors by k, as

traditionally done by k-means clustering.

In Section 2 we present the related work in visual vocabulary construction and

in Section 3 we brieﬂy present the DBSCAN-Martingale estimator of the number

of clusters. In Section 4, our proposed hybrid method for the construction of visual

vocabularies is described in detail, and ﬁnally, in Section 5, it is evaluated under the

image retrieval and image clustering tasks.

2 Related Work

The Bag-of-Visual-Words (BoV) model initially appeared in [16], in which k-means

clustering is applied for the construction of a visual vocabulary. The constructed

visual vocabulary is then used for image retrieval purposes and is similar to the

Bag-of-Words model, where a vocabulary of words is constructed, mainly for text

retrieval, clustering and classiﬁcation. In the BoV model, the image query and each

image of the collection are represented as a sparse vector of term (visual word) oc-

currences, weighted by tf-idf scores. The similarity between the query and each im-

age is calculated, using the Mahalanobis distance or simply the Euclidian distance.

However, there is no obvious value for the number of clusters kin the k-means

clustering algorithm.

Other approaches for the construction of visual vocabularies include Approxi-

mate k-means (AKM) clustering, which offers scalable construction of visual vo-

cabularies. Hierarchical k-means (HKM) was the ﬁrst approximate method for fast

and scalable construction of a visual vocabulary [10], where data points are clus-

tered by k=2 or k=10 using k-means clustering and then k-means is applied to

each one of the newly generated clusters, using the same number of clusters k. Af-

ter nsteps (levels), the result is knclusters. HKM has been outperformed by AKM

[14], where a forest of 8 randomized k-d trees provides approximate nearest neigh-

bor search between points and the approximately nearest cluster centers. The use

of 8 randomized k-d trees with skewed splits have recently been proposed, in the

special case of SIFT descriptors [4]. However, all AKM clustering methods require

as input the number of clusters k, so an efﬁcient estimation of kis necessary.

The need to estimate the number of visual words emerges from the computa-

tional cost of k-means algorithm, either in exact or approximate k-means clustering

[18]. Apart from being a time consuming process, tuning the number of clusters

kmay affect signiﬁcantly the performance of the image retrieval task [14]. Some

studies assume a ﬁxed value of k, such as k=4000 in [17], but in general the choice

of kvaries from 103up to 107, as stated in [11]. In another approach, 10 clusters

are extracted using k-means for each one of the considered classes (categories),

which are then concatenated in order to form a global visual vocabulary [19]. In

contrast, we shall estimate the number of clusters using the DBSCAN-Martingale

[5], which automatically estimates the number of clusters, based on an extension of

DBSCAN [3], without a priori knowledge of the density parameter minPts of DB-

4 Ilias Gialampoukidis, Stefanos Vrochidis and Ioannis Kompatsiaris

SCAN. DBSCAN-Martingale generates a probability distribution over the number

of clusters and has been applied to news clustering, in combination with LDA [5].

3 The DBSCAN-Martingale estimation of the number of clusters

In this section, we brieﬂy describe the DBSCAN-Martingale, which has been intro-

duced for the estimation of the number of clusters in a collection of text documents.

DBSCAN [3] uses two parameters εand minPts to cluster the points of a dataset

without knowing the number of clusters. DBSCAN-Martingale overcomes the tun-

ing of the parameter εand shows robustness to the variation of the parameter minPts

[5].

-2 -1 0 1 2

-1 0 1 2 3

3 4 5 6 7

clusters

probability

0.0 0.1 0.2 0.3 0.4 0.5

Fig. 1: The number of clusters as estimated by the DBSCAN-Martingale on an illus-

trative dataset. The generated probability distribution states that it is more likely to

have 5 clusters, although they appear in different density levels and there are points

which do not belong to any of the clusters.

The estimation of the number of clusters is a probabilistic method and assigns a

probability distribution over the number of clusters, so as to extract all clusters for all

density levels. For each randomly generated density level ε, density-based clusters

are extracted using the DBSCAN algorithm. The density levels εt,t=1,2,...,T

are generated from the uniform distribution in the interval [0,εmax]and sorted in

increasing order.

Each density level εtprovides one partitioning of the dataset, which then for-

mulates a N×1 clustering vector, namely CDBSCAN(εt)for all stages t=1,2,...,T,

where Nis the number of points to cluster. The clustering vector takes as value the

cluster ID κof the j-th point, i.e. CDBSCAN(εt)[j] = κ.

Incremental estimation of visual vocabulary size for image retrieval 5

In the beginning of the algorithm, there are no clusters detected. In the ﬁrst stage

(t=1), all clusters detected by CDBSCAN(ε1)are kept, corresponding to the low-

est density level ε1. In the second stage (t=2), some of the detected clusters by

CDBSCAN(ε2)are new and some of them have also been detected at previous stage

(t=1). DBSCAN-Martingale keeps only the newly detected clusters of the second

stage (t=2), by grouping the numbers of the same cluster ID with size greater than

minPts. After Tstages, we have progressively gained knowledge about the ﬁnal

number of clusters ˆ

k, since all clusters have been extracted with high probability.

The estimation of number of clusters ˆ

kis a random variable, because of the ran-

domness of the generated density levels εt,t=1,2,...,T. For each realization of the

DBSCAN-Martingale one estimation ˆ

kis generated, and the ﬁnal estimation of the

number of clusters have been proposed [5] as the majority vote over 10 realizations

of the DBSCAN-Martingale. The percentage of realizations where the DBSCAN-

Martingale outputs exactly ˆ

kclusters is a probability distribution, as the one shown

in Fig. 1.

4 Estimation of the visual vocabulary size using the

DBSCAN-Martingale

Motivated by the DBSCAN-Martingale, which has been applied in several collec-

tions of text documents in the context of news clustering [4], we propose an esti-

mation of the total number of visual words in an image collection, as shown in Fig.

2. The proposed method is incremental, since the estimation of the ﬁnal number of

visual words is progressively estimated and updated when a new image is added to

the collection.

Starting from the ﬁrst image, keypoints are detected and SIFT descriptors [8] are

extracted. Each visual feature is represented as a 128-dimensional vector, hence the

whole image iis a matrix Miwith 128 columns, but the number of rows is subject to

the number of detected keypoints. On each matrix Mi, the 128-dimensional vectors

are clustered using the DBSCAN-Martingale, which outputs the number of dense

patterns in the set of visual features, as provided by several density levels. Assuming

that the application of 100 realizations of the DBSCAN-Martingale has output k1

for the ﬁrst image, k2for the second image and klfor the l-th image (Fig. 2), the

proposed optimal size of the visual vocabulary is:

k=

l

∑

i=1

ki(1)

DBSCAN-Martingale extracts clusters sequentially, combines them into one sin-

gle clustering vector and outputs the most updated estimation of the number of

clusters, in each realization. The DBSCAN-Martingale requires Titerations of the

DBSCAN algorithm, which runs in O(nlogn), when kd-tree data structures are em-

ployed for fast nearest neighbor search and in O(n2)without tree-based spatial in-

6 Ilias Gialampoukidis, Stefanos Vrochidis and Ioannis Kompatsiaris

DBSCAN-

Martingale

DBSCAN-

Martingale

DBSCAN-

Martingale

... … … … … … … … … ... … … … … … … … … ... … … … … …

Image Keypoints SIFT descriptors visual words

Fig. 2: The estimation of the number of visual words in an image collection. Each

image icontributes with kivisual words to the overall estimation of the visual vo-

cabulary size.

dexing [1]. We adopt the implementation of DBSCAN-Martingale in R1, which is

available on Github2, because the R-script utilizes the dbscan3package, which runs

DBSCAN in O(nlogn). Thus, the overall complexity of the DBSCAN-Martingale

is O(T n logn), where nis the number of visual descriptors per image. Assuming r

iterations of the DBSCAN-Martingale per image and given an image collection of l

images, the overall estimation of the size of a visual vocabulary is O(lrT n logn).

In order to reduce the complexity, we sample l0out of limages to get an average

number of visual words per image. The ﬁnal number of visual words is estimated

from a sample of images S={i1,i2,...,il0}of size l0, so the overall complexity

becomes O(l0rT n log n). The ﬁnal estimation for the number of visual words ˆ

kof

Eq. (1) becomes:

ˆ

k=l

l0∑

i∈S

ki(2)

We utilize the estimation ˆ

k, provided by Eq. (2), in order to cluster all visual

features by ˆ

kusing k-means clustering. Therefore, a visual vocabulary of size ˆ

kis

constructed. After the construction of a visual vocabulary, as shown in Fig. 3, im-

ages are represented using term-frequency scores with inverse document frequency

weighting (tf-idf) [16]:

tdidfi j =nid

nd

log D

ni

(3)

1https://www.r-project.org/

2https://github.com/MKLab-ITI/topic-detection/blob/master/DBSCAN Martingale.r

3https://cran.r-project.org/web/packages/dbscan/index.html

Incremental estimation of visual vocabulary size for image retrieval 7

where nid is the number of occurrences of visual word iin image d,ndis the number

of visual words in image d,niis the number of occurrences of visual word iin the

whole image collection and Dis the total number of images in the database.

SIFT

Extraction

Visual

Vocabulary

Construction

Image

Representation

Approximate

k-means (k)

k-means (k)

Image

Collection

DBSCAN-

Martingale

4 5 6 7 8 9 10 12 14 16 19

clusters

probability

0.00 0.05 0.10 0.15

Application

Image

Retrieval

Image

Clustering

Fig. 3: The hybrid visual vocabulary construction framework using the DBSCAN-

Martingale for the estimation of kand either exact or approximate k-means clus-

tering by k. After the visual vocabulary is constructed, the collection of images is

efﬁciently represented for any application, such as image retrieval or clustering.

In the following section, we test our hybrid visual vocabulary construction in the

image retrieval and image clustering problems.

5 Experiments

We evaluate our method in the image retrieval and image clustering tasks, in which

nearest neighbor search is performed in an unsupervised way. The datasets we have

selected are the WANG41K and Caltech52.5K with 2,516 images, with queries

as described in [4] for the image retrieval task. The number of extracted visual de-

scriptors (SIFT) is 505,834 and 769,546 128-dimensional vectors in the WANG 1K

and Caltech 2.5K datasets, respectively. The number of topics is 10 for the WANG

dataset and 21 for the Caltech, allowing also image clustering experiments with

the considered datasets. We selected these datasets because they are appropriate for

performing both image retrieval and image clustering experiments and tuning the

number of visual words kmay be done in reasonable processing time, so as to eval-

4http://wang.ist.psu.edu/docs/related/

5http://www.vision.caltech.edu/Image Datasets/Caltech101/

8 Ilias Gialampoukidis, Stefanos Vrochidis and Ioannis Kompatsiaris

uate the visual vocabulary construction in terms of Mean Average Precision (MAP)

and Normalized Mutual Information (NMI).

0 1000 2000 3000 4000

0.180 0.190 0.200

k

MAP

(a) MAP for the WANG dataset

0 1000 2000 3000 4000

0.24 0.26 0.28 0.30 0.32

k

NMI

(b) NMI for the WANG dataset

0 1000 2000 3000 4000

0.10 0.11 0.12 0.13 0.14 0.15

k

MAP

(c) MAP for the Caltech dataset

0 1000 2000 3000 4000

0.34 0.36 0.38 0.40 0.42 0.44

k

NMI

(d) NMI for the Caltech dataset

Fig. 4: Evaluation using MAP and NMI in image retrieval and image clustering tasks

for the WANG and Caltech datasets. The MAP and NMI scores which are obtained

by our kestimation is the straight red line.

Keypoints are detected and SIFT descriptors are extracted using the LIP-VIREO

toolkit6. For the implementation of DBSCAN-Martingale we used the R-script,

which is available on Github7for εmax =200 and 100 realizations. We build

one visual vocabulary for several number of visual words k, which is tuned in

k∈ {100,200,300,...,4000}. The parameter minPts is tuned from 5 to 30 and the

ﬁnal number of clusters per image is the number which is more robust to the varia-

tions of minPts. In k-means clustering, we allow a maximum of 20 iterations with 5

random initial starts.

6http://pami.xmu.edu.cn/ wlzhao/lip-vireo.htm

7https://github.com/MKLab-ITI/topic-detection/blob/master/DBSCAN Martingale.r

Incremental estimation of visual vocabulary size for image retrieval 9

Our estimations for the number of visual words is ˆ

k=2180 and ˆ

k=1840 for

the WANG and Caltech datasets, respectively, given a sample of 200 images. The

corresponding MAP and NMI are compared to the best MAP and NMI scores in k∈

{100,200,300,...,4000}. The results are reported in Table 1, where apart from the

best MAP and NMI scores, we also present the ratio of MAP (NMI) provided by our

ˆ

k-estimation to the maximum observed MAP (NMI) score, denoted by rMAP (rNM I ).

In particular, in the WANG dataset, MAP is 96.42% of the best MAP observed and

NMI is 94.91% of the best NMI. Similar behavior is observed in the Caltech dataset,

where NMI is approached at 95.36% and MAP at 80.06%, respectively. In Fig. 4

we observe that our incremental estimation method, when combined with k-means,

approaches the highest observed MAP and NMI scores in all cases examined.

Table 1: Evaluation in image retrieval and image clustering tasks.

Dataset kvisual words MAP rMAP NMI rNM I

WANG best k0.2040 0.3241

DBSCAN-Martingale ˆ

k0.1967 0.9642 0.3076 0.9491

Caltech best k0.1560 0.4439

DBSCAN-Martingale ˆ

k0.1249 0.8006 0.4233 0.9536

6 Conclusion

We presented an incremental estimation of the optimal size of the visual vocabu-

lary, which efﬁciently estimates the number of visual words, and evaluated the per-

formance of the constructed visual vocabulary in an image retrieval and an image

clustering task. The proposed hybrid framework utilizes the output of DBSCAN-

Martingale on a sample of SIFT descriptors, in order to be used as input in any

k-means or approximate k-means clustering for the construction of a visual vocabu-

lary. The estimation is incremental, i.e. the ﬁnal number of visual words is updated

when a new sample image is used. A potential limitation of our approach could ap-

pear in the case where an image exists more than once in the image collection and

therefore needlessly contributes with extra visual words the ﬁnal estimation. How-

ever, if the sample of images on which the DBSCAN-Martingale is applied does not

have duplicate images, the overall estimation will not be affected. In the future, we

plan to test our method using other visual features and in the context of multimedia

retrieval, where multiple modalities are employed.

Acknowledgements This work was supported by the projects MULTISENSOR (FP7-610411)

and KRISTINA (H2020-645012), funded by the European Commission.

10 Ilias Gialampoukidis, Stefanos Vrochidis and Ioannis Kompatsiaris

References

1. Ankerst, M., Breunig, M. M., Kriegel, H. P., & Sander, J. (1999, June). OPTICS: ordering

points to identify the clustering structure. In ACM Sigmod Record (Vol. 28, No. 2, pp. 49-60).

2. Devroye, L. (1986, December). Sample-based non-uniform random variate generation. In Pro-

ceedings of the 18th conference on Winter simulation (pp. 260-265). ACM.

3. Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). A density-based algorithm

for discovering clusters in large spatial databases with noise. In Kdd (Vol. 96, No. 34, pp.

226-231).

4. Gialampoukidis, I., Vrochidis, S., & Kompatsiaris, I. (2016, January). Fast visual vocabulary

construction for image retrieval using skewed-split kd trees. In MultiMedia Modeling (pp.

466-477). Springer International Publishing.

5. Gialampoukidis, I., Vrochidis, S., & Kompatsiaris, I. (2016, July). A hybrid framework for

news clustering based on the DBSCAN-Martingale and LDA., In Machine Learning and Data

Mining, New York, USA, accepted for publication.

6. He, Y., Tan, H., Luo, W., Feng, S., & Fan, J. (2014). MR-DBSCAN: a scalable MapReduce-

based DBSCAN algorithm for heavily skewed data. Frontiers of Computer Science, 8(1), 83-

99.

7. J´

egou, H., Douze, M., Schmid, C., & P´

erez, P. (2010, June). Aggregating local descriptors into

a compact image representation. In Computer Vision and Pattern Recognition (CVPR), 2010

IEEE Conference on (pp. 3304-3311). IEEE.

8. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International

journal of computer vision, 60(2), 91-110.

9. Markatopoulou, F., Mezaris, V., & Patras, I. (2015, September). Cascade of classiﬁers based

on binary, non-binary and deep convolutional network descriptors for video concept detection.

In Image Processing (ICIP), 2015 IEEE International Conference on (pp. 1786-1790). IEEE.

10. Mikolajczyk, K., Leibe, B., & Schiele, B. (2006, June). Multiple object class detection with a

generative model. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society

Conference on (Vol. 1, pp. 26-36). IEEE.

11. Mikulik, A., Chum, O., & Matas, J. (2013). Image retrieval for online browsing in large image

collections. In Similarity Search and Applications (pp. 3-15). Springer Berlin Heidelberg.

12. Perronnin, F., Sanchez, J., & Mensink, T. (2010). Improving the ﬁsher kernel for large-scale

image classiﬁcation. In Computer Vision-ECCV 2010 (pp. 143-156). Springer Heidelberg.

13. Philbin, J. (2010). Scalable object retrieval in very large image collections (Doctoral disserta-

tion, Oxford University).

14. Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007, June). Object retrieval with

large vocabularies and fast spatial matching. In Computer Vision and Pattern Recognition,

2007. CVPR’07. IEEE Conference on (pp. 1-8). IEEE.

15. Rawlings, J. O., Pantula, S. G., & Dickey, D. A. (1998). Applied regression analysis: a re-

search tool. Springer Science & Business Media.

16. Sivic, J., & Zisserman, A. (2003, October). Video Google: A text retrieval approach to ob-

ject matching in videos. In Computer Vision, 2003. Proceedings. Ninth IEEE International

Conference on (pp. 1470-1477). IEEE.

17. Van De Sande, K. E., Gevers, T., & Snoek, C. G. (2010). Evaluating color descriptors for object

and scene recognition. In Pattern Analysis and Machine Intelligence, IEEE Transactions on,

32(9), 1582-1596.

18. Wang, J., Wang, J., Ke, Q., Zeng, G., & Li, S. (2015). Fast approximate k-means via clus-

ter closures. In Multimedia Data Mining and Analytics (pp. 373-395). Springer International

Publishing.

19. Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for

classiﬁcation of texture and object categories: A comprehensive study. International journal

of computer vision, 73(2), 213-238.