Conference PaperPDF Available

Abstract and Figures

The increasing amount of image databases over the last years has highlighted our need to represent an image collection efficiently and quickly. The majority of image retrieval and image clustering approaches has been based on the construction of a visual vocabulary in the so called Bag-of-Visual-words (BoV) model, analogous to the Bag-of-Words (BoW) model in the representation of a collection of text documents. A visual vocabulary (codebook) is constructed by clustering all available visual features in an image collection, using k-means or approximate k-means, requiring as input the number of visual words, i.e. the size of the visual vocabulary, which is hard to be tuned or directly estimated by the total amount of visual descriptors. In order to avoid tuning or guessing the number of visual words, we propose an incremental estimation of the optimal visual vocabulary size, based on the DBSCAN-Martingale, which has been introduced in the context of text clustering and is able to estimate the number of clusters efficiently, even for very noisy datasets. For a sample of images, our method estimates the potential number of very dense SIFT patterns for each image in the collection. The proposed approach is evaluated in an image retrieval and in an image clustering task, by means of Mean Average Precision and Normalized Mutual Information.
Content may be subject to copyright.
Incremental estimation of visual vocabulary size
for image retrieval
Ilias Gialampoukidis, Stefanos Vrochidis and Ioannis Kompatsiaris
Abstract The increasing amount of image databases over the last years has high-
lighted our need to represent an image collection efficiently and quickly. The major-
ity of image retrieval and image clustering approaches has been based on the con-
struction of a visual vocabulary in the so called Bag-of-Visual-words (BoV) model,
analogous to the Bag-of-Words (BoW) model in the representation of a collection
of text documents. A visual vocabulary (codebook) is constructed by clustering all
available visual features in an image collection, using k-means or approximate k-
means, requiring as input the number of visual words, i.e. the size of the visual
vocabulary, which is hard to be tuned or directly estimated by the total amount of
visual descriptors. In order to avoid tuning or guessing the number of visual words,
we propose an incremental estimation of the optimal visual vocabulary size, based
on the DBSCAN-Martingale, which has been introduced in the context of text clus-
tering and is able to estimate the number of clusters efficiently, even for very noisy
datasets. For a sample of images, our method estimates the potential number of very
dense SIFT patterns for each image in the collection. The proposed approach is
evaluated in an image retrieval and in an image clustering task, by means of Mean
Average Precision and Normalized Mutual Information.
Ilias Gialampoukidis
Centre for Research and Technology Hellas - Information Technologies Institute, 6th km Charilaou-
Thermi Road, 57001 Thermi-Thessaloniki, Greece, e-mail: heliasgj@iti.gr
Stefanos Vrochidis
Centre for Research and Technology Hellas - Information Technologies Institute, 6th km Charilaou-
Thermi Road, 57001 Thermi-Thessaloniki, Greece, e-mail: stefanos@iti.gr
Ioannis Kompatsiaris
Centre for Research and Technology Hellas - Information Technologies Institute, 6th km Charilaou-
Thermi Road, 57001 Thermi-Thessaloniki, Greece, e-mail: ikom@iti.gr
1
This is the accepted manuscript. The final published version appears at Springer in:
Proceedings of the INNS Conference on Big Data 2016
http://link.springer.com/chapter/10.1007/978-3-319-47898-2_4
2 Ilias Gialampoukidis, Stefanos Vrochidis and Ioannis Kompatsiaris
1 Introduction
Image retrieval and image clustering are related tasks because of their need to effi-
ciently and quickly search for nearest neighbors in an image collection. Taking into
account that image collections are dramatically increasing (eg. Facebook, Flickr,
etc.), both tasks, retrieval and clustering, become very challenging and traditional
techniques show reduced functionality. Nowadays, there are many applications of
image retrieval and image clustering which support image search, personal photo
organization, etc.
Searching in an image collection for similar images is strongly affected by the
representation of all images. Spatial verification techniques for image representa-
tion, like RANSAC and pixel-to-pixel comparisons, are computationally expensive
and have been outperformed by the Bag-of-Visual-words (BoV) model, which is
based on the construction of a visual vocabulary, known also as visual codebook,
using a vocabulary of visual words [16], by clustering all visual features. The visual
vocabulary construction is motivated by the Bag-of-Words (BoW) model in a col-
lection of text documents. The set of all visual descriptors in an image collection is
clustered using k-means clustering techniques, which are replaced by approximate
k-means methods [14], in order to reduce the computational cost of visual vocab-
ulary construction. However, both k-means techniques require as input the number
of visual words k, which we shall estimate incrementally.
The visual vocabulary is usually constructed, using an empirical number of vi-
sual words k, such as k=4000 in [17]. The optimal number kis hard to be tuned
in very large databases, and impossible when ground truth does not exist. An em-
pirical guess of kmay lead to the construction of visual codebooks, which are not
optimal when involved in an image retrieval of image clustering task. To that end,
we propose a scalable estimation of the optimal number of visual words kin an
incremental way, using a recent modification of DBSCAN [3], which also has a
scalable and parallel implementation [6]. In the proposed framework, the final num-
ber of visual words is incrementally estimated on a sample of images and therefore,
it can easily scale up to very large image collections, in the context of Big Data.
The most prominent visual features are SIFT descriptors [8], but several other
methods have been proposed to represent an image, such as VLAD [7], GIST [11],
Fisher vectors [12] or DCNN features [9]. In this work, we will restrict our study on
the estimation of the visual vocabulary size, based on SIFT descriptors, and com-
parison in terms of the optimal visual features is beyond the scope of this work.
The main research contributions of this work are:
Estimate the optimal size of a visual vocabulary
Build the size estimation incrementally
Therefore, we are able to build efficient visual vocabularies without tuning the
size or guessing a value for k. Our proposed method is a hybrid framework, which
combines the recent DBSCAN-Martingale [5] and k-means clustering. The pro-
posed hybrid framework is evaluated in the image retrieval and image clustering
problems, where we initially provide an estimation for the number of visual words
Incremental estimation of visual vocabulary size for image retrieval 3
k, using the DBSCAN-Martingale, and then cluster all visual descriptors by k, as
traditionally done by k-means clustering.
In Section 2 we present the related work in visual vocabulary construction and
in Section 3 we briefly present the DBSCAN-Martingale estimator of the number
of clusters. In Section 4, our proposed hybrid method for the construction of visual
vocabularies is described in detail, and finally, in Section 5, it is evaluated under the
image retrieval and image clustering tasks.
2 Related Work
The Bag-of-Visual-Words (BoV) model initially appeared in [16], in which k-means
clustering is applied for the construction of a visual vocabulary. The constructed
visual vocabulary is then used for image retrieval purposes and is similar to the
Bag-of-Words model, where a vocabulary of words is constructed, mainly for text
retrieval, clustering and classification. In the BoV model, the image query and each
image of the collection are represented as a sparse vector of term (visual word) oc-
currences, weighted by tf-idf scores. The similarity between the query and each im-
age is calculated, using the Mahalanobis distance or simply the Euclidian distance.
However, there is no obvious value for the number of clusters kin the k-means
clustering algorithm.
Other approaches for the construction of visual vocabularies include Approxi-
mate k-means (AKM) clustering, which offers scalable construction of visual vo-
cabularies. Hierarchical k-means (HKM) was the first approximate method for fast
and scalable construction of a visual vocabulary [10], where data points are clus-
tered by k=2 or k=10 using k-means clustering and then k-means is applied to
each one of the newly generated clusters, using the same number of clusters k. Af-
ter nsteps (levels), the result is knclusters. HKM has been outperformed by AKM
[14], where a forest of 8 randomized k-d trees provides approximate nearest neigh-
bor search between points and the approximately nearest cluster centers. The use
of 8 randomized k-d trees with skewed splits have recently been proposed, in the
special case of SIFT descriptors [4]. However, all AKM clustering methods require
as input the number of clusters k, so an efficient estimation of kis necessary.
The need to estimate the number of visual words emerges from the computa-
tional cost of k-means algorithm, either in exact or approximate k-means clustering
[18]. Apart from being a time consuming process, tuning the number of clusters
kmay affect significantly the performance of the image retrieval task [14]. Some
studies assume a fixed value of k, such as k=4000 in [17], but in general the choice
of kvaries from 103up to 107, as stated in [11]. In another approach, 10 clusters
are extracted using k-means for each one of the considered classes (categories),
which are then concatenated in order to form a global visual vocabulary [19]. In
contrast, we shall estimate the number of clusters using the DBSCAN-Martingale
[5], which automatically estimates the number of clusters, based on an extension of
DBSCAN [3], without a priori knowledge of the density parameter minPts of DB-
4 Ilias Gialampoukidis, Stefanos Vrochidis and Ioannis Kompatsiaris
SCAN. DBSCAN-Martingale generates a probability distribution over the number
of clusters and has been applied to news clustering, in combination with LDA [5].
3 The DBSCAN-Martingale estimation of the number of clusters
In this section, we briefly describe the DBSCAN-Martingale, which has been intro-
duced for the estimation of the number of clusters in a collection of text documents.
DBSCAN [3] uses two parameters εand minPts to cluster the points of a dataset
without knowing the number of clusters. DBSCAN-Martingale overcomes the tun-
ing of the parameter εand shows robustness to the variation of the parameter minPts
[5].
-2 -1 0 1 2
-1 0 1 2 3
3 4 5 6 7
clusters
probability
0.0 0.1 0.2 0.3 0.4 0.5
Fig. 1: The number of clusters as estimated by the DBSCAN-Martingale on an illus-
trative dataset. The generated probability distribution states that it is more likely to
have 5 clusters, although they appear in different density levels and there are points
which do not belong to any of the clusters.
The estimation of the number of clusters is a probabilistic method and assigns a
probability distribution over the number of clusters, so as to extract all clusters for all
density levels. For each randomly generated density level ε, density-based clusters
are extracted using the DBSCAN algorithm. The density levels εt,t=1,2,...,T
are generated from the uniform distribution in the interval [0,εmax]and sorted in
increasing order.
Each density level εtprovides one partitioning of the dataset, which then for-
mulates a N×1 clustering vector, namely CDBSCAN(εt)for all stages t=1,2,...,T,
where Nis the number of points to cluster. The clustering vector takes as value the
cluster ID κof the j-th point, i.e. CDBSCAN(εt)[j] = κ.
Incremental estimation of visual vocabulary size for image retrieval 5
In the beginning of the algorithm, there are no clusters detected. In the first stage
(t=1), all clusters detected by CDBSCAN(ε1)are kept, corresponding to the low-
est density level ε1. In the second stage (t=2), some of the detected clusters by
CDBSCAN(ε2)are new and some of them have also been detected at previous stage
(t=1). DBSCAN-Martingale keeps only the newly detected clusters of the second
stage (t=2), by grouping the numbers of the same cluster ID with size greater than
minPts. After Tstages, we have progressively gained knowledge about the final
number of clusters ˆ
k, since all clusters have been extracted with high probability.
The estimation of number of clusters ˆ
kis a random variable, because of the ran-
domness of the generated density levels εt,t=1,2,...,T. For each realization of the
DBSCAN-Martingale one estimation ˆ
kis generated, and the final estimation of the
number of clusters have been proposed [5] as the majority vote over 10 realizations
of the DBSCAN-Martingale. The percentage of realizations where the DBSCAN-
Martingale outputs exactly ˆ
kclusters is a probability distribution, as the one shown
in Fig. 1.
4 Estimation of the visual vocabulary size using the
DBSCAN-Martingale
Motivated by the DBSCAN-Martingale, which has been applied in several collec-
tions of text documents in the context of news clustering [4], we propose an esti-
mation of the total number of visual words in an image collection, as shown in Fig.
2. The proposed method is incremental, since the estimation of the final number of
visual words is progressively estimated and updated when a new image is added to
the collection.
Starting from the first image, keypoints are detected and SIFT descriptors [8] are
extracted. Each visual feature is represented as a 128-dimensional vector, hence the
whole image iis a matrix Miwith 128 columns, but the number of rows is subject to
the number of detected keypoints. On each matrix Mi, the 128-dimensional vectors
are clustered using the DBSCAN-Martingale, which outputs the number of dense
patterns in the set of visual features, as provided by several density levels. Assuming
that the application of 100 realizations of the DBSCAN-Martingale has output k1
for the first image, k2for the second image and klfor the l-th image (Fig. 2), the
proposed optimal size of the visual vocabulary is:
k=
l
i=1
ki(1)
DBSCAN-Martingale extracts clusters sequentially, combines them into one sin-
gle clustering vector and outputs the most updated estimation of the number of
clusters, in each realization. The DBSCAN-Martingale requires Titerations of the
DBSCAN algorithm, which runs in O(nlogn), when kd-tree data structures are em-
ployed for fast nearest neighbor search and in O(n2)without tree-based spatial in-
6 Ilias Gialampoukidis, Stefanos Vrochidis and Ioannis Kompatsiaris
DBSCAN-
Martingale
DBSCAN-
Martingale
DBSCAN-
Martingale
... … … … … … … … … ... … … … … … … … ... … … … … …
 

Image Keypoints SIFT descriptors visual words
Fig. 2: The estimation of the number of visual words in an image collection. Each
image icontributes with kivisual words to the overall estimation of the visual vo-
cabulary size.
dexing [1]. We adopt the implementation of DBSCAN-Martingale in R1, which is
available on Github2, because the R-script utilizes the dbscan3package, which runs
DBSCAN in O(nlogn). Thus, the overall complexity of the DBSCAN-Martingale
is O(T n logn), where nis the number of visual descriptors per image. Assuming r
iterations of the DBSCAN-Martingale per image and given an image collection of l
images, the overall estimation of the size of a visual vocabulary is O(lrT n logn).
In order to reduce the complexity, we sample l0out of limages to get an average
number of visual words per image. The final number of visual words is estimated
from a sample of images S={i1,i2,...,il0}of size l0, so the overall complexity
becomes O(l0rT n log n). The final estimation for the number of visual words ˆ
kof
Eq. (1) becomes:
ˆ
k=l
l0
iS
ki(2)
We utilize the estimation ˆ
k, provided by Eq. (2), in order to cluster all visual
features by ˆ
kusing k-means clustering. Therefore, a visual vocabulary of size ˆ
kis
constructed. After the construction of a visual vocabulary, as shown in Fig. 3, im-
ages are represented using term-frequency scores with inverse document frequency
weighting (tf-idf) [16]:
tdidfi j =nid
nd
log D
ni
(3)
1https://www.r-project.org/
2https://github.com/MKLab-ITI/topic-detection/blob/master/DBSCAN Martingale.r
3https://cran.r-project.org/web/packages/dbscan/index.html
Incremental estimation of visual vocabulary size for image retrieval 7
where nid is the number of occurrences of visual word iin image d,ndis the number
of visual words in image d,niis the number of occurrences of visual word iin the
whole image collection and Dis the total number of images in the database.
SIFT
Extraction
Visual
Vocabulary
Construction
Image
Representation
Approximate
k-means (k)
k-means (k)
Image
Collection
DBSCAN-
Martingale
4 5 6 7 8 9 10 12 14 16 19
clusters
probability
0.00 0.05 0.10 0.15
Application
Image
Retrieval
Image
Clustering
Fig. 3: The hybrid visual vocabulary construction framework using the DBSCAN-
Martingale for the estimation of kand either exact or approximate k-means clus-
tering by k. After the visual vocabulary is constructed, the collection of images is
efficiently represented for any application, such as image retrieval or clustering.
In the following section, we test our hybrid visual vocabulary construction in the
image retrieval and image clustering problems.
5 Experiments
We evaluate our method in the image retrieval and image clustering tasks, in which
nearest neighbor search is performed in an unsupervised way. The datasets we have
selected are the WANG41K and Caltech52.5K with 2,516 images, with queries
as described in [4] for the image retrieval task. The number of extracted visual de-
scriptors (SIFT) is 505,834 and 769,546 128-dimensional vectors in the WANG 1K
and Caltech 2.5K datasets, respectively. The number of topics is 10 for the WANG
dataset and 21 for the Caltech, allowing also image clustering experiments with
the considered datasets. We selected these datasets because they are appropriate for
performing both image retrieval and image clustering experiments and tuning the
number of visual words kmay be done in reasonable processing time, so as to eval-
4http://wang.ist.psu.edu/docs/related/
5http://www.vision.caltech.edu/Image Datasets/Caltech101/
8 Ilias Gialampoukidis, Stefanos Vrochidis and Ioannis Kompatsiaris
uate the visual vocabulary construction in terms of Mean Average Precision (MAP)
and Normalized Mutual Information (NMI).
(a) MAP for the WANG dataset
(b) NMI for the WANG dataset
(c) MAP for the Caltech dataset
(d) NMI for the Caltech dataset
Fig. 4: Evaluation using MAP and NMI in image retrieval and image clustering tasks
for the WANG and Caltech datasets. The MAP and NMI scores which are obtained
by our kestimation is the straight red line.
Keypoints are detected and SIFT descriptors are extracted using the LIP-VIREO
toolkit6. For the implementation of DBSCAN-Martingale we used the R-script,
which is available on Github7for εmax =200 and 100 realizations. We build
one visual vocabulary for several number of visual words k, which is tuned in
k∈ {100,200,300,...,4000}. The parameter minPts is tuned from 5 to 30 and the
final number of clusters per image is the number which is more robust to the varia-
tions of minPts. In k-means clustering, we allow a maximum of 20 iterations with 5
random initial starts.
6http://pami.xmu.edu.cn/ wlzhao/lip-vireo.htm
7https://github.com/MKLab-ITI/topic-detection/blob/master/DBSCAN Martingale.r
Incremental estimation of visual vocabulary size for image retrieval 9
Our estimations for the number of visual words is ˆ
k=2180 and ˆ
k=1840 for
the WANG and Caltech datasets, respectively, given a sample of 200 images. The
corresponding MAP and NMI are compared to the best MAP and NMI scores in k
{100,200,300,...,4000}. The results are reported in Table 1, where apart from the
best MAP and NMI scores, we also present the ratio of MAP (NMI) provided by our
ˆ
k-estimation to the maximum observed MAP (NMI) score, denoted by rMAP (rNM I ).
In particular, in the WANG dataset, MAP is 96.42% of the best MAP observed and
NMI is 94.91% of the best NMI. Similar behavior is observed in the Caltech dataset,
where NMI is approached at 95.36% and MAP at 80.06%, respectively. In Fig. 4
we observe that our incremental estimation method, when combined with k-means,
approaches the highest observed MAP and NMI scores in all cases examined.
Table 1: Evaluation in image retrieval and image clustering tasks.
Dataset kvisual words MAP rMAP NMI rNM I
WANG best k0.2040 0.3241
DBSCAN-Martingale ˆ
k0.1967 0.9642 0.3076 0.9491
Caltech best k0.1560 0.4439
DBSCAN-Martingale ˆ
k0.1249 0.8006 0.4233 0.9536
6 Conclusion
We presented an incremental estimation of the optimal size of the visual vocabu-
lary, which efficiently estimates the number of visual words, and evaluated the per-
formance of the constructed visual vocabulary in an image retrieval and an image
clustering task. The proposed hybrid framework utilizes the output of DBSCAN-
Martingale on a sample of SIFT descriptors, in order to be used as input in any
k-means or approximate k-means clustering for the construction of a visual vocabu-
lary. The estimation is incremental, i.e. the final number of visual words is updated
when a new sample image is used. A potential limitation of our approach could ap-
pear in the case where an image exists more than once in the image collection and
therefore needlessly contributes with extra visual words the final estimation. How-
ever, if the sample of images on which the DBSCAN-Martingale is applied does not
have duplicate images, the overall estimation will not be affected. In the future, we
plan to test our method using other visual features and in the context of multimedia
retrieval, where multiple modalities are employed.
Acknowledgements This work was supported by the projects MULTISENSOR (FP7-610411)
and KRISTINA (H2020-645012), funded by the European Commission.
10 Ilias Gialampoukidis, Stefanos Vrochidis and Ioannis Kompatsiaris
References
1. Ankerst, M., Breunig, M. M., Kriegel, H. P., & Sander, J. (1999, June). OPTICS: ordering
points to identify the clustering structure. In ACM Sigmod Record (Vol. 28, No. 2, pp. 49-60).
2. Devroye, L. (1986, December). Sample-based non-uniform random variate generation. In Pro-
ceedings of the 18th conference on Winter simulation (pp. 260-265). ACM.
3. Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). A density-based algorithm
for discovering clusters in large spatial databases with noise. In Kdd (Vol. 96, No. 34, pp.
226-231).
4. Gialampoukidis, I., Vrochidis, S., & Kompatsiaris, I. (2016, January). Fast visual vocabulary
construction for image retrieval using skewed-split kd trees. In MultiMedia Modeling (pp.
466-477). Springer International Publishing.
5. Gialampoukidis, I., Vrochidis, S., & Kompatsiaris, I. (2016, July). A hybrid framework for
news clustering based on the DBSCAN-Martingale and LDA., In Machine Learning and Data
Mining, New York, USA, accepted for publication.
6. He, Y., Tan, H., Luo, W., Feng, S., & Fan, J. (2014). MR-DBSCAN: a scalable MapReduce-
based DBSCAN algorithm for heavily skewed data. Frontiers of Computer Science, 8(1), 83-
99.
7. J´
egou, H., Douze, M., Schmid, C., & P´
erez, P. (2010, June). Aggregating local descriptors into
a compact image representation. In Computer Vision and Pattern Recognition (CVPR), 2010
IEEE Conference on (pp. 3304-3311). IEEE.
8. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International
journal of computer vision, 60(2), 91-110.
9. Markatopoulou, F., Mezaris, V., & Patras, I. (2015, September). Cascade of classifiers based
on binary, non-binary and deep convolutional network descriptors for video concept detection.
In Image Processing (ICIP), 2015 IEEE International Conference on (pp. 1786-1790). IEEE.
10. Mikolajczyk, K., Leibe, B., & Schiele, B. (2006, June). Multiple object class detection with a
generative model. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society
Conference on (Vol. 1, pp. 26-36). IEEE.
11. Mikulik, A., Chum, O., & Matas, J. (2013). Image retrieval for online browsing in large image
collections. In Similarity Search and Applications (pp. 3-15). Springer Berlin Heidelberg.
12. Perronnin, F., Sanchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale
image classification. In Computer Vision-ECCV 2010 (pp. 143-156). Springer Heidelberg.
13. Philbin, J. (2010). Scalable object retrieval in very large image collections (Doctoral disserta-
tion, Oxford University).
14. Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007, June). Object retrieval with
large vocabularies and fast spatial matching. In Computer Vision and Pattern Recognition,
2007. CVPR’07. IEEE Conference on (pp. 1-8). IEEE.
15. Rawlings, J. O., Pantula, S. G., & Dickey, D. A. (1998). Applied regression analysis: a re-
search tool. Springer Science & Business Media.
16. Sivic, J., & Zisserman, A. (2003, October). Video Google: A text retrieval approach to ob-
ject matching in videos. In Computer Vision, 2003. Proceedings. Ninth IEEE International
Conference on (pp. 1470-1477). IEEE.
17. Van De Sande, K. E., Gevers, T., & Snoek, C. G. (2010). Evaluating color descriptors for object
and scene recognition. In Pattern Analysis and Machine Intelligence, IEEE Transactions on,
32(9), 1582-1596.
18. Wang, J., Wang, J., Ke, Q., Zeng, G., & Li, S. (2015). Fast approximate k-means via clus-
ter closures. In Multimedia Data Mining and Analytics (pp. 373-395). Springer International
Publishing.
19. Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for
classification of texture and object categories: A comprehensive study. International journal
of computer vision, 73(2), 213-238.
ResearchGate has not been able to resolve any citations for this publication.
Chapter
Full-text available
Nowadays there is an important need by journalists and media monitoring companies to cluster news in large amounts of web articles, in order to ensure fast access to their topics or events of interest. Our aim in this work is to identify groups of news articles that share a common topic or event, without a priori knowledge of the number of clusters. The estimation of the correct number of topics is a challenging issue, due to the existence of “noise”, i.e. news articles which are irrelevant to all other topics. In this context, we introduce a novel density-based news clustering framework, in which the assignment of news articles to topics is done by the well-established Latent Dirichlet Allocation, but the estimation of the number of clusters is performed by our novel method, called “DBSCAN-Martingale”, which allows for extracting noise from the dataset and progressively extracts clusters from an OPTICS reachability plot. We evaluate our framework and the DBSCAN-Martingale on the 20newsgroups-mini dataset and on 220 web news articles, which are references to specific Wikipedia pages. Among twenty methods for news clustering, without knowing the number of clusters k, the framework of DBSCAN-Martingale provides the correct number of clusters and the highest Normalized Mutual Information.
Article
Full-text available
DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN.
Article
Image category recognition is important to access visual information on the level of objects and scene types. So far, intensity-based descriptors have been widely used for feature extraction at salient points. To increase illumination invariance and discriminative power, color descriptors have been proposed. Because many different descriptors exist, a structured overview is required of color invariant descriptors in the context of image category recognition. Therefore, this paper studies the invariance properties and the distinctiveness of color descriptors (software to compute the color descriptors from this paper is available from http://www.colordescriptors.com) in a structured way. The analytical invariance properties of color descriptors are explored, using a taxonomy based on invariance properties with respect to photometric transformations, and tested experimentally using a data set with known illumination conditions. In addition, the distinctiveness of color descriptors is assessed experimentally using two benchmarks, one from the image domain and one from the video domain. From the theoretical and experimental results, it can be derived that invariance to light intensity changes and light color changes affects category recognition. The results further reveal that, for light intensity shifts, the usefulness of invariance is category-specific. Overall, when choosing a single descriptor and no prior knowledge about the data set and object and scene categories is available, the OpponentSIFT is recommended. Furthermore, a combined set of color descriptors outperforms intensity-based SIFT and improves category recognition by 8 percent on the PASCAL VOC 2007 and by 7 percent on the Mediamill Challenge.
Conference Paper
Two new methods for large scale image retrieval are proposed, showing that the classical ranking of images based on similarity addresses only one of possible user requirements. The novel retrieval methods add zoom-in and zoom-out capabilities and answer the “What is this?” and “Where is this?” questions. The functionality is obtained by modifying the scoring and ranking functions of a standard bag-of-words image retrieval pipeline. We show the importance of the DAAT scoring and query expansion for recall of zoomed images. The proposed methods were tested on a standard large annotated image dataset together with images of Sagrada Familia and 100000 image confusers downloaded from Flickr. For completeness, we present in detail components of image retrieval pipelines in state-of-the-art systems. Finally, open problems related to zoom-in and zoom-out queries are discussed.
Conference Paper
Most of the image retrieval approaches nowadays are based on the Bag-of-Words (BoW) model, which allows for representing an image efficiently and quickly. The efficiency of the BoW model is related to the efficiency of the visual vocabulary. In general, visual vocabularies are created by clustering all available visual features, formulating specific patterns. Clustering techniques are k-means oriented and they are replaced by approximate k-means methods for very large datasets. In this work, we propose a faster construction of visual vocabularies compared to the existing method in the case of SIFT descriptors, based on our observation that the values of the 128-dimensional SIFT descriptors follow the exponential distribution. The application of our method to image retrieval in specific image datasets showed that the mean Average Precision is not reduced by our approximation, despite that the visual vocabulary has been constructed significantly faster compared to the state of the art methods.
Article
Recently, methods based on local image features have shown promise for texture and object recognition tasks. This paper presents a large-scale evaluation of an approach that represents images as distributions (signatures or histograms) of features extracted from a sparse set of keypoint locations and learns a Support Vector Machine classifier with kernels based on two effective measures for comparing distributions, the Earth Mover's Distance and the χ 2 distance. We first evaluate the performance of our approach with different keypoint detectors and descriptors, as well as different kernels and classifiers. We then conduct a comparative evaluation with several state-of-the-art recognition methods on four texture and five object databases. On most of these databases, our implementation exceeds the best reported results and achieves comparable performance on the rest. Finally, we investigate the influence of background correlations on recognition performance via extensive tests on the PASCAL database, for which ground-truth object localization information is available. Our experiments demonstrate that image representations based on distributions of local features are surprisingly effective for classification of texture and object images under challenging real-world conditions, including significant intra-class variations and substantial background clutter.
Article
A sample of n lid random variables with a given unknown density is given. We discuss several issues related to the problem or generating a new sample of lid random variables with almost the same density. In particular, we look at sample independence, consistency, sample indistinguishability, moment matching and generator efficiency. We also introduce the notion of a replacement number, the minimum number of observations in a given sample that have to be replaced to obtain a sample with a given density.
Conference Paper
We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. The temporal continuity of the video within a shot is used to track the regions in order to reject unstable regions and reduce the effects of noise in the descriptors. The analogy with text retrieval is in the implementation where matches on descriptors are pre-computed (using vector quantization), and inverted file systems and document rankings are used. The result is that retrieved is immediate, returning a ranked list of key frames/shots in the manner of Google. The method is illustrated for matching in two full length feature films.