ArticlePDF Available

Abstract and Figures

Content-based image classification is a wide research field that addresses the landmark recognition problem. Among the many classification techniques proposed, the k-nearest neighbor (kNN) is one of the most simple and widely used methods. In this article, we use kNN classification and landmark recognition techniques to address the problem of monument recognition in images. We propose two novel approaches that exploit kNN classification technique in conjunction with local visual descriptors. The first approach is based on a relaxed definition of the local feature based image to image similarity and allows standard kNN classification to be efficiently executed with the support of access methods for similarity search. The second approach uses kNN classification to classify local features rather than images. An image is classified evaluating the consensus among the classification of its local features. In this case, access methods for similarity search can be used to make the classification approach efficient. The proposed strategies were extensively tested and compared against other state-of-the-art alternatives in a monument and cultural heritage landmark recognition setting. The results proved the superiority of our approaches. An additional relevant contribution of this work is the exhaustive comparison of various types of local features and image matching solutions for recognition of monuments and cultural heritage related landmarks.
Content may be subject to copyright.
18
DRAFT
Fast Image Classification for Monument Recognition
GIUSEPPE AMATO and FABRIZIO FALCHI and CLAUDIO GENNARO, ISTI-CNR
Content-based image classification is a wide research field addressing also the landmark recognition problem. Among the
many classification techniques proposed, the k-nearest neighbor (kNN ) is one of the most simple and widely used methods. In
this paper, we use kNN classification and landmark recognition techniques to address the problem of monument recognition in
images. We propose two novel approaches that exploit kNN classification technique in conjunction with local visual descriptors.
The first approach is based on a relaxed definition of the local feature based image to image similarity and allows standard
kNN classification to be efficiently executed with the support of access methods for similarity search.
The second approach uses kNN classification to classify local features, rather than images. An image is classified evaluating
the consensus among the classification of its local features. Also in this case access methods for similarity search can be used
to make the classification approach efficient.
The proposed strategies were extensively tested and compared against other state of the art alternatives, in a monument and
cultural heritage landmark recognition setting. The results proved the superiority of our approaches.
An additional relevant contribution of this paper is the exhaustive comparison of various types of local features and image
matching solutions for recognition of monuments and cultural heritage related landmarks.
Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing—Indexing
amethods; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Retrieval Models
General Terms: landmark recognition
ACM Reference Format:
G. Amato, F. Falchi and C. Gennaro. 2014. K-Nearest Neighbor Classification Algorithms for Landmark Recognition ACM J.
Comput. Cult. Herit. 8, 4, Article 18 (August 2015), 25 pages.
DOI:http://dx.doi.org/10.1145/0000000.0000000
1. INTRODUCTION
Perhaps the easiest way to obtain information about something is to use a picture of the object of
interest as a query. Consider, for instance, a cultural tourist who is in front of a monument and wants
to have information about it. A very easy and intuitive action can be that of pointing the monument
with a smartphone and obtaining pertinent and contextual information.
This work was supported by: the VISITO Tuscany project, funded by Regione Toscana; the Europeana network of Ancient Greek
and Latin Epigraphy (EAGLE, grant agreement: 325122) co-funded by the European Commission; the Mobility and Tourism in
Urban Scenarios (MOTUS) co-funded by the Italian government.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided
that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page
or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to
lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be
requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481,
or permissions@acm.org.
c
2015 ACM 1556-4673/2015/08-ART18 $15.00
DOI:http://dx.doi.org/10.1145/0000000.0000000
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
18:2 G. Amato, F. Falchi and C. Gennaro
The aim of this paper is to discuss, propose and compare techniques of image recognition that can be
used to support the scenario above described. The proposed techniques have been thoughtfully tested
and compared in a cultural heritage domain.
A commonly-used approach to identify an object contained in a query image is to use the k-nearest-
neighbor (kNN) classification algorithm [Cover and Hart 1967]. At the most abstract level, a kNN
classifier executes the following steps. Given a query image, the kNN algorithm scans a training set to
retrieve the best matching images. The most represented class (if any), among the retrieved images,
determines the class of the object contained in the query image.
A promising technique, increasingly applied with success in recent years in image–matching tasks,
is to compare images in terms of their local features. Local features (or descriptors) are visual descrip-
tors of selected interest points, or key points, occurring in images [Lowe 2004; Bay et al. 2006; Rublee
et al. 2011]. The comparison of two images, in terms of their local features, involves two steps: the
detection of pairs of matching key points in the two images, and a geometric consistency check of the
position of these matching key points. Determining the pairs of matching key points in two images in-
volves finding pairs of local features whose mutual similarity is much higher than their similarity with
other local features [Lowe 2004]. Checking the geometric consistency of the identified matching pairs
implies finding a reasonable geometric transformation that maps the position of most of the match-
ing key points of the first image to the position of the corresponding key points of the second image
[Fischler and Bolles 1981]. Using local feature matching and geometric consistency check strategies,
it is possible to rank images of a training set according to the degree with which they match the query
image, and then execute the kNN classification algorithm.
Other descriptors, such as MSER (Maximally Stable Extremal Region) [Matas et al. 2004] and LBP
(Local Binary Pattern) [Ojala et al. 2002] can be used for image–matching tasks. However, according to
the reported results in [Mikolajczyk and Schmid 2005; Mikolajczyk et al. 2005] local features similar
to the SIFT descriptor generally perform best on object recognition problems. Moreover, the meth-
ods such as SIFT (Scale Invariant Feature Transform), SURF (Speed Up Robust Features), and ORB
(Oriented FAST and Rotated BRIEF) provide both an interest point detector and a feature descriptor
implementation.
The idea of applying the kNN classification in combination with the geometric consistency technique
is very effective for tasks where only a few objects need to be recognized and the training sets are
small. The drawback of this approach is that it is not scalable when the number of training images
used to describe the objects is very large. The execution of the kNN classification algorithm requires
that the query image be sequentially compared with all the images of the training set. In order to
compare the query image with a single image of the training set, all local features of the query image
must be compared with all local features of the training set image. Considering that each image is
typically described by thousands of local features, this means that a single image comparison requires
something like 1,000 ×1,000 local feature comparisons. This has to be repeated for all the images of
the training set, every time a new query image is processed.
For example, in the experiments that will be described in this paper, the size of the training set is
some orders of magnitude larger than the number of objects (monuments in our case) to be recognized.
In fact, a query image related to a monument, for instance a church or a tower, might be taken from an
arbitrary position from anywhere around the monument, capturing just portions of the monument and
of the landmark in which it is situated. Consequently, for a single monument, we could need hundreds
of training images, depicting it from various points of views and perspectives, in order to obtain a high
recognition quality. The recognition of small objects poses less problems given that, in many cases,
such objects are entirely contained in the query image. In these cases, typically, just the orientation of
the objects changes.
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
Fast Image Classification for Monument Recognition 18:3
In order to reduce the cost of finding the best matches for local image features, some years ago the
bag of visual words (BoW) [Sivic and Zisserman 2003] paradigm was introduced. With this technique,
sometimes called bag of features, groups of very similar local features, taken from the entire training
set, are clustered together and represented by their centroid (a representative feature for the entire
cluster denoted visual word). The set of centroids is called the visual word vocabulary. An image is
then represented by quantizing each feature to its nearest visual word. In order to decide whether two
local features belonging to two different images match, it is sufficient to check whether they belong to
the same cluster, or in other words, are represented by the same visual word.
The kNN classification technique can be successfully applied directly to the BoW representation.
However, this approach still presents some scalability and effectiveness problems. Even with the use of
inverted files to maintain relationships among features and images, “a fundamental difference between
an image query (e.g. 1,500 visual terms) and a text query (e.g. 3 terms) is largely ignored in existing
index design. This difference makes the inverted list inappropriate to index images” [Zhang et al.
2009]. In addition, the use of the BoW approach makes it difficult to efficiently perform a geometric
consistency check, and the approximation introduced by the quantization of the local features reduces
the effectiveness.
The approaches presented in this work lie in between these two extremes (direct use of local features,
on one side, and BoW on the other). We still exploit the effectiveness of local features and geometric
consistency but we rely on the use of access methods for local image features [Zezula et al. 2006; Samet
2005] in order to scale to a large number of classes and training images. These strategies have been
tested and compared against other state of the art approaches in the context of landmark recognition
for cultural heritage.
2. CONTRIBUTION OF THIS PAPER
In this paper, we compare several strategies to recognize the content of digital pictures against two
novel proposed approaches. We particularly focus on discussing and evaluating how the various op-
tions and techniques perform in the applicative scenario of monument and cultural heritage related
landmark recognition. The two new proposed approaches are based on image kNN classification tech-
niques.
The first approach exploits kNN classification to classify images and relies on a relaxed definition of
the local feature based image to image similarity definition, which allows efficient index for similarity
search to be used. Surprisingly, we show that in addition to increasing efficiency and scalability, this
approach also increases effectiveness.
The second approach that we propose, called Local Features Based Image Classifier, uses kNN clas-
sification to classify individual local features of an image, rather than the entire image. It consists of a
two–step classification process: 1) kNN classification of individual local features, and 2) classification
of whole images evaluating the consensus among the classes and the confidences assigned to each lo-
cal feature in step 1). Also this approach makes it possible the usege of efficient indexes for similarity
search in order to offer high efficiency and scalability, without penalizing effectiveness. Tests were ex-
ecuted using various types of local features, and also applying geometric consistency check techniques.
An additional significant contribution of this paper is the comparison between various types of local
feature and image matching solutions in a monument and cultural heritage related landmark recog-
nition scenario. As far as we know, no such complete and extensive comparisons have been performed
previously in such a consistent and specific scenario.
A preliminary version of the approaches presented in this paper was presented in [Amato et al.
2011]. The novel contribution here, with respect to previous work, can be summarized as follows. We
extensively investigated and experimented different approaches of kNN classifications for landmark
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
18:4 G. Amato, F. Falchi and C. Gennaro
recognition, and in particular, we introduced the novel concept of “local feature based image classifica-
tion”. We compared the proposed approaches using ORB and BRISK features, in addition to SIFT and
SURF features. We also compared our results against the BoW approach. Finally, we introduced the
use of a geometric constraint check in combination with the local feature based image classifier. More
experiments and analysis were also carried-out.
The paper is organized as follows. Section 3 presents other related work. Section 4 provides the
background for the remaining of the paper. Section 5 introduces the pairwise distance criterion used
in classification algorithms. Sections 6 contains the details of our proposed approaches and Section 7
validates our proposed techniques. A concluding summary is given in Section 8.
3. RELATED WORK
In this paper, we address the problem of landmark recognition and visual categorization with special
focus on kNN classification and local image features. In [Chen et al. 2009] a survey of the literature
on mobile landmark recognition for information retrieval is given. The classification methods reported
include SVM, Adaboost, Bayesian model, HMM, GMM. However, the survey does not report the kNN
classification technique, which is the main focus of this paper.
In [Zheng et al. 2009], Google presented its approach to building a web-scale landmark recognition
engine. Most of the work reported was used to implement the Google Goggles service [goo 2010]. The
image recognition is based on a kNN classifier using local feature matching. According to the authors,
the recognition performance on over 5,000 landmarks reaches an accuracy of 80.8%.
Popescu et al. [Popescu and Mo¨
ellic 2009] used a geo-referenced collection of 5,000 landmarks world-
wide to automatically annotate landmark images. They organized the landmarks spatially and clas-
sified the images using spatial distance together with kNN classification. The images to label are
indexed using only the BoW approach.
A mobile landmark recognition system called Snap2Tell were developed in [Chevallet et al. 2007].
However, the authors use a simple matching technique based on color histograms and a 1NN classifier,
combined with localization information. For the task of image-based geolocation, a similar approach
has been exploited in [Hays and Efros 2008].
In [Labb 2014] a tutorial on how a system for object recognition can also be used for place recognition
is given. The system uses local features to execute the recognition task.
In [Fagni et al. 2010], various MPEG-7 global descriptors have been used to build kNN classifier
committees. However, local features were not taken in consideration.
Boiman et al. [Boiman et al. 2008] propose an approach to 1NN image classification that uses a kd-
tree structure for efficiency and is very similar in spirit to one of the approaches presented in this paper.
This work also introduced a novel, non-parametric approach for image classication, the Naive Bayes
Nearest Neighbor classifier (NBNN), which was further generalized by Timofte et al [Timofte et al.
2013] by replacing the nearest neighbor part with more elaborate and robust (sparse) representations
(kNN, Iterative Nearest Neighbors (INN), Local Linear Embedding (LLE), etc.). Bosch et al. [Bosch
et al. 2008] also use a kNN classifier in combination probabilistic Latent Semantic Analysis for scene
classification purposes. However, no access methods were used to handle efficiency issues in the case
of large dimension problems.
kNN classifiers are also suitable for real-time learning applications such as 3D object tracking. In
[Hinterstoisser et al. 2011], the authors exploit a simple nearest neighbors classification using a set
of “mean patches” that encode the average of the keypoints appearing over a limited set of poses.
However, learning approaches do not scale very well with respect to the size of the keypoints database
[Lourenc¸o 2011].
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
Fast Image Classification for Monument Recognition 18:5
[Johns and Yang 2011] addresses the problem of recognizing a place depicted in an image by clus-
tering similar database images to represent distinct scenes, and tracking local features that are con-
sistently detected to form a set of real-world landmarks. In this work, features are first quantized and
images are described as a BoW, allowing a more efficient means of computing image similarities. The
closest kdatabase images to the query image are then passed on to the second stage. Here, geometric
verification prunes out false positive feature matches from the first stage.
The idea of applying the BoW technique to transform images described by local features in vectors
to exploit kNN classification is also used in [Mejdoub and Ben Amar 2011]. In this study, the authors
propose a new categorization tree based on the kNN algorithm. The proposed categorization tree com-
bines both unsupervised and supervised classification of local feature vectors. The advantage of this
tree is that it achieves a trade-off between accuracy and speed-up of categorization. The proposed tech-
nique, however, involves several complex steps: a hierarchical lattice vector quantization algorithm,
and a supervised step based on both feature vector labeling and a supervised feature selection method.
In this respect, similar approaches in which high dimensional descriptors based on local features, such
as Vector of Locally Aggregated Descriptors (VLAD) [Jegou et al. 2010] and Locality constraint Linear
Coding (LLC) [Wang et al. 2010], are employed have become a topic of considerable interest in the
development of classification systems (see for instance [Su et al. 2013; Amato et al. 2013; Perronnin
and Dance 2007]).
In [Haase and Denzler 2011], state-of-the-art CBIR methods were tested in order to recognize land-
marks in a large-scale scenario. The image dataset consists of 900 lanmarks from 449 cities and 228
countries. BoW and visual phrase approaches were tested in combination with SVM and kNN classi-
fiers. The best restuls were obtained by using a kNN classifier in combination with the BoW descrip-
tion.
Some approaches exploit a metric learning phase to improve the performance of metric-based kNN
classification algorithms. Although these methods are reported to be effective, most of the existing ap-
plications are still limited to vector space models in which there is no connection to local features. For
a recent survey on metric learning, see [Bellet et al. 2013]. Within this topic, there is increased in-
terest in local distance functions for nearest neighbor classification on local image patches [Mahamud
and Hebert 2003] or geometric blur features [Frome et al. 2007; Malisiewicz and Efros 2008; Zhang
et al. 2006; Zhang et al. 2011]. Note that such approaches, however, often map local features to multi-
resolution histograms and compute a weighted histogram intersection; approximate correspondence
can be captured by a pyramid vector representation [Grauman and Darrell 2007].
Weighted voting is another common approach for improving kNN classifiers. Weights are usually
either based on the position of an element in the kNN list or its distance to the observed data point [Zuo
et al. 2008]. However, the hubness weighting scheme which was first proposed for high-dimensional
data in [Radovanovi´
c ] is slightly more flexible; each point in the training set has a unique associated
weight, with which it votes whenever it appears in some kNN list, regardless of its position in the list.
This idea was recently generalized into fuzzy kNN for local features [Tomaˇ
sev and Radovanovi´
c ]. This
technique still relies on vector representation and therefore is only suitable for high-dimensional data
such as codebooks of most representative SIFT features (BoW).
Finally, comparatively few papers have proposed the use of boosting techniques for kNN classifica-
tion. Boosting methods adaptively change the distribution of the training set based on the performance
of the previous classifiers [Garca-Pedrajas and Ortiz-Boyer 2009]. Unfortunately, to the best of our
knowledge, all boosting techniques for kNN classification rely on a pairwise distance between objects
to be classified. A good survey of kNN classification boosting can be found in [Piro et al. 2013].
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
18:6 G. Amato, F. Falchi and C. Gennaro
4. OBJECT RECOGNITION
In this section, we provide preliminaries and give a brief overview of the local features that we have
used.
4.1 Notation and Preliminaries
Throughout the paper, we represent each image Iby a set of nlocal features l, i.e. I={l1, . . . , ln}. With
a slight abuse of notation, we use the general notation d() to denote the distance functions used for
comparing images or local features.
Let Sbe a database of objects xand a ddistance function for the objects, the k-th nearest neighbor
of object qcan then be recursively defined as:
NNk(q, S ) =
xS| ∀yS d(q, y)d(q , x)if k= 1;
NNk1(q, S \ {NNk1(q, S )})if k > 1.
(1)
The set of the first knearest neighbors is defined as:
kNN(q, S ) = {NNˆ
k(q, S )|ˆ
k= 1..k}(2)
4.2 Local Features
In the last decade, the introduction of local features to describe image visual content, along with local
feature matching and geometric consistency check approaches has significantly advanced the perfor-
mance of image content and object recognition techniques. In the following, we introduce these two
strategies, which are at the basis of the classification techniques that we use to perform the recogni-
tion of monuments in images.
Local feature descriptors describe selected individual points or areas in an image. The extraction
is executed in two steps. First, a set of keypoints in the image is detected. Second, the area around
the selected keypoints is analyzed to extract a visual description. Keypoint selection strategies are
appropriately designed to guarantee invariance to scale changes and the same points are selected
under different views of the same object. Local feature descriptors contain information that allow local
feature matching, i.e. deciding that two local features from two different images represent the same
point. Standard information on the position in the image, the orientation and size of the region are
typically associated with the visual information that depends on the particular local features. Various
local features have been proposed. In this work, we tested SIFT, SURF, ORB, and BRISK.
4.2.1 SIFT. The Scale Invariant Feature Transformation (SIFT) [Lowe 2004] is a representation
of low level image content that is based on a transformation of the image data into scale-invariant
coordinates relative to local features. Local features are low level descriptions of keypoints in an image.
Keypoints are interest points in an image that are invariant to scale and orientation. Keypoints are
selected by choosing the most stable points from a set of candidate locations. Each keypoint in an
image is associated with one or more orientations, based on local image gradients. Image matching is
performed by comparing descriptions of the keypoints in the images.
This extraction scheme has been used by many other local features including the following ones.
In particular, SIFT selects keypoints using a difference of gaussians approach that can be seen as an
approximation to the Laplacian that results in detecting blobs. The description of each keypoint and
its neighbors (i.e., the blob) is based on an histogram of orientation gradients normalized with respect
to the dominant orientations in order to be rotation invariant. We used publicly available software
developed by David Lowe [sif 2005] to both detect keypoints and extract the SIFT features.
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
Fast Image Classification for Monument Recognition 18:7
4.2.2 SURF. The basic idea of Speeded Up Robust Features (SURF) [Bay et al. 2006] is quite sim-
ilar to SIFT. SURF detects some keypoints in an image and describes them using orientation informa-
tion. However, the SURF definition uses a new method for both the detection of keypoints and their
description that is much faster while still guaranteeing a performance comparable to or even better
than SIFT. Specifically, keypoint detection relies on a technique based on an approximation of the Hes-
sian Matrix. The descriptor of a keypoint is built considering the distortion of Haar-wavelet responses
around the keypoint itself. We used the publicly available noncommercial software developed by the
authors [sur 2006] to both detect the keypoints and to extract the SURF features.
4.2.3 ORB. ORB [Rublee et al. 2011] stands for Oriented FAST and Rotated BRIEF. It is a very
fast and effective local feature descriptor that selects keypoints using the FAST detector and builds
features with an improved version of the BRIEF descriptors that offer rotational invariance. It is
very fast in both the feature extraction phases and matching phases, which can be used for real-time
applications even with low-power devices and without GPU acceleration. The descriptor has a binary
format and the simple Hamming distance is used for comparing local features.
4.2.4 BRISK. Similarly to ORB, BRISK [Leutenegger et al. 2011] is also a binary local feature
descriptor. It uses a FAST based keypoint detector and generates a bit-string descriptor from intensity
comparisons retrieved by dedicated sampling of keypont neighborhood. BRISK also uses the Hamming
distance to compare local features. A comparison of ORB and BRISK together with BRIEF has been
presented in [Heinly et al. 2012].
4.3 Local Features Matching
Local features lautomatically extracted from an image Iare used to identify, in two distinct images Ii
and Ij, couples of matching descriptors (li, lj)where liIiand ljIj. Identifying matches requires:
comparing local descriptors using a distance function d; identifying a candidate match ljIjfor any
liIi; filtering out matches with high probability to be incorrect.
For SIFT and SURF the Euclidean distance is used, while the Hamming distance is the obvious
choice for binary features such as ORB and BRISK.
The candidate match for liis typically the nearest local descriptor in Ij, i.e. NN1(li, Ij).
Filtering incorrect matches is the most difficult task. Lowe showed in [Lowe 2004] that the distance
d(li,NN1(li, Ij)) is not a good measure of the quality of matches. Instead, he proposed to consider the
ratio between the distance from liof the first and the second nearest neighbors in Ij, i.e.:
σ(li, Ij) = d(li,NN1(li, Ij))
d(li,NN2(li, Ij)) (3)
Any matching pair of descriptors hli,NN1(li, Ij)i, liIifor which σ(li, Ij)> c, where cis a predefined
threshold is discarded. Thus, the set of candidate features matches between image Iiand Ijis:
Mσ(Ii, Ij) = {hli,NN1(li, Ij)i | σ(li, Ij)< c, liIi}(4)
In [Lowe 2004] it was reported that c= 0.8allows us to eliminate 90% of the false matches while
discarding less than 5% of the correct matches when using SIFT. In [Amato and Falchi 2010], an
experimental evaluation of classification effectiveness varying cfor both SIFT and SURF confirmed
the results obtained by Lowe. In the following, we will use c= 0.8for both SIFT and SURF; we used
c= 0.9for the ORB and BRISK binary local features because it gave better performance.
We call the set of matches Mσ, defined above, as the plain distance ratio matches. In the following
we will also define additional strategies to find the set matches, some of which are obtained starting
from Mσitself.
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
18:8 G. Amato, F. Falchi and C. Gennaro
5. PAIRWISE IMAGE DISTANCE
Central to the concept of the kNN classifier is the definition of a pairwise image distance dbetween
two images, which is based on how many features match and how close the matches are.
We define a distance function based on the plain distance ratio matches (5.1), on the BoW quanti-
zation approach (5.2), finally we extend the distance functions to also handle geometric consistency
checks (5.3 and 5.4).
5.1 Local Feature Matching
The pairwise matching between two images is based on how many of the feature descriptors match.
Given a set Mσ(Ii, Ij)of candidate local feature matches (see Section 4.3) between two images Ii, Ij, we
define the distance as:
dσ(Ii, Ij) = 1 |Mσ(Ii, Ij)|
|Ii|(5)
Note that the proposed distance measure is not actually a distance measure since it is not symmetric:
dσ(Ii, Ij)6=dσ(Ij, Ii). Moreover, since 0dσ1, sometimes it is more convenient to use the concept of
similarity sσ= 1 dσ.
5.2 Bag of Words Matching
The traditional BoW model used for text, has been applied to images by treating image features as
words. As for text documents, a BoW description is a sparse vector of number of occurrences of visual
words, taken from a predefined vocabulary. The assumptions is that two features m if they have been
assigned to the very same visual words. Thus, the BoW approach can also be used for efficient features
matching (see [Philbin et al. 2007; Philbin 2010]).
The first step to describe images using visual words is to select some local features creating the
visual vocabulary. The visual vocabulary is typically built grouping local descriptors of the dataset
using a clustering algorithm such as k-means. The second step consists of describing each image using
the words of the vocabulary that occur in it.
At the end of the process, each image is described as a set of visual words. More formally, the BoW
framework consists of a group of cluster centers, referred to as visual words W={w1, w2, . . . , wk}
[Turcot and Lowe 2009]. Let bWbe a function that assigns a visual word to each local descriptor liof
an image Ii, as follows:
bW(li) = argwNN1(li, W )(6)
Let BW(Ii)be the set of visual words corresponding to the local features of the image Ii, i.e.:
BW(Ii) = {bW(li)liIi},(7)
we are able to convert images into a vector of visual word occurrences, as for standard full-text
retrieval term frequency (TF) approach:
tfj(Ii) = |{j} ∩ BW(Ii)|(8)
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
Fast Image Classification for Monument Recognition 18:9
where tfm(Ii)is the m-th element of the vector of visual words and corresponds to the number of
occurrences of wmin the set BW(Ii). In order to compare image word occurrence, cosine similarity can
be used:
dw(Ii, Ij)=1Pn
m=1 tfm(Ii)tfm(Ij)
pPn
m=1 tfm(Ii)2pPn
m=1 tfm(Ij)2(9)
More advanced weighting schemes based on Information Retrieval technology such as TF-IDF can
be used (e.g. [Tirilly et al. 2010]). Using these similarity functions, traditional inverted files can be
used to search nearest neighbor images.
5.3 Geometric Consistency Constraints
In order to further improve the effectiveness of the pairwise image matching described above, geomet-
ric consistency constraints can be exploited. The problem is to determine a transformation that maps
the positions of the keypoints in the first image to the positions of the corresponding keypoints of the
second image. Only matches consistent with this transformation are retained. As discussed previously,
the coordinates of the keypoints, together with the size and orientation of the region, are associated
with each local descriptor.
The algorithms used to estimate such a transformation are typically the Random Sample Consensus
(RANSAC) [Fischler and Bolles 1981] and Least Median of Squares. However, fitting methods such as
RANSAC or Least Median of Squares perform poorly when the percent of correct matches falls much
below 50%. Fortunately, much better performance can be obtained by clustering features in the scale
and orientation space using the Hough transform as suggested in [Lowe 2004].
Estimating a transformation using RANSAC involves: 1) random selecting the requested number
of matches for the given transformation estimation; 2) evaluating the transformation itself; and 3)
selecting the matches that are consistency with it.
A geometric transformation maps a point ~p = (px, py)to a second point ~
p0= (p0
x, p0
y). In the following,
we report the most common types of transformations that can be searched.
Each of the following transformations can be used as a filter for a set of candidate matches M. In
fact, the subset of matches that are consistent with the evaluated transformation is presumed to be a
more reliable set of candidate matches with respect to the original M.
5.3.1 Hough Transform (FHOU ). Hough Transform is used to cluster matches into groups that
agree upon a particular model pose (intuitively, the same point of view description of an object). Hough
Transform identifies clusters of features with a consistent interpretation by using each feature to vote
for all object poses that are consistent with the feature [Lowe 2004]. When clusters of features are
found that vote for the same pose of an object, the probability of the interpretation being correct is
much higher than for any single feature. In our experiments, we create a Hough transform entry pre-
dicting the model orientation, and scale from the match hypothesis. A pseudo-random hash function
is used to insert votes into a one-dimensional hash table in which collisions are easily detected. The
Hough transform is typically used to increase the percentage of inliers before estimating a transfor-
mation (typically using RANSAC). However, the greater cluster can be considered to be the subset of
most relevant matches.
Therefore, we define FHOU (M)as the subset of candidate matches Mthat belongs to the greater
cluster obtained with the Hough transform. For our experiments, we used the same parameters pro-
posed in [Lowe 2004], i.e. bin size of 30 degrees for orientation, a factor of 2 for scale, and 0.25 times
the maximum model dimension for location.
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
18:10 G. Amato, F. Falchi and C. Gennaro
Considering the clusters of matches created by the Hough transform, it is possible to estimate a
transformation that can map the points of one image onto another.
5.3.2 RST (FRST ). Rotation, Scale and Translation transformation can be formalized as follows:
p0
x
p0
y=scos(θ)sin(θ)
sin(θ)scos(θ)px
py+tx
ty(10)
where θis the angle of the counter clock rotation, sis the scaling and ~
tis the translation. Estimating
this transformation requires two pairs of matching points (~p and ~
p0).
5.3.3 Affine (FAF F ). Affine transformation is a linear transformation (rotation, scaling, reflection
and shear) followed by a translation.
p0
x
p0
y=a11 a12
a21 a22px
py+tx
ty(11)
Note that an RST transformation is a special case of a general affine transformation. Affine trans-
formation allows one shear mapping and/or reflection in addition to translation, rotation and scaling
[Prince 2012]. A shear transformation leaves all points on one axis fixed, while the other points are
shifted parallel to the axis by a distance proportional to their perpendicular distance from that axis.
Estimating an affine transformation requires three pairs of matching points.
5.3.4 Homography (FHM G). Homography is an invertible projective transformation from the real
projective plane to the projective plane that maps lines to straight lines. Any two images in the same
planar surface in space are related by a homography.
w
p0
x
p0
y
1
=
h11 h12 h13
h21 h22 h23
h31 h32 h33
px
py
1
(12)
where wis a scale parameter. Please note that an affine transformation is a special type of gen-
eral homography whose last row is fixed to h31 = 0, h32 = 0, h33 = 1. Estimating this transformation
requires four pairs of matching points.
5.3.5 Isotropic scaling. Typically, the coordinates of the points of the local features extraction al-
gorithms are reported in pixels of the image. However, a normalization can improve the effectiveness
of the transformation estimation. In this work, we use an isotropic scaling [Hartley 1995] that scales
and translates the pixel coordinates so as to bring the centroid of the set to the origin, and the average
distance from the centroid to 2.
5.4 Enhancing Pairwise Image Matching with Geometric Consistency Constraint
Geometric consistency checks can be used when comparing two images still using image distance de-
fined in 5, and by replacing Mσ, with the matches remaining after geometric filtering described above.
In the following we give five options for defining the set of candidate mathces M, using Mσdefined
in Section 4.3, and the filtering criteria defined in Section 5.3. Used in conjunction with Eq. (5), ), these
options result in five similarity functions. Specifically, the five different instantiations for Mare:
—Plain distance ratio matches Mσ
—Hough matches MHOU =FH OU (Mσ)
—RST matches MRST =FRST (Mσ)
—Affine matches MAF F =FAF F (Mσ)
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
Fast Image Classification for Monument Recognition 18:11
—Homography matches MHM G =FH M G(Mσ)
The performance of these similarity functions will be also compared in Section 7 below.
The BoW approach can also be exploited to define a set of candidate matches that can be used as
a basis for the geometric consistency checks. In this scenario, we do not use the cosine distance to
calculate similarities between vectors; we directly match the BoW components:
˙
M(Ii, Ij) = {hli, lji | bw(li) = bw(lj)}(13)
In this case, ˙
Mis used in place of Mσ, employed in the previous section. Similarly as before, we
define four sets of candidate matches that are obtained by filtering the set ˙
Mobtained through the
BoW approach.
Thus, in our experiments, in addition to cosine TF and cosine TF-IDF, we test the following BoW
based approach:
—Hough matches ˙
MHOU =FH OU (˙
M)
—RST matches ˙
MRST =FRS T (˙
M)
—Affine matches ˙
MAF F =FAF F (˙
M)
—Homography matches ˙
MHM G =FH M G(˙
M)
6. KNN IMAGE CLASSIFICATION
Document classification has two flavours as single label and multi label. In the single label classifica-
tion, documents may belong to only one class while in the multi label one, documents may belong to
more than one class [Korde and Mahender 2012]. In this paper we only consider single label document
classification.
Let Sbe a database of objects x,dthe distance function for the objects, let us have a predefined a
predefined set of classes (also known as labels, or categories)C={c1, . . . , cm},Single-label document
classification [Dudani 1975] is the task of automatically approximating or estimating, by means of a
function ˆ
Φ : SC, called the classifier, an unknown target function Φ : SC, that defines how
documents ought to be classified.
6.1 The Single Label kNN Classification
The single-label distance-weighted kNN classifier is one of the most simple and widely used methods
for semi-supervised learning. It first executes a kNN search between the objects of the training set
T. The training set is the subset TSof data used to fit the classifier, and for which we know the
target function Φ. The result of this operation is the set kNN(x, T )of labeled documents belonging to
the training set, ordered with respect to the increasing values of the distance function d. The label
assigned to the object xby the classifier is the class cjCthat maximizes the sum of a similarity
between xand the documents labeled cj, in the ranked list kNN(x, T ). The similarity between two
objects can be calculated as s= 1 dsince, without loss of generality, we assume that 0d1always
holds true. The classification task starts by computing the score zk(x, cj)for each label cjC:
zk
s(x, cj) = X
ykNN(x,T ) : Φ(y)=cj
s(x, y).(14)
the class that obtains the maximum score is then chosen:
ˆ
Φk
s(x, T ) = arg max
cjCzk
s(x, cj)(15)
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
18:12 G. Amato, F. Falchi and C. Gennaro
where ˆ
Φk
s(x, T )is the classification function. All the kNN classifier algorithms that we present in the
next sections share the same basic principle and hence the same classification function. They differ in
the way that the confidence of the classification is computed and can be used to decide whether the
predicted label has a high probability to be correct. A special case is the local features based classifica-
tion, which uses kNN classification for the individual local fractures to estimate the confidence of the
whole image classification.
6.2 Similarity Based kNN Image Classification
This section discusses how to classify images using a kNN classifier relying on pairwise image dis-
tances as seen in Section 5. The techniques defined in this section is the baseline approach to image
classification using local feature.
This approach, along with approach based on BoW, will be compared against the our proposed meth-
ods discussed in Sections 6.3 and 6.4.
Given a set of images Iand a predefined set of classes C={c1, . . . , cm}, the kNN classification can
be obtained with the function: ˆ
Φk
s(Ii,I)defined in Eq. (15), in which in place of the similarity s= 1 d,
we can exploit one of the pairwise image distances defined in Section 5. A typical way of evaluating the
confidence is 1 minus the ratio between the score obtained by the second-best label and the best label,
i.e.:
νdoc(ˆ
Φk
s, Ii)=1
arg max
cjC\{ˆ
Φk
s(Ii)}
zk(Ii, cj)
arg max
cjCzk(Ii, cj).
This classification confidence can be used to decide whether the predicted label has a high probability
to be correct. A value of νclose to one denotes a high confidence.
When the number of classes, and consequently the number of training images, are large, the use
of the simple distance based on local feature matching presented in Section 5.1 becomes prohibitive
in terms of efficiency since it implies the sequential scan of the entire training set. In this case, it is
useful to introduce some approximations that allow the problem to be more efficiently managed. A
widely used solution is that of using the BoW approach presented in Sections 5.2.
As already stated this is just a baseline method to image classification. In next section we will
propose a new solution that selects just the most promising local feature pairs in images to be matched.
This approach can use indexes for efficient similarity search to speed up the classification process.
Surprisingly this method is more efficient and more effective as well, even if just a subset of the local
features in images are matched. We call this approach Dataset Matching, since the set of promising
pairs are selected by submitting similarity queries on the entire database of local features.
6.3 Dataset Matching
The distance measure defined in Section 5.1 is a direct application of the techniques developed by the
Computer Vision community and requires the direct comparison of each pair of images. In fact, the
distances are neither metric nor symmetric and the complexity of the distance evaluation prevents the
use of any sort of indexing. Therefore, given a query, searching for the knearest images to a given
query image requires a complete sequential scan of the archive.
On the other hand, the BoW approach makes the search process much faster than when executing
a sequential scan of the training set. However, the quantization introduced by the visual vocabulary
reduces the effectiveness of the method.
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
Fast Image Classification for Monument Recognition 18:13
In this respect, we propose an efficient pairwise image matching that relies on access methods for
metric space [Zezula et al. 2006], and at the same time increase the effectiveness, even with respect to
the approach discussed in Section 5.1.
Let Iibe the image that we want to classify and S={I1, . . . , IM}the entire training set of images of
size M. We propose to retrieve, for every local feature liof Iithe kclosest local features from the union
of all the local features in all the images of the training set Ω = iIi. We denote the kclosest local
features to lias kN N (li,Ω), and we call it the set of candidate matches.
Since the distance dfunctions for comparing local features are metric distances (SIFT and SURF
use Euclidian distance, ORB and BRISK use Hamming distance), metric [Zezula et al. 2006] or spatial
[Samet 2005] access methods can be used to efficiently execute kN N (li,Ω).
We define the matches between an image Iiand any IjSas:
¯
M(Ii, Ij) = {hli, lji|liIilj=NN1(li, kN N (li,Ω) Ij)}(16)
In short, we select the matching local features in two images, just considering the candidate match-
ing local features obtained executing the nearest neighbor similarity search query kN N (li,Ω). Note
that kN N (li,Ω) need to be executed just once for every local feature liof Iiindependently of the size
Mof the training set.
In this scenario, ¯
Mis used in place of the candidate set of matches Mσ, employed in Section 5.1 for
evaluating the distance between two images:
dS(Ii, Ij)=1|¯
M(Ii, Ij)|
|Ii|(17)
The matching function ¯
M, defined in this section, can also be enhanced by the use of the four geo-
metric filtering criteria, defined in Section 5.4. Thus, in Section 7 we also test:
—Plain nearest Neighbor matches ¯
M
—Hough matches ¯
MHOU =FH OU (¯
M)
—RST matches ¯
MRST =FRS T (¯
M)
—Affine matches ¯
MAF F =FAF F (¯
M)
—Homography matches ¯
MHM G =FH M G(¯
M)
6.4 Local Features Based kNN Classifier
In the previous section, we considered the classification of an image Iias a process of retrieving the
most similar images in the training set Tland then applying a kNN classification technique in order to
predict the class of Ii.
In this section, we propose a different approach that classifies an image Iiin two steps:
(1) each local feature liIiis first individually classified considering the local features of all the
images in the training set Tl;
(2) the whole image is classified considering the class assigned to each local feature and the confidence
of the classification evaluated in step 1.
Note that by classifying the local features individually, i.e. before assigning a label to an image,
we might lose the implicit mutual information between the interest points of an image. However,
surprisingly, we will see that this method gives a better performance than the other approaches.
In the next sections, we will define four distinct algorithms for local feature classification, i.e. step 1.
All the proposed algorithms require searching for similar local features for each of the local features
belonging to the image.
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
18:14 G. Amato, F. Falchi and C. Gennaro
6.4.1 Step 1: Local Feature Classification. All the following kNN Local Feature Classifiers are ap-
plications of the single label distance weighted kNN discussed in Section 6.1. They make use of a
similarity function that can be obtained from the distance measure dbetween local features by apply-
ing the well-known transformation s= 1 d/dMAX .
1NN LF Classifier (ˆ
Φf). The simplest way to classify a local feature is to consider the label of its
closest neighbor in Tl. The 1NN Local Features Classifier ˆ
Φ1
s(lx)assigns the label of the closest neighbor
in Tlto a local feature lx. The confidence of the classification assigned is the similarity between liand
its nearest neighbor. Formally:
ν(ˆ
Φ1
s, lx) = s(lx,NN1(lx,Tl))
Please note that this classifier does not require any parameter to be set. Moreover, the similarity
search over the local features training set is a simple 1NN search.
Weighted kNN LF Classifier (ˆ
Φk). The Weighted kNN LF Classifier is the natural application of the
kNN classification function ˆ
Φk
s(lx,Tl)on local features. The confidence is similarly based on the ratio
between second best and best class as follows:
ν(ˆ
Φk
s, lx)=1
arg max
cjC\{ˆ
Φk
s(lx,Tl)}
zk
s(lx, cj)
arg max
ciCzk
s(lx, ci)
Note that for k= 1, we degenerate to the 1NN LF classifier case, while the measure of confidence is
different. In fact, 1NN always assigns 1as confidence when k= 1, while the kNN LF Classifier consid-
ers the first nearest neighbor similarity as measure of confidence. This classifier requires parameter k
to be chosen.
LF Matching Classifier (ˆ
Φm). The LF Matching Classifier decides the candidate label similarly to
the 1NN LF Classifier, i.e., ˆ
Φ1
s(lx,Tl), while the confidence value of the selected label is evaluated using
the idea of the distance ratio discussed in Section 4.3:
ν(ˆ
Φ1
s, lx) = 1if ˙σ(lx,Tl)< c
0otherwise
The distance ratio ˙σis computed considering the nearest local feature to lxand the closest local
feature that has a label different from the nearest local feature. Following the idea of Lowe explained
in Section 4.3, we define the similarity ratio ˙σas:
˙σ(lx,Tl) = d(lx,NN1(lx,Tl))
d(lx,NN
2(lx,Tl))
where NN
2(lx,Tl)is the closest neighbor that is known to be labeled differently than the first.
Note that searching for NN
2(lx,Tl)cannot be directly translated into a standard kNN search. How-
ever, the kNN implementation in metric spaces is generally performed starting with an infinite range
and reducing this during the evaluation, considering at any time the current NNk. The same approach
can be used for searching NN
2(lx,Tl). In fact, while kis not known in advance, the current NN
2dur-
ing the similarity search, can be used to reduce the range of the query. Thus, the similarity search
needed for the evaluation of ˙σ(lx, Tr)can be implemented slightly modifying the standard algorithms
developed for metric spaces (see [Zezula et al. 2006]).
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
Fast Image Classification for Monument Recognition 18:15
Parameter cused in the definition of the confidence is equivalent to that used in [Lowe 2004] and
[Bay et al. 2006]. We will see in Section 7.3 that c= 0.8proposed in [Lowe 2004] by Lowe is able to
guarantee good performance. It is worth noting that cis the only parameter to be set for this classi-
fier considering that the similarity search performed over the local features in Tldoes not require a
parameter kto be set.
Weighted LF Distance Ratio Classifier (ˆ
Φw). The Weighted LF Distance Ratio Classifier is an exten-
sion of the LF Matching Classifier defined in the previous section. However, the confidence here is not
binary but is a fuzzy measure derived from the distance ratio. Given that the greater the confidence
the better the matching, we define the assigned label and the respective confidence as:
ν(ˆ
Φ1
s, lx) = (1 ˙σ(lx,Tl))2
The intuition is that it could be preferable not to filter non-matching features on the basis of the
distance ratio, but to adopt 1˙σ(lx,Tl)as a measure of confidence for the classification of the whole
image. The value is then squared to emphasize the relative importance of greater distance ratios.
Please note that for this classifier, we do not have to specify either a distance ratio threshold cor k.
Thus, this classifier has no parameters.
Weighted LF Distance Ratio with Geometric Constraints ˆ
Φw
g.It is also possible to combine the classi-
fication approach of the Weighted LF Distance Ratio Classifier with the geometric consistency filtering
power presented in Section 5.3.
First, we perform a nearest neighbor search for each local feature on the image Iito be classified. At
the end of this process, for each local feature liIi, we apply geometric consistency filtering to obtain
sets of candidate matches for hIi, Ijiimage pairs. Finally, we merge the local features pairs hli, ljiin all
the filtered matches ¯
M, i.e.:
Mg=[
j
¯
Mg(Ii, Ij)
where gstands for HOU (Hough), RST (Rotation, Scale and Translation), AF F (Affine) and HM G
(Homography) as explained in Section 5.3. Please note that Ijare at most all the images having at least
one feature in kN N (li,Tl)li∈ Tl. Furthermore, given that each geometric consistency filter requires
a minimum number of points to be applied, the cardinality of Ijis typically much smaller.
The classification process is then performed as for the Weighted LF Distance Ratio Classifier just
considering the filtered set of local features Mgobtained with one of the specific filters (HOU ,RST ,
AF F , or HM G ), i.e., ˆ
Φ1
s(lx,M), and the following confidence:
ν(ˆ
Φ1
s, lx) = (1 ˙σ(lx,M))2
6.4.2 Step 2: Image Classification. In the following, we assume that the label of each local feature
lx, belonging to images in the training set Tl, is the label assigned to the image to which it belongs (i.e.,
Ix):
lxIx,IxT , Φ(lx) = Φ(Ix)(18)
In other words, we assume that the local features generated over interest points of images in the
training set can be labeled as the image to which they belong. Note that the local features classifier
can manage the noise introduced by this label propagation from the whole image to the local features.
In fact, we will see that when very similar training local features are assigned to different classes, a
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
18:16 G. Amato, F. Falchi and C. Gennaro
local feature close to them is classified with a low confidence. The experimental evaluation reported in
Section 7.3 confirms the validity of this assumption.
As already stated, given lxIx, the classifier ˆ
Φof step 1 returns both a class ˆ
Φ(lx) = ciCto lxand
a numerical value ν(ˆ
Φ, lx)that represents the confidence that ˆ
Φhas in this decision.
The whole image is classified, given the label ˆ
Φ(lx)and the confidence ν(ˆ
Φ, lx)assigned to its local
features lxIxduring the first phase, using a confidence-rated majority vote approach. We first com-
pute a score z(lx, ci)for each label ciC. The score is the sum of the confidences obtained for the local
features predicted as ci. Formally,
z(Ix, ci) = X
lxIx,ˆ
Φ(lx)=ci
ν(ˆ
Φ, lx).
The label that obtains the maximum score is then chosen:
ˆ
Φ(Ix) = arg max
cjCz(Ix, cj).
As measure of confidence for the classification of the whole image, we use the ratio between the
predicted and the second best class:
νimg(ˆ
Φ, Ix) = 1
arg max
cjCˆ
Φ(lx)
z(Ix, cj)
arg max
ciCz(Ix, ci).
This whole image classification confidence can be used to decide whether the predicted label has a
high probability to be correct.
7. EXPERIMENTAL EVALUATION
The aim of this performance analysis is to evaluate the classification effectiveness of the different
strategies of kNN classification combined with various types of local features with and without geo-
metric consistency checks.
7.1 Dataset, Ground Truth, and Experiment Settings
The dataset that we used for our tests is publicly available and composed of 1,227 photos of 12 monu-
ments or cultural heritage related landmarks located in Pisa.
It was created during the VISITO Tuscany project 1and was also used in [Amato et al. 2010; Amato
and Falchi 2010; 2011]. The photos have been crawled from Flickr, the well–known on-line photo ser-
vice. The IDs of the photos used for these experiments together with the assigned label and extracted
features can be downloaded from [pis 2011]. In the following we list the classes that we used and the
number of photos belonging to each class. In Figure 1 we reported an example for each class in the
same order as they are reported in the list below:
(1) Battistero (104 photos) – the baptistery of St. John
(2) Camposanto Monumentale (exterior) (46 photos)
(3) Camposanto Monumentale (portico) (138 photos)
(4) Camposanto Monumentale (field) (113 photos)
(5) Certosa (53 photos) – the charterhouse
1http://www.visitotuscany.it/
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
Fast Image Classification for Monument Recognition 18:17
Fig. 1. Example images taken from the Pisa dataset (Images Available by Flikr under Creative Commons License agreement).
(6) Chiesa della Spina (112 photos) – Gothic church
(7) Guelph tower (71 photos)
(8) Duomo (130 photos) – the cathedral of St. Mary
(9) Palazzo dell’Orologio (92 photos) – building
(10) Basilica of San Piero (48 photos) – church of St. Peter
(11) Palazzo della Carovana (101 photos) – building
(12) Leaning Tower (119 photos) – leaning campanile
In order to build and evaluate a classifier for these classes, we divided the dataset into a training set
(Tl) consisting of 226 photos (approximately 20% of the dataset) and a test set consisting of 921 photos
(approximately 80% of the dataset). The image resolution used for feature extraction is the standard
resolution used by Flickr (maximum 500 pixels for either the height or width)
The total numbers of local features extracted by the SIFT and SURF detectors were about 1,000,000
and 500,000 respectively. The number of local features per image varies between 113 and 2,816 for
SIFT and 50 and 904 for SURF. ORB was tested setting the feature extractor to identify both 500 and
1000 local features. The number of local features detected for BRISK was less than 500.
Various classifiers were created using the local features taken into considerations and the definitions
given in Section 6.
7.2 Performance Measures
In order to evaluate the effectiveness of the classifiers on the test set, we use the micro-averaged accu-
racy and micro- and macro-averaged precision,recall and F1.
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
18:18 G. Amato, F. Falchi and C. Gennaro
In macro-averaging, the performance metrics are calculated for each class and then the average of
all is evaluated. In micro-averaging, the average is calculated across all the individual classification
decisions made by a system [Chau and Chen 2008].
Precision is defined as the ratio between the number of correctly predicted and the overall number of
predicted documents for a specific class. Recall is the ratio between the number of correctly predicted
and the overall number of documents for a specific class. F1is the harmonic mean of precision and
recall.
Note that for the single-label classification task, micro-averaged accuracy is defined as the number
of documents correctly classified divided by the total number of documents of the same label in the
test set and is equivalent to the micro-averaged precision,recall and F1scores. Therefore, in the tables
discussed in the following, we just report the values of accuracy and F1Macro.
7.3 Similarity Based Image kNN Classification Results
In Figure 2, we report the results obtained by both the local feature matching (see Section 5.1) and
dataset matching (see Section 6.3). Given that the kNN classifier requires the parameter k, we report
the results obtained for k= 1 (row labeled k=1), the best results obtained varying k[1,100] (row
labeled Best), and the value kat which the best result was obtained (row labeled Best k). Figure 3
reports results obtained by using the BoW approach (see Section 5.2). The details are discussed in the
following sections.
7.3.1 Local Feature Matching. Comparing the results obtained by the various similarity functions
for the local feature matching comparison approach, we can see that geometric consistency checks are
able to significantly improve the quality of the classification process. The best performance was gener-
ally obtained using the distance function that makes use of the affine geometric constraint. Only ORB
with 500 local features and BRISK, sometimes obtain a better F1Macro, when using Rotation, Scale
and Translation (RTS) geometric constraint checks. Overall, SIFT provides the highest effectiveness,
achieving an accuracy and F1Macro of 0.94, with k= 3. However, ORB and BRISK have a more com-
pact size and are easier to manage, given that they are binary features. Thus, even if their effectiveness
is a little lower, their usage can be justified in terms of efficiency.
7.3.2 Dataset Matching. In Figure 2, we also report the results obtained by the dataset matching
approach using ¯
k= 10, i.e., performing a 10 nearest neighbors search for each local feature in the
query over the local features in the training set. In our experiments, we also tested ¯
k= 30,50, and
100 obtaining comparable but worse results. We note that a peculiar feature of this approach is that
it relies on spatial or metric access methods for similarity searching to significantly improve efficiency
of classification with very large training sets. Notably this approach often performs better than the
local feature matching approach. The intuition to justify this behavior is that the ¯
knearest neighbors
search performed between all the local features in the training set is able to reduce the number of false
matches. The best results are obtained using Hough and RTS geometric constraint checks, with an
accuracy and F1macro of 0,95 and 0,94 respectively, using SIFT. Hough obtains the best performance
using k= 1, RTS needs k= 9. As executing Hough transformation costs much less than RST, Hough is
the best choice in this case. As previously, SIFT is the local feature that offers the best results in this
case.
7.3.3 Bag of Words. In Figure 3, we report the results obtained by the BoW approach described in
Section 5.2, using a vocabulary of 100kfeatures selected using the k-means algorithm. As established in
the literature, typically the more the words, the better the results. In our experiments, we are dealing
with a dataset of about one million features. Thus, 100kof visual words is the highest value for which it
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
Fast Image Classification for Monument Recognition 18:19
Hough RST Affine Hom. Hough RST Affine Hom.
SIFT 0.88 0.91 0.93 0.94 0.92 0.88 0.95 0.94 0.94 0.93
SURF 0.81 0.87 0.91 0.92 0.90 0.86 0.91 0.93 0.93 0.89
ORB1000 0.81 0.87 0.89 0.90 0.89 0.86 0.90 0.87 0.86 0.87
ORB1500 0.80 0.84 0.87 0.87 0.82 0.83 0.89 0.84 0.83 0.83
BRISK 0.67 0.74 0.75 0.74 0.65 0.63 0.74 0.75 0.74 0.65
SIFT 0.86 0.90 0.92 0.93 0.84 0.88 0.95 0.94 0.87 0.87
SURF 0.79 0.85 0.90 0.92 0.83 0.84 0.90 0.92 0.86 0.84
ORB1000 0.80 0.87 0.89 0.89 0.83 0.86 0.90 0.87 0.86 0.86
ORB1500 0.79 0.83 0.86 0.79 0.80 0.83 0.88 0.83 0.83 0.83
BRISK 0.66 0.65 0.67 0.69 0.66 0.63 0.65 0.67 0.69 0.66
SIFT 0.88 0.92 0.94 0.94 0.92 0.88 0.95 0.94 0.95 0.94
SURF 0.85 0.89 0.91 0.92 0.91 0.86 0.91 0.93 0.94 0.90
ORB1000 0.83 0.88 0.89 0.91 0.89 0.87 0.91 0.87 0.87 0.87
ORB1500 0.81 0.85 0.86 0.86 0.83 0.84 0.89 0.84 0.83 0.84
BRISK 0.70 0.74 0.76 0.75 0.65 0.66 0.74 0.76 0.75 0.65
SIFT 0.86 0.90 0.93 0.94 0.84 0.87 0.95 0.94 0.87 0.87
SURF 0.83 0.87 0.90 0.92 0.84 0.84 0.90 0.92 0.86 0.84
ORB1000 0.83 0.87 0.89 0.90 0.83 0.86 0.90 0.87 0.86 0.86
ORB1500 0.80 0.84 0.86 0.78 0.80 0.83 0.88 0.83 0.83 0.83
BRISK 0.69 0.65 0.68 0.69 0.66 0.66 0.65 0.68 0.69 0.66
SIFT 12839219712
SURF 20 21 4 1 4 3 4 1 2 3
ORB1000 73 434126111
ORB1500 61 10 7 3 26 2 1 1 2 2
BRISK 82 123 17 1 6 1 23 17 1
SIFT 12831219912
SURF 18 21 4 1 4 3 4 1 3 3
ORB1000 77 511 4126111
ORB1500 61 2 7 6 26 2 1 1 2 2
BRISK 82 11 29 17 1 6 11 29 17 1
Best k
Accuracy
F1 Macro
k=1
Accuracy
F1 Macro
Best
Accuracy
F1 Macro
Local feature matching
Dataset matching
Geometric consistency check
Geometric consistency check
𝒌 = 𝟏𝟎
Fig. 2. Similarity based image kNN classification results using the local feature and dataset matches for ¯
k= 10.
makes sense to perform a clustering algorithm. The results in this case are worse than those obtained
with our proposed approaches discussed in Sections 7.3.1 and 7.3.2. In fact, both accuracy and F1Macro
never exceed 0.9. Moreover, the geometric consistency checks do not significantly improve performance;
this is particularly true for F1. The intuition is that the candidate matches found using the BoW
approach are much too noisy. Standard cosine and TF-IDF similarity measures are more suitable for
this scenario. It is worth noting that the k-means algorithm for selecting the 100kwords was executed
over the whole dataset, while it would have been more correct to only consider the training images. In
fact, the test images should not be used during any training phase. However, we preferred to compare
our approach in this scenario, even if the BoW performance is actually overestimated.
7.4 Local Features Based Image Classifier Results
Figure 4 reports the results obtained with the local feature based classifier (see Section 6.4). Similarly
to the approach based on image to local features matching, this approach also allows us to signifi-
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
18:20 G. Amato, F. Falchi and C. Gennaro
Hough RST Affine Hom.
SIFT 0.86 0.87 0.88 0.88 0.81 0.88
SURF 0.85 0.84 0.85 0.85 0.82 0.86
ORB1000 0.78 0.79 0.78 0.78 0.75 0.81
ORB1500 0.75 0.76 0.77 0.77 0.77 0.82
BRISK 0.59 0.63 0.52 0.59 0.52 0.59
SIFT 0.88 0.87 0.87 0.87 0.80 0.80
SURF 0.84 0.83 0.75 0.75 0.72 0.79
ORB1000 0.77 0.78 0.77 0.77 0.75 0.80
ORB1500 0.73 0.74 0.76 0.76 0.76 0.74
BRISK 0.58 0.62 0.51 0.48 0.44 0.51
SIFT 0.88 0.90 0.88 0.88 0.83 0.89
SURF 0.86 0.86 0.88 0.88 0.83 0.86
ORB1000 0.79 0.81 0.80 0.82 0.81 0.84
ORB1500 0.78 0.78 0.79 0.82 0.81 0.84
BRISK 0.61 0.65 0.53 0.63 0.59 0.61
SIFT 0.87 0.87 0.87 0.87 0.82 0.81
SURF 0.84 0.85 0.78 0.78 0.76 0.79
ORB1500 0.78 0.80 0.79 0.80 0.80 0.82
ORB1000 0.73 0.77 0.78 0.76 0.79 0.76
BRISK 0.59 0.64 0.52 0.52 0.49 0.53
SIFT 7 8 2 4 4 6
SURF 3 9 15 713 2
ORB1000 25 22 86 83 48 70
ORB1500 28 33 38 90 74 79
BRISK 33 17 29 64 76 50
SIFT 3 3 1 2 4 6
SURF 7 9 15 713 2
ORB1500 25 22 86 84 48 94
ORB1000 28 33 38 91 69 79
BRISK 17 17 26 24 31 44
Bag of Words
Best
Accuracy
F1 Macro
cosine
k=1
Accuracy
F1 Macro
cosine
TF-IDF
Geometric consistency check
Best k
Accuracy
F1 Macro
Fig. 3. Classification Results using the BoW approach with a vocabulary of 100k features.
cantly improve efficiency, relying on metric or spatial access methods for similarity searching. In fact,
local features can be classified using distance functions that can be easily indexed using these access
methods.
Experiments show that very good results are obtained even without geometric constraint checks. For
instance, using SIFT, we obtained values of accuracy and F1Macro of 0.95 simply with the weighted
local feature classifier. In just a few cases, Hough transformation slightly improves the performance.
However, the improvement obtained does not justify the extra efficiency cost involved. For instance, for
all binary local features, improvements range from 0.01 to 0.02 in both accuracy and F1Macro using
Hough with respect to the simple weighted local feature classifier.
Overall, the performance of this approach is comparable to the best results obtained by the approach
based on image to local feature matching. However, its efficiency is higher as it does not require geo-
metric constraint checks to be performed.
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
Fast Image Classification for Monument Recognition 18:21
f 1 510 m w Hough RST Affine Hom. Hough RST Affine Hom.
SIFT 0.90 0.90 0.85 0.82 0.94 0.95 0.94 0.87 0.87 0.63 0.95 0.93 0.93 0.81
SURF 0.88 0.88 0.84 0.79 0.93 0.93 0.93 0.84 0.84 0.39 0.94 0.92 0.92 0.83
ORB1000 0.87 0.86 0.82 0.77 0.86 0.91 0.92 0.89 0.80 0.80 0.92 0.92 0.89 0.89
ORB1500 0.84 0.84 0.80 0.76 0.84 0.89 0.90 0.83 0.46 0.45 0.90 0.91 0.88 0.88
BRISK 0.68 0.68 0.60 0.51 0.69 0.80 0.82 0.71 0.74 0.73 0.82 0.83 0.79 0.74
SIFT 0.81 0.88 0.81 0.75 0.94 0.95 0.94 0.79 0.80 0.62 0.95 0.92 0.92 0.73
SURF 0.79 0.87 0.80 0.73 0.91 0.92 0.93 0.76 0.76 0.42 0.93 0.92 0.91 0.82
ORB1000 0.77 0.76 0.71 0.64 0.76 0.91 0.92 0.88 0.73 0.74 0.92 0.91 0.89 0.89
ORB1500 0.74 0.74 0.69 0.63 0.75 0.89 0.90 0.83 0.51 0.49 0.89 0.90 0.88 0.89
BRISK 0.55 0.55 0.46 0.38 0.55 0.78 0.80 0.62 0.65 0.65 0.80 0.81 0.78 0.65
F1 Macro
k=10
k=100
Accuracy
Fig. 4. Local Features Based Classifier Results
In order to compare per class results, in Figure 5, 6, 7 and 8, we report the confusion matrices for
the most relevant classifiers tested according to the results reported in the previous sections. All the
results were obtained using SIFT in order to be comparable. We report actual classes by column and
the assigned ones by row. Each monument is indicated by the number used in Section 7.1 for describing
the dataset. The last rows show specific monument recall and F1Macro while in the last column we
report the precision. The overall more difficult to recognizes monuments resulted to be Camposanto
Monumentale (2), Certosa (5), and Guelph tower (7). Comparing the matrices we see major variations on
the relative and absolute performance obtained by the various approaches on (2) and (7). For instance,
dataset matching (Figure 7) and weighted LF Distance Ratio Classifier ˆ
Φw(Figure 8) have overall
similar performance but they obtained significant different results on these two classes.
8. CONCLUSIONS
In this paper, we have developed several strategies for efficient landmark recognition, which combine
two different approaches to k-nearest neighbor classification applying different methods to match local
descriptors.
The results of the experiments conducted in a cultural heritage scenario revealed that the ap-
proaches proposed gave a better performance than other state of the art approaches.
Among the techniques that we proposed, the local feature based classifier gave the best performance.
With this classifier, we can improve efficiency by using metric or spatial access methods. In addition,
the effectiveness provided is generally equal to or better than the other methods. The great advan-
tage of this method is that it offers high performance even without geometric consistency checks, thus
further raising efficiency. Comparisons were executed using various types of local features. The best
performance was always obtained using SIFT. Although binary features (ORB and BRISK) were gen-
erally slightly worse, they can further boost efficiency, given their compactness and convenience for
mobile applications.
A system built with the proposed image recognition approach is mainly intended to be used by vis-
itors (tourists) of cities with cultural heritage related landmarks, for instance using a smartphone,
to recognize and get information on monuments that they see. Clearly, these techniques can also be
used to build systems to be used by researchers to retrieve information on artworks that are mainly
described by their visual appearance. In this respect, we are using these techniques to provide access
to databases of ancient inscriptions and epigraphy in the EU Funded EAGLE project [eag 2014]. The
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
18:22 G. Amato, F. Falchi and C. Gennaro
12345678910 11 12 prec.
182 2 1 1 2 1 .92
226 1.00
379 2 1 .96
4 2 2 108 1.96
5 2 1 36 2.88
6 2 1 1 86 .96
746 1.00
8 1 1 2 2 102 .94
9 1 1 2 73 2.92
10 138 .97
11 1 2 3 1 1 1 79 .90
12 1 1 1 3 1 93 .93
.99 .70 .88 .98 .86 .96 .81 .98 .99 1.00 .98 .98
.95 .83 .92 .97 .87 .96 .89 .96 .95 .99 .93 .95
Actual Monument
Assigned Monument
recall
F1
Fig. 5. Confusion matrix obtained by the local features
matching with affine geometric consistency check and k= 3.
Overall acc = 0.94 and F1M acro = 0.93.
12345678910 11 12 prec.
182 8 1 1 2 2 .85
214 1.00
3 6 86 1 1 1 7 1 1 .83
4 3 2 108 1.95
525 1.00
6 3 2 7 86 4 1 2 2 .80
7 1 33 1 1 1 .89
8 1 2 2 1 96 1.93
9 5 3 68 .89
10 1 7 37 .82
11 1 3 78 .95
12 494 .96
.99 .38 .96 .98 .60 .96 .58 .92 .92 .97 .96 .99
.92 .55 .89 .96 .75 .87 .70 .93 .91 .89 .96 .97
Assigned Monument
Actual Monument
recall
F1
Fig. 6. Confusion matrix obtained by the BoW approach us-
ing cosine and TF-IDF and k= 8. Overall acc = 0.90 and
F1M acro = 0.87.
12345678910 11 12 prec.
181 1.99
231 1.00
3 2 86 3 1 1 .92
4 2 1 107 2 2 1 .93
537 1.00
6 1 2 2 90 6.89
740 1.00
8 1 1 103 2 1 2 .94
9 2 69 .97
10 1 2 38 .93
11 1 3 80 .95
12 1 2 1 92 .96
.98 .84 .96 .97 .88 1.00 .70 .99 .93 1.00 .99 .97
.98 .91 .94 .95 .94 .94 .82 .96 .95 .96 .97 .96
Actual Monument
Assigned Monument
recall
F1
Fig. 7. Confusion matrix obtained by the dataset matching
approach with hough geometric consistency check and k= 1.
Overall acc = 0.95 and F1M acro = 0.95.
12345678910 11 12 prec.
181 1.99
225 1.00
3 2 87 2 1 4 .91
4 7 3 108 4 1 1 .87
534 1.00
6 1 89 3.96
746 1.00
8 2 1 1 1 1 103 .94
9 1 2 1 73 2.92
10 37 1.00
11 179 .99
12 1 1 94 .98
.98 .68 .97 .98 .81 .99 .81 .99 .99 .97 .98 .99
.98 .81 .94 .92 .89 .97 .89 .97 .95 .99 .98 .98
Actual Monument
Assigned Monument
recall
F1
Fig. 8. Confusion matrix obtained by the Weighted LF Dis-
tance Ratio Classifier ˆ
Φw. Overall acc = 0.95 and F1M acro =
0.95.
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
Fast Image Classification for Monument Recognition 18:23
traditional way of retrieving information from an epigraphic database is, for instance, that of submit-
ting text queries related to place where the item has been found, or where it currently stored. Using
our techniques, it is possible to retrieve information by simply using a picture of the epigraph as a
query.
REFERENCES
2005. SIFT Keypoint detector. http://www.cs.ubc.ca/lowe/keypoints/. (2005). last accessed on 12rd-November-2014.
2006. SURF detector. http://www.vision.ee.ethz.ch/surf/. (2006). last accessed on 12rd-November-2014.
2010. Google Goggles. http://www.google.com/mobile/goggles/. (2010). last accessed on 12rd-November-2014.
2011. Pisa Landmarks Dataset. http://www.fabriziofalchi.it/pisaDataset/. (2011). last accessed on 12rd-November-2014.
2014. Eagle. http://www.eagle-network.eu/. (2014). last accessed on 13rd-November-2014.
G. Amato, P. Bolettieri, F. Falchi, and C. Gennaro. 2013. Large Scale Image Retrieval Using Vector of Locally Aggregated
Descriptors. In Similarity Search and Applications. LNCS, Vol. 8199. Springer Berlin Heidelberg, 245–256.
G. Amato and F. Falchi. 2010. kNN based image classification relying on local feature similarity. In SISAP ’10: Proceedings of
the Third International Conference on SImilarity Search and APplications. ACM, New York, NY, USA, 101–108.
G. Amato and F. Falchi. 2011. Local Feature based Image Similarity Functions for kNN Classfication. In Proceedings of the 3rd
International Conference on Agents and Artificial Intelligence (ICAART 2011). SciTePress, 157–166. Vol. 1.
G. Amato, F. Falchi, and P. Bolettieri. 2010. Recognizing Landmarks Using Automated Classification Techniques: an Evaluation
of Various Visual Features. In Proceeding of The Second Interantional Conference on Advances in Multimedia (MMEDIA 2010).
IEEE Computer Society, 78–83.
Giuseppe Amato, Fabrizio Falchi, and Claudio Gennaro. 2011. Geometric Consistency Checks for kNN Based Image Classifica-
tion Relying on Local Features. In Proceedings of the Fourth International Conference on SImilarity Search and APplications
(SISAP ’11). ACM, New York, NY, USA, 81–88. DOI:http://dx.doi.org/10.1145/1995412.1995428
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In In ECCV. 404–417.
Aur´
elien Bellet, Amaury Habrard, and Marc Sebban. 2013. A Survey on Metric Learning for Feature Vectors and Structured
Data. arXiv preprint arXiv:1306.6709 (2013).
Oren Boiman, Eli Shechtman, and Michal Irani. 2008. In defense of Nearest-Neighbor based image classification. In CVPR.
A. Bosch, A. Zisserman, and X. Munoz. 2008. Scene Classification using a Hybrid Generative/Discriminative Approach. IEEE
Transactions on Pattern Analysis and Machine Intelligence 30, 4 (2008).
Michael Chau and Hsinchun Chen. 2008. A machine learning approach to web page filtering using content and structure
analysis. Decision Support Systems 44, 2 (2008), 482 – 494.
Tao Chen, Kui Wu, Kim-Hui Yap, Zhen Li, and Flora S. Tsai. 2009. A Survey on Mobile Landmark Recognition for Information
Retrieval. In MDM ’09. IEEE Computer Society, 625–630.
Jean-Pierre Chevallet, Joo-Hwee Lim, and Mun-Kew Leong. 2007. Object identification and retrieval from efficient image
matching. Snap2Tell with the STOIC dataset. Information processing & management 43, 2 (2007), 515–530.
T. Cover and P. Hart. 1967. Nearest neighbor pattern classification. Information Theory, IEEE Transactions on 13, 1 (1967),
21–27.
S. Dudani. 1975. The Distance-Weighted K-Nearest-Neighbour Rule. IEEE Transactions on Systems, Man and Cybernetics
SMC-6(4) (1975), 325–327.
T. Fagni, F. Falchi, and F. Sebastiani. 2010. Image classification via adaptive ensembles of descriptor-specific classifiers. Pattern
Recognition and Image Analysis 20 (2010), 21–28. Issue 1.
Martin A. Fischler and Robert C. Bolles. 1981. Random Sample Consensus: A Paradigm for Model Fitting with Applications to
Image Analysis and Automated Cartography. Commun. ACM 24, 6 (1981), 381–395.
Andrea Frome, Yoram Singer, and Jitendra Malik. 2007. Image Retrieval and Classification Using Local Distance Functions. In
Advances in Neural Information Processing Systems: Proceedings of the 2006 Conference, Vol. 19. The MIT Press, 417.
Nicols Garca-Pedrajas and Domingo Ortiz-Boyer. 2009. Boosting k-nearest neighbor classifier by means of input space projec-
tion. Expert Systems with Applications 36, 7 (2009), 10570 – 10582.
Kristen Grauman and Trevor Darrell. 2007. The Pyramid Match Kernel: Efficient Learning with Sets of Features. J. Mach.
Learn. Res. 8 (May 2007), 725–760.
Daniel Haase and Joachim Denzler. 2011. Comparative Evaluation of Human and Active Appearance Model Based Tracking
Performance of Anatomical Landmarks in Locomotion Analysis. In Proceedings of the 8th Open German-Russian Workshop
Pattern Recognition and Image Understanding (OGRW). 96–99.
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
18:24 G. Amato, F. Falchi and C. Gennaro
R. I. Hartley. 1995. In defence of the 8-point algorithm. In Proceedings of the Fifth International Conference on Computer Vision
(ICCV ’95). IEEE Computer Society, Washington, DC, USA, 1064–.
J. Hays and A.A. Efros. 2008. IM2GPS: estimating geographic information from a single image. In Computer Vision and Pattern
Recognition, 2008. CVPR 2008. IEEE Conference on. 1–8. DOI:http://dx.doi.org/10.1109/CVPR.2008.4587784
Jared Heinly, Enrique Dunn, and Jan-Michael Frahm. 2012. Comparative Evaluation of Binary Features. In Computer Vi-
sion - ECCV 2012. 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Andrew
Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid (Eds.). Springer Berlin Heidelberg, 759–773.
DOI:http://dx.doi.org/10.1007/978-3-642-33709- 3 54
Stefan Hinterstoisser, Vincent Lepetit, Selim Benhimane, Pascal Fua, and Nassir Navab. 2011. Learning Real-Time Perspective
Patch Rectification. International Journal of Computer Vision 91, 1 (2011), 107–130.
H. Jegou, M. Douze, C. Schmid, and P. Perez. 2010. Aggregating local descriptors into a compact image representation. In
Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. 3304–3311.
E. Johns and Guang-Zhong Yang. 2011. From images to scenes: Compressing an image cluster into a single scene model for
place recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on. 874 –881.
Vandana Korde and C Namrata Mahender. 2012. Text Classification and Classifiers: A Survey. International Journal of Artificial
Intelligence & Applications (IJAIA) 3, 2 (2012), 85–99.
Mathieu Labb. 2014. Find-Object. https://code.google.com/p/find-object/. (2014). last accessed on 13rd-November-2014.
Stefan Leutenegger, Margarita Chli, and Roland Y Siegwart. 2011. BRISK: Binary robust invariant scalable keypoints. In
Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2548–2555.
Miguel Lourenc¸o. 2011. Local Invariant Features. http://arthronav.isr.uc.pt/$\sim$mlourenco/files/tutorial.pdf. (2011).
David G. Lowe. 2004. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60,
2 (2004), 91–110.
Shyjan Mahamud and Martial Hebert. 2003. The optimal distance measure for object detection. In Computer Vision and Pattern
Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, Vol. 1. IEEE, I–248.
T. Malisiewicz and A.A. Efros. 2008. Recognition by association via learning per-exemplar distances. In Computer Vision and
Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. 1–8.
Jiri Matas, Ondrej Chum, Martin Urban, and Tom´
as Pajdla. 2004. Robust wide-baseline stereo from maximally stable extremal
regions. Image and vision computing 22, 10 (2004), 761–767.
Mahmoud Mejdoub and Chokri Ben Amar. 2011. Classification improvement of local feature vectors over the KNN algorithm.
Multimedia Tools and Applications (2011), 1–22. http://dx.doi.org/10.1007/s11042- 011-0900-4
Krystian Mikolajczyk, Bastian Leibe, and Bernt Schiele. 2005. Local features for object class recognition. In Computer Vision,
2005. ICCV 2005. Tenth IEEE International Conference on, Vol. 2. IEEE, 1792–1799.
Krystian Mikolajczyk and Cordelia Schmid. 2005. A performance evaluation of local descriptors. Pattern Analysis and Machine
Intelligence, IEEE Transactions on 27, 10 (2005), 1615–1630.
Timo Ojala, Matti Pietikainen, and Topi Maenpaa. 2002. Multiresolution gray-scale and rotation invariant texture classification
with local binary patterns. Pattern Analysis and Machine Intelligence, IEEE Transactions on 24, 7 (2002), 971–987.
F. Perronnin and C. Dance. 2007. Fisher Kernels on Visual Vocabularies for Image Categorization. In Computer Vision and
Pattern Recognition, 2007. CVPR ’07. IEEE Conference on. 1–8.
J. Philbin. 2010. Scalable Object Retrieval in Very Large Image Collections. Ph.D. Dissertation. University of Oxford.
J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. 2007. Object Retrieval with Large Vocabularies and Fast Spatial
Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
P. Piro, Richard Nock, Wafa Bel Haj Ali, Frank Nielsen, and Michel Barlaud. 2013. Boosting k-Nearest Neighbors Classification.
In Advanced Topics in Computer Vision. Springer London, 341–375.
A. Popescu and P A Mo¨
ellic. 2009. MonuAnno: Automatic Annotation of Georeferenced Landmarks Images. In Proceedings of
the ACM International Conference on Image and Video Retrieval (CIVR ’09). ACM, New York, NY, USA, Article 11, 8 pages.
Simon J. D. Prince. 2012. Computer Vision: Models, Learning, and Inference (1st ed.). Cambridge University Press, New York,
NY, USA.
M. Radovanovi´
c. Nearest neighbors in high-dimensional data: The emergence and influence of hubs.
Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. 2011. ORB: an efficient alternative to SIFT or SURF. In
Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2564–2571.
Hanan Samet. 2005. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA.
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
Fast Image Classification for Monument Recognition 18:25
Josef Sivic and Andrew Zisserman. 2003. Video Google: A Text Retrieval Approach to Object Matching in Videos. In Proceedings
of the Ninth IEEE International Conference on Computer Vision - Volume 2 (ICCV ’03). IEEE Computer Society, Washington,
DC, USA, 1470–.
Yu-Chuan Su, Tzu-Hsuan Chiu, Guan-Long Wu, Chun-Yen Yeh, Felix Wu, and Winston Hsu. 2013. Flickr-tag Prediction Using
Multi-modal Fusion and Meta Information. In Proceedings of the 21st ACM International Conference on Multimedia (MM ’13).
ACM, 353–356.
Radu Timofte, Tinne Tuytelaars, and Luc Gool. 2013. Naive Bayes Image Classification: Beyond Nearest Neighbors. In
Computer Vision ACCV 2012. Lecture Notes in Computer Science, Vol. 7724. Springer Berlin Heidelberg, 689–703. http:
//dx.doi.org/10.1007/978-3-642-37331- 2 52
Pierre Tirilly, Vincent Claveau, and Patrick Gros. 2010. Distances and weighting schemes for bag of visual words image retrieval.
In Proceedings of the international conference on Multimedia information retrieval (MIR ’10). ACM, 323–332.
N. Tomaˇ
sev and M. Radovanovi´
c. Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification.
P. Turcot and D G Lowe. 2009. Better matching with fewer features: The selection of useful features in large database recognition
problems. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on. IEEE, 2109–2116.
Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, T. Huang, and Yihong Gong. 2010. Locality-constrained Linear Coding for
image classification. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. 3360–3367.
Pavel Zezula, G. Amato, Vlastislav Dohnal, and Michal Batko. 2006. Similarity Search: The Metric Space Approach. Advances
in Database Systems, Vol. 32. Springer-Verlag. 220 pages.
H Zhang, A.C. Berg, M. Maire, and J. Malik. 2006. SVM-KNN: Discriminative Nearest Neighbor Classification for Visual
Category Recognition. In Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, Vol. 2. 2126–2136.
Xiao Zhang, Zhiwei Li, Lei Zhang, Wei-Ying Ma, and Heung-Yeung Shum. 2009. Efficient indexing for large scale visual search.
In Computer Vision, 2009 IEEE 12th International Conference on. 1103–1110.
Ziming Zhang, Jiawei Huang, and Ze-Nian Li. 2011. Learning Sparse Features On-Line for Image Classification. In Image
Analysis and Recognition. LNCS, Vol. 6753. Springer Berlin Heidelberg, 122–131.
Yantao Zheng, Ming Zhao, Yang Song, Hartwig Adam, Ulrich Buddemeier, Alessandro Bissacco, Fernando Brucher, Tat-Seng
Chua, and Hartmut Neven. 2009. Tour the world: Building a web-scale landmark recognition engine. In CVPR. 1085–1092.
Wangmeng Zuo, David Zhang, and Kuanquan Wang. 2008. On kernel difference-weighted k-nearest neighbor classification.
Pattern Analysis and Applications 11, 3-4 (2008), 247–257.
ACM Journal on Computing and Cultural Heritage, Vol. 8, No. 4, Article 18, Publication date: August 2015.
... While analysing the different works, the authors have attempted to go beyond what is stated in the texts, trying to figure out if the methods could also be applied to other cases. For instance, [21] describes a system mainly intended to be used by tourists of cities with cultural heritage related landmarks, but it can also be applied in rural cultural heritage assets. The results of this process are shown in Fig. 4 and Table A.6. ...
... In the field of heritage, these algorithms have been applied to classify heritage assets based on images. For instance, in [21] the authors use the k-nearest neighbour (kNN) for monument recognition by image classification. This image recognition approach is mainly intended to be used by visitors (tourists) to locations with cultural heritage related landmarks (i.e., using a smartphone) to recognise and get information on monuments that they see. ...
Article
Full-text available
Cultural and Natural Heritage (CNH) are both irreplaceable sources of life and inspiration, according to the UNESCO definition. Rural areas represent outstanding examples of cultural, either tangible or intangible, and natural heritage. While rural areas are facing a socio-economic and demographic crisis all over the world, CNH need not only to be safeguarded, but also promoted as a driver for competitiveness, growth and sustainable and inclusive development. This paper goes deeper into the study of computational methods (CMs) applied to modelling CNH in rural areas by looking at how computational methods can support CNH promotion and valorisation to transform rural areas into laboratories for the demonstration of sustainable development through improving the unique potential of their heritage. To this end, different computational methods have been studied and classified according to their scope and application area parameters, showing some correlation among the said parameters and the class of computational method. Apart from how CMs have been applied, wehether it is possible to scale up these CMs elsewhere has also been considered.
... Amato et al. [45] proposed the system for classification and recognition of archaeological monuments. This work adopted four feature extraction techniques like SIFT, ORB, BRISK and SURF by using KNN classifiers. ...
Article
Full-text available
This work extricates the image characteristic features for the classification of archeological monument images. At the pre-processing stage, archeological dataset sample images are treated by using structure safeguarding image abstraction framework, which can deliver the most effective image abstraction output by manipulating the perceptible features in the given low-illuminated and underexposed color image samples. Proposed abstraction-framework effectively boosted the significant image property features like color, edge, sharpness, contrast and suppresses complexity and noise. The image properties were also refined at each phase based on the attained statistical feature disposal information. The work adopted the Harris feature identification technique to identify the most significant image features in the input and enhanced images. The framework also preserves significant features in the foreground of an image by intelligently integrating the series of filters during rigorous experimental work and also diminishes the background content of an input image. The proposed archeological system evaluates every stage of the result with assorted subjective matters and calculates the image quality and properties assessment statistical attributes. By this way prominent features in an image have been recognized. The efficiency of this work has been corroborated by performing the trials on the selected archeological dataset. In addition, user’s visual feedback and the standard image quality assessment techniques were also used to evaluate the proposed pre-processing framework. Based on the obtained abstraction images from the framework, this work extracts the image gray color texture features using GLCM, color texture from CTMs and deep-learning features from AlexNet for the classification of archeological monument classification. This work adopted a support vector machine as a classifier. To corroborate the efficiency of the proposed method, an experiment was conducted on our own data set of Chalukya, Kadamba, Hoysala and new engraving monuments, each domain consisting of 500 archeological data set samples with large intra-class variation, with different environmental lighting condition, low-illumination and different pose. Implementation of this work was carried out in MATLAB-2020 with HPC Nvidia Tesla P100 GPU, and obtained results show that combination of multiple features significantly improves the performance to the extent of 98.10%.
... A thorough study regarding the application of KNN for content based image classification for the task of monument classification was previously conducted by [49] where the PISA dataset [50] composed of 1,227 photos of 12 cultural sites located in and around Pisa were used. The local feature based classifier was found out to be best in overall performance with respect to the various metrics used. ...
Preprint
Full-text available
Tourism in India plays a quintessential role in the country's economy with an estimated 9.2% GDP share for the year 2018. With a yearly growth rate of 6.2%, the industry holds a huge potential for being the primary driver of the economy as observed in the nations of the Middle East like the United Arab Emirates. The historical and cultural diversity exhibited throughout the geography of the nation is a unique spectacle for people around the world and therefore serves to attract tourists in tens of millions in number every year. Traditionally, tour guides or academic professionals who study these heritage monuments were responsible for providing information to the visitors regarding their architectural and historical significance. However, unfortunately this system has several caveats when considered on a large scale such as unavailability of sufficient trained people, lack of accurate information, failure to convey the richness of details in an attractive format etc. Recently, machine learning approaches revolving around the usage of monument pictures have been shown to be useful for rudimentary analysis of heritage sights. This paper serves as a survey of the research endeavors undertaken in this direction which would eventually provide insights for building an automated decision system that could be utilized to make the experience of tourism in India more modernized for visitors.
... An early study [15] was investigated on a dataset containing 1.227 images dataset of 12 cultural heritage memorials and Pisa landmarks. The image classification was compared by using the k-nearest neighbor (k-NN) classification with different types of the feature extraction, namely Scale Invariant Feature Transform (SIFT), Speed up Robust Feature (SURF), Oriented FAST and Rotated BRIEF (ORB), and Binary Robust Invariant Scalable Keypoints (BRISK). ...
Article
Full-text available
Through digitization, maintaining and promoting cultural heritage is being strengthened. Concerning this background, this study presents a new Indonesia cultural events dataset and automatic image classification for cultural events. The dataset was developed using the Flickr image platform, and the five cultural events image was collected including the Baliem Festival, Jember Fashion Festival, Nyepi Festival, Pacu Jawi, and Pasola Festival. Further, Convolutional Neural Networks (CNN) was developed for the classification method. A comparison of CNN models (VGG16 and VGG19) using several optimization configurations was performed to get the best model. The results showed that the VGG16 with image augmentation and dropout regularization technique performed best with 94.66% accuracy. This study hoped to support the heritage's digital documentation process and preserve Indonesia's cultural heritage. Keywords—Cultural events; convolutional neural network (CNN); very depth convolutional network (VGG); multi-class classification
... This is not a trivial task, especially because of the wide ranges of architectural solutions that have been proposed worldwide for implementing special classes of monuments [27]. Moreover, images of buildings in a real exploration can be partially captured by users, under varying environmental conditions (light, background, overlapping objects and people), which may complicate the identification process [28,29]. A project similar to ours, devised to identify a special site rather than a whole class of monuments, is presented in [30] by still working on image analysis. ...
Article
Full-text available
In recent years, we have assisted with an impressive advance in augmented reality systems and computer vision algorithms, based on image processing and artificial intelligence. Thanks to these technologies, mainstream smartphones are able to estimate their own motion in 3D space with high accuracy. In this paper, we exploit such technologies to support the autonomous mobility of people with visual disabilities, identifying pre-defined virtual paths and providing context information, reducing the distance between the digital and real worlds. In particular, we present ARIANNA+, an extension of ARIANNA, a system explicitly designed for visually impaired people for indoor and outdoor localization and navigation. While ARIANNA is based on the assumption that landmarks, such as QR codes, and physical paths (composed of colored tapes, painted lines, or tactile pavings) are deployed in the environment and recognized by the camera of a common smartphone, ARIANNA+ eliminates the need for any physical support thanks to the ARKit library, which we exploit to build a completely virtual path. Moreover, ARIANNA+ adds the possibility for the users to have enhanced interactions with the surrounding environment, through convolutional neural networks (CNNs) trained to recognize objects or buildings and enabling the possibility of accessing contents associated with them. By using a common smartphone as a mediation instrument with the environment, ARIANNA+ leverages augmented reality and machine learning for enhancing physical accessibility. The proposed system allows visually impaired people to easily navigate in indoor and outdoor scenarios simply by loading a previously recorded virtual path and providing automatic guidance along the route, through haptic, speech, and sound feedback.
Article
Purpose Because of the fast-growing digital image collections on online platforms and the transfer learning ability of deep learning technology, image classification could be improved and implemented for the hostel domain, which has complex clusters of image contents. This paper aims to test the potential of 11 pretrained convolutional neural network (CNN) with transfer learning for hostel image classification on the first hostel image database to advance the knowledge and fill the gap academically, as well as to suggest an alternative solution in optimal image classification with less labour cost and human errors to those who manage hostel image collections. Design/methodology/approach The hostel image database is first created with data pre-processing steps, data selection and data augmentation. Then, the systematic and comprehensive investigation is divided into seven experiments to test 11 pretrained CNNs which transfer learning was applied and parameters were fine-tuned to match this newly created hostel image dataset. All experiments were conducted in Google Colaboratory environment using PyTorch. Findings The 7,350 hostel image database is created and labelled into seven classes. Furthermore, its experiment results highlight that DenseNet 121 and DenseNet 201 have the greatest potential for hostel image classification as they outperform other CNNs in terms of accuracy and training time. Originality/value The fact that there is no existing academic work dedicating to test pretrained CNNs with transfer learning for hostel image classification and no existing hostel image-only database have made this paper a novel contribution.
Chapter
Spotting is finding the location of a particular object without explicitly knowing the entire content in a collection of objects. In this chapter, we consider two types of objects. We consider the word in a document image as an object. Another object is an artifact that is present in terracotta panel images. The proposed object spotting method is based on Wave Kernel Signature (WKS) under the foundation of quantum mechanics. The query image and the document/panel image are smoothened first, and then the Scale Invariant Feature Transform detector is used to obtain the keypoints in both the query image and the document/panel image. Each keypoint is described in terms of WKS. The WKS descriptors represent the average probability of measuring a quantum mechanical particle at a specific location based on quantum energy. In the case of word spotting, a two-step searching technique is introduced to find the region of interest in the document image under test. On the other hand, a single-step searching technique is used to spot figures present in the panel image corresponding to a particular query image. The proposed method is tested on three historical Bangla handwritten datasets and one historical English handwritten dataset, as well as a terracotta panel image dataset. The performance of the proposed method is evaluated using standard metrics.
Chapter
The identification of plant species by looking at their leaves, flowers, and seeds is a crucial component in the conservation of endangered plants. Traditional identification methods are manual and time consuming and require domain knowledge to operate. Owing to an increased interest in the automated plant identification system, we propose one that utilizes modern convolutional neural network architectures. This approach helps in the recognition of leaf images and can be integrated into mobile platforms like smartphones. Such a system can also be employed in aiding plant-related education, promoting ecotourism, and creating a digital heritage for plant species, among many others. Our proposed solution achieves a state-of-the-art performance for plant classification in the wild. An exhaustive set of experiments are performed to classify 112 species of plants from the challenging Indic-Leaf dataset. The best-performing model gives Top 1 precision of 90.08 and Top 5 precision of 96.90. We discuss and elaborate on our crowdsourcing web application that is used to collect and regulate data. We explain how the automated plant identification system can be integrated into a smartphone by detailing the flow of our mobile application.
Article
Full-text available
As most information (over 80%) is stored as text, text mining is believed to have a high commercialpotential value. knowledge may be discovered from many sources of information; yet, unstructured textsremain the largest readily available source of knowledge .Text classification which classifies thedocuments according to predefined categories .In this paper we are tried to give the introduction of textclassification, process of text classification as well as the overview of the classifiers and tried to comparethe some existing classifier on basis of few criteria like time complexity, principal and performance.
Chapter
Full-text available
A major drawback of the k-nearest neighbors (k-NN) rule is the high variance when dealing with sparse prototype datasets in high dimensions. Most techniques proposed for improving k-NN classification rely either on deforming the k-NN relationship by learning a distance function or modifying the input space by means of subspace selection. Here we propose a novel boosting approach for generalizing the k-NN rule. Namely, we redefine the voting rule as a strong classifier that linearly combines predictions from the k closest prototypes. Our algorithm, called UNN (Universal Nearest Neighbors), rely on the k-nearest neighbors examples as weak classifiers and learn their weights so as to minimize a surrogate risk. These weights, called leveraging coefficients, allow us to distinguish the most relevant prototypes for a given class. Results obtained on several scene categorization datasets display the ability of UNN to compete with or beat state-of-the-art methods, while achieving comparatively small training and testing times.
Article
As most information (over 80%) is stored as text, text mining is believed to have a high commercial potential value. knowledge may be discovered from many sources of information; yet, unstructured texts remain the largest readily available source of knowledge .Text classification which classifies the documents according to predefined categories .In this paper we are tried to give the introduction of text classification, process of text classification as well as the overview of the classifiers and tried to compare the some existing classifier on basis of few criteria like time complexity, principal and performance.
Article
Cambridge Core - Computer Graphics, Image Processing and Robotics - Computer Vision - by Simon J. D. Prince
Article
As the Web continues to grow, it has become increasingly difficult to search for relevant information using traditional search engines. Topic-specific search engines provide an alternative way to support efficient information retrieval on the Web by providing more precise and customized searching in various domains. However, developers of topic-specific search engines need to address two issues: how to locate relevant documents (URLs) on the Web and how to filter out irrelevant documents from a set of documents collected from the Web. This paper reports our research in addressing the second issue. We propose a machine-learning-based approach that combines Web content analysis and Web structure analysis. We represent each Web page by a set of content-based and link-based features, which can be used as the input for various machine learning algorithms. The proposed approach was implemented using both a feedforward/backpropagation neural network and a support vector machine. Two experiments were designed and conducted to compare the proposed Web-feature approach with two existing Web page filtering methods - a keyword-based approach and a lexicon-based approach. The experimental results showed that the proposed approach in general performed better than the benchmark approaches, especially when the number of training documents was small. The proposed approaches can be applied in topic-specific search engine development and other Web applications such as Web content management.