Conference PaperPDF Available

Geometric consistency checks for kNN based image classification relying on local features

Conference Paper

Geometric consistency checks for kNN based image classification relying on local features

Abstract and Figures

Applications of image content recognition, as for instance landmark recognition, can be obtained by using techniques of kNN classifications based on the use of local image features, such as SIFT or SURF. Quality of image classification can be improved by defining geometric consistency check rules based on space transformations of the scene depicted in images. However, this prevents the use of state of the art access methods for similarity searching and sequential scan of the images in the training sets has to be executed in order to perform classification. In this paper we propose a technique that allows one to use access methods for similarity searching, such as those exploiting metric space properties, in order to perform kNN classification with geometric consistency checks. We will see that the proposed approach, in addition to offer an obvious efficiency improvement, surprisingly offers also an improvement of the effectiveness of the classification.
Content may be subject to copyright.
Geometric consistency checks for kNN based
image classification relying on local features
Giuseppe Amato
ISTI-CNR
via G. Moruzzi, 1
Pisa, Italy
g.amato@isti.cnr.it
Fabrizio Falchi
ISTI-CNR
via G. Moruzzi, 1
Pisa, Italy
f.falchi@isti.cnr.it
Claudio Gennaro
ISTI-CNR
via G. Moruzzi, 1
Pisa, Italy
c.gennaro@isti.cnr.it
ABSTRACT
Applications of image content recognition, as for instance
landmark recognition, can be obtained by using techniques
of kN N classifications based on the use of local image fea-
tures, such as SIFT or SURF. Quality of image classification
can be improved by defining geometric consistency check
rules based on space transformations of the scene depicted
in images. However, this prevents the use of state of the art
access methods for similarity searching and sequential scan
of the images in the training sets has to be executed in order
to perform classification. In this paper we propose a tech-
nique that allows one to use access methods for similarity
searching, such as those exploiting metric space properties,
in order to perform kN N classification with geometric con-
sistency checks. We will see that the proposed approach, in
addition to offer an obvious efficiency improvement, surpris-
ingly offers also an improvement of the effectiveness of the
classification.
Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content
Analysis and Indexing; H.3.1 [Information Storage and
Retrieval]: Information Search and Retrieval
Keywords
Image indexing, image classification, recognition, landmarks,
local features
1. INTRODUCTION
An emerging challenge that is recently attracting atten-
tion in the field of multimedia information retrieval is that
of landmark recognition [22]. It consists in automatically
recognizing the landmark (a building, a square, a statue,
a monument, etc.) appearing in a non annotated picture.
Landmark recognition is particularly appealing for instance
in applications for mobile devices, where one wants to ob-
tain information on monuments by simply taking a picture,
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SISAP ’11 , June 30 - July 01, 2011, Lipari, Italy.
Copyright 2010 ACM 978-1-4503-0795-6/11/06 ...$10.00.
or automatic annotation of media published on social net-
work services.
The problem of landmark recognition is typically addressed
by leveraging on techniques of automatic classification, as
for instances kN N Classification [11], applied to image lo-
cal features, such as SIFT [17] and SURF [8].
In Computer Vision, an interesting problem, that scien-
tists address using local features, is that of automatically
locating an object in a test image containing many other
objects. To this goal, a transformation able to map the
model image on the test image is evaluated on the basis
of candidate matches obtained comparing the local features
and using various transformation estimation algorithms.
These transformation estimation techniques can also be
used with kN N specification to perform a geometric con-
sistency check with the purpose of improving the quality of
the image classification. Given an image to be classified, a
kN N classifier compares it against the images of a training
set, in order to identify the most similar images and conse-
quently the correct class. Geometric consistency checks, as
discussed in the rest of the paper, can be used to create im-
age similarity functions that are more effective in deciding
that the same landmark is contained in two images. We will
see, in fact, that geometric consistency checks offer a per-
formance boost with respect to classification based solely on
the presence of interest points in images.
The problem of using similarity functions that are based
on the geometric consistency checks is that classification of
an image should be performed exhaustively by comparing
the image to be classified and all the images in the train-
ing set. This is due to the fact that the similarity functions
based on geometric consistency checks do not offer nice prop-
erties like the metric properties, for instance.
In this paper we will show that kNN classification with ge-
ometric consistency checks can be reformulated as a problem
of similarity searching executed at the level of the individual
local features, rather than entire image. Similarity functions
between individual local features are generally metric func-
tions and in most cases are also defined as Euclidian dis-
tances. This makes it possible to capitalize on the research
and the results obtained in the field of similarity searching
in metric spaces [21] to make the kNN classification with
geometric consistency checks efficient and scalable.
We will also see that the reformulation that we propose in
this paper, in addition of offering higher efficiency and scal-
ability, surprisingly also offers improvement of effectiveness
over the exhaustive kN N classifier.
The structure of the paper is as follows. Section 2 presents
81
some related work. Local features are introduced in Sections
3. In Sections 5, 6, 7 various approaches and similarity mea-
sures are presented. Sections 8 and 9 presents the experi-
mental results.
2. RELATED WORK
In the last few years the problem of recognizing landmarks
have received growing attention by the research community.
In [18] methods for placing photos uploaded to Flickr on
the World map was presented. In the proposed approach
the images were represented by vectors of features of the
tags, and visual keywords derived from a vector quantiza-
tion of the SIFT descriptors. In [22], Google presented its
approach to building a web-scale landmark recognition en-
gine. Most of the work reported was used to implement the
Google Goggles service [1]. The approach makes use of the
SIFT feature. The recognition is based on best matching
image searching, while our novel approach is based on local
features classification.
In [12], various MPEG-7 descriptors have been used to
build kN N classifier committees. However local features
were not considered. In [10] a survey on mobile landmark
recognition for information retrieval is given. Classification
methods reported as previously presented in the literature
include SVM, Adaboost, Bayesian model, HMM, GMM. The
kN N based approach which is the main focus of this paper
is not reported in that survey.
In [9] the effectiveness of NN image classifiers has been
proved and an innovative approach based on Image-to-Class
distance that is similar in spirit to our approach has been
proposed.
The bag of visual words model was initially proposed in
[19]. In [20] the use of term weighting techniques and clas-
sical distances from text retrieval in the case of images has
been explored. The experiments show that the effectiveness
of a given weighting scheme or distance is strongly linked
to the dataset used. In the case of large and varied image
collections, the noise in descriptor assignation and the need
to use larger vocabularies tend to make all distances and
weights equivalent.
An alternative approach to RANSAC for geometry con-
sistency checks based on interest points position has been
presented in [16, 15]. Basically, a proximity-based order-
respecting intersection is performed after searching in the
whole set of local features the most similar to the one ex-
tracted from the query.
In [5], we presented four local features based image clas-
sification algorithms. These algorithms classify an image in
two steps: first each local feature is classified considering
the local features of a training set; second the whole image
is classified considering the label assigned to each local fea-
ture and the confidence of these classifications. In this paper
we will not consider this approach because it is very diffi-
cult to define any geometric consistency check algorithm on
top of them. However, a direct comparison with the results
obtained in this paper is given in the Experimental Results
settings.
3. LOCAL FEATURES
The approach described in this paper focuses on the use
of image local features. Specifically, we performed our tests
using the SIFT [17] and SURF [8] local features. In this
section, we briefly describe both of them.
The Scale Invariant Feature Transformation (SIFT) [17] is
a representation of the low level image content that is based
on a transformation of the image data into scale-invariant
coordinates relative to local features. Local feature are low
level descriptions of keypoints in an image. Keypoints are
interest points in an image that are invariant to scale and ori-
entation. Keypoints are selected by choosing the most stable
points from a set of candidate location. Each keypoint in an
image is associated with one or more orientations, based
on local image gradients. Image matching is performed by
comparing the description of the keypoints in images. For
both detecting keypoints and extracting the SIFT features
we used the public available software developed by David
Lowe [3].
The basic idea of Speeded Up Robust Features (SURF) [8]
is quite similar to SIFT. SURF detects some keypoints in an
image and describes these keypoints using orientation infor-
mation. However, the SURF definition uses a new method
for both detection of keypoints and their description that is
much faster still guaranteeing a performance comparable or
even better than SIFT. Specifically, keypoint detection re-
lies on a technique based on a approximation of the Hessian
Matrix. The descriptor of a keypoint is built considering the
distortion of Haar-wavelet responses around the keypoint it-
self. For both detecting interest points and extracting the
SURF features, we used the public available noncommercial
software developed by the authors [4].
For both SIFT and SURF the Euclidean distance is typi-
cally used as measure of dissimilarity between two features
[17, 8].
3.1 Local Features Matching
A useful aspect that is often used when dealing with local
features is the concept of local feature matching. In [17], a
distance ratio matching scheme was proposed that has also
been adopted in [8] and many others.
Let us consider a local feature fibelonging to an image
di(i.e. fidi) and an image dj. First, the feature fjdj
that best matches fi, based on a distance δ, is referred to
as the first nearest neighbor (in the remainder NN1(fi, dj))
and is selected as candidate match. Then, the distance ratio
σ(fi, dj)[0,1] between second-closest and closest neigh-
bors of fiin djis considered. The distance ratio is defined
as:
σ(fi, dj) = δ(fi, NN1(fi, dj))
δ(fi, N N2(fi, dj)) (1)
Finally, fiand N N1(fi, dj) are considered matching if the
distance ratio σ(fi, dj) is smaller than a given threshold.
Thus, the set of candidate local features matches between
image diand djis:
Cd
di,dj={(fi, fj)|fidi, fjdj, σ(fi, dj)< c}(2)
In [17] c= 0.8 was proposed reporting that this threshold
allows to eliminate 90% of the false matches while discarding
less than 5% of the correct matches. In [5] an experimen-
tal evaluation of classification effectiveness varying cthat
confirms the results obtained by Lowe, is reported. In the
following we will use c= 0.8 for both SURF and SIFT.
Please note, that this parameter will be used in defining
the image to image based similarity measures of Section 5
while it is not necessary for the similarity search approach
presented in Section 6.
82
Image 1 Image 2
Rotation, Scale and Translation Affine Homography
Figure 1: Results of searching for matching a portion of Image 2 on Image 1 using various type of transfor-
mations found relying on local features and RANSAC.
3.1.1 Geometric Consistency Checks
Each local feature is extracted considering a point of in-
terest in the image and a region around it. The coordinates
of the point of interest are associated to the description to-
gether with the scale and orientation of the region in the
image. In fact, the description of the region itself is scale
and orientation invariant because it has been defined for
searching similar regions despite changes in scale and/or ori-
entation.
The coordinate of the interest point and the scale and
orientation information related to the region can be used to
perform consistency checks of the candidate matches. More-
over this information can be used for estimating the trans-
formation able to map one image on top of the other (e.g.
for image stitching).
The algorithm used to estimated such a transformation
are typically the Random Sample Consensus (RANSAC) [13]
and Least Median of Squares. However, fitting methods such
as RANSAC or Least Median of Squares perform poorly
when the percent of correct matches falls much below 50%.
Fortunately, much better performance can be obtained by
clustering features in scale and orientation space using the
Hough transform.
Hough Transform is used to cluster matches in groups
that agree upon a particular model pose. Hough transform
identifies clusters of feature by using each feature to vote for
all object poses that are consistent with the feature. When
clusters of features are found that vote for the same pose of
an object, the probability of the interpretation being correct
is much higher than for any single feature. In our experi-
ments, we create a Hough transform entry predicting the
model orientation, and scale from the match hypothesis. A
pseudo-random hash function is used to insert votes into a
one-dimensional hash table in which collisions are easily de-
tected. The Hough transform is typically used for increasing
the percentage of inliers before estimating a transformation
(typically using RANSAC). However, the number of matches
in the greater cluster can be considered as an estimation of
the actual matches.
Considering the clusters of matches created by the Hough
transform it is possible to estimate a transformation able to
map the points of an image on the other. Estimating a trans-
formation using RANSAC is a process of: random selecting
the requested number of matches for the given transforma-
tion estimation; evaluating the transformation itself; and
selecting the matches which are consistency with it.
In the following we report the most common types of
transformation that can be searched for. In Figure 1 we re-
port the results of transformation estimation for the various
types of transformation on a pair of photos of the cathedral
of St. Mary in Pisa.
ARotation, Scale and Translation (RST) transfor-
mation can be formalized as follows:
p0
x
p0
y=scos(σ)sin(σ)
sin(sigma)scos(σ)px
py+tx
ty(3)
where σis the angle of the counter clock rotation, sis
the scaling and ~
tis the translation. Estimating this trans-
formation requires two couples of matching points (~p and
~
p0).
An Affine transformation is a linear transformation (ro-
tation, scaling and shear) followed by a translation.
p0
x
p0
y=a11 a12
a21 a22px
py+tx
ty(4)
Please note that a RST transformation is a special type of
83
a general affine transformation. Affine allows also shearing
which leaves fixed all points on one axis and other points
are shifted parallel to the axis by a distance proportional to
their perpendicular distance from the axis. Estimating this
transformation requires three couples of matching points.
AHomography is an invertible projective transforma-
tion from the real projective plane to the projective plane
that maps lines to straight lines. Any two images of the
same planar surface in space are related by a homography.
wp0
x
wp0
y
w
=
h11 h12 h13
h21 h22 h23
h31 h32 h33
px
py
1
(5)
where wis a scale parameter. Please note that an affine
transformation is a special type of a general homography
whose last row is fixed to h31 = 0, h32 = 0, h33 = 1. Esti-
mating this transformation requires four couples of matching
points.
3.1.2 Isotropic scaling
Typically, the coordinates of the points reported by local
features extraction softwares describe the pixel in the origi-
nal image. However, a normalization not only is useful but
can improve the effectiveness of transformation estimation.
The most used normalization is the isotropic scaling [14] in
which the set of points belonging to an image are translated
so as to bring the centroid of the set to the origin, and the
coordinates are also scaled so that on the average a point lie
a distance 2 from the origin.
4. KNN CLASSIFIER
Given a set of documents Dand a predefined set of classes
(also known as labels, or categories)C={c1,...,cm},single-
label document classification (SLC) [11] is the task of auto-
matically approximating, or estimating, an unknown target
function Φ : DC, that describes how documents ought
to be classified, by means of a function ˆ
Φ : DC, called
the classifier, such that ˆ
Φ is an approximation of Φ.
A popular SLC classification technique is the Single-label
distance-weighted kN N . Given a training set Tr contain-
ing various examples for each class c, it assigns a label to
a document in two steps. Given a document di(an image
for example) to be classified, it first executes a kN N search
between the objects of the training set. The result of such
operation is a list kN N (di) of labeled documents djbelong-
ing to the training set ordered with respect to the decreasing
values of the similarity s(di, dj) between diand dj. The la-
bel assigned to the document diby the classifier is the class
cyCthat maximizes the sum of the similarity between
diand the documents labeled cy, in the kN N results list
kN N(di)
Therefore, first a score z(di, cj) for each label is computed
for any label cjC:
z(di, cj) = X
dykN N(di) : Φ(dy)=cj
s(di, dy).
Then, the class that obtains the maximum score is chosen:
ˆ
Φs(di) = arg max
cjCz(di, cj).
It is also convenient to express a degree of confidence on
the answer of the classifier. For the Single-label distance-
weighted kNN classifier described here we defined the con-
fidence as 1 minus the ratio between the score obtained by
the second-best label and the best label, i.e,
νdoc(ˆ
Φs, di) = 1
arg max
cjCˆ
Φs(di)
z(di, cj)
arg max
cjCz(di, cj).
This classification confidence can be used to decide whether
or not the predicted label has an high probability to be cor-
rect.
5. IMAGE TO IMAGE COMPARISON
In order the kN N search step to be executed, a simi-
larity function between images should be defined. Global
features, generally, are defined along with a similarity (or a
distance) function. Therefore, similarity between images, is
computed as the similarity between the corresponding global
features. On the other hand, a single image has several local
features. Therefore, computing the similarity between two
images requires combining somehow the similarities between
their numerous local features.
Local features have been used in Computer Vision to iden-
tify the same points in two distinct photos of the same ob-
ject even changing the point of view. Thus, the similarity
measure between two images can be easily defined as the
percentage of local features in one image that have a match
in the other one. Thus, given a set of candidate matches
between two images Cdi,dj, we define the similarity as:
s(di, dj) = |Cdi,dj|
|di|(6)
In the following we define 5 matching criteria (CD,CH,
CR,CA,CH) that used in conjunction with Equation 6
result in 5 similarity measures.
Distance ratio matches – CD
This set, defined in Equation 2, is used in most of the
literature as the first candidate set of matches evalu-
ated on the basis of the local features similarities.
Hough transform matches – CH
As mentioned in Section 3.1, an Hough transform is of-
ten used to search for keys that agree upon a particular
model pose. We define CHas the subset of matches in
CDrelated to the most voted pose in terms of orienta-
tion and scale. For the experiments, we used the same
parameters proposed in [17], i.e. bin size of 30 degrees
for orientation, a factor of 2 for scale, and 0.25 times
the maximum model dimension for location.
As described in Section 3.1.1, various transformation can
be estimated using RANSAC on the clusters of matches
identified by the Hough transform. Once a transformation
has been estimated, the matches that are not consistent with
it are rejected. Typically a threshold eon the distance be-
tween the expected (given the transformation) and actual
match is used to identify inlier and outliers. Given the nor-
malized coordinate space mentioned in Section 3.1.2, we set
e= 0.1. Between all the transformation estimated, the one
having the greater number of consistent matches is retained.
RST Transform Matches – CR
are the matches in CDthat are consistent with the
estimated RST transformation (Equation 3).
84
Affine Transform Matches – CA
are the matches in CDthat are consistent with the
estimated Affine transformation (Equation 4).
Homography Transform Matches – CH
are the matches in CDthat are consistent with the
estimated Homography transformation (Equation 5).
6. A SIMILARITY SEARCH APPROACH
The similarity measures defined in Section 5, which is a
direct application of the techniques developed by the Com-
puter Vision community, require the direct comparison of
each pair of images. In fact, the distances are not metric
and not even symmetric and the complexity of the distance
evaluation does not allow any sort of indexing. Thus, given
a query, searching for the knearest images to a given query
image require a complete sequential scan of the archive. The
first step of all these distances for comparing two images is
selecting candidate matches searching for each local feature
in one image the 2NN in the other one. The candidate
matches are then pruned considering the distance ratio de-
fined in Equation 1.
In this section we propose to identify the candidate matches
searching for ¯
kN N between all the local features in all the
images in the dataset (D). Please note that ¯
kused for this
NN search is different from the one eventually used for the
whole image kN N search. At the end of this process we
have for each local feature fqin the query image dqa list
of candidate matches (¯
kN N(fq, D)). Please note that the
local features in ¯
kN N(fq, D) can belong to distinct images
and that the same fqcould have more than one match in
the same image. However, having only the best match for
each couple of fqand image diis preferable. Thus, we can
define the candidate matches between the query image dq
and any diDas:
¯
CD
dq,di={(fq, fi)|fqdq,fi¯
kN N(fq, D)di,
δ(fq, fi)δ(fq, fj),fj¯
kN N(fq, D)di}(7)
Please note that ¯
CDis equivalent, in this scenario, to
CDof Equation 2. Thus, starting from ¯
CD, it is possible
to redefine the five matching criteria of Section 5 that, in
conjunction with Equation 6, result in five new similarity
measures.
7. THE BAG OF FEATURES APPROACH
In the last few year several object and image retrieval
systems which have directly taken a text-based approach to
the problem of local features matching have been proposed.
Starting from [19], the ”visual word” paradigm was intro-
duced which is based on assigning each local feature to a
visual word of a predefined vocabulary. At search time, two
local features assigned to the same visual words will be con-
sidered as matching. The first step to describe images using
visual words is to select some visual words creating a vo-
cabulary. The visual vocabulary is typically built grouping
local descriptors of the dataset using a clustering algorithm
such as k-means. The second step is to describe each image
using the words of the vocabulary that occur in it.
At the end of the process, each image is described as a set
of visual words. Thus, standard text retrieval approaches
can be used. In particular the cosine similarity and TF-IDF
approaches have been used (e.g. [20]). Using this similarities
functions, traditional inverted files can be used for efficiently
searching nearest neighbor images.
The bag of features approach can also be used to define a
set of candidate matches that can be used as a basis for the
geometric consistency checks described in Section 3.1.1:
˙
CD
di,dj={(fi, fj)|bag(fi) = bag(fj)}(8)
˙
CDis equivalent, in this scenario, to CDof Equation 2 and
¯
CDof Equation 7. Thus, starting from ˙
CD, it is possible
to redefine the five matching criteria of Section 5 that, in
conjunction with Equation 6, result in five new similarity
measures. Please note in this case it is not possible to avoid
multiple matches of the same fi.
8. EXPERIMENTAL SETTINGS
8.1 The Dataset
The dataset that we used for our tests is publicly avail-
able and composed of 1,227 photos of 12 landmarks located
in Pisa and was used also in [7, 5, 6]. The photos have been
crawled from Flickr, the well known on-line photo service.
The IDs of the photos used for these experiments together
with the assigned label and extracted features can be down-
loaded from [2].
In order to build and evaluating a classifier for these classes,
we divided the dataset in a training set (T r ) consisting of
226 photos (approximately 20% of the dataset) and a test set
(T e) consisting of 921 (approximately 80% of the dataset).
The image resolution used for feature extraction is the stan-
dard resolution used by Flickr i.e., maximum between width
and height equal to 500 pixels. In other words, uploaded
photos were originally all bigger than 500 pixels on the max-
imum side and they have all been resized to 500.
The total number of local features extracted by the SIFT
and SURF detectors were about 1,000,000 and 500,000 re-
spectively. The number of local features per image varies
between 113 and 2816 for SIFT and 50 and 904 for SUFR.
8.2 Performance Measures
For evaluating the effectiveness of the classifiers in classify-
ing the documents of the test set we use the micro-averaged
accuracy and micro- and macro-averaged precision,recall
and F1.
Micro-averaged values are calculated by constructing a
global contingency table and then calculating the measures
using these sums. In contrast macro-averaged scores are cal-
culated by first calculating each measure for each category
and then taking the average of these. In most of the cases
we reported the micro-averaged values for each measure.
Precision is defined as the ratio between correctly pre-
dicted and the overall predicted documents for a specific
class. Recall is the ratio between correctly predicted and
the overall actual documents for a specific class. F1is the
harmonic mean of precision and recall.
Note that for the single-label classification task, micro-
averaged accuracy is defined as the number of documents
correctly classified divided by the total number of documents
of the same label in the test set and it is equivalent to the
micro-averaged precision,recall and F1scores.
85
Hough RST Affine Hom. Hough RST Affine Hom.
SIFT 0.877 0.912 0.931 0.939 0.922 0.880 0.948 0.941 0.943 0.931
SURF 0.807 0.870 0.907 0.920 0.905 0.859 0.909 0.928 0.935 0.893
SIFT 0.864 0.899 0.924 0.935 0.843 0.876 0.946 0.938 0.868 0.865
SURF 0.788 0.845 0.902 0.917 0.835 0.842 0.903 0.924 0.856 0.836
SIFT 0.864 0.901 0.936 0.929 0.923 0.875 0.939 0.940 0.946 0.933
SURF 0.841 0.871 0.899 0.911 0.908 0.850 0.903 0.928 0.928 0.898
SIFT 0.843 0.879 0.929 0.923 0.837 0.854 0.932 0.937 0.868 0.866
SURF 0.818 0.849 0.888 0.899 0.841 0.823 0.892 0.922 0.846 0.839
SIFT 0.877 0.916 0.936 0.941 0.923 0.877 0.948 0.943 0.947 0.936
SURF 0.851 0.887 0.910 0.920 0.912 0.861 0.905 0.928 0.936 0.898
SIFT 0.864 0.904 0.929 0.937 0.843 0.868 0.946 0.939 0.868 0.868
SURF 0.828 0.867 0.904 0.917 0.845 0.844 0.897 0.924 0.858 0.840
SIFT 12839219712
SURF 202141434123
SIFT 12831219912
SURF 182141434133
Image to image comparison Similarity search approach
Geometric consistency check Geometric consistency check
d-ratio
k=1
Accuracy
F
1
Macro
k=10
Accuracy
F
1
Macro
Best
Accuracy
F
1
Macro
Best k
Accuracy
F
1
Macro
=
Figure 2: Image Similarity Based Classification Results using the image to image and similarity search
approaches for ¯
k= 10.
9. EXPERIMENTAL RESULTS
In Figure 2 we report the results obtained by both the im-
age to image comparison and similarity search approaches.
Accuracy and macro averaged F1are reported for both SIFT
and SURF. Given that the kNN require a parameter kwe
report the results obtained for k= 1,10 and the best results
obtained for k[1,100].
Comparing the results obtained by the various similarity
functions for the image to image comparison approach, we
can see that geometric consistency checks are able to signifi-
cantly improve the quality of the classification process. The
best results are obtained by searching for an affine trans-
formation. This is consistent with the fact that both SIFT
and SURF are affine invariant. It is worth to say that the
benefits of consistency checks are more relevant for SURF
even if its overall performance remains below the one ob-
tained using SIFT. Regarding the local features used and
the computational cost, note that say that the number of
local features detected by the SIFT extractor is twice the
ones detected by SURF. Thus, on one hand SIFT has better
performance while on the other hand SURF is more efficient.
In Figure 2 we also report the results obtained by the
similarity search approach using ¯
k= 10, i.e., performing a
10NN for each local feature in the query over the local fea-
tures in the training set. In our experiments we also tested
¯
k= 30,50,100 obtaining comparable but worst results. Sur-
prisingly the similarity search approach often performs bet-
ter than the image to image comparison. The intuition is
that the ¯
kN N search performed between all the local fea-
tures in the training set is able to reduce the number of
false matches. Please note, that this approach is also more
efficient because the local features compared using the Eu-
clidean distance are indexable while the whole image are
not.
In the case of the similarity search approach the choice of
the geometric consistency check is more problematic. In par-
ticular both Homography and Affine reveal a big loss in F1
while Hough perform significantly better because of the loss
noisy first step matches. However, the overall best is RST.
The intuition is that we have less first step matches but less
noise resulting in better results with a geometric consistency
check that only require two matches for the transformation
evalution.
In Figure 3 we report the results obtained by the bag of
features approach, described in Section 7 using a vocabu-
lary of 100kfeatures selected using the k-means algorithm.
As known in the literature, typically the more the words,
the better the results. In our experiments we are dealing
with a dataset of about 1 million features. Thus, 100kof
visual words is the highest value for which it does make
sense to perform a clustering algorithm. The results are
worst than one obtained before. Moreover, the geometric
consistency checks do not allow significantly gains in perfor-
mance especially considering F1. The intuition is that the
candidate matches found using the bag of features approach
are too much noisy. Standard cosine and TF-IDF similar-
ity measure are more suitable for this scenario. It is worth
to note, that the k-means algorithm for selecting the 100k
words was performed over the whole dataset while it would
have been more correct to only consider the training images.
In fact, the test images should not be used during any train-
ing phase. However, we preferred to compare our approach
86
Hough RST Affine Hom.
SIFT 0.863 0.875 0.877 0.878 0.812 0.882
SURF 0.853 0.845 0.849 0.851 0.820 0.857
SIFT 0.879 0.869 0.869 0.870 0.805 0.800
SURF 0.839 0.829 0.750 0.750 0.725 0.787
SIFT 0.856 0.888 0.866 0.868 0.817 0.887
SURF 0.851 0.862 0.872 0.868 0.857 0.855
SIFT 0.845 0.849 0.852 0.854 0.804 0.804
SURF 0.835 0.848 0.778 0.769 0.754 0.785
SIFT 0.885 0.896 0.878 0.882 0.832 0.891
SURF 0.858 0.863 0.876 0.878 0.828 0.859
SIFT 0.871 0.873 0.869 0.873 0.821 0.811
SURF 0.843 0.849 0.779 0.783 0.758 0.789
SIFT 782446
SURF 3 9 15 7 13 2
SIFT 331246
SURF 7 9 15 7 13 2
cosine
TF-IDF
Geometric consistency check
Bag of features
Best
Accuracy
F
1
Macro
cosine
Best k
Accuracy
F
1
Macro
k=1
Accuracy
F
1
Macro
k=10
Accuracy
F
1
Macro
Figure 3: Classification Results using the Bag of Features approach with a vocabulary of 100k features.
in this scenario even if the bag of features performance are
actually overestimated.
In [5], we presented four local features based image classi-
fication algorithm that classify an image in two steps: first
each local feature is classified considering the local features
of a training set; second the whole image is classified consid-
ering the label assigned to each local feature and the confi-
dence of these classifications. The results obtained are very
similar to the best results obtained here. In particular, the
Weighted LF Distance Ratio Classifier, which is the best
performing algorithm in [5], obtained 0.928 in accuracy and
0.922 in F1using SURF, which are slightly worst than the
values obtained by the similarity search approach proposed
in this paper considering the RST consistency check. Re-
garding, SIFT the results obtained by the Weighted LF Dis-
tance Ratio Classifier were 0.952 in accuracy and 0.947 in
F1which are slightly better than the measures obtained by
the similarity search approach with RST. It is worth to note
that even if the results are very similar, the geometric con-
sistency check also results in a transformation estimation
that could be necessary in some scenarios as, for instance,
in augmented reality. It would be interesting to add geo-
metric consistency check to the algorithms proposed in [5],
but their local feature classification approach results in very
difficult geometric consistency check definition.
10. CONCLUSIONS
In this paper we have presented a techniques that allows
performing kN N classification of images by also performing
geometric consistency checks of the scenes appearing in im-
ages. The proposed approach allows executing classification
efficiently relying on the use of access methods for similarity
searching, such as those exploiting metric space properties.
We have performed an extensive experimentation of the pro-
posed approach and we have shown that it offers higher ef-
fectiveness than basic kNN classification that simply uses
percentage of matches. In the tests we have compared vari-
ous solutions for geometric consistency checks both exhaus-
tively sequentially scanning all images in the training set
and by using our method relying on similarity searching.
From these tests we have also surprisingly observed that the
proposed approach, based on similarity search, offers better
effectiveness than the exhaustive and non scalable approach.
11. ACKNOWLEDGMENTS
This work was partially supported by the VISITO Tus-
cany project, funded by Regione Toscana, in the POR FESR
2007-2013 program, action line 1.1.d, and the MOTUS project,
funded by the Industria 2015 program.
12. REFERENCES
[1] Google goggles.
http://www.google.com/mobile/goggles/. last
accessed on 30-March-2010.
[2] Pisa landmarks dataset.
http://www.fabriziofalchi.it/pisaDataset/. last
accessed on 3-March-2011.
[3] SIFT keypoint detector.
http://people.cs.ubc.ca/~lowe/. last accessed on
3-March-2011.
[4] SURF detector.
http://www.vision.ee.ethz.ch/~surf/. last
accessed on 3-March-2011.
87
[5] G. Amato and F. Falchi. kNN based image
classification relying on local feature similarity. In
SISAP ’10: Proceedings of the Third International
Conference on SImilarity Search and APplications,
pages 101–108, New York, NY, USA, 2010. ACM.
[6] G. Amato and F. Falchi. Local feature based image
similarity functions for kNN classfication. In
Proceedings of the 3rd International Conference on
Agents and Artificial Intelligence (ICAART 2011),
pages 157–166. SciTePress, 2011. Vol. 1.
[7] G. Amato, F. Falchi, and P. Bolettieri. Recognizing
landmarks using automated classification techniques:
an evaluation of various visual features. In in
Proceeding of The Second Interantional Conference on
Advances in Multimedia (MMEDIA 2010), pages
78–83. IEEE Computer Society, 2010.
[8] H. Bay, T. Tuytelaars, and L. V. Gool. Surf: Speeded
up robust features. In In ECCV, pages 404–417, 2006.
[9] O. Boiman, E. Shechtman, and M. Irani. In defense of
nearest-neighbor based image classification. In CVPR.
IEEE Computer Society, 2008.
[10] T. Chen, K. Wu, K.-H. Yap, Z. Li, and F. S. Tsai. A
survey on mobile landmark recognition for information
retrieval. In MDM ’09, pages 625–630. IEEE
Computer Society, 2009.
[11] S. Dudani. The distance-weighted k-nearest-neighbour
rule. IEEE Transactions on Systems, Man and
Cybernetics, SMC-6(4):325–327, 1975.
[12] T. Fagni, F. Falchi, and F. Sebastiani. Image
classification via adaptive ensembles of
descriptor-specific classifiers. Pattern Recognition and
Image Analysis, 20:21–28, 2010.
[13] M. A. Fischler and R. C. Bolles. Random sample
consensus: A paradigm for model fitting with
applications to image analysis and automated
cartography. Commun. ACM, 24(6):381–395, 1981.
[14] R. I. Hartley. In defence of the 8-point algorithm. In
Proceedings of the Fifth International Conference on
Computer Vision, ICCV ’95, pages 1064–,
Washington, DC, USA, 1995. IEEE Computer Society.
[15] T. Homola, V. Dohnal, and P. Zezula.
Proximity-based order-respecting intersection for
searching in image databases. In In Proceedings of the
8th International Workshop on Adaptive Multimedia
Retrieval (AMR 2010), 2010.
[16] T. Homola, V. Dohnal, and P. Zezula. Sub-image
searching through intersection of local descriptors. In
Proceedings of the Third International Conference on
SImilarity Search and APplications, SISAP ’10, pages
127–128, New York, NY, USA, 2010. ACM.
[17] D. G. Lowe. Distinctive image features from
scale-invariant keypoints. International Journal of
Computer Vision, 60(2):91–110, 2004.
[18] P. Serdyukov, V. Murdock, and R. van Zwol. Placing
flickr photos on a map. In SIGIR ’09: Proceedings of
the 32nd international ACM SIGIR conference on
Research and development in information retrieval,
pages 484–491, New York, NY, USA, 2009. ACM.
[19] J. Sivic and A. Zisserman. Video google: A text
retrieval approach to object matching in videos. In
Proceedings of the Ninth IEEE International
Conference on Computer Vision - Volume 2, ICCV
’03, pages 1470–, Washington, DC, USA, 2003. IEEE
Computer Society.
[20] P. Tirilly, V. Claveau, and P. Gros. Distances and
weighting schemes for bag of visual words image
retrieval. In Proceedings of the international
conference on Multimedia information retrieval, MIR
’10, pages 323–332, New York, NY, USA, 2010. ACM.
[21] P. Zezula, G. Amato, V. Dohnal, and M. Batko.
Similarity Search: The Metric Space Approach,
volume 32 of Advances in Database Systems.
Springer-Verlag, 2006.
[22] Y. Zheng, M. Z. 0003, Y. Song, H. Adam,
U. Buddemeier, A. Bissacco, F. Brucher, T.-S. Chua,
and H. Neven. Tour the world: Building a web-scale
landmark recognition engine. In CVPR, pages
1085–1092. IEEE, 2009.
88
... However, current technologies cann information related to the garment fit [10]. In this paper, we proposed a novel body shape detection which extracts consecutive feature points using speeded up ro detects visual words from extracted features points using the b and classifies body shapes using a machine learning techniq neighbor (k-NN) [18]. We also propose a convolutional neural body shape detection algorithm. ...
... Especially, SURF feature detection, the bag-of-features ing technique have not been proposed to differentiate the bo The conventional method for body shape detection is limited t based approach [20] or a multiple photo-based 3D reconstruc are mostly inaccurate and impractical due to the expensive co in this paper, we suggest a unique combination of image proces based body shape detection using 2D smartphone images. In this paper, we proposed a novel body shape detection algorithm for garment fit, which extracts consecutive feature points using speeded up robust features (SURF) [16], detects visual words from extracted features points using the bag-of-features model [17], and classifies body shapes using a machine learning technique, such as the k-nearest neighbor (k-NN) [18]. We also propose a convolutional neural network (CNN)-based [19] body shape detection algorithm. ...
Article
Full-text available
The apparel e-commerce industry is growing day by day. In recent times, consumers are particularly interested in an easy and time-saving way of online apparel shopping. In addition, the COVID-19 pandemic has generated more need for an effective and convenient online shopping solution for consumers. However, online shopping, particularly online apparel shopping, has several challenges for consumers. These issues include sizing, fit, return, and cost concerns. Especially, the fit issue is one of the cardinal factors causing hesitance and drawback in online apparel purchases. The conventional method of clothing fit detection based on body shapes relies upon manual body measurements. Since no convenient and easy-to-use method has been proposed for body shape detection, we propose an interactive smartphone application, “SmartFit”, that will provide the optimal fitting clothing recommendation to the consumer by detecting their body shape. This optimal recommendation is provided by using image processing and machine learning that are solely dependent on smartphone images. Our preliminary assessment of the developed model shows an accuracy of 87.50% for body shape detection, producing a promising solution to the fit detection problem persisting in the digital apparel market.
... In fact, very unlikely projective transformation can be expressed as homographies. Filtering out these noisy results can be difficult [3]. In the experiments we report the results obtained by both. ...
Technical Report
Full-text available
Local descriptors are state-of-the-art of representing low-level visual information in object recognition. Because of their effectiveness , they are also largely used in content-based image retrieval whenever the query visually express a specific object to be retrieved between the images in the archive. Given that searching for the local descriptors can be very costly, many recent works have proposed to encode the local descriptors in a compact representation. In this paper, we propose to embed the aggregated information in the local descriptors in order to achieve higher effectiveness. The experimental results, obtained on a largely used public dataset, reveal the potential of the approach. Even if we only tested our approach in a content-based image retrieval scenario , the idea of combining aggregated and local information is general and could be applied in other similarity search tasks. We call the proposed approach bifocal searching because of the similarity with bifocal eyeglasses which have two parts with different focal lengths.
... Traditional image classification technologies include KNN [122][123][124][125][126], MLP [127][128][129], CNN [130][131][132][133][134], etc. At present, CNN has been widely used [135][136][137][138][139][140][141][142][143][144]. ...
Article
Background Medical image fusion is very important for diagnosis and treatment of disease. In recent years, there are lots of different multimodal medical image fusion algorithms which can provide delicate contexts for disease diagnosis more clearly and more convenient. Recently, nuclear norm minimization and deep learning have been used effectively in image processing. Method A multi-modality medical image fusion method using rolling guidance filter (RGF) with convolutional neural network (CNN) based feature mapping and nuclear norm minimization (NNM) is proposed. At first, we decompose medical images to base layer components and detail layer components by using RGF. In the next step, we get the basic fused image through the pre-trained CNN model. The CNN model with pre-training is used to obtain the significant characteristics of the base layer components. And we can compute the activity level measurement from the regional energy of CNN-based fusion maps. Then, a detail fused image is gained by NNM. That is, we use NNM to fuse the detail layer components. At last, the basic and detail fused images are integrated into the fused result. Results From the comparison with the most advanced fusion algorithms, the results of experiments indicate that this fusion algorithm has the best effect in visual evaluation and objective standard. Conclusion The fusion algorithm using RGF and CNN-based feature mapping combined with NNM can improve fusion effects and suppress artifacts and blocking effects in the fused result.
... In this case, image classification was done based on an overall similarity between the images. In the second approach, local feature based method proposed in [2] was employed with kNN to perform per descriptor classification for each image. In this case, final class of the query image is taken as the class to which a majority descriptors belong. ...
Preprint
Full-text available
In this project, the task of architecture classification for monuments and buildings from the Indian subcontinent was explored. Five major classes of architecture were taken and various supervised learning methods, both probabilistic and nonprobabilistic, were experimented with in order to classify the monuments into one of the five categories. The categories were: ’Ancient’, ’British’, ’IndoIslamic’, ’Maratha’ and ’Sikh’. Local ORB feature descriptors were used to represent each image and clustering was applied to quantize the obtained features to a smaller size. Other than the typical method of using features to do an image-wise classification, another method where descriptor wise classification is done was also explored. In this method, image label was provided as the mode of the labels of the descriptors of that image. It was found that among the different classifiers, k nearest neighbors for the case of descriptor-wise classification performed the best.
... Re-ranking considers the results set obtained with a CBIR system as a new dataset which is compared, typically sequentially, with the query image in order to apply a more effective similarity measure, generally exploiting geometrical information [28]. To this aim, a number of highly ranked and geometrically verified images are used to expand the original query with the goal of improving retrieval effectiveness [10,5]. ...
Chapter
Full-text available
The widespread diffusion of smart devices, such as smartphones and tablets, and the new emerging trend of wearable devices, such as smart glasses and smart watches, has pushed forward the development of applications where the user can interact relying on his or her position and field of view. In this way, users can also receive additional information in augmented reality, that is, seeing the information through the smart device, overlaid on top of the real scene. The GPS or the compass can be used to localize the user when augmented reality has to be provided with scenes of large size, for instance, squares or large buildings. However, when augmented reality has to be offered for enriching the view of small objects or small details of larger objects, for instance, statues, paintings, or epigraphs, a more precise positioning is needed. Visual object recognition and tracking technologies offer very detailed and fine-grained positioning capabilities. This chapter discusses the techniques enabling a precise positioning of the user and the subsequent experience in augmented reality, focusing on algorithms for image matching and homography estimation between the images seen by smart devices and images representing objects of interest.
... At query time, the recognizer performs a similarity search for the image to be recognized and then takes from the result list the first k results for which there is also a classifier. The recognizer uses the RANSAC algorithm to perform geometry consistency checks [5] and assign a score to each class. We decided to assign to each class the highest matching score (i.e., percentage of inliers after the RANSAC) between the query image and all the image in the classifier. ...
Conference Paper
Full-text available
This demonstration paper describes the mobile application developed by the EAGLE project to increase the use and visibility of its epigraphic material. The EAGLE project (European network of Ancient Greek and Latin Epigraphy) is gathering a comprehensive collection of inscriptions (about 80 % of the surviving material) and making it accessible through a user-friendly portal, which supports searching and browsing of the epigraphic material. In order to increase the usefulness and visibility of its content, EAGLE has developed also a mobile application to enable tourists and scholars to obtain detailed information about the inscriptions they are looking at by taking pictures with their smartphones and sending them to the EAGLE portal for recognition. In this demonstration paper we describe the EAGLE mobile application and give an outline of its features and its architecture.
... As reported in [17] many geometric transformation can be found using the RANSAC algorthm [6]. Given the peculiar characteristics of our scenario, where the image dataset consists of aerial photos taken mostly from the vertical, we used the rotational and scale transformation, which provided us with the right compromise between simplicity and precision of results. ...
Article
Full-text available
In the last few years, aerial and satellite photographs have become more an more important for historical records. The availability of Geographical Information Systems and the increasing number of photos made per year allows very advanced fruition of large number of contents. In this paper we illustrate the GeoMemories approach and we focus on its automatic image alignment architecture. The approach leverages on a set of georeferenced images used as knowledge base. Local features are used in combination with compact codes and space transformation to achieve high level of efficiency.
Article
The identification of targets varies in different surge tests. A multi-color space threshold segmentation and self-learning k-nearest neighbor algorithm (k-NN) for equipment under test status identification was proposed after using feature matching to identify equipment status had to train new patterns every time before testing. First, color space (L*a*b*, hue saturation lightness (HSL), hue saturation value (HSV)) to segment was selected according to the high luminance points ratio and white luminance points ratio of the image. Second, the unknown class sample S r was classified by the k-NN algorithm with training set T z according to the feature vector, which was formed from number of pixels, eccentricity ratio, compactness ratio, and Euler’s numbers. Last, while the classification confidence coefficient equaled k, made S r as one sample of pre-training set T z ′. The training set T z increased to T z+1 by T z if T z was saturated. In nine series of illuminant, indicator light, screen, and disturbances samples (a total of 21600 frames), the algorithm had a 98.65%identification accuracy, also selected five groups of samples to enlarge the training set from T 0 to T 5 by itself.
Article
Full-text available
The state-of-The-art algorithms for large visual content recognition and content based similarity search today use the "Bag of Features" (BoF) or "Bag of Words" (BoW) approach. The idea, borrowed from text retrieval, enables the use of inverted files. A very well known issue with this approach is that the query images, as well as the stored data, are described with thousands of words. This poses obvious efficiency problems when using inverted files to perform efficient image matching. In this paper, we propose and compare various techniques to reduce the number of words describing an image to improve efficiency and we study the effects of this reduction on effectiveness in landmark recognition and retrieval scenarios. We show that very relevant improvement in performance are achievable still preserving the advantages of the BoF base approach.
Conference Paper
Full-text available
Many novel applications in the field of object recognition and pose estimation have been built relying on local invariant features extracted from selected key points of the images. Such keypoints usually lie on high-contrast regions of the image, such as object edges. However, the visual saliency of the those regions is not considered by state-of-the art detection algorithms that assume the user is interested in the whole image. Moreover, the most common approaches discard all the color in- formation by limiting their analysis to monochromatic versions of the input images. In this paper we present the experimental results of the application of a biologically-inspired visual attention model to the problem of local feature selection in landmark and object recognition tasks. The model uses color-information and restricts the matching between the images to the areas showing a strong saliency. The results show that the approach improves the accuracy of the classifier in the object recognition task and preserves a good accuracy in the landmark recognition task when a high percentage of visual features is filtered out. In both cases the reduction of the average numbers of local features result in high efficiency gains during the search phase that typically requires costly searches of candidate images for matches and geometric consistency checks.
Article
Full-text available
In this paper, the performance of several visual features is evaluated in automatically recognizing landmarks (monuments, statues, buildings, etc.) in pictures. A number of landmarks were selected for the test. Pictures taken from a test set were classified automatically trying to guess which landmark they contained. We evaluated both global and local features. As expected, local features performed better given their capability of being less affected to visual variations and given that landmarks are mainly static objects that generally also maintain static local features. Between the local features, SIFT outperformed SURF and ColorSIFT.
Article
Full-text available
An automated classification system usually consists of (i) a supervised learning algorithm for automatically generating classifiers from training data, and (ii) a representation scheme for converting the training objects into vectorial representations of their content. In this work, we take a detour from this tradition and present an approach to image classification based on an adaptive ensemble of classifiers, each specialized on classifying images based on a single “descriptor.” Each descriptor focuses on a different aspect, or perspective, of images; an ensemble of descriptor-specific classifiers can thus be seen as a committee of experts, each viewing the problem to be solved with a different slant, of from a different viewpoint. We test four different ways to set up such an ensemble, based on different ways of leveraging on the individual responses returned by each member of the ensemble, and on how confident these members are on their responses. We test this approach by using five different MPEG-7 descriptors on the task of assigning photographs of stone slabs to classes representing different types of stones. Our experimental results show important accuracy improvements with respect to a baseline in which a single classifier, working an all five descriptors at the same time, is employed. Keywordsimage classification-supervised learning-classifier committees-classifier ensembles-metric spaces- k-nearest neighbours classifier-MPEG-7
Conference Paper
Full-text available
In this paper, we present a novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features). It approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (in casu, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application. Both show SURF’s strong performance.
Conference Paper
Full-text available
State-of-the-art image classification methods require an intensive learning/training stage (using SVM, Boosting, etc.) In contrast, non-parametric nearest-neighbor (NN) based image classifiers require no training time and have other favorable properties. However, the large performance gap between these two families of approaches rendered NN-based image classifiers useless. We claim that the effectiveness of non-parametric NN-based image classification has been considerably undervalued. We argue that two practices commonly used in image classification methods, have led to the inferior performance of NN-based image classifiers: (i) Quantization of local image descriptors (used to generate "bags-of-words ", codebooks). (ii) Computation of 'image-to-image' distance, instead of 'image-to-class' distance. We propose a trivial NN-based classifier - NBNN, (Naive-Bayes nearest-neighbor), which employs NN- distances in the space of the local image descriptors (and not in the space of images). NBNN computes direct 'image- to-class' distances without descriptor quantization. We further show that under the Naive-Bayes assumption, the theoretically optimal image classifier can be accurately approximated by NBNN. Although NBNN is extremely simple, efficient, and requires no learning/training phase, its performance ranks among the top leading learning-based image classifiers. Empirical comparisons are shown on several challenging databases (Caltech-101 ,Caltech-256 and Graz-01).
Conference Paper
Full-text available
As the volume of non-textual data, such images and other multimedia data, available on Internet is increasing. The issue of identifying data items based on query containment rather than query equality is becoming more and more important. In this paper, we propose a solution to this problem. We assume local descriptors are extracted from data items, so the aforementioned problem reduces to finding data items that share as many as possible local descriptors with the query. In particular, we define a new ε-intersection for this purpose. Local descriptors usually contain the location of the descriptors, so the proposed solution takes them into account to increase effectiveness of searching. We evaluate the ε-intersection on two real-life image collections using SIFT and SURF local descriptors from both effectiveness and efficiency points of view. Moreover, we study the influence of individual parameters of the ε-intersection to query results.
Conference Paper
We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. The temporal continuity of the video within a shot is used to track the regions in order to reject unstable regions and reduce the effects of noise in the descriptors. The analogy with text retrieval is in the implementation where matches on descriptors are pre-computed (using vector quantization), and inverted file systems and document rankings are used. The result is that retrieved is immediate, returning a ranked list of key frames/shots in the manner of Google. The method is illustrated for matching in two full length feature films.
Conference Paper
The growing usage of mobile devices has led to proliferation of many mobile applications. A growing trend in mobile applications is centered on mobile landmark recognition. It is a new mobile application that recognizes a captured landmark using the mobile device and retrieves related information. This paper will present a survey on mobile landmark recognition for information retrieval. A general overview of existing mobile landmark recognition systems will be summarized. The techniques and algorithms used in the literatures, including content analysis of landmarks and classification methods for recognition, will be described.
Among the simplest and most intuitively appealing classes of nonprobabilistic classification procedures are those that weight the evidence of nearby sample observations most heavily. More specifically, one might wish to weight the evidence of a neighbor close to an unclassified observation more heavily than the evidence of another neighbor which is at a greater distance from the unclassified observation. One such classification rule is described which makes use of a neighbor weighting function for the purpose of assigning a class to an unclassified sample. The admissibility of such a rule is also considered.
Conference Paper
In this paper we consider the problem of image content recognition and we address it by using local features and kNN based classification strategies. Specifically, we define a number of image similarity functions relying on local feature similarity and matching with and without geometric constrains. We compare their performance when used with a kNN classifier. Finally we compare everything with a new kNN based classification strategy that makes direct use of similarity between local features rather than similarity between entire images. As expected, the use of geometric information offers an improvement over the use of pure image similarity. However, surprisingly, the kNN classifier that use local feature similarity has a better performance than the others, even without the use of geometric information. We perform our experiments solving the task of recognizing landmarks in photos.