Multi-cue Discriminative Place Recognition.
ABSTRACT In this paper we report on our successful participation in the RobotVision challenge in the ImageCLEF 2009 campaign. We present
a place recognition system that employs four different discriminative models trained on different global and local visual
cues. In order to provide robust recognition, the outputs generated by the models are combined using a discriminative accumulation
method. Moreover, the system is able to provide an indication of the confidence of its decision. We analyse the properties
and performance of the system on the training and validation data and report the final score obtained on the test run which
ranked first in the obligatory track of the RobotVision task.
- SourceAvailable from: Achim J. Lilienthal[show abstract] [hide abstract]
ABSTRACT: The problem of appearance-based mapping and navigation in outdoor environments is far from trivial. In this paper, an appearance-based topological map, covering a large, mixed indoor and outdoor environment, is built incrementally by using panoramic images. The map is based on image similarity, so that the resulting segmentation of the world corresponds closely to the human concept of a place. Using high-resolution images and the epipolar constraint, the resulting map is shown to be very suitable for localization, even when the environment has undergone seasonal changes.2008 IEEE International Conference on Robotics and Automation, ICRA 2008, May 19-23, 2008, Pasadena, California, USA; 01/2008
Conference Proceeding: Object recognition using composed receptive field histograms of higher dimensionality[show abstract] [hide abstract]
ABSTRACT: Effective methods for recognising objects or spatio-temporal events can be constructed based on receptive field responses summarised into histograms or other histogram-like image descriptors. This work presents a set of composed histogram features of higher dimensionality, which give significantly better recognition performance compared to the histogram descriptors of lower dimensionality that were used in the original papers by Swain & Bollard (1991) or Schiele & Crowley (2000). The use of histograms of higher dimensionality is made possible by a sparse representation for efficient computation and handling of higher-dimensional histograms. Results of extensive experiments are reported, showing how the performance of histogram-based recognition schemes depend upon different combinations of cues, in terms of Gaussian derivatives or differential invariants applied to either intensity information, chromatic information or both. It is shown that there exist composed higher-dimensional histogram descriptors with much better performance for recognising known objects than previously used histogram features. Experiments are also reported of classifying unknown objects into visual categories.Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on; 09/2004
Conference Proceeding: SVM-based discriminative accumulation scheme for place recognition[show abstract] [hide abstract]
ABSTRACT: Integrating information coming from different sensors is a fundamental capability for autonomous robots. For complex tasks like topological localization, it would be desirable to use multiple cues, possibly from different modalities, so to achieve robust performance. This paper proposes a new method for integrating multiple cues. For each cue we train a large margin classifier which outputs a set of scores indicating the confidence of the decision. These scores are then used as input to a support vector machine, that learns how to weight each cue, for each class, optimally during training. We call this algorithm SVM-based discriminative accumulation scheme (SVM-DAS). We applied our method to the topological localization task, using vision and laser-based cues. Experimental results clearly show the value of our approach.Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on; 06/2008
Multi-cue Discriminative Place Recognition?
Li Xing and Andrzej Pronobis
Centre for Autonomous Systems, The Royal Institute of Technology
SE100-44 Stockholm, Sweden
Abstract. In this paper we report on our successful participation in the
RobotVision challenge in the ImageCLEF 2009 campaign. We present a
place recognition system that employs four different discriminative mod-
els trained on different global and local visual cues. In order to provide
robust recognition, the outputs generated by the models are combined
using a discriminative accumulation method. Moreover, the system is
able to provide an indication of the confidence of its decision. We anal-
yse the properties and performance of the system on the training and
validation data and report the final score obtained on the test run which
ranked first in the obligatory track of the RobotVision task.
This paper presents the place recognition algorithm based on multiple visual cues
that was applied to the RobotVision task of the ImageCLEF 2009 campaign.
The task addressed the problem of visual indoor place recognition applied to
robot topological localization. Participants were given training, validation and
test sequences capturing the appearance of an office environment under various
conditions . The task was to build a system able to answer the question “where
are you?” (I am in the kitchen, in the corridor, etc) when presented with a test
sequence imaging rooms seen during training, or additional rooms that were not
imaged in the training sequence. The results could be submitted for two separate
tracks: (a) obligatory, in case of which each single image had to be classified
independently; (b) optional, where the temporal continuity of the sequences
could be exploited to improve the robustness of the system. For more information
about the task and the dataset used for the challenge, we refer the reader to the
RobotVision@ImageCLEF’09 overview paper .
The visual place recognition system presented in this paper obtained the high-
est score in the obligatory track and constituted a basis for our approach used
in the optional track. The system relies on four discriminative models trained on
different visual cues that capture both global and local appearance of a scene. In
order to increase the robustness of the system, the cues are integrated efficiently
using a high-level accumulation scheme that operates on the separate models
?This work was supported by the EU FP7 integrated project ICT-215181-CogX. The
supportis gratefully acknowledged.
C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 315–323, 2010.
c ? Springer-Verlag Berlin Heidelberg 2010
316L. Xing and A. Pronobis
adapted to the properties of each cue. Additionally, in the optional track, we
used a simple temporal accumulation technique which exploits the continuity of
the image sequences to refine the results. Since the misclassifications were penal-
ized in the competition, we experimented with an ignorance detection technique
relying on the estimated confidence of the decision.
Visual place recognition is a vastly researched topic in the robotics and com-
puter vision communities and several different approaches have been proposed
to the problem considered in the competition. The main differences between the
approaches relate to the way the scene is perceived and thus the visual cues ex-
tracted from the input images. There are two main groups of approaches using
either global or local image features. Typically, SIFT  and SURF  are ap-
plied as local features, either using a matching strategy [5,6] or the bag-of-words
approach [7,8]. Global features are also commonly used for place recognition and
such representations as gist of a scene , CRFH , or PACT  were pro-
posed. Recently, several authors observed that robustness and efficiency of the
recognition system can be improved by combining information provided by both
types of cues (global and local) [5,12]. Our approach belongs to this group and
four different types of features previously used in the domain of place recognition
have been used in the presented system.
The rest of the paper gives a description of the structure and components of
our place recognition system (Section 2). Then, we describe the initial experi-
ments performed on the training and validation data (Section 3). We explain the
procedure applied for parameter selection and study the properties of the cue
integration and confidence estimation algorithms. Finally, we present the results
obtained on the test sequence and our ranking in the competition (Section 4).
The paper concludes with a summary and possible avenues for future research.
2The Visual Place Recognition System
This section describes our approach to visual place classification. Our method is
fully supervised and assumes that during training, each place (room) is repre-
sented by a collection of labeled data which captures its intrinsic visual proper-
ties under various viewpoints, at a fixed time and illumination setting. During
testing, the algorithm is presented with data samples acquired under different
conditions and after some time. The goal is to recognize correctly each single data
sample provided to the system. The rest of the section describes the structure
and components of the system.
The architecture of the system is illustrated in Fig. 1. We use four different cues
extracted independently from the visual input. We see that there is a separate
path for each cue. Every path consists of two main building blocks: a feature
extractor and a classifier. Thus separate decisions can be obtained for every cue.
The outputs encoding the confidence of single-cue classifiers are combined using
a discriminative accumulation scheme.
Multi-cue Discriminative Place Recognition317
Fig.1. Structure of the multi-cue visual place recognition system
The system relies on visual cues based on global and local image features. Global
features are derived from the whole image and thus can capture general prop-
erties of the whole scene. In contrast, local features are computed locally, from
distinct parts of an image. This makes them much more robust to occlusions and
viewpoint variations. In order to capture different aspects of the environment,
we combine cues produced by four different feature extractors.
Composed Receptive Field Histograms (CRFH). CRFH  is a multi-
dimensional statistical representation (a histogram) of the occurrence of re-
sponses of several image descriptors applied to the whole image. Each dimension
corresponds to one descriptor and the cells of the histogram count the pixels
sharing similar responses of all descriptors. This approach allows to capture var-
ious properties of the image as well as relations that occur between them. On
the basis of the evaluation in , we build the histograms from second order
Gaussian derivative filters applied to the illumination channel at two scales.
PCA of Census Transform Histograms (PACT) Census Transform (CT)
 is a non-parametric local transform designed for establishing correspondence
between local patches. Census transform compares the intensity values of a pixel
with its eight neighboring pixels, as illustrated in Figure 2. A histogram of the
CT values encode both local and global information of the image. PACT  is a
global representation that extracts the CT histograms for several image patches
organized in a grid and applies Principal Component Analysis (PCA) to the
Scale Invariant Feature Transform (SIFT). As one of the local represen-
tations, we used a combination of the SIFT descriptor  and the scale, ro-
tation and translation invariant Harris-Laplace corner detector . The SIFT
descriptor represents local image patches around interest points characterized by
coordinates in the scale space in the form of histograms of gradient directions.
318L. Xing and A. Pronobis
Fig.2. Illustration of the Census Transform 
Speed-Up Robust Features (SURF). SURF  is a scale- and rotation-
invariant local detector and descriptor which is designed to approximate the
performance of previously proposed schemes while being much more computa-
tionally efficient. This is obtained by using integral images, a Hessian matrix-
based measure for the detector and a distribution of Haar-wavelet responses for
2.3 Place Models
Based on its state-of-the-art performance in several visual recognition domains
[15,16], we used the Support Vector Machine classifier  to build the models
of places for each cue. The choice of the kernel function is a key ingredient
for the good performance of SVMs and we selected specialized kernels for each
cue. Based on results reported in the literature, we chose in this paper the χ2
kernel  for CRFH, the Gaussian (RBF) kernel  for PACT and the match
kernel  for both local features. In order to extend the binary SVM to multiple
classes, we used the one-against-all strategy for which one SVM is trained for
each class separating the class from all other classes.
SVMs do not provide any out-of-the-box solution for estimating confidence
of the decision; however, it is possible to derive confidence information and hy-
potheses ranking from the distances between the samples and the hyperplanes.
In this work, we experimented with the distance-based methods proposed in ,
which define confidence as a measure of unambiguity of the final decision.
2.4 Cue Integration and Temporal Accumulation
As indicated in , different properties of viual cues result in different per-
formance and error patterns on the place classification task. The role of the
cue integration scheme is to exploit this fact in order to increase the overall
performance. Our place recognition system uses the Discriminative Accumu-
lation Scheme (DAS)  that was proposed for the place classification prob-
lem in . It accumulates multiple cues, by turning classifiers into experts. The
basic idea is to consider real-valued outputs of a multi-class discriminative clas-
sifier as an indication of a soft decision for each class. Then, all of the out-
puts obtained from the various cues are summed together, therefore linearly
accumulated. In the presented system, this can be expressed by the equation
OΣ= a · OCRFH+ b · OPACT+ c · OSIFT+ d · OSURF, where a,b,c,d are the
weights assigned to each cue and a+b+c+d = 1. The vectors O represent the
outputs of the multi-class classifiers for each cue.
We used a very similar scheme to improve the robustness of the system oper-
ating on image sequences. For this, we exploited the continuity of the sequences
Multi-cue Discriminative Place Recognition319
and accumulated the outputs (of a single cue or integrated cues) for the current
sample and N previously classified samples. The result of accumulation was then
used as the final decision of the system.
3Experiments on the Training and Validation Data
We conducted several series of experiments on the training and validation data
in order to analyze the behavior of our system and select parameters. We present
the analysis and results in successive subsections.
3.1Selection of the Model Parameters
The first set of experiments was aimed at finding the values of parameters of the
place models, i.e. the SVM error penalty C and the kernel parameters. The exper-
iments were performed separately for each visual cue (CRFH, PACT, SIFT and
SURF). To find the parameters, we performed cross validation on the training
and validation data. For every training set, we selected parameters that resulted
in highest classification rate on all available test sets acquired under different
conditions. The classification rate was calculated in a similar way as the final
score used in the competition i.e. as the percentage of correctly classified images
in the whole testing sequence.
Figure 3 presents the results obtained for the experiments with the dum-night3
training set which was selected for the final run of the competition. It is apparent
that the model based on the SIFT features provides the highest recognition rate
on average. However, we can also see that different cues have different character-
istics as their performance changes according to different patterns. This suggests
that the overall performance of the system could be increased by integrating the
outputs of the models.
3.2Cue Integration and Temporal Accumulation
The next step was to integrate the outputs of the models and choose the proper
values of the DAS weights for each model. We performed an exhaustive search for
Fig.3. Classification rates for the best model parameters and the dum-night3 training
set. Results are given separately for each test set as well as averaged over all sets.