Article

On Robot Indoor Scene Classification Based on Descriptor Quality and Efficiency

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Indoor scene classification is usually approached from a computer vision perspective. However, in some fields like robotics, additional constraints must be taken into account. Specifically, in systems with low resources, state-of-the-art techniques (CNNs) cannot be successfully deployed. In this paper, we try to close this gap between theoretical approaches and real world solutions by performing an in-depth study of the factors that influence classifiers performance, that is, size and descriptor quality. To this end, we perform a thorough evaluation of the visual and depth data obtained with an RGB-D sensor to propose techniques to build robust descriptors that can enable real-time indoor scene classification. Those descriptors are obtained by properly selecting and combining visual and depth information sources.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In the approach presented in [5] and [17], the authors modeled the places using hierarchical representations, obtaining a multiversal semantic map of the environment. Robust scene descriptors generated by the merging of 2D and 3D data descriptors were tested for classification problems in [15]. These descriptors focus on the environment representation of the desired area and are applied in a traditional image classification pipeline. ...
... In order to test the adaptability of the proposed methodology we had selected the ImageClef2012 Dataset for a second experiment set. We will compare the results obtained in this dataset after applying our methodology, with the results obtained with the methodology proposed in [15]. This dataset is composed of three training sequences: training1, training2 and training3. ...
... The images in the dataset belongs to one of the 9 available categories. Authors in [15] selected all the training sequences and merged then in an unique training dataset, after they trained a Support Vector Machines (SVM) classifier. This classifier is evaluated using test sequences 1 and 2. ...
Article
Full-text available
The capacity of a robot to automatically adapt to new environments is crucial, especially in social robotics. Often, when these robots are deployed in home or office environments, they tend to fail because they lack the ability to adapt to new and continuously changing scenarios. In order to accomplish this task, robots must obtain new information from the environment, and then add it to their already learned knowledge. Deep learning techniques are often used to tackle this problem successfully. However, these approaches, complete retraining of the models, which is highly time-consuming. In this work, several strategies are tested to find the best way to include new knowledge in an already learned model in a deep learning pipeline, putting the spotlight on the time spent for this training. We tackle the localization problem in the long term with a deep learning approach and testing several retraining strategies. The results of the experiments are discussed and, finally, the best approach is deployed on a Pepper robot.
... Since the transformation relationship of the entire camera is considered, the relative relationship R rgb , T rgb of the color camera and the relative relationship R ir , T ir of the depth camera need to be subjected to rigid body transformation. e transformation process is shown in formula (16) and formula (17) [25]: ...
Article
Full-text available
Under the background of the information and data age, intelligent technology has had a large and profound impact on the production field of each industry, and the field of news dissemination has inevitably become one of them. With the rapid development of new media technologies, the traditional news production model has been impacted by the development of the times. Out of the needs of the transformation and upgrading of its own industry and the further development of laser sensor smart technology, a manuscript writing robot based on laser sensors came into being. Manuscript writing robot is the first time in the history of the development of news dissemination that the main body of news manuscripts has been changed from a natural person to a machine, realizing the transformation of the news production process from semifixed to completely relying on machine automation. This paper first briefly introduces the manuscript writing robot’s working principle and work characteristics, analyzes its influence on other aspects, then models the machine hardware system, and finally conducts a quality comparison test between the manuscript writing robot and traditional artificial news production. Through the analysis of experimental data, the language organization ability of the manuscript writing robot is generally inferior to the traditional artificial language organization ability. However, the quality of manuscript writing robot data is 1.1 times that of traditional humans, and the speed of producing news is also 1.3 times faster than traditional humans. In terms of the depth and breadth of the scope of news, the overall average value of writing robots is 0.475 higher than that of traditional humans. Generally speaking, manuscript writing robots have clear advantages compared with traditional manual labor in news production and have a large room for development.
... Classification of rooms is a challenging problem, as there are significant variations in layouts and objects present in each type of room [6]. Classification of scenes can be accomplished using (a) high-level features of the scenes, such as detected objects [7], (b) global image features, or (c) local image features. ...
Article
Full-text available
Classification of indoor environments is a challenging problem. The availability of low-cost depth sensors has opened up a new research area of using depth information in addition to color image (RGB) data for scene understanding. Transfer learning of deep convolutional networks with pairs of RGB and depth (RGB-D) images has to deal with integrating these two modalities. Single-channel depth images are often converted to three-channel images by extracting horizontal disparity, height above ground, and the angle of the pixel’s local surface normal (HHA) to apply transfer learning using networks trained on the Places365 dataset. The high computational cost of HHA encoding can be a major disadvantage for the real-time prediction of scenes, although this may be less important during the training phase. We propose a new, computationally efficient encoding method that can be integrated with any convolutional neural network. We show that our encoding approach performs equally well or better in a multimodal transfer learning setup for scene classification. Our encoding is implemented in a customized and pretrained VGG16 Net. We address the class imbalance problem seen in the image dataset using a method based on the synthetic minority oversampling technique (SMOTE) at the feature level. With appropriate image augmentation and fine-tuning, our network achieves scene classification accuracy comparable to that of other state-of-the-art architectures.
... Such research efforts look at locating robots in indoor environments through semantic location. The ultimate goal of such works is to enable the autonomous navigation of robots by classifying scenes pertaining to the robot trajectories (Romero-González, Martínez-Gómez, García-Varea, & Rodriguez-Ruiz, 2017). This work aims, not only to identify elements of interest in the image, but also to locate them. ...
Article
Full-text available
The study of artificial learning processes in the area of computer vision context has mainly focused on achieving a fixed output target rather than on identifying the underlying processes as a means to develop solutions capable of performing as good as or better than the human brain. This work reviews the well-known segmentation efforts in computer vision. However, our primary focus is on the quantitative evaluation of the amount of contextual information provided to the neural network. In particular, the information used to mimic the tacit information that a human is capable of using, like a sense of unambiguous order and the capability of improving its estimation by complementing already learned information. Our results show that, after a set of pre and post-processing methods applied to both the training data and the neural network architecture, the predictions made were drastically closer to the expected output in comparison to the cases where no contextual additions were provided. Our results provide evidence that learning systems strongly rely on contextual information for the identification task process.
... In [18], the authors propose to develop a study for building robust scene descriptors based on the combination of visual and depth data. The approach was tested for classification problems. ...
Chapter
In social robotics, it is important that a mobile robot knows where it is because it provides a starting point for other activities such as moving from one room to another. As a contribution to solving this problem in the field of the semantic location of the mobile robot, we pro- pose to implement a methodology of recognition and scene learning in a real domestic environment. For this purpose, we used images from five different residences to create a dataset with which the base model was trained. The effectiveness of the implemented base model is evaluated in different scenarios. When the accuracy of the site identification decreases, the user provides feedback to the robot so that it can process the information collected from the new environment and re-identify the current location. The results obtained reinforce the need to acquire more knowledge when the environment is not recognizable by the pre-trained model.
... One among the most recent works also uses depth images along with RGB in order to retrieve greater information regarding a scene. The authors in [15] take advantage of the depth images but don't use the technique involving CNN due to unavailability of pre-trained models for scene recognition and high computing power required for the training. Instead SIFT based feature extraction and Online Independent SVM are used. ...
... This representation could be produced by using different kinds of descriptors such as 3D and 2D. The work put forward in [16] develop a study for building robust scene descriptors based on the combination of visual and depth data. The approach was tested for classification problems. ...
Conference Paper
Semantic localization for mobile robots involves an accurate determination of the kind of place where a robot is located. Therefore, the representation of the knowledge of this place is crucial for the robot. In this paper we present a study for finding a robust model for scene classification procedure for a mobile robot. The proposed system uses CNN descriptors for representing the input perceptions of the robot. First, we develop comparative experiments in order for finding a model. Experiments include the evaluation of several pre-trained CNN models and training a classifier with different classifications algorithms. These experiments were carried out using the ViDRILO dataset and compared with the benchmark provided by their authors. The results demonstrate the goodness of using CNN descriptors for semantic classification.
Article
In the unstructured family environment, robots are expected to provide various services to improve the quality of human life, based on the performance of specific action sequences generated by service planning. This paper focuses on one of the greatest challenges in service planning that is aimed at accomplishing the service tasks by generating appropriate object sequences to guide the robot on searching the corresponding target objects efficiently and reasonably. A well-structured knowledge-based framework of object search is proposed in our approach as well as taking into account the multi-domain knowledge of applying object, scene, and service in design. In order to improve the searching efficiency and reasonability, an ontology-based hierarchical and interrelated knowledge structure is formed to support the implementation of complicated service planning with either single or multiple tasks. The proposed framework is tested by comprehensive experiments, and the performance is evaluated with other mainstream methods in both simulation and real-world environments. The experimental results demonstrate the feasibility and effectiveness of applying this knowledge-based framework to efficient object searching aspect in service planning.
Article
Indoor place recognition is a challenging problem because of the hard representation to complicated intra-class variations and inter-class similarities.This paper presents a new indoor place recognition scheme using deep neural network. Traditional representations of indoor place almost utilize image feature to retain the spatial structure without considering the object’s semantic characteristics. However, we argue that the attributes, state and relationships of objects are much more helpful in indoor place recognition. In particular, we improve the recognition framework by utilizing Place Descriptors (PDs) in text from to connect different types of place information with their categories. Meanwhile, we analyse the ability of Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) for classification in natural language, for which we use them to process the indoor place descriptions. In addition, we improve the robustness of the designed deep neural network by combining a number of effective strategies, i.e. L2-regularization, data normalization, and proper calibration of key parameters. Compared with existing state of the art, the proposed approach achieves well performance of 70.73%, 70.08% and 70.16% of accuracy, precision and recall on Visual Genome database respectively. Meanwhile, the accuracy becomes 98.6% after adding voting mechanics.
Conference Paper
Full-text available
This paper addresses the problem of classifying places in the environment of a mobile robot into semantic categories. We believe that semantic information about the type of place improves the capabilities of a mobile robot in various domains including localization, path-planning, or human-robot interaction. Our approach uses AdaBoost, a supervised learning algorithm, to train a set of classifiers for place recognition based on laser range data. In this paper we describe how this approach can be applied to distinguish between rooms, corridors, doorways, and hallways. Experimental results obtained in simulation and with real robots demonstrate the effectiveness of our approach in various environments.
Article
Full-text available
The ability to represent knowledge about space and its position therein is crucial for a mobile robot. To this end, topological and semantic descriptions are gaining popularity for augmenting purely metric space representations. In this paper we present a multi-modal place classification system that allows a mobile robot to identify places and recognize semantic categories in an indoor environment. The system effectively utilizes information from different robotic sen-sors by fusing multiple visual cues and laser range data. This is achieved using a high-level cue integration scheme based on a Sup-port Vector Machine (SVM) that learns how to optimally combine and weight each cue. Our multi-modal place classification approach can be used to obtain a real-time semantic space labeling system which integrates information over time and space. We perform an extensive experimental evaluation of the method for two different platforms and environments, on a realistic off-line database and in a live experiment on an autonomous robot. The results clearly demonstrate the effec- tiveness of our cue integration scheme and its value for robust place classification under varying conditions.
Conference Paper
Full-text available
In the framework of indoor mobile robotics, place recognition is a challenging task, where it is crucial that self-localization be enforced precisely, notwithstanding the changing conditions of illumination, objects being shifted around and/or people affecting the appearance of the scene. In this scenario online learning seems the main way out, thanks to the possibility of adapting to changes in a smart and flexible way. Nevertheless, standard machine learning approaches usually suffer when confronted with massive amounts of data and when asked to work online. Online learning requires a high training and testing speed, all the more in place recognition, where a continuous flow of data comes from one or more cameras. In this paper we follow the Support Vector Machines-based approach of Pronobis et al., proposing an improvement that we call Online Independent Support Vector Machines. This technique exploits linear independence in the image feature space to incrementally keep the size of the learning machine remarkably small while retaining the accuracy of a standard machine. Since the training and testing time crucially depend on the size of the machine, this solves the above stated problems. Our experimental results prove the effectiveness of the approach.
Conference Paper
Full-text available
In this paper we describe the problem of visual place categorization (VPC) for mobile robotics, which involves predicting the semantic category of a place from image measurements acquired from an autonomous platform. For example, a robot in an unfamiliar home environment should be able to recognize the functionality of the rooms it visits, such as kitchen, living room, etc. We describe an approach to VPC based on sequential processing of images acquired with a conventional video camera. We identify two key challenges: Dealing with non-characteristic views and integrating restricted-FOV imagery into a holistic prediction. We present a solution to VPC based upon a recently-developed visual feature known as CENTRIST (census transform histogram). We describe a new dataset for VPC which we have recently collected and are making publicly available. We believe this is the first significant, realistic dataset for the VPC problem. It contains the interiors of six different homes with ground truth labels. We use this dataset to validate our solution approach, achieving promising results.
Conference Paper
Full-text available
Indoor scene recognition is a challenging open problem in high level vision. Most scene recognition models that work well for outdoor scenes perform poorly in the indoor domain. The main difficulty is that while some indoor scenes (e.g. corridors) can be well characterized by global spatial properties, others (e.g, bookstores) are better characterized by the objects they contain. More generally, to address the indoor scenes recognition problem we need a model that can exploit local and global discriminative information. In this paper we propose a prototype based model that can successfully combine both sources of information. To test our approach we created a dataset of 67 indoor scenes categories (the largest available) covering a wide range of domains. The results show that our approach can significantly outperform a state of the art classifier for the task.
Conference Paper
Full-text available
There has been much interest in unsuper- vised learning of hierarchical generative mod- els such as deep belief networks. Scaling such models to full-sized, high-dimensional images remains a dicult problem. To ad- dress this problem, we present the convolu- tional deep belief network, a hierarchical gen- erative model which scales to realistic image sizes. This model is translation-invariant and supports ecient bottom-up and top-down probabilistic inference. Key to our approach is probabilistic max-pooling, a novel technique which shrinks the representations of higher layers in a probabilistically sound way. Our experiments show that the algorithm learns useful high-level visual features, such as ob- ject parts, from unlabeled images of objects and natural scenes. We demonstrate excel- lent performance on several visual recogni- tion tasks and show that our model can per- form hierarchical (bottom-up and top-down) inference over full-sized images.
Conference Paper
Full-text available
A key ingredient in the design of visual object classifi- cation systems is the identification of relevant class specific aspects while being robust to intra-class variations. While this is a necessity in order to generalize beyond a given set of training images, it is also a very difficult problem due to the high variability of visual appearance within each class. In the last years substantial performance gains on challeng- ing benchmark datasets have been reported in the literature. This progress can be attributed to two developments: the design of highly discriminative and robust image features and the combination of multiple complementary features based on different aspects such as shape, color or texture. In this paper we study several models that aim at learn- ing the correct weighting of different features from train- ing data. These include multiple kernel learning as well as simple baseline methods. Furthermore we derive ensemble methods inspired by Boosting which are easily extendable to several multiclass setting. All methods are thoroughly eval- uated on object classification datasets using a multitude of feature descriptors. The key results are that even very sim- ple baseline methods, that are orders of magnitude faster than learning techniques are highly competitive with multi- ple kernel learning. Furthermore the Boosting type methods are found to produce consistently better results in all exper- iments. We provide insight of when combination methods can be expected to work and how the benefit of complemen- tary features can be exploited most efficiently.
Article
Full-text available
The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.
Conference Paper
Full-text available
Effective methods for recognising objects or spatio-temporal events can be constructed based on receptive field responses summarised into histograms or other histogram-like image descriptors. This work presents a set of composed histogram features of higher dimensionality, which give significantly better recognition performance compared to the histogram descriptors of lower dimensionality that were used in the original papers by Swain & Bollard (1991) or Schiele & Crowley (2000). The use of histograms of higher dimensionality is made possible by a sparse representation for efficient computation and handling of higher-dimensional histograms. Results of extensive experiments are reported, showing how the performance of histogram-based recognition schemes depend upon different combinations of cues, in terms of Gaussian derivatives or differential invariants applied to either intensity information, chromatic information or both. It is shown that there exist composed higher-dimensional histogram descriptors with much better performance for recognising known objects than previously used histogram features. Experiments are also reported of classifying unknown objects into visual categories.
Article
This article describes the Robot Vision challenge, a competition that evaluates solutions for the visual place classification problem. Since its origin, this challenge has been proposed as a common benchmark where worldwide proposals are measured using a common overall score. Each new edition of the competition introduced novelties, both for the type of input data and sub-objectives of the challenge. All the techniques used by the participants have been gathered up and published to make it accessible for future developments. The legacy of the Robot Vision challenge includes datasets, benchmarking techniques, and a wide experience in the place classification research that is reflected in this article.
Article
In this article we describe a semantic localization dataset for indoor environments named ViDRILO. The dataset provides five sequences of frames acquired with a mobile robot in two similar office buildings under different lighting conditions. Each frame consists of a point cloud representation of the scene and a perspective image. The frames in the dataset are annotated with the semantic category of the scene, but also with the presence or absence of a list of predefined objects appearing in the scene. In addition to the frames and annotations, the dataset is distributed with a set of tools for its use in both place classification and object recognition tasks. The large number of labeled frames in conjunction with the annotation scheme make this dataset different from existing ones. The ViDRILO dataset is released for use as a benchmark for different problems such as multimodal place classification and object recognition, 3D reconstruction or point cloud data compression.
Article
Most of the current scene classification approaches are based on either low-level or semantic modeling strategies. However, the both strategies have some inherent weaknesses. The low-level modeling based approaches normally classify images into a small number of scene categories and often exhibit poor performance. And the semantic modeling strategies usually bring high computational cost and memory consumption. In this paper, we present a novel approach which retains the advantages of both the low-level and semantic modeling strategies, while at the same time getting over the weaknesses of these two strategies. To represent scene images more effectively, a new visual descriptor called GBPWHGO is introduced. Experimental results on six commonly used data sets demonstrate that our approach performs competitively against previous methods across all data sets, and the GBPWHGO descriptor outperforms the SIFT, LBP and Gist descriptors in scene classification.
Article
Recently, methods based on local image features have shown promise for texture and object recognition tasks. This paper presents a large-scale evaluation of an approach that represents images as distributions (signatures or histograms) of features extracted from a sparse set of keypoint locations and learns a Support Vector Machine classifier with kernels based on two effective measures for comparing distributions, the Earth Mover's Distance and the χ 2 distance. We first evaluate the performance of our approach with different keypoint detectors and descriptors, as well as different kernels and classifiers. We then conduct a comparative evaluation with several state-of-the-art recognition methods on four texture and five object databases. On most of these databases, our implementation exceeds the best reported results and achieves comparable performance on the rest. Finally, we investigate the influence of background correlations on recognition performance via extensive tests on the PASCAL database, for which ground-truth object localization information is available. Our experiments demonstrate that image representations based on distributions of local features are surprisingly effective for classification of texture and object images under challenging real-world conditions, including significant intra-class variations and substantial background clutter.
Conference Paper
An important competence for a mobile robot system is the ability to localize and perform context interpretation. This is required to perform basic navigation and to facilitate local specific services. Usually localization is performed based on a purely geometric model. Through use of vision and place recognition a number of opportunities open up in terms of flexibility and association of semantics to the model. To achieve this, the present paper presents an appearance based method for place recognition. The method is based on a large margin classifier in combination with a rich global image descriptor. The method is robust to variations in illumination and minor scene changes. The method is evaluated across several different cameras, changes in time-of-day and weather conditions. The results clearly demonstrate the value of the approach
Article
CENsus TRansform hISTogram (CENTRIST), a new visual descriptor for recognizing topological places or scene categories, is introduced in this paper. We show that place and scene recognition, especially for indoor environments, require its visual descriptor to possess properties that are different from other vision domains (e.g., object recognition). CENTRIST satisfies these properties and suits the place and scene recognition task. It is a holistic representation and has strong generalizability for category recognition. CENTRIST mainly encodes the structural properties within an image and suppresses detailed textural information. Our experiments demonstrate that CENTRIST outperforms the current state of the art in several place and scene recognition data sets, compared with other descriptors such as SIFT and Gist. Besides, it is easy to implement and evaluates extremely fast.
Article
Support vector machines (SVMs) are one of the most successful algorithms for classification. However, due to their space and time requirements, they are not suitable for on-line learning, that is, when presented with an endless stream of training observations.In this paper we propose a new on-line algorithm, called on-line independent support vector machines (OISVMs), which approximately converges to the standard SVM solution each time new observations are added; the approximation is controlled via a user-defined parameter. The method employs a set of linearly independent observations and tries to project every new observation onto the set obtained so far, dramatically reducing time and space requirements at the price of a negligible loss in accuracy. As opposed to similar algorithms, the size of the solution obtained by OISVMs is always bounded, implying a bounded testing time. These statements are supported by extensive experiments on standard benchmark databases as well as on two real-world applications, namely place recognition by a mobile robot in an indoor environment and human grasping posture classification.
We propose a novel approach to learn and recognize natural scene categories. Unlike previous work [9,17], it does not require experts to annotate the training set. We represent the image of a scene by a collection of local regions, denoted as codewords obtained by unsupervised learning. Each region is represented as part of a "theme". In previous work, such themes were learnt from hand-annotations of experts, while our method learns the theme distributions as well as the codewords distribution over the themes without supervision. We report satisfactory categorization performances on a large set of 13 categories of complex scenes.
Scene categorization is a fundamental problem in computer vision. However, scene understanding research has been constrained by the limited scope of currently-used databases which do not capture the full variety of scene categories. Whereas standard databases for object categorization contain hundreds of different classes of objects, the largest available dataset of scene categories contains only 15 classes. In this paper we propose the extensive Scene UNderstanding (SUN) database that contains 899 categories and 130,519 images. We use 397 well-sampled categories to evaluate numerous state-of-the-art algorithms for scene recognition and establish new bounds of performance. We measure human scene classification performance on the SUN database and compare this with computational methods. Additionally, we study a finer-grained scene representation to detect scenes embedded inside of larger scenes.
Article
Local binary pattern (LBP) is an effective texture descriptor which has successful applications in texture classification and face recognition. Many extensions are made for conventional LBP descriptors. One of the extensions is dominant local binary patterns which aim at extracting the dominant local structures in texture images. The second extension is representing LBP descriptors in Gabor transform domain (LGBP). The third extension is multi-resolution LBP (MLBP). Another extension is dynamic LBP for video texture extraction. In this paper, we extend the conventional local binary pattern to pyramid transform domain (PLBP). By cascading the LBP information of hierarchical spatial pyramids, PLBP descriptors take texture resolution variations into account. PLBP descriptors show their effectiveness for texture representation. Comprehensive comparisons are made for LBP, MLBP, LGBP, and PLBP. Performances of no sampling, partial sampling and spatial pyramid sampling approaches for the construction of PLBP texture descriptors are compared. The influences of pyramid generation approaches, and pyramid levels to PLBP based image categorization performances are discussed. Compared to the existing multi-resolution LBP descriptors, PLBP is with satisfactory performances and with low computational costs.
Conference Paper
In this paper we explore how a structured light depth sensor, in the form of the Microsoft Kinect, can assist with indoor scene segmentation. We use a CRF-based model to evaluate a range of different representations for depth information and propose a novel prior on 3D location. We introduce a new and challenging indoor scene dataset, complete with accurate depth maps and dense label coverage. Evaluating our model on this dataset reveals that the combination of depth and intensity images gives dramatic performance gains over intensity images alone. Our results clearly demonstrate the utility of structured light sensors for scene understanding.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
This paper presents a method for recognizing scene categories based on approximate global geometric correspondence. This technique works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. The resulting "spatial pyramid" is a simple and computationally efficient extension of an orderless bag-of-features image representation, and it shows significantly improved performance on challenging scene categorization tasks. Specifically, our proposed method exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories. The spatial pyramid framework also offers insights into the success of several recently proposed image descriptions, including Torralba’s "gist" and Lowe’s SIFT descriptors.
Conference Paper
While navigating in an environment, a vision system has to be able to recognize where it is and what the main objects in the scene are. We present a context-based vision system for place and object recognition. The goal is to identify familiar locations (e.g., office 610, conference room 941, main street), to categorize new environments (office, corridor, street) and to use that information to provide contextual priors for object recognition (e.g., tables are more likely in an office than a street). We present a low-dimensional global image representation that provides relevant information for place recognition and categorization, and show how such contextual information introduces strong priors that simplify object recognition. We have trained the system to recognize over 60 locations (indoors and outdoors) and to suggest the presence and locations of more than 20 different object types. The algorithm has been integrated into a mobile system that provides realtime feedback to the user.
Article
We have developed a technique for place learning and place recognition in dynamic environments. Our technique associates evidence grids with places in the world and uses hill climbing to find the best alignment between current perceptions and learned evidence grids. We present results from five experiments performed using a real mobile robot in a real-world environment. These experiments measured the effects of transient and lasting changes in the environment on the robot's ability to localize. In addition, these experiments tested the robot's ability to recognize places from different viewpoints and verified the scalability of this approach to environments containing large numbers of places. Our results demonstrate that places can be recognized successfully despite significant changes in their appearance, despite the presence of moving obstacles, and despite observing these places from different viewpoints during place learning and place recognition. 1 Introduction Place learning a...