Conference PaperPDF Available

Real-time Visual-Based Localization for Mobile Robot Using Structured-View Deep Learning

Authors:

Abstract

This paper demonstrates a place recognition and localization method designed for automated guidance of mobile robots. Collecting and annotating sufficient images for a supervised deep learning model is often an exhausting work. Devising an effective visual detection scheme for a mobile robot location detection job in a feature-barren environment such as the indoor corridors of buildings is also quite challenging. To address these issues, a supervised deep learning model for the spatial coordinate detection of a mobile robot is proposed here. Specifically, a novel technique is introduced involving structuring and collaging of the surrounding views obtained by the on-board cameras for the training data preparation. A system linking robot kinematics and image processing provides automatic data annotation, which significantly reduces the need for human work on data preparation. Experimental evidence showed that the precision and recall rates of the location coordinate detection are 0.91 and 0.85, respectively. Also, the detection appeared to be effective over a path width of 0.75 m, which is sufficient to cover the possible deviations from the target path. Furthermore, it took averagely 0.14 s for each visual detection performed by an ordinary PC on-board the mobile robot; thus, real-time navigation using the proposed method is achievable.
Abstract This paper demonstrates a place recognition and
localization method designed for automated guidance of mobile
robots. Collecting and annotating sufficient images for a
supervised deep learning model is often an exhausting work.
Devising an effective visual detection scheme for a mobile robot
location detection job in a feature-barren environment such as
the indoor corridors of buildings is also quite challenging. To
address these issues, a supervised deep learning model for the
spatial coordinate detection of a mobile robot is proposed here.
Specifically, a novel technique is introduced involving
structuring and collaging of the surrounding views obtained by
the on-board cameras for the training data preparation. A
system linking robot kinematics and image processing provides
automatic data annotation, which significantly reduces the need
for human work on data preparation. Experimental evidence
showed that the precision and recall rates of the location
coordinate detection are 0.91 and 0.85, respectively. Also, the
detection appeared to be effective over a path width of 0.75 m,
which is sufficient to cover the possible deviations from the
target path. Furthermore, it took averagely 0.14 s for each visual
detection performed by an ordinary PC on-board the mobile
robot; thus, real-time navigation using the proposed method is
achievable.
I. INTRODUCTION
Autonomous mobility is a key capability of the mobile
robots, which are expected to perform real-time obstacle
avoidance, path planning, and place recognition without
human assistance. Making sense of the position where the
robot is located at is essential to autonomous mobility; a
reliable and effective method for such automated localization
tasks is of great use. A common and important task of a mobile
service robot is to navigate and localize itself along the correct
path to the designated destination. For example, when a
hospital robot is dispatched to a specific station, the robot
should first determine the routes to take, then navigate along
the routes while constantly confirming its location/coordinates,
and finally reach the goal. Often the indoor routes consist of
lengthy corridors such as the one shown in Fig. 1, which is lack
of sufficient visual features for effective visual detection. It is
also more convenient and efficient for navigation control, if
the localization system outputs precise spatial coordinates of
the robot based on the result of the visual detection.
Much work has been done on Simultaneous Localization
And Mapping (SLAM); however, an easy-to-implement
method that provides accurate and precision coordinates of the
mobile robot is yet to come. In this paper, we address the
*This work was supported by the National Taipei University of
Technology King Mongkut’s University of Technology Thonburi Joint
Research Program: NTUT-KMUTT-107-02.
problem of indoor localization in a feature-barren environment
such as corridors with the simple and repeating appearance
(see Fig. 1). Specifically, we devise a method which allows the
robot to remember visual features associated with each set of
strictly defined spatial coordinates. The robot can then locate
itself along the learned paths based on the views that it sees.
The idea is similar to a teacher walking through a school
campus with new students and introducing the environment to
them. If the robots (the students) can automatically learn to
remember and associate the scenes with internally constructed
spatial coordinates, the operator does not have to have any
engineering skills to manually construct the map for every
application. Common and tedious work associated with
environmental setup such as marking and taping are also no
longer in need.
Here we propose a visual localization system that
combines the superior visual feature-learning capability of
Convolutional Neural Network (ConvNet) with designed
kinematic patterns of the mobile robot to form our framework
of image data preparation and coordinate-articulate SLAM.
By taking photos following the pre-determined kinematic rules,
images are acquired at every nodal point and automatically
annotated with precision coordinates. To increase the
dimensionality of visual features, we collage four views front,
rear, left, and right into one image, yet computation is managed
to be able to run in real time. To further enhance the detection
accuracy, a data augmentation technique is introduced based
on a structured image collaging operation. Low-cost CMOS
cameras are proved to be sufficient in our proposed system;
Yi-Feng Hong, Yu-Ming Chang, and Chih-Hung G. Li are with the
Graduate Institute of Manufacturing Technology, National Taipei University
of Technology, Taipei, 10608 Taiwan ROC (phone: +886-2-2771-2171 ext.
2092; fax: +886-2-2776-4889; e-mail: cL4e@ntut.edu.tw).
Real-time Visual-Based Localization for Mobile Robot Using
Structured-View Deep Learning
Yi-Feng Hong, Yu-Ming Chang, and Chih-Hung G. Li*, Member, IEEE
Figure 1. Left: The proposed visual-based localization system aims at
providing real-time spatial coordinate predictions of the mobile robot in
feature-barren environments such as the indoor corridors shown in the
photos. Right: The mobile robot built by the authors and used in the study.
computation speed for image processing such as collaging and
resizing is also proved to be high enough for real-time
operations.
The core contribution of this work is that we introduce a
novel visual localization system that fuses deep learning with
a structured visual data preparation scheme for automatic
training data preparation and direct spatial coordinate
detection of a mobile robot (see Fig. 2). Specifically, we
introduce an innovative image structure and collage operation
prior to the ConvNet. The training and the detection process is
highly automated, and even non-skilled personnel can easily
set up the system for every application.
The rest of the paper is organized as follows. Section II
describes related prior work in visual-based localization and
ConvNet. Section III details the proposed method. Section IV
reports our experimental results. Section V details comparison
and discussion. The conclusion is in Section VI.
II. RELATED WORK
Various methods have been adopted for mobile robot
localization, e.g., EKF-SLAM, FastSLAM, GraphSLAM,
visual-based localization [1], et al., using devices such as GPS,
LIDAR, RFID, magnetic tape, QR Code, Wifi, RGB-D
cameras, et al. Visual-based localization generally uses local
features or global features [2], in which local features describe
at pixel level among a local neighborhood of several points in
the image [2] such as SIFT [3]. Global features consider the
image as a whole and produce one signature with high
dimensionality [2]; e.g., Oliva and Torralba [4] proposed a
Spatial Envelope that represents the dominant spatial structure
of a scene. It has been shown that deep Conv-Net is powerful
in learning global features into a single vector for classification
[5]. ConvNet gives the artificial neural network a higher
abstraction ability, making it possible to identify homogeneous
but highly variable signals [5] [8]. For example, place
recognition subjected to dramatic seasonal changes was
accomplished using a ConvNet [9], [10]. Experimental
evidence also showed that ConvNet clearly out-performs SIFT
on descriptor matching [11], and other methods in visual
recognition, classification, and detection [12].
Recently, many approaches adopting ConvNet for visual
localization or place recognition have been proposed. Yang
and Scherer [13] presented a monocular SLAM that infers 3D
objects and layout planes based on a combination of ConvNet
and geometric modelling. Iscen et al. [14] introduced the
panorama-to-panorama matching process with stitched images;
in their findings, four views of a scene are enough to obtain a
recall up to 99%. Mendez et al. [15] presented a global
localization approach that relies solely on the ConvNet-based
semantic labels present in the floorplan and extracted from
RGB images. Sunderhauf et al. [16] presented an object
proposal scheme to identify potential landmarks within an
image, utilizing ConvNet features as robust landmark
descriptors. Their method goes through a sophisticated
landmark matching process and is not capable of operating in
real time (1.8 s per image). As our focus of indoor corridors
possesses similar and repeating features such as walls, doors,
windows, bulletin boards, et al., the composition is essentially
very different from the city views which are full of
distinguishable landmarks (see Fig. 1). Chakravarty et al. [17]
used a generative model to learn the joint distribution between
RGB and depth images; they conditioned the depth generation
on the location of the image, which is obtained using a
standard ConvNet, trained to output the camera location as one
of N topological nodes from the environment. The depth
information can be used to plan collision free paths through an
environment. To further improve the performance of place
recognition in route-based navigation, sequence matching was
proposed, in which localization was achieved by recognizing
coherent sequences of the local best matches [10], [18]. By
introducing a NetVLAD layer inside a ConvNet, better results
on the photo-querying problem were reported [19].
Albeit rather successful, ConvNet as a branch of
supervised learning faces some major problems in practice.
For one, large quantities of training data are needed, as a rule
of thumb, the more the merrier. To remedy the problem of
shortage of image data, D’Innocente et al. [20] proposed a data
augmentation layer that zooms on the object of interest and
simulates the object detection outcome of a robot vision
system. Oquab et al. [21] also pointed out that detailed image
annotation, e.g. by object bounding boxes is both expensive
and often subjective. To be relieved from heavily relying on
manually annotated data, they introduced multi-scale sliding-
window training to train from images rescaled to multiple
different sizes and to treat fully connected adaptation layers as
convolutions. Inspired by their approaches, here we propose
Figure 2. Architecture of the proposed visual localization system. The mobile robot captures four images in real time and activates an image structuring
and collaging operation. The collaged images are trained by a ConvNet for direct spatial coordinate output. In the example, 21 location coordinates along
a route are set up as the output.
64 64 1664 64 1
Full
Connect
Pool_1 Pool_2
Conv_2
…..….
…..….
128 21
Conv_1
Output
32 32 16 32 32 36 16 16 36
Full
Connect
…..….
9216
Holistic deep learning and prediction of collaged views
Structuring and collaging
of surrounding views
Capturing four views
surrounding the mobile robot
an image structuring and collaging method to enrich the
training data set that is needed for supervised deep learning.
As will be clearer below, our training data preparation
framework not only generates the required quantities, but also
provides coordinate-articulate image data.
III. PROPOSED METHOD
Assuming the mobile robot is capable of autonomous
navigation such as obstacle avoidance and maintaining the
direction. Here, we focus on the problem of recognizing nodal
locations along a given route using nothing but visual features.
First, we divide each route into short segments, each of a
length of approximately 0.1 2 m. Connecting the segments
are the spatial nodes that serve as landmarks, whose
surrounding views are to be memorized by the robot. As the
robot navigates, it will constantly check its current nodal
location by detecting the surrounding views through the
ConvNet pipeline. Thus, the robot reaches the goal by
detecting the destination node; the middle nodes assist in
recognition of the current status and confirmation of the
correct path. As shown in Fig. 3, our robot is equipped with 4
low-resolution (640480 pixels) CMOS cameras, installed
around the robot at a height of 41 cm from the ground facing
front, rear, left, and right. Typical images obtained by the four
cameras are shown in Fig. 4. To memorize the surrounding
views of each nodal location, a photo is taken by each of the
4 cameras at every node. Note that at every node, we only take
4 photos and use them as the basis of the training image data;
thus, the preparation operation for each new application is
simple. For each new route, the operator only needs to take
the robot to go through the pre-determined route once; the
program automatically collects the images at the nodal
locations sequentially based on the pre-defined kinematic rule
of the robot, e.g., setting a node every 2 m while travelling at
1 m/s and a node every 1 m while traveling at 0.5 m/s. Such
quantities of image data are insufficient for a successful
supervised deep learning model. As will be clearer below, our
proposed image structuring and collaging operation generates
thousands of training images for each nodal location based on
the four basis photos. With deliberate bounding window
arrangements, these new images simulate the views of the
observer slightly drifting away from the original robot
location; thus, the effective area of location detection can be
enlarged. The image structuring and collaging process are
detailed as follows.
A. Image Structuring and Collaging
To increase the number of training images and to enhance
success of place recognition, the following data preparation
and augmentation techniques are proposed. First, image
rescaling was used to generate various object sizes relative to
the image frame. Such an operation creates an effect of a
different depth distance between the observer and the scene.
By scaling up the objects in the image relative to the image
frame, a distance-shortening effect is created and vice versa.
Secondly, by sliding a bounding window horizontally, an
effect of translational movement can also be created. As an
example illustrated in Fig. 4, to create images that are viewed
at a location slightly to the left of the original camera position,
the front image is cropped to the left, the rear image is cropped
to the right, the left image is scaled up, and the right image is
scaled down. The precise region of each bounding window are
defined by the horizontal  and the vertical  as,
   
     
 
   
    
  (1)
where  and  are the width and the height of the basis
photo; denotes the dimension of the neutral bounding
window. The coordinate origin is at the center of the photo.
Integers i and k are the indices in the horizontal and the depth
directions:
        (2)
A larger indicates a smaller bounding window and results in
a shorter depth distance. The dimension of each square
bounding window is defined as,
   (3)
The horizontal and the depth increments are defined as,
  
  
(4)
: Original photo location
: Virtual location of an offset to the left
I Front II Left
III Rear IV Right
I
II
III
IV
Figure 3. Design of the vision system of our mobile robot. 4 low-
resolution CMOS cameras, installed around the robot facing front, rear,
left, and right.
cameras cameras
where w defines the dimension of the smallest bounding
window. The relationship between the indices in each set of
the four images are,
     
      (5)
Thus, for each nodal location, the systematic image
treatment described in (1) (5) is performed to generate large
amounts of image sets. As will be clear below, this action not
only increases the quantities of the training data, but also
enhances the success rate of place recognition by essentially
enlarging the admitted area of the nodal location without
physically collecting more data.
Following the image structuring process described above,
the four images are then collaged to be one image as shown in
Fig. 4. This simple process allows us to include sufficient
visual features for better detection results, yet the computation
power is much less demanding than forming panorama images.
In real-time location detection, the collaging process needs to
be performed for each detection. However, it is fast enough to
be executed by an ordinary PC on-board the robot and achieves
real-time detection.
B. Comparison of the Real and the Simulated Views
As discussed earlier, the images structured and collaged
based on (1) (5) essentially create the views seen by the
robot slightly drifting away from the original location where
the basis photos were taken. The relationship between the
processing parameters in (1) (5) and the moving distances
in the real environment depends on the dimensions of the
environment and specifications of the lens. For example, the
actual views that the robot sees at a location about 0.3 m ahead
of the original observation location are shown in Fig. 5 (a), in
comparison with the result of image structuring and collaging
shown in Fig. 5 (b). It is found that 0.3 m approximately
corresponds to   and  . In Fig. 5 (c), the actual
views at a location about 0.3 m to the right of the original
observation location are captured; it is found that the image
resembles the one obtained with   and   as shown
in Fig. 5 (d). Note that stands for the maximum value of the
index and in the case, indicates that the bounding window has
reached the edge of the basis photo. Such results infer that the
image structuring process, in principle, expands the original
(a)
(b)
(c)
(d)
(e)
Figure 5. Comparison of the real and the simulated views: (a) the real views at 0.3 m ahead of the original observation location, (b) the structured and collaged
image with   and  , (c) the real views at 0.3 m to the right of the original observation location, (d) the structured and collaged image with  
and  , and (e) basis images taken at the original observation location in the order of front, rear, left, right, respectively.
Figure 6. Sample structured and collaged images at various nodal locations of the tested route; each image contains four views from the four cameras. The
right path is the original path, along which the training images were collected; nodal location detection was carried out for the middle and the left paths as
well as the right path.
① ② ③ ⑤ ⑥ ⑦ ⑧ ⑨ ⑩
󰈙
󰈘󰈗󰈖󰈕󰈔󰈓󰈒󰈑󰈐
right
middle
left
observation point to an area of 0.60.6 m2. These images
were obtained at a corridor which is around 3.5 m wide; the
original observation location is approximately 1.0 m to the
right wall.
IV. EXPERIMENTS
A. Setup of Robot Localization Experiment
To demonstrate the proposed method, a real-time robot
localization experiment was conducted on an indoor route of
approximately 40 m long along a straight corridor in the
basement of a campus building as shown in Fig. 6. 21 nodes
were defined and evenly separated by a distance of 2 m as
shown in Fig. 6. To follow a right-going mode, we
commanded the robot to navigate on a path that is around 1 m
to the right wall along the 3.5 m wide corridor. One of the
purposes of the experiment was to see whether the robot can
still recognize the nodal locations while slightly drifting away
from the original path. Later, we tested the nodal location
detection accuracy by navigating the robot along the middle
and the left paths of the corridor.
B. ConvNet Training
A route of 21 nodal locations results in a ConvNet
classification of 21 categories. The architecture of the
ConvNet as shown in Fig. 2 consists of the input layer, two
convolution layers, each followed by a maximum pooling
layer, two flattened layers, and the output layer. For the input
layer, every collaged RGB image is converted to 64 64
pixels and grey scale. 16 filters of a size of 55 are used for
the first convolution layer; 36 filters of a size of 55 are used
for the second convolution layer. The total number of trainable
parameters is 1,198,137. At each nodal location, 5041 collaged
images were prepared according to (1) (5); the collaged
images represent an enlarged nodal area that covers a grid map
of 7171 offset locations. A total of approximately 105,861
training images was used for the 21 label classification. The
ConvNets were trained on a 3.60 GHz Intel Core i7-7700 CPU
with a RAM of 16.0 GB and an NVIDIA GeForce GTX
1080Ti. The total time for completing the training process of
the 21-node route was approximately 20 min; a training
accuracy of approximately 99% was obtained.
C. Nodal Location Detection Result
The nodal location detection was conducted along three
paths the right (original) path, the middle path, and the left
path as shown in Fig. 6. The right path is 1 m to the right wall,
the middle path is in the middle of the 3.5 m wide corridor, and
the left path is 1 m to the left wall. When the operator
controlled the robot to navigate along the designated path, the
four on-board cameras captured the surrounding views at a
frame rate of 20 fps. The on-board computer processed the
images and took averagely 0.14 s to complete a detection. The
robot travelled at an average speed of 1.2 m/s. Totally 13,000
collaged images were produced in the detection test, around
4300 images per path. For every nodal location, precision and
recall were evaluated and the results are shown in Fig. 7. The
precision and recall averages of each path are summarized in
Table I. It is worthy to note that the right and the middle paths
resulted in comparable precision and recall rates, whereas the
left path showed substantially lower values. Thus, the result of
the experiment indicates that the proposed method produces an
effective detection width of approximately 0.75 m (between
the right and the middle paths), which is quite sufficient for
TABLE I. AVERAGE PRECISION AND RECALL OF VARIOUS PATHS
Right
Middle
Left
Precision
Structured
0.91
0.89
0.65
Non-Structured
0.92
0.83
0.61
Recall
Structured
0.85
0.89
0.67
Non-Structured
0.87
0.87
0.62
(a) Precision of structured-and-collaged image training
(b) Recall of structured-and-collaged image training
(c) Precision of single image training
(d) Recall of single image training
Figure 7. Experimental results of the nodal location detection test. For the proposed method, the precision and recall were obtained for three different paths
the right (original), the middle, and the left. The training images were collected along the right path; the results show that the middle path also has good
precision and recall rates. As a benchmark for comparison, detection performance based on single views was also obtained, in which each detector was
trained and made predictions based on a single view. The single view detection performed poorly.
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
right middle left
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
right middle left
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
front view rear view left view right view
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
front view rear view left view right view
our robot which has a body width of 0.6 m. Videos of the
experiment can be found at the website addresses:
https://www.youtube.com/watch?v=-Eqi3adxoKQ.
For comparison, we also tested the ConvNet performance
trained by non-structured collaged images, in which the
images constructing each collaged image were randomly
selected without following (5) (see Table I). By comparing the
two, one may find that the structured images enhance precision
and recall of the middle and the left paths by a margin of
approximately 5%. As a benchmark for comparison, detection
by single views was also tested, in which the ConvNet was
trained by just one of the four views without any collaging
operation. Similar rescaling and sliding window operations
were applied; 4800 training images were obtained for each
nodal location. The resulting precision and recall rates are
shown in Fig. 7 (c) & (d) based on each of the four views. It is
not surprising that the single view detection performs poorly.
Considering the minimal effort required for implementing
this method, precision and recall in a range of 0.85 to 0.91 is
quite satisfactory. During the test, the mobile robot could not
be precisely controlled for direction and speed, and was
inevitably subject to vibration and minor view-point
differences. Thus, the result, to some extent, confirms its
robustness in robot applications. Detection accuracy of
ConvNet can be further raised by a larger training image set;
however, it will be at a cost of human time for data collection.
V. CONCLUSION
Mobile robot localization requires an effective and easy-to-
implement method in feature-barren environments such as the
indoor corridors in the buildings. In this paper, we present a
visual-based deep learning method of place recognition for
real-time applications in mobile robot localization.
Specifically, we trained the robot to memorize the global
visual features associated with each nodal locations. We
trained the ConvNet for nodal location classification with
deliberately structured and collaged images obtained from
only four basis images of the front, rear, left, and right views
captured at each nodal location. The training data preparation
scheme automatically structures the four views and creates
new sets of views which infer slight offsets in the locations.
Thus, sufficient quantities of training data are generated, and
the detector can admit larger nodal areas for better visual
detection results. The image collaging process also effectively
increases the feature dimensionality, yet maintains the real-
time capability. In a robot localization test which involves an
indoor route of 40 m and 21 nodal locations, the average
precision and recall of nodal location detections along the
original path were 0.91 and 0.85, respectively, and maintained
0.89 and 0.89 along a path that is 0.75 m to the left. The robot
utilized an ordinary PC and completed each detection within
an average of 0.14 s. while travelling at a speed of 1.2 m/s.
ACKNOWLEDGMENT
This work was supported by the National Taipei University
of Technology King Mongkut’s University of Technology
Thonburi Joint Research Program: NTUT-KMUTT-107-02.
REFERENCES
[1] A. A. Panchpor, S. Shue, and J. M. Conrad, “A survey of methods
for mobile robot localization and mapping in dynamic indoor
environments,” in 2018 Conf. Signal Process. Comm. Eng. Sys.
(SPACES), pp. 138 144, 2018.
[2] N. Piasco, D. Sidibé, C. Demonceaux, and V. Gouet-Brunet, “A
survey on visual-based localization: on the benefit of heterogeneous
data,” in Pattern Recognition, vol. 74, pp. 90 109, 2018.
[3] S. Se, D. Lowe, and J. Little, “Mobile robot localization and mapping
with uncertainty using scale-invariant visual landmarks,” Int. J.
Robotics Research, vol. 21, pp. 735 758, 2002.
[4] A. Oliva and A. Torralba, Modeling the shape of the scene: a holistic
representation of the spatial envelope, Int. J. Computer Vision, vol.
42, no. 3, pp. 145 175, 2001.
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet
classification with deep convolutional neural networks,” in Proc.
Advances Neural Infor. Process. Sys. (NIPS), 2012.
[6] Y. Chen, Y. Shen, X. Liu, and B. Zhong, “3D object tracking via
image sets and depth-based occlusion detection,” Signal Process., vol.
ll2, pp. 146 153, 2015.
[7] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.
521, pp. 436 444, 2015.
[8] J. Schmidhuber, “Deep learning in neural networks: an overview,”
Neural Networks, vol. 61, pp. 85 117, 2015.
[9] N. Sunderhauf, S. Shirazi, F. Dayoub, B. Upcroft, and M. Milford,
“On the performance of ConvNet features for place recognition,” in
Proc. IEEE Int. Conf. Intell. Robots Systems (IROS), 2015.
[10] Y. Qiao, C. Cappelle, Y. Ruichek, and F. Dornaika, “Visual
localization based on sequence matching using ConvNet features,” in
Proc. IECON 2016, Florence, Italy, 2016.
[11] P. Fischer, A. Dosovitskiy, and T. Brox, “Descriptor matching with
convolutional neural networks: a comparison to SIFT,” arXiv:
1405.5769, 2014.
[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proc. IEEE Conf. Computer Vision Pattern Recognition (CVPR),
2014.
[13] S. Yang and S. Scherer, “Monocular object and plane SLAM in
structured environments,” arXiv:1809.03415v1 [cs.RO], 2018.
[14] A. Iscen, G. Tolias, Y. Avrithis, T. Furon, and O. Chum, “Panorama
to panorama matching for location recognition,” arXiv:1704.06591v1
[cs.CV], 2017.
[15] O. Mendez, S. Hadfield, N. Pugeault, and R. Bowden, SeDAR -
Semantic detection and ranging: humans can localise without LiDAR,
can robots?, in 2018 IEEE Int. Conf. Robotics Automation (ICRA),
2018.
[16] N. S¨underhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell, B.
Upcroft, and M. Milford, “Place recognition with ConvNet landmarks:
viewpoint-robust, condition-robust, training-Free,” in Proc. Robotics
Sci. Sys. XII, 2015.
[17] P. Chakravarty, P. Narayanan, and T. Roussel, “GEN-SLAM:
Generative modeling for monocular simultaneous localization and
mapping,” arXiv:1902.02086v1 [cs.CV], 2019.
[18] M. Milford and G. Wyeth, “SeqSLAM: Visual route-based navigation
for sunny summer days and stormy winter nights,” in Proc. IEEE Int.
Conf. Robotics Automation (ICRA), 2012.
[19] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic,
NetVLAD: CNN architecture for weakly supervised place
recognition,” in Proc. IEEE Conf. Computer Vision Pattern Recog.
(CVPR), 2016.
[20] A. D’Innocente, F. M. Carlucci, M. Colosi, and B. Caputo, “Bridging
between computer and robot vision through data augmentation: a case
study on object recognition,” arXiv:1705.02139.
[21] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Is object localization
for free? - weakly-supervised learning with Convolutional Neural
Networks,” in Proc. IEEE Conf. Computer Vision Pattern Recog.
(CVPR), 2015.
Article
Mobile service robots possess high potential of providing numerous assistances in the working areas. In an attempt to develop a mobile service robot which is dynamically balanced for faster movement and taller manipulation capability, we designed and prototyped J4.alpha, which is intended for swift navigation and nimble manipulation. Previously, we devised a pure visual method based on a supervised deep learning model for real-time recognition of nodal locations. Four low-resolution RGB cameras are installed around J4.alpha to capture the surrounding visual features for training and detection. As the method is developed for ease of implementation, fast real-time application, accurate detection, and low cost, we further improve the accuracy and the practicality of the method in this study. Specifically, a set of expectation rules are introduced to reject outlier detections, and a scheme of training renewal is devised to effectively react to environmental modifications. In our previous tests, precision and recall rates of the location coordinate detection by the ConvNet models were generally between 0.78 and 0.91; by introducing the expectation rules, precision and recall are improved by approximately 10%. A large scale field test is also carried out here for both corridor and factory scenarios; the performance of the proposed method was tested for detection accuracy and verified for 2 m and 0.5 m nodal intervals. The scheme of training renewal designed for capturing and reflecting environmental modifications was also proved to be effective.
Conference Paper
Full-text available
How does a person work out their location using a floorplan? It is probably safe to say that we do not explicitly measure depths to every visible surface and try to match them against different pose estimates in the floorplan. And yet, this is exactly how most robotic scan-matching algorithms operate. Similarly, we do not extrude the 2D geometry present in the floorplan into 3D and try to align it to the real-world. And yet, this is how most vision-based approaches localise. Humans do the exact opposite. Instead of depth, we use high level semantic cues. Instead of extruding the floorplan up into the third dimension, we collapse the 3D world into a 2D representation. Evidence of this is that many of the floorplans we use in everyday life are not accurate, opting instead for high levels of discriminative landmarks. In this work, we use this insight to present a global localisation approach that relies solely on the semantic labels present in the floorplan and extracted from RGB images. While our approach is able to use range measurements if available, we demonstrate that they are unnecessary as we can achieve results comparable to state-of-the-art without them.
Article
Full-text available
How does a person work out their location using a floorplan? It is probably safe to say that we do not explicitly measure depths to every visible surface and try to match them against different pose estimates in the floorplan. And yet, this is exactly how most robotic scan-matching algorithms operate. Similarly, we do not extrude the 2D geometry present in the floorplan into 3D and try to align it to the real-world. And yet, this is how most vision-based approaches localise. Humans do the exact opposite. Instead of depth, we use high level semantic cues. Instead of extruding the floorplan up into the third dimension, we collapse the 3D world into a 2D representation. Evidence of this is that many of the floorplans we use in everyday life are not accurate, opting instead for high levels of discriminative landmarks. In this work, we use this insight to present a global localisation approach that relies solely on the semantic labels present in the floorplan and extracted from RGB images. While our approach is able to use range measurements if available, we demonstrate that they are unnecessary as we can achieve results comparable to state-of-the-art without them.
Article
Full-text available
Despite the impressive progress brought by deep network in visual object recognition, robot vision is still far from being a solved problem. The most successful convolutional architectures are developed starting from ImageNet, a large scale collection of images of object categories downloaded from the Web. This kind of images is very different from the situated and embodied visual experience of robots deployed in unconstrained settings. To reduce the gap between these two visual experiences, this paper proposes a simple yet effective data augmentation layer that zooms on the object of interest and simulates the object detection outcome of a robot vision system. The layer, that can be used with any convolutional deep architecture, brings to an increase in object recognition performance of up to 7\%, in experiments performed over three different benchmark databases. Upon acceptance of the paper, our robot data augmentation layer will be made publicly available.
Article
In this paper, we present a monocular Simultaneous Localization and Mapping (SLAM) algorithm using high-level object and plane landmarks. The built map is denser, more compact and semantic meaningful compared to feature point based SLAM. We first propose a high order graphical model to jointly infer the 3D object and layout planes from single images considering occlusions and semantic constraints. The extracted objects and planes are further optimized with camera poses in a unified SLAM framework. Objects and planes can provide more semantic constraints such as Manhattan and object supporting relationships compared to points. Experiments on various public and collected datasets including ICL NUIM and TUM Mono show that our algorithm can improve camera localization accuracy compared to state-of-the-art SLAM especially when there is no loop closure, and also generate dense maps robustly in many structured environments.
Conference Paper
Despite the impressive progress brought by deep network in visual object recognition, robot vision is still far from being a solved problem. The most successful convolutional architectures are developed starting from ImageNet, a large scale collection of images of object categories downloaded from the Web. This kind of images is very different from the situated and embodied visual experience of robots deployed in unconstrained settings. To reduce the gap between these two visual experiences, this paper proposes a simple yet effective data augmentation layer that zooms on the object of interest and simulates the object detection outcome of a robot vision system. The layer, that can be used with any convolutional deep architecture, brings to an increase in object recognition performance of up to 7%, in experiments performed over three different benchmark databases. An implementation of our robot data augmentation layer has been made publicly available.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
We are surrounded by plenty of information about our environment. From these multiple sources, numerous data could be extracted: set of images, 3D model, coloured points cloud. When classical localization devices failed (e.g. GPS sensor in cluttered environments), aforementioned data could be used within a localization framework. This is called Visual Based Localization (VBL). Due to numerous data types that can be collected from a scene, VBL encompasses a large amount of different methods. This paper presents a survey about recent methods that localize a visual acquisition system according to a known environment. We start by categorizing VBL methods into two distinct families: indirect and direct localization systems. As the localization environment is almost always dynamic, we pay special attention to methods designed to handle appearances changes occurring in a scene. Thereafter, we highlight methods exploiting heterogeneous types of data. Finally, we conclude the paper with a discussion on promising trends that could permit to a localization system to reach high precision pose estimation within an area as large as possible.