Abstract— This paper demonstrates a place recognition and
localization method designed for automated guidance of mobile
robots. Collecting and annotating sufficient images for a
supervised deep learning model is often an exhausting work.
Devising an effective visual detection scheme for a mobile robot
location detection job in a feature-barren environment such as
the indoor corridors of buildings is also quite challenging. To
address these issues, a supervised deep learning model for the
spatial coordinate detection of a mobile robot is proposed here.
Specifically, a novel technique is introduced involving
structuring and collaging of the surrounding views obtained by
the on-board cameras for the training data preparation. A
system linking robot kinematics and image processing provides
automatic data annotation, which significantly reduces the need
for human work on data preparation. Experimental evidence
showed that the precision and recall rates of the location
coordinate detection are 0.91 and 0.85, respectively. Also, the
detection appeared to be effective over a path width of 0.75 m,
which is sufficient to cover the possible deviations from the
target path. Furthermore, it took averagely 0.14 s for each visual
detection performed by an ordinary PC on-board the mobile
robot; thus, real-time navigation using the proposed method is
Autonomous mobility is a key capability of the mobile
robots, which are expected to perform real-time obstacle
avoidance, path planning, and place recognition without
human assistance. Making sense of the position where the
robot is located at is essential to autonomous mobility; a
reliable and effective method for such automated localization
tasks is of great use. A common and important task of a mobile
service robot is to navigate and localize itself along the correct
path to the designated destination. For example, when a
hospital robot is dispatched to a specific station, the robot
should first determine the routes to take, then navigate along
the routes while constantly confirming its location/coordinates,
and finally reach the goal. Often the indoor routes consist of
lengthy corridors such as the one shown in Fig. 1, which is lack
of sufficient visual features for effective visual detection. It is
also more convenient and efficient for navigation control, if
the localization system outputs precise spatial coordinates of
the robot based on the result of the visual detection.
Much work has been done on Simultaneous Localization
And Mapping (SLAM); however, an easy-to-implement
method that provides accurate and precision coordinates of the
mobile robot is yet to come. In this paper, we address the
*This work was supported by the National Taipei University of
Technology – King Mongkut’s University of Technology Thonburi Joint
Research Program: NTUT-KMUTT-107-02.
problem of indoor localization in a feature-barren environment
such as corridors with the simple and repeating appearance
(see Fig. 1). Specifically, we devise a method which allows the
robot to remember visual features associated with each set of
strictly defined spatial coordinates. The robot can then locate
itself along the learned paths based on the views that it sees.
The idea is similar to a teacher walking through a school
campus with new students and introducing the environment to
them. If the robots (the students) can automatically learn to
remember and associate the scenes with internally constructed
spatial coordinates, the operator does not have to have any
engineering skills to manually construct the map for every
application. Common and tedious work associated with
environmental setup such as marking and taping are also no
longer in need.
Here we propose a visual localization system that
combines the superior visual feature-learning capability of
Convolutional Neural Network (ConvNet) with designed
kinematic patterns of the mobile robot to form our framework
of image data preparation and coordinate-articulate SLAM.
By taking photos following the pre-determined kinematic rules,
images are acquired at every nodal point and automatically
annotated with precision coordinates. To increase the
dimensionality of visual features, we collage four views – front,
rear, left, and right into one image, yet computation is managed
to be able to run in real time. To further enhance the detection
accuracy, a data augmentation technique is introduced based
on a structured image collaging operation. Low-cost CMOS
cameras are proved to be sufficient in our proposed system;
Yi-Feng Hong, Yu-Ming Chang, and Chih-Hung G. Li are with the
Graduate Institute of Manufacturing Technology, National Taipei University
of Technology, Taipei, 10608 Taiwan ROC (phone: +886-2-2771-2171 ext.
2092; fax: +886-2-2776-4889; e-mail: cL4e@ntut.edu.tw).
Real-time Visual-Based Localization for Mobile Robot Using
Structured-View Deep Learning
Yi-Feng Hong, Yu-Ming Chang, and Chih-Hung G. Li*, Member, IEEE
Figure 1. Left: The proposed visual-based localization system aims at
providing real-time spatial coordinate predictions of the mobile robot in
feature-barren environments such as the indoor corridors shown in the
photos. Right: The mobile robot built by the authors and used in the study.
computation speed for image processing such as collaging and
resizing is also proved to be high enough for real-time
The core contribution of this work is that we introduce a
novel visual localization system that fuses deep learning with
a structured visual data preparation scheme for automatic
training data preparation and direct spatial coordinate
detection of a mobile robot (see Fig. 2). Specifically, we
introduce an innovative image structure and collage operation
prior to the ConvNet. The training and the detection process is
highly automated, and even non-skilled personnel can easily
set up the system for every application.
The rest of the paper is organized as follows. Section II
describes related prior work in visual-based localization and
ConvNet. Section III details the proposed method. Section IV
reports our experimental results. Section V details comparison
and discussion. The conclusion is in Section VI.
II. RELATED WORK
Various methods have been adopted for mobile robot
localization, e.g., EKF-SLAM, FastSLAM, GraphSLAM,
visual-based localization , et al., using devices such as GPS,
LIDAR, RFID, magnetic tape, QR Code, Wifi, RGB-D
cameras, et al. Visual-based localization generally uses local
features or global features , in which local features describe
at pixel level among a local neighborhood of several points in
the image  such as SIFT . Global features consider the
image as a whole and produce one signature with high
dimensionality ; e.g., Oliva and Torralba  proposed a
Spatial Envelope that represents the dominant spatial structure
of a scene. It has been shown that deep Conv-Net is powerful
in learning global features into a single vector for classification
. ConvNet gives the artificial neural network a higher
abstraction ability, making it possible to identify homogeneous
but highly variable signals  – . For example, place
recognition subjected to dramatic seasonal changes was
accomplished using a ConvNet , . Experimental
evidence also showed that ConvNet clearly out-performs SIFT
on descriptor matching , and other methods in visual
recognition, classification, and detection .
Recently, many approaches adopting ConvNet for visual
localization or place recognition have been proposed. Yang
and Scherer  presented a monocular SLAM that infers 3D
objects and layout planes based on a combination of ConvNet
and geometric modelling. Iscen et al.  introduced the
panorama-to-panorama matching process with stitched images;
in their findings, four views of a scene are enough to obtain a
recall up to 99%. Mendez et al.  presented a global
localization approach that relies solely on the ConvNet-based
semantic labels present in the floorplan and extracted from
RGB images. Sunderhauf et al.  presented an object
proposal scheme to identify potential landmarks within an
image, utilizing ConvNet features as robust landmark
descriptors. Their method goes through a sophisticated
landmark matching process and is not capable of operating in
real time (1.8 s per image). As our focus of indoor corridors
possesses similar and repeating features such as walls, doors,
windows, bulletin boards, et al., the composition is essentially
very different from the city views which are full of
distinguishable landmarks (see Fig. 1). Chakravarty et al. 
used a generative model to learn the joint distribution between
RGB and depth images; they conditioned the depth generation
on the location of the image, which is obtained using a
standard ConvNet, trained to output the camera location as one
of N topological nodes from the environment. The depth
information can be used to plan collision free paths through an
environment. To further improve the performance of place
recognition in route-based navigation, sequence matching was
proposed, in which localization was achieved by recognizing
coherent sequences of the local best matches , . By
introducing a NetVLAD layer inside a ConvNet, better results
on the photo-querying problem were reported .
Albeit rather successful, ConvNet as a branch of
supervised learning faces some major problems in practice.
For one, large quantities of training data are needed, as a rule
of thumb, the more the merrier. To remedy the problem of
shortage of image data, D’Innocente et al.  proposed a data
augmentation layer that zooms on the object of interest and
simulates the object detection outcome of a robot vision
system. Oquab et al.  also pointed out that detailed image
annotation, e.g. by object bounding boxes is both expensive
and often subjective. To be relieved from heavily relying on
manually annotated data, they introduced multi-scale sliding-
window training to train from images rescaled to multiple
different sizes and to treat fully connected adaptation layers as
convolutions. Inspired by their approaches, here we propose
Figure 2. Architecture of the proposed visual localization system. The mobile robot captures four images in real time and activates an image structuring
and collaging operation. The collaged images are trained by a ConvNet for direct spatial coordinate output. In the example, 21 location coordinates along
a route are set up as the output.
64 64 1664 64 1
32 32 16 32 32 36 16 16 36
Holistic deep learning and prediction of collaged views
Structuring and collaging
of surrounding views
Capturing four views
surrounding the mobile robot
an image structuring and collaging method to enrich the
training data set that is needed for supervised deep learning.
As will be clearer below, our training data preparation
framework not only generates the required quantities, but also
provides coordinate-articulate image data.
III. PROPOSED METHOD
Assuming the mobile robot is capable of autonomous
navigation such as obstacle avoidance and maintaining the
direction. Here, we focus on the problem of recognizing nodal
locations along a given route using nothing but visual features.
First, we divide each route into short segments, each of a
length of approximately 0.1 – 2 m. Connecting the segments
are the spatial nodes that serve as landmarks, whose
surrounding views are to be memorized by the robot. As the
robot navigates, it will constantly check its current nodal
location by detecting the surrounding views through the
ConvNet pipeline. Thus, the robot reaches the goal by
detecting the destination node; the middle nodes assist in
recognition of the current status and confirmation of the
correct path. As shown in Fig. 3, our robot is equipped with 4
low-resolution (640480 pixels) CMOS cameras, installed
around the robot at a height of 41 cm from the ground facing
front, rear, left, and right. Typical images obtained by the four
cameras are shown in Fig. 4. To memorize the surrounding
views of each nodal location, a photo is taken by each of the
4 cameras at every node. Note that at every node, we only take
4 photos and use them as the basis of the training image data;
thus, the preparation operation for each new application is
simple. For each new route, the operator only needs to take
the robot to go through the pre-determined route once; the
program automatically collects the images at the nodal
locations sequentially based on the pre-defined kinematic rule
of the robot, e.g., setting a node every 2 m while travelling at
1 m/s and a node every 1 m while traveling at 0.5 m/s. Such
quantities of image data are insufficient for a successful
supervised deep learning model. As will be clearer below, our
proposed image structuring and collaging operation generates
thousands of training images for each nodal location based on
the four basis photos. With deliberate bounding window
arrangements, these new images simulate the views of the
observer slightly drifting away from the original robot
location; thus, the effective area of location detection can be
enlarged. The image structuring and collaging process are
detailed as follows.
A. Image Structuring and Collaging
To increase the number of training images and to enhance
success of place recognition, the following data preparation
and augmentation techniques are proposed. First, image
rescaling was used to generate various object sizes relative to
the image frame. Such an operation creates an effect of a
different depth distance between the observer and the scene.
By scaling up the objects in the image relative to the image
frame, a distance-shortening effect is created and vice versa.
Secondly, by sliding a bounding window horizontally, an
effect of translational movement can also be created. As an
example illustrated in Fig. 4, to create images that are viewed
at a location slightly to the left of the original camera position,
the front image is cropped to the left, the rear image is cropped
to the right, the left image is scaled up, and the right image is
scaled down. The precise region of each bounding window are
defined by the horizontal and the vertical as,
where and are the width and the height of the basis
photo; denotes the dimension of the neutral bounding
window. The coordinate origin is at the center of the photo.
Integers i and k are the indices in the horizontal and the depth
A larger indicates a smaller bounding window and results in
a shorter depth distance. The dimension of each square
bounding window is defined as,
The horizontal and the depth increments are defined as,
Figure 4. Illustration of the image structuring and collaging process that
integrates four images into one. By rescaling and sliding the bounding
window, images resembling the views at offset locations can be generated.
The current example shows an offset to the left of the original camera
: Original photo location
: Virtual location of an offset to the left
I Front II Left
III Rear IV Right
Figure 3. Design of the vision system of our mobile robot. 4 low-
resolution CMOS cameras, installed around the robot facing front, rear,
left, and right.
where w defines the dimension of the smallest bounding
window. The relationship between the indices in each set of
the four images are,
Thus, for each nodal location, the systematic image
treatment described in (1) – (5) is performed to generate large
amounts of image sets. As will be clear below, this action not
only increases the quantities of the training data, but also
enhances the success rate of place recognition by essentially
enlarging the admitted area of the nodal location without
physically collecting more data.
Following the image structuring process described above,
the four images are then collaged to be one image as shown in
Fig. 4. This simple process allows us to include sufficient
visual features for better detection results, yet the computation
power is much less demanding than forming panorama images.
In real-time location detection, the collaging process needs to
be performed for each detection. However, it is fast enough to
be executed by an ordinary PC on-board the robot and achieves
B. Comparison of the Real and the Simulated Views
As discussed earlier, the images structured and collaged
based on (1) – (5) essentially create the views seen by the
robot slightly drifting away from the original location where
the basis photos were taken. The relationship between the
processing parameters in (1) – (5) and the moving distances
in the real environment depends on the dimensions of the
environment and specifications of the lens. For example, the
actual views that the robot sees at a location about 0.3 m ahead
of the original observation location are shown in Fig. 5 (a), in
comparison with the result of image structuring and collaging
shown in Fig. 5 (b). It is found that 0.3 m approximately
corresponds to and . In Fig. 5 (c), the actual
views at a location about 0.3 m to the right of the original
observation location are captured; it is found that the image
resembles the one obtained with and as shown
in Fig. 5 (d). Note that stands for the maximum value of the
index and in the case, indicates that the bounding window has
reached the edge of the basis photo. Such results infer that the
image structuring process, in principle, expands the original
Figure 5. Comparison of the real and the simulated views: (a) the real views at 0.3 m ahead of the original observation location, (b) the structured and collaged
image with and , (c) the real views at 0.3 m to the right of the original observation location, (d) the structured and collaged image with
and , and (e) basis images taken at the original observation location in the order of front, rear, left, right, respectively.
Figure 6. Sample structured and collaged images at various nodal locations of the tested route; each image contains four views from the four cameras. The
right path is the original path, along which the training images were collected; nodal location detection was carried out for the middle and the left paths as
well as the right path.
① ② ③ ⑤ ⑥ ⑦ ⑧ ⑨ ⑩④
observation point to an area of 0.60.6 m2. These images
were obtained at a corridor which is around 3.5 m wide; the
original observation location is approximately 1.0 m to the
A. Setup of Robot Localization Experiment
To demonstrate the proposed method, a real-time robot
localization experiment was conducted on an indoor route of
approximately 40 m long along a straight corridor in the
basement of a campus building as shown in Fig. 6. 21 nodes
were defined and evenly separated by a distance of 2 m as
shown in Fig. 6. To follow a right-going mode, we
commanded the robot to navigate on a path that is around 1 m
to the right wall along the 3.5 m wide corridor. One of the
purposes of the experiment was to see whether the robot can
still recognize the nodal locations while slightly drifting away
from the original path. Later, we tested the nodal location
detection accuracy by navigating the robot along the middle
and the left paths of the corridor.
B. ConvNet Training
A route of 21 nodal locations results in a ConvNet
classification of 21 categories. The architecture of the
ConvNet as shown in Fig. 2 consists of the input layer, two
convolution layers, each followed by a maximum pooling
layer, two flattened layers, and the output layer. For the input
layer, every collaged RGB image is converted to 64 64
pixels and grey scale. 16 filters of a size of 55 are used for
the first convolution layer; 36 filters of a size of 55 are used
for the second convolution layer. The total number of trainable
parameters is 1,198,137. At each nodal location, 5041 collaged
images were prepared according to (1) – (5); the collaged
images represent an enlarged nodal area that covers a grid map
of 7171 offset locations. A total of approximately 105,861
training images was used for the 21 label classification. The
ConvNets were trained on a 3.60 GHz Intel Core i7-7700 CPU
with a RAM of 16.0 GB and an NVIDIA GeForce GTX
1080Ti. The total time for completing the training process of
the 21-node route was approximately 20 min; a training
accuracy of approximately 99% was obtained.
C. Nodal Location Detection Result
The nodal location detection was conducted along three
paths – the right (original) path, the middle path, and the left
path as shown in Fig. 6. The right path is 1 m to the right wall,
the middle path is in the middle of the 3.5 m wide corridor, and
the left path is 1 m to the left wall. When the operator
controlled the robot to navigate along the designated path, the
four on-board cameras captured the surrounding views at a
frame rate of 20 fps. The on-board computer processed the
images and took averagely 0.14 s to complete a detection. The
robot travelled at an average speed of 1.2 m/s. Totally 13,000
collaged images were produced in the detection test, around
4300 images per path. For every nodal location, precision and
recall were evaluated and the results are shown in Fig. 7. The
precision and recall averages of each path are summarized in
Table I. It is worthy to note that the right and the middle paths
resulted in comparable precision and recall rates, whereas the
left path showed substantially lower values. Thus, the result of
the experiment indicates that the proposed method produces an
effective detection width of approximately 0.75 m (between
the right and the middle paths), which is quite sufficient for
TABLE I. AVERAGE PRECISION AND RECALL OF VARIOUS PATHS
(a) Precision of structured-and-collaged image training
(b) Recall of structured-and-collaged image training
(c) Precision of single image training
(d) Recall of single image training
Figure 7. Experimental results of the nodal location detection test. For the proposed method, the precision and recall were obtained for three different paths
– the right (original), the middle, and the left. The training images were collected along the right path; the results show that the middle path also has good
precision and recall rates. As a benchmark for comparison, detection performance based on single views was also obtained, in which each detector was
trained and made predictions based on a single view. The single view detection performed poorly.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
right middle left
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
right middle left
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
front view rear view left view right view
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
front view rear view left view right view
our robot which has a body width of 0.6 m. Videos of the
experiment can be found at the website addresses:
For comparison, we also tested the ConvNet performance
trained by non-structured collaged images, in which the
images constructing each collaged image were randomly
selected without following (5) (see Table I). By comparing the
two, one may find that the structured images enhance precision
and recall of the middle and the left paths by a margin of
approximately 5%. As a benchmark for comparison, detection
by single views was also tested, in which the ConvNet was
trained by just one of the four views without any collaging
operation. Similar rescaling and sliding window operations
were applied; 4800 training images were obtained for each
nodal location. The resulting precision and recall rates are
shown in Fig. 7 (c) & (d) based on each of the four views. It is
not surprising that the single view detection performs poorly.
Considering the minimal effort required for implementing
this method, precision and recall in a range of 0.85 to 0.91 is
quite satisfactory. During the test, the mobile robot could not
be precisely controlled for direction and speed, and was
inevitably subject to vibration and minor view-point
differences. Thus, the result, to some extent, confirms its
robustness in robot applications. Detection accuracy of
ConvNet can be further raised by a larger training image set;
however, it will be at a cost of human time for data collection.
Mobile robot localization requires an effective and easy-to-
implement method in feature-barren environments such as the
indoor corridors in the buildings. In this paper, we present a
visual-based deep learning method of place recognition for
real-time applications in mobile robot localization.
Specifically, we trained the robot to memorize the global
visual features associated with each nodal locations. We
trained the ConvNet for nodal location classification with
deliberately structured and collaged images obtained from
only four basis images of the front, rear, left, and right views
captured at each nodal location. The training data preparation
scheme automatically structures the four views and creates
new sets of views which infer slight offsets in the locations.
Thus, sufficient quantities of training data are generated, and
the detector can admit larger nodal areas for better visual
detection results. The image collaging process also effectively
increases the feature dimensionality, yet maintains the real-
time capability. In a robot localization test which involves an
indoor route of 40 m and 21 nodal locations, the average
precision and recall of nodal location detections along the
original path were 0.91 and 0.85, respectively, and maintained
0.89 and 0.89 along a path that is 0.75 m to the left. The robot
utilized an ordinary PC and completed each detection within
an average of 0.14 s. while travelling at a speed of 1.2 m/s.
This work was supported by the National Taipei University
of Technology – King Mongkut’s University of Technology
Thonburi Joint Research Program: NTUT-KMUTT-107-02.
 A. A. Panchpor, S. Shue, and J. M. Conrad, “A survey of methods
for mobile robot localization and mapping in dynamic indoor
environments,” in 2018 Conf. Signal Process. Comm. Eng. Sys.
(SPACES), pp. 138 – 144, 2018.
 N. Piasco, D. Sidibé, C. Demonceaux, and V. Gouet-Brunet, “A
survey on visual-based localization: on the benefit of heterogeneous
data,” in Pattern Recognition, vol. 74, pp. 90 – 109, 2018.
 S. Se, D. Lowe, and J. Little, “Mobile robot localization and mapping
with uncertainty using scale-invariant visual landmarks,” Int. J.
Robotics Research, vol. 21, pp. 735 – 758, 2002.
 A. Oliva and A. Torralba, “Modeling the shape of the scene: a holistic
representation of the spatial envelope,” Int. J. Computer Vision, vol.
42, no. 3, pp. 145 – 175, 2001.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet
classification with deep convolutional neural networks,” in Proc.
Advances Neural Infor. Process. Sys. (NIPS), 2012.
 Y. Chen, Y. Shen, X. Liu, and B. Zhong, “3D object tracking via
image sets and depth-based occlusion detection,” Signal Process., vol.
ll2, pp. 146 – 153, 2015.
 Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.
521, pp. 436 – 444, 2015.
 J. Schmidhuber, “Deep learning in neural networks: an overview,”
Neural Networks, vol. 61, pp. 85 – 117, 2015.
 N. Sunderhauf, S. Shirazi, F. Dayoub, B. Upcroft, and M. Milford,
“On the performance of ConvNet features for place recognition,” in
Proc. IEEE Int. Conf. Intell. Robots Systems (IROS), 2015.
 Y. Qiao, C. Cappelle, Y. Ruichek, and F. Dornaika, “Visual
localization based on sequence matching using ConvNet features,” in
Proc. IECON 2016, Florence, Italy, 2016.
 P. Fischer, A. Dosovitskiy, and T. Brox, “Descriptor matching with
convolutional neural networks: a comparison to SIFT,” arXiv:
 R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proc. IEEE Conf. Computer Vision Pattern Recognition (CVPR),
 S. Yang and S. Scherer, “Monocular object and plane SLAM in
structured environments,” arXiv:1809.03415v1 [cs.RO], 2018.
 A. Iscen, G. Tolias, Y. Avrithis, T. Furon, and O. Chum, “Panorama
to panorama matching for location recognition,” arXiv:1704.06591v1
 O. Mendez, S. Hadfield, N. Pugeault, and R. Bowden, “SeDAR -
Semantic detection and ranging: humans can localise without LiDAR,
can robots?,” in 2018 IEEE Int. Conf. Robotics Automation (ICRA),
 N. S¨underhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell, B.
Upcroft, and M. Milford, “Place recognition with ConvNet landmarks:
viewpoint-robust, condition-robust, training-Free,” in Proc. Robotics
Sci. Sys. XII, 2015.
 P. Chakravarty, P. Narayanan, and T. Roussel, “GEN-SLAM:
Generative modeling for monocular simultaneous localization and
mapping,” arXiv:1902.02086v1 [cs.CV], 2019.
 M. Milford and G. Wyeth, “SeqSLAM: Visual route-based navigation
for sunny summer days and stormy winter nights,” in Proc. IEEE Int.
Conf. Robotics Automation (ICRA), 2012.
 R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic,
“NetVLAD: CNN architecture for weakly supervised place
recognition,” in Proc. IEEE Conf. Computer Vision Pattern Recog.
 A. D’Innocente, F. M. Carlucci, M. Colosi, and B. Caputo, “Bridging
between computer and robot vision through data augmentation: a case
study on object recognition,” arXiv:1705.02139.
 M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Is object localization
for free? - weakly-supervised learning with Convolutional Neural
Networks,” in Proc. IEEE Conf. Computer Vision Pattern Recog.