ArticlePDF Available

Navigating the Landscape for Real-Time Localization and Mapping for Robotics and Virtual and Augmented Reality


Abstract and Figures

Visual understanding of 3D environments in real-time, at low power, is a huge computational challenge. Often referred to as SLAM (Simultaneous Localisation and Mapping), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM, by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware, and compilation pathway, to meet their performance, accuracy, and energy consumption goals. The major contributions we present are (1) tools and methodology for systematic quantitative evaluation of SLAM algorithms, (2) automated, machine-learning-guided exploration of the algorithmic and implementation design space with respect to multiple objectives, (3) end-to-end simulation tools to enable optimisation of heterogeneous, accelerated architectures for the specific algorithmic requirements of the various SLAM algorithmic approaches, and (4) tools for delivering, where appropriate, accelerated, adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context.
demonstrates execution times for common convolution filters on various CPUs and GPUs compared with an implementation of FPSP, known as SCAMP [60]. The code for FPSP was automatically generated as explained in [65]. The parallel nature of the FPSP allows it to perform all of the tested filter kernels, shown on x-axis, in a fraction of the time needed by the other devices, shown on y-axis. This is a direct consequence of having a dedicated processing element available for every pixel, building up the filter on the whole image at the same time. As for the other devices, we see that for dense kernels (Gauss, Box), GPUs usually perform better than CPUs, whereas for sparse kernels (Sobel, Laplacian, Sharpen), CPUs seem to have an advantage. An outlier case being the 7 ? 7 box filter, at which only the most powerful graphics card manages to get a result comparable to the CPUs. It is assumed that the CPU implementation follows a more suitable algorithm than the GPU implementation, even though both implementations are based on their vendors performance libraries (Intel IPP, nVidia NPP). Another reason could be the fact, that the GTX680 and GTX780 are based on a hardware architecture that is less suitable for this type of filter than the TITAN X's architecture. While Fig. 7 shows that there is a significant reduction in execution time, the SCAMP chip consumes only 1.23W under full load. Compared to the experimented CPU and GPU systems, this at least 20 times less power. Clearly a more specialised image processing pipeline architecture could be more energy-efficient than these fully programmable architectures. There is scope for further research to map the space of alternative designs, including specialised heterogeneous multicore vision processing accelerators such as the Myriad-2 Vision Processing Unit [66].
Content may be subject to copyright.
Navigating the Landscape for Real-time Localisation and Mapping for Robotics and Virtual
and Augmented Reality
Sajad Saeedi, Bruno Bodin?, Harry Wagstaff?, Andy Nisbet, Luigi Nardi, John Mawer, Nicolas Melot,
Oscar Palomar, Emanuele Vespa, Tom Spink?, Cosmin Gorgovan, Andrew Webb, James Clarkson,
Erik Tomusk?, Thomas Debrunner, Kuba Kaszyk ?, Pablo Gonzalez-de-Aledo, Andrey Rodchenko,
Graham Riley, Christos Kotselidis, Bj¨
orn Franke?, Michael F. P. O’Boyle?, Andrew J. Davison,
Paul H. J. Kelly, Mikel Luj´
an, and Steve Furber
Abstract—Visual understanding of 3D environments in real-
time, at low power, is a huge computational challenge. Often
referred to as SLAM (Simultaneous Localisation and Mapping),
it is central to applications spanning domestic and industrial
robotics, autonomous vehicles, virtual and augmented reality.
This paper describes the results of a major research effort
to assemble the algorithms, architectures, tools, and systems
software needed to enable delivery of SLAM, by supporting
applications specialists in selecting and configuring the appro-
priate algorithm and the appropriate hardware, and compilation
pathway, to meet their performance, accuracy, and energy
consumption goals. The major contributions we present are (1)
tools and methodology for systematic quantitative evaluation
of SLAM algorithms, (2) automated, machine-learning-guided
exploration of the algorithmic and implementation design space
with respect to multiple objectives, (3) end-to-end simulation tools
to enable optimisation of heterogeneous, accelerated architectures
for the specific algorithmic requirements of the various SLAM
algorithmic approaches, and (4) tools for delivering, where
appropriate, accelerated, adaptive SLAM solutions in a managed,
JIT-compiled, adaptive runtime context.
Index Terms—SLAM, automatic performance tuning, hard-
ware simulation, scheduling
Programming increasingly heterogeneous systems for
emerging application domains is an urgent challenge. One
particular domain with massive potential is real-time 3D scene
understanding, poised to effect a radical transformation in the
engagement between digital devices and the physical human
world. In particular, visual Simultaneous Localisation and
Mapping (SLAM), defined as determining the position and
orientation of a moving camera in an unknown environment
by processing image frames in real-time, has emerged to be an
enabling technology for robotics and virtual/augmented reality
The objective of this work is to build the tools to enable
the computer vision pipeline architecture to be designed so
that SLAM requirements are aligned with hardware capability.
Since SLAM is computationally very demanding, several
subgoals are defined: developing systems with 1) power and
energy efficiency, 2) speed and runtime improvement, and
3) improved results in terms of accuracy and robustness.
Fig. 1presents an overview of the directions explored. At
the first stage, we consider different layers of the system
Department of Computing, Imperial College London, UK
?School of Informatics, University of Edinburgh, UK
School of Computer Science, University of Manchester, UK
Electrical Engineering - Computer Systems, Stanford University, USA
Compiler and
Algorithm Design Space
Fig. 1: The objective of the paper is to create a pipeline that aligns computer vision
requirements with hardware capabilities. The paper’s focus is on three layers: algorithms,
compiler and runtime, and architecture. The goal is to develop a system that allows
us to achieve power and energy efficiency, speed and runtime improvement, and
accuracy/robustness at each layer and also holistically through design space exploration
and machine learning techniques.
including architecture, compiler and runtime, and computer
vision algorithms. Several distinct contributions have been
presented in these three layers, explained throughout the
paper. These contributions include novel benchmarking frame-
works for SLAM algorithms, various scheduling techniques
for software performance improvement, and ‘functional‘ and
‘detailed’ hardware simulation frameworks. Additionally, we
present holistic optimisation techniques, such as Design Space
Exploration (DSE), that allows us to take into account all these
layers together and optimise the system holistically to achieve
the desired performance metrics.
The major contributions we present are:
tools and methodology for systematic quantitative evalu-
ation of SLAM algorithms,
automated, machine-learning-guided exploration of the
algorithmic and implementation design space with respect
to multiple objectives,
end-to-end simulation tools to enable optimisation of
heterogeneous, accelerated architectures for the specific
algorithmic requirements of the various SLAM algorith-
mic approaches, and
tools for delivering, where appropriate, accelerated, adap-
tive SLAM solutions in a managed, JIT-compiled, adap-
tive runtime context.
This article is an overview of a large body of work unified
by these common objectives — to apply software synthesis,
and automatic performance tuning in the context of compil-
ers and library generators, performance engineering, program
generation, languages, and hardware synthesis. We specifically
target mobile, embedded, and wearable contexts, where trading
off quality-of-result against energy consumption is of critical
importance. The key significance of the work lies, we believe,
in showing the importance and the feasibility of extending
these ideas across the full stack, incorporating algorithm
arXiv:1808.06352v1 [cs.CV] 20 Aug 2018
selection and configuration into the design space along with
code generation and hardware levels of the system.
A. Background
Based on the structure shown in Fig. 1, in this section,
background material for the following topics is presented very
computer vision,
system software,
computer architecture, and
model-based design space exploration.
1) Computer Vision: In computer vision and robotics com-
munity, SLAM is a well-known problem. Using SLAM, a
sensor, such as a camera, is able to localise itself in an
unknown environment by incrementally building a map and
at the same time localising itself within the map. Various
methods have been proposed to solve the SLAM problem, but
robustness and real-time performance is still challenging [1].
From the mid 1990s onwards, a strong return has been made to
a model-based paradigm enabled primarily by the adoption of
probabilistic algorithms [2] which are able to cope with the un-
certainty in all real sensor measurements [3]. A breakthrough
was when it was shown to be feasible using computer vision
applied to commodity camera hardware. The MonoSLAM
system offered real-time 3D tracking of the position of a hand-
held or robot-mounted camera while reconstructing a sparse
point cloud of scene landmarks [4]. Increasing computer power
has since enabled previously “off-line” vision techniques to
be brought into the real-time domain; Parallel Tracking and
Mapping (PTAM) made use of classical bundle adjustment
within a real-time loop [5]. Then live dense reconstruction
methods, Dense Tracking and Mapping (DTAM) using a
standard single camera [6] and KinectFusion using a Microsoft
Kinect depth camera [7], showed that surface reconstruction
can be a standard part of a live SLAM pipeline, making use
of GPU-accelerated techniques for rapid dense reconstruction
and tracking.
KinectFusion is an important research contribution and has
been used throughout this paper in several sections, including
in SLAMBench benchmarking (Section II-A), in improved
mapping and path planning in robotic applications (Sec-
tion II-B), in Diplomat static scheduling (Section III-A2), in
Tornado and MaxineVM dynamic scheduling (Sections III-B1
and III-B2), in MaxSim hardware profiling (Section IV-B2),
and various design space exploration and crowdsourcing meth-
ods (Section V).
KinectFusion models the occupied space only and tells
nothing about the free space which is vital for robot navigation.
In this paper, we present a method to extend KinectFusion
to model free space as well (Section II-B). Additionally, we
introduce two benchmarking frameworks, SLAMBench and
SLAMBench2 (Section II-A). These frameworks allow us to
study various SLAM algorithms, including KinectFusion, un-
der different hardware and software configurations. Moreover,
a new sensor technology, focal-plane sensor-processor arrays,
is used to develop scene understanding algorithms, operating
at very high frame rates with very low power consumption
(Section II-C).
2) System Software: Smart scheduling strategies can bring
significant performance improvement regarding execution
time [8] or energy consumption [9], [10], [11] by breaking
an algorithm into smaller units, distributing the units between
cores or Intellectual Properties (IP)s available, and adjusting
the voltage and frequency of the cores. Scheduling can be
done either statically or dynamically. Static scheduling re-
quires extended knowledge about the application, i.e., how an
algorithm can be broken into units, and how these units behave
in different settings. Decomposing an algorithm this way
impacts a static scheduler’s choice in allocating and mapping
resources to computation units, and therefore it needs to be
optimised. In this paper, two static scheduling techniques are
introduced (Section III-A) including idiom-based compilation
and Diplomat, a task-graph framework that exploits static
dataflow analysis to perform CPU/GPU mapping.
Since static schedulers do not operate online, optimisation
time is not a primary concern. However, important optimisa-
tion opportunities may depend on the data being processed;
therefore, dynamic schedulers have more chances in obtain-
ing the best performance. In this paper, two novel dynamic
scheduling techniques are introduced including MaxineVM, a
research platform for managed runtime languages executing on
ARMv7, and Tornado, a heterogeneous task-parallel program-
ming framework designed for heterogeneous systems where
the specific configurations of CPUs, GPGPUs, FPGAs, DSPs,
etc. in a system are not known till runtime (Section III-B).
In contrast, dynamic schedulers cannot spend too much
processing power to find good solutions, as the performance
penalty may outweight the benefits they bring. Quasi-static
scheduling is a compromising approach that statically com-
putes a good schedule and further improves it online depend-
ing on runtime conditions [12]. A hybrid scheduling technique
is introduced called power-aware code generation, which is a
compiler-based approach to runtime power management for
heterogeneous cores (Section III-C).
3) Computer Architecture: It has been shown that moving
to a dynamic heterogeneous model, where the use of hardware
resources and the capabilities of those resources are adjusted
at run-time, allows far more flexible optimisation of system
performance and efficiency [13], [14]. Simulation methods,
such as memory and instruction set simulation, are powerful
tools to design and evaluate such systems. A large number of
simulation tools are available [15]; in this paper we further
improve upon current tools by introducing novel ‘functional’
and ‘detailed’ hardware simulation packages, that can simulate
individual cores and also complete CPU/GPU systems (Sec-
tion IV-A). Also novel profiling (Section IV-B) and speciali-
sation (Section IV-C) techniques are introduced which allow
us to custom-design chips for SLAM and computer vision
4) Model-based Design Space Exploration: Machine learn-
ing has rapidly emerged as a viable means to automate sequen-
tial optimising compiler construction. Rather than hand-craft a
set of optimisation heuristics based on compiler expert insight,
learning techniques automatically determine how to apply
optimisations based on statistical modelling and learning. Its
great advantage is that it can adapt to changing platforms
Static Dynamic Hybrid
Holistic Optimisation
Design Space Exploration
Design Space Exploration
Design Space Exploration
Power Eciency Speed Quality of Results
Fig. 2: Outline of the paper. The contributions of the paper have been organised under four
sections, shown with solid blocks. These blocks cover algorithmic, software, architecture,
and holistic optimisation domains. Power efficiency, runtime speed, and quality of results
are the subgoals of the project. The latter includes metrics such as accuracy of model
reconstruction, accuracy of trajectory, and robustness.
as it has no a priori assumptions about their behaviour.
There are many studies showing it outperforms human-based
approaches [16], [17], [18], and [19].
Recent work shows that machine learning can automatically
port across architecture spaces with no additional learning
time, and can find different, appropriate, ways of mapping
program parallelism for different parallel platforms [20], [21].
There is now ample evidence from previous research, that
design space exploration based on machine learning provides
a powerful tool for optimising the configuration of complex
systems both statically and dynamically. It has been used
from the perspective of single-core processor design [22], the
modelling and prediction of processor performance [23], the
dynamic reconfiguration of memory systems for energy effi-
ciency [24], the design of SoC interconnect architectures [25],
and power management [24]. The DSE methodology will
address this paper’s goals from the perspective of future many-
core systems, extending beyond compilers and architecture to
elements of the system stack including application choices and
run-time policies. In this paper, several DSE related works
are introduced. Multi-domain DSE performs exploration on
hardware, software, and algorithmic choices (Section V-A1).
With multi-domain DSE, it is possible to compromise be-
tween metrics such as runtime speed, power consumption, and
SLAM accuracy. In Motion-aware DSE (Section V-A2), we
develop a comprehensive DSE that also takes into account the
complexity of the environment being modelled, including the
photometric texture of the environment, the geometric shape
of the environment, and the speed of the camera in the envi-
ronment. DSE works allow us to design applications that can
optimally choose a set of hardware, software, and algorithmic
parameters meeting certain desired performance metrics. One
example application is active SLAM (Section V-A2a).
B. Outline
Real-time 3D scene understanding is the main driving force
behind this work. 3D scene understanding has various applica-
tions in wearable devices, mobile systems, personal assistant
Quality of Results
Power Eciency
Advanced Sensors
Focal-Plane Sensor-Processor
Probabilistic Mapping
Fig. 3: Algorithmic contributions include benchmarking tools, advanced sensors, and
improved probabilistic mapping.
devices, Internet of Things, and many more. Throughout this
paper, we aim to answer the following questions: 1) How can
we improve 3D scene understanding (specially SLAM) algo-
rithms? 2) How can we improve power performance for het-
erogeneous systems? 3) How can we reduce the development
complexity of hardware and software? As shown in Fig. 2,
we focus on four design domains: computer vision algorithms,
software, hardware, and holistic optimisation methods. Several
novel improvements have been introduced, organised as shown
in Fig. 2.
Section II (Algorithm) explains the algorithmic contri-
butions such as using novel sensors, improving dense
mapping, and novel benchmarking methods.
Section III (Software) introduces software techniques for
improving system performance, including various types
of scheduling.
Section IV (Architecture) presents hardware develop-
ments, including simulation, specialisation, and profiling
Section V(Holistic Optimisation) introduces holistic op-
timisation approaches, such as design space exploration
and crowdsourcing.
Section VI summarises the work.
Computer vision algorithms are the main motivation of
the paper. We focus mainly on SLAM. Within the past few
decades, researchers have developed various SLAM algo-
rithms, but few tools are available to compare and bench-
mark these algorithms and evaluate their performance on the
available diverse hardware platforms. Moreover, the general
research direction is also moving towards making the current
algorithms more robust to eventually make them available in
industries and our everyday life. Additionally, as the sensing
technologies progress, the pool of SLAM algorithms become
more diverse and fundamentally new approaches need to be
This section presents algorithmic contributions from three
different aspects. As shown in Fig. 3, three main topics are
covered: 1) benchmarking tools to compare the performance
of the SLAM algorithms, 2) improved probabilistic mapping,
and 3) new sensor technologies for scene understanding.
Correctness Verication
ICL-NUIM Dataset Visualisation Tool
Performance Evaluation
Frame Rate
Energy Consumption
Accuracy Trade-O
KinectFusion Application
Acquire and
Platforms ...
C++ OpenMP OpenCL CUDA ...
Fig. 4: SLAMBench enables benchmarking of the KinectFusion algorithm on various
types of platforms by providing different implementations such as C++, OpenMP, CUDA,
and OpenCL.
A. Benchmarking: Evaluation of SLAM Algorithms
Real-time computer vision and SLAM offer great poten-
tial for a new level of scene modelling, tracking, and real
environmental interaction for many types of robots, but their
high computational requirements mean that implementation on
mass market embedded platforms is challenging. Meanwhile,
trends in low-cost, low-power processing are towards massive
parallelism and heterogeneity, making it difficult for robotics
and vision researchers to implement their algorithms in a
performance-portable way.
To tackle the aforementioned challenges, in this section, two
computer vision benchmarking frameworks are introduced:
SLAMBench and SLAMBench2. Benchmarking is a scientific
method to compare the performance of different hardware and
software systems. Both benchmarking frameworks share com-
mon functionalities, but their objectives are different. While
SLAMBench provides a framework that is able to benchmark
various implementations of KinectFusion, SLAMBench2 pro-
vides a framework that is able to benchmark various different
SLAM algorithms in their original implementations.
Additionally, to systemically choose the proper datasets
to evaluate the SLAM algorithms, we introduce a dataset
complexity scoring method. All these projects allow us to
optimise power, speed, and accuracy.
1) SLAMBench: As a first approach to investigate SLAM
algorithms, we introduced SLAMBench [26], a publicly avail-
able software framework which represents a starting point
for quantitative, comparable, and validatable experimental
research to investigate trade-offs in performance, accuracy,
and energy consumption of a dense RGB-D SLAM system.
SLAMBench provides a KinectFusion [7] implementation,
inspired by the open-source KFusion implementation [27].
SLAMBench provides the same KinectFusion in the C++,
OpenMP, CUDA, and OpenCL variants, and harnesses the
ICL-NUIM synthetic RGB-D dataset [28] with trajectory
and scene ground truth for reliable accuracy comparison of
different implementation and algorithms. The overall vision of
the SLAMBench framework is shown in Fig. 4, refer to [26]
for more information.
Runtime Power Accuracy
Algorithm API
Dataset Format
Fig. 5: SLAMBench2 allows multiple algorithms (and implementations) to be combined
with a wide array of datasets. A simple API and dataset make it easy to interface with
new algorithms.
Algorithm Type Implementations
ElasticFusion [33] Dense CUDA
InfiniTAM [34] Dense C++, OpenMP, CUDA
KinectFusion [7] Dense C++, OpenMP, OpenCL, CUDA
LSD-SLAM [35] Semi-Dense C++, PThread
ORB-SLAM2 [36] Sparse C++
MonoSLAM [37] Sparse C++, OpenCL
OKVIS [38] Sparse C++
PTAM [5] Sparse C++
SVO [39] Sparse C++
TABLE I: List of SLAM algorithms currently integrated in SLAMBench2. These
algorithms provide either dense, semi-dense, or sparse reconstructions [32].
Third parties have provided implementations of SLAM-
Bench in additional emerging languages:
the C++ SyCL for OpenCL Khronos Group standard [29],
the platform-neutral compute intermediate language for
accelerator programming PENCIL [30], the PENCIL
SLAMBench implementation can be found in [31].
As demonstrated in Fig. 2, SLAMBench has enabled us
to do more research in algorithmic, software, and archi-
tecture domains, explained throughout the paper. Examples
include Diplomat static scheduling (Section III-A2), Tornado
dynamic scheduling (Sections III-B1), MaxSim hardware pro-
filing (Section IV-B2), multi-domain design space exploration
(Section V-A1), comparative design space exploration (Sec-
tion V-A3), and crowdsourcing (Section V-B).
2) SLAMBench2: SLAMBench has had substantial success
within both the compiler and architecture realms of academia
and industry. The SLAMBench performance evaluation frame-
work is tailored for the KinectFusion algorithm and the ICL-
NUIM input dataset. However, in SLAMBench 2.0, we re-
engineered SLAMBench to have more modularity by integrat-
ing two major features [32]. Firstly, a SLAM API has been
defined, which provides an easy interface to integrate any
new SLAM algorithms into the framework. Secondly, there
is now an I/O system in SLAMBench2 which enables the
easy integration of new datasets and new sensors (see Fig. 5).
Additionally, SLAMBench2 features a new set of algorithms
and datasets from among the most popular in the computer
vision community, Table Isummarises these algorithms.
The works in [40] and [41] present benchmarking results,
comparing several SLAM algorithms on various hardware
platforms; however, SLAMBench2 provides a framework that
researchers can easily integrate and use to explore various
SLAM algorithms.
Dataset Trajectory Max Mean Variance
lr kt0 0.0250 0.0026 0.0014
lr kt1 0.0183 0.0026 0.0012
lr kt2 0.0427 0.0032 0.0023
lr kt3 0.0352 0.0032 0.0023
TABLE II: Complexity level metrics using information divergence [44].
3) Datasets: Research papers on SLAM often report per-
formance metrics such as pose estimation accuracy, scene
reconstruction error, or energy consumption. The reported
performance metrics, may not be representative of how well an
algorithm will work in real-world applications. Additionally,
as the diversity of the datasets is growing, it becomes a
challenging issue to decide which and how many datasets
should be used to compare the results. To address this concern,
not only we categorised datasets according to their complexity
in terms of trajectory and environment, but also we have
proposed new synthetic datasets with highly detailed scene
and realistic trajectories [42], [43].
In general, datasets do not come with a measure of com-
plexity level, and thus the comparisons may not reveal all
strengths or weaknesses of a new SLAM algorithm. In [44], we
proposed to use frame-by-frame Kullback-Leibler divergence
as a simple and fast metric to measure the complexity of a
dataset. Across all frames in a dataset, mean divergence and
the variance of divergence were used to assess the complex-
ity. Table II shows some of these statistics for ICL-NUIM
sequences for intensity divergence. Based on the reported
trajectory error metrics of the ElasticFusion algorithm [33],
datasets lr kt2 and lr kt3 are more difficult than lr kt0 and
lr kt1. Using the proposed statistical divergence, these difficult
trajectories have a higher complexity score as well.
B. OFusion: Probabilistic Mapping
Modern dense volumetric methods based on signed distance
functions such as DTAM [6] or explicit point clouds, such
as ElasticFusion [33], are able to recover high quality geo-
metric information in real-time. However, they do not explic-
itly encode information about empty space which essentially
becomes equivalent to unmapped space. In various robotic
applications this could be a problem as many navigation
algorithms require explicit and persistent knowledge about
the mapped empty space. Such information is instead well
encoded in classic occupancy grids, which, on the other hand,
lack the ability to faithfully represent the surface boundaries.
Loop et al. [45] proposed a novel probabilistic fusion frame-
work aiming at closing such information gap by employing
a continuous occupancy map representation in which the sur-
face boundaries are well-defined. Targeting real-time robotics
applications, we have extended such framework to make it
suitable for the incremental tracking and mapping typical of
an exploratory SLAM system. The new formulation, denoted
as OFusion [46], allows robots to seamlessly perform camera
tracking, occupancy grid mapping and surface reconstruction
at the same time. As shown in Table III, OFusion not only
encodes the free space, but also performs at the same level
or better than state-of-the-art volumetric SLAM pipelines
Trajectory TSDF OFusion InfiniTAM
ICL-NUIM lr kt0 0.0113 0.2289 0.3052
ICL-NUIM lr kt1 0.0117 0.0170 0.0214
ICL-NUIM lr kt2 0.0040 0.0055 0.1725
ICL-NUIM lr kt3 0.7582 0.0904 0.4858
TUM fr1 xyz 0.0295 0.0322 0.0273
TUM fr1 floor × × ×
TUM fr1 plant × × ×
TUM fr1 desk 0.1030 0.0918 0.0647
TUM fr2 desk 0.0641 0.0724 0.0598
TUM fr3 office 0.0686 0.0531 0.0996
TABLE III: Absolute Trajectory Error (ATE), in metres, comparison between KinectFu-
sion (TSDF), occupancy mapping (OFusion), and InfiniTAM across sequences from the
ICL-NUIM and TUM RGB-D detasets. Cross signs indicate tracking failure.
such as KinectFusion [7] and InfiniTAM [34] in terms of
mean Absolute Trajectory Error (ATE). To demonstrate the
effectiveness of our approach we implemented a simple path
planning application on top of our mapping pipeline. We
used Informed RTT* [47] to generate a collision-free 3-meter
long trajectory between two obstructed start-goal endpoints,
showing the feasibility to achieve tracking, mapping and
planning in a single integrated control loop in real-time.
C. Advanced Sensors
Mobile robotics and various applications of SLAM, Convo-
lutional Neural Networks (CNN), and VR/AR are constrained
by power resources and low frame rates. These applications
can not only benefit from high frame rate, but also could save
resources if they consumed less energy.
Monocular cameras have been used in many scene under-
standing and SLAM algorithms [37]. Passive stereo cameras
(e.g. Bumblebee2, 48 FPS @ 2.5 W [48]), structured light
cameras (e,g, Kinect, 30 FPS @ 2.25 W [49]) and Time-
of-flight cameras (e.g. Kinect One, 30 FPS @ 15 W [49])
additionally provide metric depth measurements; however,
these cameras are limited by low frame rate and have rela-
tively demanding power budget for mobile devices; problems
that modern bio-inspired and analogue methods are trying to
Dynamic Vision Sensor (DVS), also known as the event
camera, is a novel bio-inspired imaging technology, which
has the potential to address some of the key limitations
of conventional imaging systems. Instead of capturing and
sending a full frame, an event camera captures and sends a set
of sparse events, generated by the change in the intensity. They
are low-power and are able to detect changes very quickly.
Event cameras have been used in camera tracking [50], optical
flow estimation [51], and pose estimation [52], [53], [54]. Very
high dynamic range of DVS makes it suitable for real-world
Cellular vision chips, such as the ACE400 [55],
ACE16K [56], MIPA4K [57], and Focal-plane Sensor-
Processor Arrays (FPSPs) [58], [59], [60], integrate sensing
and processing in the focal plane. FPSPs are massively parallel
processing systems on a single chip. By eliminating the need
for data transmission, not only the effective frame rate is
increased, but also the power consumption is reduced signif-
icantly. The individual processing elements are small general
purpose analogue processors with a reduced instruction set
and memory. Fig. 6shows a concept diagram of FPSP, where
each pixel not only has a light-sensitive sensor, but also has
a simple processing element. The main advantages of FPSPs
are the high effective frame rates at lower clock frequencies
which in turn reduces power consumption compared to con-
ventional sensing and processing systems [61]. However with
the limited instruction sets and local memory [60], developing
new applications for FPSPs, such as image filtering or camera
tracking, is a challenging problem.
In the past, several interesting works have been presented
using FPSPs, including high-dynamic range imaging [62].
New directions are being followed to explore the performance
of FPSPs in real-world robotic and virtual reality applications.
These directions include 1) 4-DOF camera tracking [63], and
2) automatic filter kernel code generation as well as Viola-
Jones [64] face detection [65]. The key concept behind these
works with FPSP is the fact that FPSP is able to report sum of
intensity values of all (or a selection of) pixels in just one clock
cycle. This ability allows us to develop kernel code generation
and also develop/verify motion hypotheses for visual odometry
and camera tracking applications. The results of these works
demonstrate that FPSPs not only consume much less power
compared to conventional cameras, but also can be operated
at very high frame rates, such as 10,000 FPS.
Fig. 7demonstrates execution times for common convo-
lution filters on various CPUs and GPUs compared with an
implementation of FPSP, known as SCAMP [60]. The code
for FPSP was automatically generated as explained in [65].
The parallel nature of the FPSP allows it to perform all
of the tested filter kernels, shown on x-axis, in a fraction
of the time needed by the other devices, shown on y-axis.
This is a direct consequence of having a dedicated processing
element available for every pixel, building up the filter on
the whole image at the same time. As for the other devices,
we see that for dense kernels (Gauss, Box), GPUs usually
perform better than CPUs, whereas for sparse kernels (Sobel,
Laplacian, Sharpen), CPUs seem to have an advantage. An
outlier case being the 7×7box filter, at which only the most
powerful graphics card manages to get a result comparable to
the CPUs. It is assumed that the CPU implementation follows
a more suitable algorithm than the GPU implementation,
even though both implementations are based on their vendors
performance libraries (Intel IPP, nVidia NPP). Another reason
could be the fact, that the GTX680 and GTX780 are based on
a hardware architecture that is less suitable for this type of
filter than the TITAN X’s architecture. While Fig. 7shows that
there is a significant reduction in execution time, the SCAMP
chip consumes only 1.23Wunder full load. Compared to
the experimented CPU and GPU systems, this at least 20
times less power. Clearly a more specialised image processing
pipeline architecture could be more energy-efficient than these
fully programmable architectures. There is scope for further
research to map the space of alternative designs, including
specialised heterogeneous multicore vision processing accel-
erators such as the Myriad-2 Vision Processing Unit [66].
Fig. 6: Focal-plane Sensor-Processor Arrays (FPSPs) are parallel processing systems,
where each pixel has a processing element.
Gauss3 Gauss5 Box3 Box5 Box7 Sob el Laplacian Sharpen
Filtering time [µs]
Fig. 7: Time for a single filter application of several well-known filters on CPU, GPU,
and SCAMP FPSP hardware. The FPSP code was generated by the method explained
in [65], the CPU and GPU code are based on OpenCV 3.3.0.
Fig. 8: Static, dynamic, and hybrid scheduling are the software optimisation methods
presented for power efficiency and speed improvement.
In this section, we investigate how software optimisations,
that are mainly implemented as a collection of compiler
and runtime techniques, can be used to deliver potential
improvements in power consumption and speed trade-offs.
The optimisations must determine how to efficiently map and
schedule program parallelism onto multi-core, heterogeneous
processor architectures. This section presents the novel static,
dynamic, and hybrid approaches used to specialise computer
vision applications for execution on energy efficient runtimes
and hardware (Fig. 8).
A. Static Scheduling and Code Transformation
In this section, we focus on static techniques applied when
building an optimised executable. Static schedulers and op-
timisers can only rely on performance models of underlying
architectures or code to optimise, which limit opportunities.
However they do not require additional code to execute, which
reduces runtime overhead. We first introduce in III-A1 an
idiom-based heterogeneous compilation methodology which
given the source code of a program, can automatically identify
and transform portions of code in order to be accelerated using
many-core CPUs or GPUs. Then in III-A2, we propose a dif-
ferent methodology used to determine which resources should
be used to execute those portions of code. This methodology
takes a specialised direction, where applications need to be
expressed using a particular model in order to be scheduled.
1) Idiom-based heterogeneous compilation: A wide variety
of high-performance accelerators now exist, ranging from em-
bedded DSPs, to GPUs, to highly specialised devices such as
Tensor Processing Unit [67] and Vision Processing Unit [66].
These devices have the capacity to deliver high performance
and energy efficiency, but these improvements come at a cost:
to obtain peak performance, the target application or kernel
often needs to be rewritten or heavily modified. Although
high-level abstractions can reduce the cost and difficulty of
these modifications, these make it more difficult to obtain peak
performance. In order to extract the maximum performance
from a particular accelerator, an application must be aware of
its exact hardware parameters (number of processors, mem-
ory sizes, bus speed, Network-on-Chip (NoC) routers, etc.),
and this often requires low level programming and tuning.
Optimised numeric libraries and Domain Specific Languages
(DSLs) have been proposed as a means of reconciling pro-
grammer ease and hardware performance. However, they still
require significant legacy code modification and increase the
number of languages programmers need to master.
Ideally, the compiler should be able to automatically take
advantage of these accelerators, by identifying opportunities
for their use, and then automatically calling into the appropri-
ate libraries or DSLs. However, in practice, compilers struggle
to identify such opportunities due to the complex and expen-
sive analysis required. Additionally, when such opportunities
are found, they are frequently on a too small scale to obtain any
real benefit, with the cost of setting up the accelerator (i.e. data
movement, Remote Procedure Call (RPC) costs , etc.) being
much greater than the improvement in execution time or power
efficiency. Larger scale opportunities are difficult to identify
due to the complexity of analysis, which often requires inter-
procedural analyses, loop invariant detection, pointer and alias
analyses, etc., which are complex to implement in the compiler
and expensive to compute. On the other hand, when humans
attempt to use these accelerators, they often lack the detailed
knowledge of the compiler, and resort to “hunches” or ad-hoc
methods, leading to sub-optimal performance.
In [68], we develop a novel approach to automatically detect
and exploit opportunities to take advantage of accelerators and
DSLs. We call these opportunities “idioms”. By expressing
these idioms as constraint problems, we can take advantage
Tasks' implementation in
C++, OpenMP, OpenCL, ...
Diplomat DSL
Task graph representation
Generated source code
C++, OpenMP, OpenCL, ...
Code generation
Static dataow
Time proling
Dataow analysis,
e.g. mapping
Fig. 9: An overview of the Diplomat framework. The user provides (1) the task
implementations in various languages and (2) the dependencies between the tasks. Then
in (3) Diplomat performs timing analysis on the target platform and in (4) abstracts the
task-graph as a static dataflow model. Finally, a dataflow model analysis step is performed
in (5), and in (6) the Diplomat compiler performs the code generation.
of constraint solving techniques (in our case a Satisfiability
Modulo Theories (SMT) solver). Our technique converts the
constraint problem which describes each idiom into an LLVM
compiler pass. When running on LLVM IR (Intermediate
Representation), these passes identify and report instances of
each idiom. This technique is further strengthened by the use
of Symbolic Execution and Static Analysis techniques, so that
formally proved transformations can be automatically applied
when idioms are detected.
We have described idioms for sparse and dense linear
algebra, and stencils and reductions, and written transforma-
tions from these idioms to the established cuSPARSE and
clSPARSE libraries, as well as a data-parallel, functional DSL
which can be used to generate high performance platform
specific OpenCL code. We have then evaluated this tech-
nique on the NAS, Parboil, and Rodinia sequential C/C++
benchmarks, where we detect 55 instances of our described
idioms. The NAS, Parboil, and Rodinia benchmarks include
several key and frequently used computer vision and SLAM
related tasks such as convolution filtering, particle filtering,
backpropagation, k-means clustering, breadth-first search, and
other fundamental computational building blocks. In the cases
where these idioms form a significant part of the sequential
execution time, we are able to transform the program to obtain
performance improvements ranging from 1.24x to over 20x on
integrated and discrete GPUs, contributing to the fast execution
time objective.
2) Diplomat, Static mapping of multi-kernel applications
on heterogeneous platforms: We propose a novel approach
to heterogeneous embedded systems programmability using
a task-graph based DSL called Diplomat [69]. Diplomat is
a task-graph framework that exploits the potential of static
dataflow modelling and analysis to deliver performance es-
Diplomat (CPU/GPU)
Speedup-Mapping (CPU/GPU)
Partitioning (CPU/GPU)
Manual-OpenCL (CPU/GPU)
Speedup over Sequential
+0.7% +25.2%
+30.5% +0.8%
Fig. 10: Evaluation of the best result obtained with Diplomat for CPU and GPU
configurations, and comparison with handwritten solutions (OpenMP, OpenCL) and
automatic heuristics (Partitioning, Speed-up mapping) for KinectFusion on Arndale
platform. The associated numbers on x-axis are different KinectFusion algorithmic
parameter configuration, and the percent on top of Diplomat bars are the speedup over
the manual implementation.
timation and CPU/GPU mapping. An application has to be
specified once, and then the framework can automatically
propose good mappings. This work aims at improving runtime
as much as performance robustness.
The Diplomat front-end is embedded in the Python pro-
gramming language and it allows the framework to gather
fundamental information about the application: the different
possible implementations of the tasks, their expected input and
output data sizes, and the existing data dependencies between
each of them.
At compile-time, the framework performs static analysis.
In order to benefit from existing dataflow analysis techniques,
the initial task-graph needs to be turned into a dataflow model.
As the dataflow graph will not be used to generate the code, a
representation of the application does not need to be precise.
But it needs to model an application’s behaviour close enough
to obtain good performance estimations. Diplomat performs
the following steps. First, the initial task-graph is abstracted
into a static dataflow formalism. This includes a timing profil-
ing step to estimate task durations and communication delays.
Then, by using static analysis techniques [70], a throughput
evaluation and a mapping of the application are performed.
Once a potential mapping has been selected, an executable
C++ code is automatically generated. This generated im-
plementation takes advantage of task-parallelism and data-
parallelism. It can use OpenMP and OpenCL and it may apply
partitioning between CPU and GPU when it is beneficial. This
overview is summarised in Fig. 9.
We evaluate Diplomat with KinectFusion on two embed-
ded platforms, Odroid-XU3 and Arndale, with four different
configurations for algorithmic parameters, chosen manually.
Fig. 10 shows the results for Arndale for four different config-
urations, marked as ARN0...3. Using Diplomat, we observed
a 16% speed improvement on average and up to a 30% im-
provement over the best existing hand-coded implementation.
This is an improvement on runtime speed, one of the goals
outlined earlier.
B. Dynamic Scheduling
Dynamic scheduling takes place while the optimised pro-
gram runs with actual data. Because dynamic schedulers can
monitor actual performance, they can compensate for perfor-
mance skews due to data-dependant control-flow and com-
putation that static schedulers cannot accurately capture and
model. Dynamic schedulers can therefore exploit additional
dynamic run-time information to enable more optimisation
opportunities. However, they also require the execution of
additional profiling and monitoring code, which can create
performance penalties.
Tornado and MaxineVM runtime scheduling are research
prototype systems that we are using to explore and investigate
dynamic scheduling opportunities. Tornado is a framework
(prototyped on top of Java) using dynamic scheduling for
transparent exploitation of task-level parallelism on hetero-
geneous systems having multicore CPUs, and accelerators
such as GPUs, DSPs and FPGAs. MaxineVM is a research
Java Virtual Machine (JVM) that we are initially using to
investigate dynamic heterogeneous multicore scheduling for
application and JVM service threads in order to better meet
the changing power and performance objectives of a system
under dynamically varying battery life and application service
1) Tornado: Tornado is a heterogeneous programming
framework that has been designed for programming sys-
tems that have a higher-degree of heterogeneity than existing
GPGPU accelerated systems and where system configurations
are unknown until runtime. The current Tornado prototype [71]
superseding JACC, described in [72], can dynamically of-
fload code to big.LITTLE cores, and GPUs with its OpenCL
backend that supports the widest possible set of accelerators.
Tornado can also be used to generate OpenCL code that is
suitable for high-level synthesis tools in order to produce
FPGA accelerators, although it is not practical to do this unless
the relatively long place and route times of FPGA vendor
tools can be amortised by application run-time overheads. The
main benefit of Tornado is that it allows portable dynamic
exploration of how heterogeneous scheduling decisions for
task-parallel frameworks will lead to improvements in power-
performance trade-offs without rewriting the application level
code, and also where knowledge of the heterogeneous config-
uration of a system is delayed until runtime.
The Tornado API cleanly separates computation logic from
co-ordination logic that is expressed using a task-based pro-
gramming model. Currently, data parallelisation is expressed
using standard Java support for annotations [71]. Applications
remain architecture-neutral, and as the current implementation
of Tornado is based on the Java managed language, we are
able to dynamically generate code for heterogeneous execution
without recompilation of the Java source, and without manu-
ally generating new optimised routines for any accelerators
that may become available. Applications need only to be
configured at runtime for execution on the available hardware.
Tornado currently uses an OpenCL driver for maximum device
coverage: this includes mature support for: multi-core CPUs
and GPGPU, and maturing support for Xeon Phi coproces-
sor/accelerators. The current dynamic compiler technology of
Tornado is built upon JVMCI and GRAAL APIs for Java 8 and
above. The sequential Java and C++ versions of KinectFusion
in SLAMBench both perform at under 3 FPS with the C++
C++ - 2.72 FPS
Java - 0.81 FPS
- 33.13 FPS
0 500 1000
Frame Number
Frames Per Second
Fig. 11: Execution performance of KinectFusion (using FPS) over the time using Tornado
(Java/OpenCL) vs. baseline Java and C++.
version being 3.4x faster than Java. This improvement of
runtime speed is shown in Fig. 11. By accelerating Kinect-
Fusion through GPGPU execution using Tornado, we manage
to achieve a constant rate of over 30 FPS (33.13 FPS) across
all frames (882) from the ICL-NUIM dataset with room 2
configuration [28]. To achieve 30 FPS, all kernels have been
accelerated by up to 821.20x with an average of 47.84x across
the whole application [71], [73]. Tornado is an attractive
framework for the development of portable computer vision
applications as its dynamic JIT compilation for traditional
CPU cores and OpenCL compute devices such as GPUs
enables real-time performance constraints to be met whilst
eliminating the need to rewrite and optimise code for different
GPU devices.
2) MaxineVM: The main contribution of MaxineVM is to
provide a research infrastructure for managed runtime systems
that can execute on top of modern Instruction Set Architectures
(ISA)s supplied by both Intel and ARM. This is especially
relevant because ARM is the dominant ISA in mobile and
embedded platforms. MaxineVM has been released as open-
source software [74].
Heterogeneous multicore systems comprised of CPUs hav-
ing the same ISA but different power/performance design
point characteristics create a significant challenge for virtual
machines that are typically agnostic to CPU core heterogene-
ity when undertaking thread-scheduling decisions. Further,
heterogeneous CPU core clusters, are typically attached to
NUMA-like memory system designs, consequently thread
scheduling policies need to be adjusted to make appropriate
decisions that do not adversely affect the performance and
power consumption of managed applications.
In MaxineVM, we are using the Java managed runtime
environment to optimise thread scheduling for heterogeneous
architectures. Consequently, we have chosen to use and extend
the Oracle Labs research project software for MaxineVM [75]
that provided a state-of-the-art research VM for x86-64. We
have developed a robust port of MaxineVM to ARMv7 [71],
[76] (an AArch64 port is also in progress) ISA processors that
can run important Java and SLAM benchmarks, including a
Java version of KinectFusion. MaxineVM has been designed
for maximum flexibility, this sacrifices some performance, but
it is trivially possible to replace the public implementation
of an interface or scheme, such as for monitor or garbage
collection with simple command line switches to the command
that generates a MaxineVM executable image.
C. Hybrid Scheduling
Hybrid scheduling considers dynamic techniques which
takes advantage of static and dynamic data. A schedule can
be statically optimised for a target architecture and application
(i.e. using machine learning), and a dynamic scheduler can
further adjust this schedule to optimise further actual code
executions. Since it can rely on a statically optimised schedule,
the dynamic scheduler can save a significant amount of work
and therefore lower its negative impact on performance.
1) Power-aware Code Generation: Power is an important
constraint in modern multi-core processor design. We have
shown that power across heterogeneous cores varies consider-
ably [77]. This work develops a compiler-based approach to
runtime power management for heterogeneous cores. Given
an externally determined power budget, it generates parallel
parameterised partitioned code that attempts to give the best
performance within that power budget. It uses the compiler in-
frastructure developed in [78]. The hybrid scheduling has been
tested on standard benchmarks such as DSPstone, UTSDP, and
Polybench. These benchmarks provide an in-depth comparison
with other methods and include key building blocks of many
SLAM and computer vision tasks such as matrix multipli-
cation, edge detection, and image histogram. We applied
this technique to embedded parallel OpenMP benchmarks on
the TI OMAP4 platform for a range of power budgets. On
average we obtain a 1.37x speed-up over dynamic voltage
and frequency scaling (DVFS). For low power budgets, we
see a 2x speed-up improvement. SLAM systems, and vision
applications in general, are composed of different phases.
An adaptive power budget for every phases positively impact
frame rate and power consumption.
The designers of heterogeneous Multiprocessor System-
on-Chip (MPSoC) are faced with an enormous task when
attempting to design a system that is co-optimised to deliver
power-performance efficiency under a wide range of dynamic
operating conditions concerning the available power stored in
a battery, and the current application performance demands.
In this paper, a variety of simulation tools and technologies
have been presented to assist designers in their evaluations
of how performance, energy, and power consumption trade-
offs are affected by computer vision algorithm parameters
and computational characteristics of specific implementations
on different heterogeneous processors and accelerators. Tools
have been developed that focus on the evaluation of native
and managed runtime systems, that execute on ARM and x86-
64 processor instruction set architectures in conjunction with
GPU and custom accelerator intellectual property.
The contributions of this section have been organised under
three main topics: simulation,profiling, and specialisation.
Under each topic, several novel tools and methods are pre-
sented. The main objective in developing these tools and
Fig. 12: Hardware development tasks are simulation, profiling, and specialisation tools;
each with its own goals. With these three task, it is possible to develop customised
hardware for computer vision applications.
methods is to reduce development complexity and increase
reproducibility for system analysis. Fig. 12 presents a graph
where all simulation, profiling, and specialisation tools are
A. Fast Simulation
Simulators have become an essential tool for hardware
design. They allow designers to prototype different systems
before committing to a silicon design, and save enormous
amounts of money and time. They allow embedded systems
engineers to develop the driver and compiler stack, before the
system is available, and be able to verify their results. Even
after releasing the hardware, software engineers can make
use of simulators to prototype their programs in a virtual
environment, without the latency of flashing the software onto
the hardware, or even without access to the hardware.
These different use cases require very different simulation
technologies. Prototyping hardware typically requires ‘de-
tailed’ performance modelling simulation to be performed,
which comes with a significant slowdown compared to real
hardware. On the other hand, software development often does
not require such detailed simulation, and so faster ‘functional’
simulators can be used. This has led to the development of
multiple simulation systems within this work, with the GenSim
system being used for ‘functional’ simulation and APTsim
being used for more detailed simulation.
In this section, three novel system simulation works are
presented. These works are: GenSim, CPU/GPU simulation,
and APTsim.
1) The GenSim Architecture Description Language: Mod-
ern CPU architectures often have a large number of extensions
and versions. At the same time, simulation technologies have
improved, making simulators both faster and more accurate.
However, this has made the creation of a simulator for a mod-
ern architecture much more complex. Architecture Description
Languages (ADLs) seek to solve this problem by decoupling
the details of the simulated architecture from the tool used to
simulate it.
We have developed the GenSim simulation infrastructure,
which includes an ADL toolchain (see Fig. 13). This ADL is
designed to enable the rapid development of fast functional
simulation tools [79], and the prototyping of architectural
extensions (and potentially full instruction set architectures).
This infrastructure is used in the CPU/GPU simulation work
Register Files
Fig. 13: Diagram showing the general flow of the GenSim ADL toolchain
(Section IV-A2). The GenSim infrastructure is described in
a number of publications [80], [81], [82]. GenSim is avail-
able under a permissive open-source license, and is available
at [83].
2) Full-system simulation for CPU/GPU: Graphics pro-
cessing units are highly-specialized processors that were origi-
nally designed to process large graphics workloads effectively,
however they have been influential in many industries, includ-
ing in executing computer vision tasks. Simulators for parallel
architectures, including GPUs, have not reached the same level
of maturity as simulators for CPUs, both due to the secrecy of
leading GPU vendors, and the problems arising from mapping
parallel onto scalar architectures, or onto different parallel
At the moment, GPU simulators that have been presented in
literature have limitations, resulting from lack of verification,
poor accuracy, poor speeds, and limited observability due to
incomplete modelling of certain hardware features. As they
don’t accurately model the full native software stack, they
are unable to execute realistic GPU workloads, which rely on
extensive interaction with user and system runtime libraries.
In this work, we propose a full-system methodology for
GPU simulation, where rather than simulating the GPU as
an independent unit, we simulate it as a component of a
larger system, comprising a CPU simulator with supporting
devices, operating system, and a native, unmodified driver
stack. This faithful modelling results in a simulation platform
indistinguishable from real hardware.
We have been focusing our efforts on simulation of the
ARM Mali GPU, and have built a substantial amount of
surrounding infrastructure. We have seen promising results
in simulation of compute applications, most notably SLAM-
The work directly contributed to full system simulation,
by implementing the ARMv7 MMU, ARMv7 and Thumb-2
Instruction Sets, and a number of devices needed to commu-
nicate with the GPU. To connect the GPU model realistically,
we have implemented an ARM CPU GPU interface containing
an ARM Device on the CPU side [84].
The implementation of the Mali GPU simulator comprises:
An implementation of the Job Manager, a hardware
resource for controlling jobs on the GPU side,
The Shader Core Infrastructure, which allows for retriev-
ing important context, needed to execute shader programs
The Shader Program Decoder, which allows us to inter-
pret Mali Shader binary programs,
The Shader Program Execution Engine, which allows us
to simulate the behaviour of Mali programs.
Future plans for simulation include extending the infrastruc-
ture to support real time graphics simulation, increasing GPU
Simulation performance using Dynamic Binary Translation
(DBT) [79], [82], [85] techniques, and extending the Mali
Model to support performance modelling. We have also con-
tinued to investigate new techniques for full-system dynamic
binary translation (such as exploiting hardware features on the
host to further accelerate simulation performance), as well as
new methodologies for accelerating the implementation and
verification of full system instruction set simulators. Fast full
system simulation presents a large number of unique chal-
lenges and difficulties and in addressing and overcoming these
difficulties, we expect to be able to produce a significant body
of novel research. Taken as a whole, these tools will directly
allow us to explore next-generation many-core applications,
and design hardware that is characterised by high performance
and low power.
3) APTSim - simulation and prototyping platform: APTSim
(Fig. 14) is intended as a fast simulator allowing rapid simu-
lation of microprocessor architectures and microarchitectures
as well as the prototyping of accelerators. The system runs
on a platform consisting of a processor, for functional sim-
ulation, and an FPGA for implementing architecture timing
models and prototypes. Currently the Xilinx Zynq family
is used as the host platform. APTSim performs dynamic
binary instrumentation using MAMBO, see Section IV-B1, to
dynamically instrument a running executable along with the
MAST co-design library, described below. Custom MAMBO
plugins allow specific instructions, such as load/store or PC
changing events to be sent to MAST hardware models, such
as memory systems or processor pipeline. From a simulation
perspective the hardware models are for timing and gathering
statistics and do not perform functional simulation, which is
carried out on the host processor as native execution; so for
example if we send a request to a cache system the model
will tell us at which memory level the result is present in
and a response time, while the actual data will be returned
from the processor’s own memory. This separation allows for
smaller, less complicated, hardware models to gather statistics
whilst the processor executes the benchmark natively and
the MAMBO plugins capture the necessary events with low
The MAST library provides a framework for easily inte-
grating many hardware IP blocks, implemented on FPGA,
with a linux based application running on a host processor.
MAST consists of two principal parts: a software compo-
nent and a hardware library. The software component allows
the discovery and management of hardware IP blocks and
the management of memory used by the hardware blocks;
critically this allows new hardware blocks to be configured
and used at runtime using only user space software. The
hardware library, written in Bluespec, contains parametrised IP
blocks including architecture models such as cache systems or
pipeline models and accelerator modules for computer vision,
Native ARM Application
utilisation, performance etc
Load Store
Instruction trace
Running on Zynq A9
Fig. 14: APTSim an FPGA accelerated simulation and prototyping platform, currently
implemented on Zynq SoC.
such as filters or feature detectors. The hardware models can
either be masters or slaves, from a memory perspective. As
masters, models can directly access processor memory leaving
the processor to execute code whilst the hardware is analysing
the execution of the last code block.
APTSim also allows us to evaluate prototype hardware,
for example we evaluated multiple branch predictors by im-
plementing them in Bluespec and using a MAST compliant
interface. This allows us to execute our benchmark code once
on the CPU and offload to multiple candidate implementations
to rapidly explore the design space.
In [86] we show that on the Xilinx Zynq 7000 FPGA board
coupled with a relatively slow 666MHz ARM9 processor,
the slowdown of APTsim is 400x in comparison to native
execution on the same processor. While a relatively important
slowdown over native execution is unavoidable to implement
a fine performance monitoring, slowdown on APTsim is about
half of GEM5 running at 3.2GHz on an Intel Xeon E3
to simulate the same ARM system. Note that, contrary to
APTsim, GEM5 on Xeon does not take profit of any FPGA
acceleration. This shows the interest of APTsim to take profit
of FPGA acceleration to implement a fast Register Transfer
Level (RTL) simulation and monitor its performance, while
hiding the complexity of FPGA programming from the user.
B. Profiling
Profiling is the process of analysing the runtime behaviour
of a program in order to perform some measurements about the
performance of the program. For example, to determine which
parts of the program take the most time to execute. This infor-
mation can then be used to improve software (for example, by
using a more optimised implementation of frequently executed
ZSim (C++)
Maxine VM (JAVA + C)
Code Cache
OOO Core Model
xchg rcx, rcx (magic NOPs);
ld / st [tag:base + oset];
(tagged pointers);
* Field prole.
message FieldProf {
required int32 oset = 1;
required int64 readCount = 2;
required int64 writeCount = 3;
repeated int64 cacheMissCount = 4;
* Field information.
message FieldInfo {
required string name = 1;
required int32 classId = 2;
required int32 oset = 3;
profGen profUse MaxineInfoGen
ZSimProf.db MaxineInfo.db
Fig. 15: MaxSim overview of Zsim and MaxineVM based profiling
functions) or to improve hardware (by including hardware
structures or instructions which provide better performance for
frequently executed functions). Profiling of native applications
is typically performed via dynamic binary instrumentation.
However, when a managed runtime environment is used,
the runtime environment can often perform the necessary
instrumentation. In this section, we explore both of these
possibilities, with MAMBO being used for native profiling,
and MaxSim being used for the profiling of Java applications.
1) MAMBO: instruction level profiling: Dynamic Binary
Instrumentation (DBI) is a technique for instrumenting ap-
plications transparently while they are executed, working at
the level of machine code. As the ARM architecture expands
beyond its traditional embedded domain, there is a growing
interest in DBI systems for the general-purpose multicore
processors that are part of the ARM family. DBI systems
introduce a performance overhead and reducing it is an active
area of research; however, most efforts have focused on the
x86 architecture.
MAMBO is a low overhead DBI framework for 32-bit
(AArch32) and 64-bit ARM (AArch64) [87]. MAMBO is
open-source [88]. MAMBO provides an event-driven plugin
API for the implementation of instrumentation tools with mini-
mal complexity. The API allows the enumeration, analysis and
instrumentation of the application code ahead of execution, as
well as tracking and control of events such as system calls.
Furthermore, the MAMBO API provides a number of high
level facilities for developing portable instrumentation, i.e.
plugins which can execute efficiently both on AArch32 and
AArch64, while being implemented using mostly high level
architecture-agnostic code.
MAMBO incorporates a number of novel optimisations,
specifically designed for the ARM architecture, which allow
it to minimise its performance overhead. The geometric mean
runtime overhead of MAMBO running SPEC CPU2006 with
no instrumentation is as low as 12% (on an APM X-C1
system), compared DynamoRIO [89], a state of the art DBI
system, which has an overhead of 34% under the same test
Class Information Pointer (CIP) Elimination
Object Pointers Compression
(c) Relative
DRAM Dynamic Energy
(b) Relative
Execution Time
Reduction (%)
Reduction (%)
Saving (%)
tag bits
012345678910 11
(a) Heap Space Saving per Tag Bits
Fig. 16: Performance of MaxSim on KinectFusion. (a) heap space saving using tagged
pointers, (b) relative reduction in execution time, and (c) relative reduction in DRAM
dynamic energy.
2) MaxSim: profiling and prototyping hardware-software
co-design for managed runtime systems: Managed applica-
tions, written in programming languages such as Java, C# and
others, represent a significant share of workloads in the mo-
bile, desktop, and server domains. Microarchitectural timing
simulation of such workloads is useful for characterisation
and performance analysis, of both hardware and software,
as well as for research and development of novel hardware
extensions. MaxSim [90] (see Fig. 15), is a simulation platform
based on the MaxineVM [75] (explained in Section III-B2),
the ZSim [91] simulator, and the McPAT [92] modelling
framework. MaxSim can perform fast and accurate simulation
of managed runtime workloads running on top of Maxine
VM [74]. MaxSim’s capabilities include: 1) low-intrusive
microarchitectural profiling via pointer tagging on x86-64
platforms, 2) modelling of hardware extensions related, but
not limited to, tagged pointers, and 3) modelling of complex
software changes via address-space morphing.
Low-intrusive microarchitectural profiling is achieved by
utilising tagged pointers to collect type- and allocation-site re-
lated hardware events. Furthermore, MaxSim allows, through
a novel technique called address space morphing, the easy
modelling of complex object layout transformations. Finally,
through the co- designed capabilities of MaxSim, novel hard-
ware extensions can be implemented and evaluated. We show-
case MaxSim’s capabilities by simulating the whole set of
the DaCapo-9.12-bach benchmarks in less than a day while
performing an up-to-date microarchitectural power and per-
formance characterisation [90]. Furthermore, we demonstrate
a hardware/software co-designed optimisation that performs
dynamic load elimination for array length retrieval achieving
up to 14% L1 data cache loads reduction and up to 4%
dynamic energy reduction. In [93] we present results for
MaxineVM with MaxSim. We use SLAMBench to experiment
with KinectFusion on a 4-core Nehalem system, using 1 and 4
cores (denoted by 1C and 4C, respectively). We use MaxSim’s
extensions for the Address Generation Unit (AGU) (denoted
by 1CA and 4CA) and Load-Store Unit (LSU) extension
(shown by 1CAL and 4CAL). Fig. 16-a shows heap savings of
more than 30% on SLAMBench thanks to CIP (Class Informa-
tion Pointer) elimination. Fig. 16-b demonstrates the relative
reduction in execution time, using the proposed framework.
On this figure, EA refers to a machine configuration with
CIP elimination with 16 bits CID (Class Information) and
EAL refers to a variant with CIP elimination, 4 bits CID, and
AGU and LSU extensions. Bstands for the standard baseline
MaxSim virtual machine and Cis Bwith object compression.
Fig 16-b shows up to 6% execution time performance benefits
of CIP elimination over MaxSim with none of our extension,
whether its uses 4 cores (4CA-EA/4CA-B) or 1 core (1CA-
EA/1C-B) Finally, Fig. 16-c shows the relative reduction in
DRAM dynamic energy for the cases mentioned above. As
the graph shows, there is an 18% to 28% reduction in DRAM
dynamic energy. These reductions contribute to the objective
of having improved quality of the results. MaxSim is open-
source [74].
C. Specialisation
Recent developments in computer vision and machine learn-
ing have challenged hardware and circuit designers to design
faster and more efficient systems for these tasks [94]. Tensor
Processing Unit (TPU) from Google [67], Vision Processing
Unit (VPU) from Intel Movidius [66], and Intelligent Pro-
cessing Unit (IPU) from Graphcore [95], are such devices
with major re-engineerings in hardware design, resulting in
outstanding performance. While the development of custom
hardware can be appealing due to the possible significant
benefits, it can lead to extremely high design, development,
and verification costs, and a very long time to market. One
method of avoiding these costs while still obtaining many
of the benefits of custom hardware is to specialise exist-
ing hardware. We have explored several possible paths to
specialisation, including specialised memory architectures for
GPGPU computations (which are frequently used in computer
vision algorithm implementations), the use of single-ISA het-
erogeneity (as seen in ARM’s big.LITTLE platforms), and the
potential for power and area savings by replacing hardware
structures with software.
1) Memory Architectures for GPGPU Computation: Cur-
rent GPUs are no longer perceived as accelerators solely for
graphic workloads, and now cater to a much broader spectrum
of applications. In a short time, GPUs have proven to be
of substantive significance in the world of general-purpose
computing, playing a pivotal role in Scientific and High
Performance Computing (HPC). The rise of general-purpose
computing on GPUs has contributed to the introduction of
on-chip cache hierarchies in those systems. Additionally, in
SLAM algorithms, reusing previously processed data fre-
quently occurs such as in bundle adjustment, loop detection,
and loop closure. It has been shown that efficient memory
use can improve the runtime speed of the algorithm. For
instance, the Distributed Particle (DP) filter optimises memory
Fig. 17: Speed-up of Instructions Per Cycle (IPC) with varying remote L1 access
requirements using an efficient data structure for maintaining
the map [96].
We have carried out a workload characterisation of GPU
architectures on general-purpose workloads, to assess the
efficiency of their memory hierarchies [97] and proposed a
novel cache optimisation to resolve some of the memory
performance bottlenecks in GPGPU systems [98].
In our workload characterisation study (overview on Fig. 17)
we saw that, in general, high level-1 (L1) data cache miss rates
place high demands on the available level-2 (L2) bandwidth
that is shared by the large number of cores in typical GPUs.
In particular, Fig. 17 represents bandwidth as the number of
Instruction Per Cycle (IPC). Furthermore, the high demand for
L2 bandwidth leads to extensive congestion in the L2 access
path, and in turn this leads to high memory latencies. Al-
though GPUs are heavily multi-threaded, in memory intensive
applications the memory latency becomes exposed due to a
shortage of active compute threads, reducing the ability of the
multi-threaded GPU to hide memory latency (Exposed latency
range on Fig. 17). Our study also quantified congestion in
the memory system, at each level of the memory hierarchy,
and characterised the implications of high latencies due to
congestion. We identified architectural parameters that play a
pivotal role in memory system congestion, and explored the
design space of architectural options to mitigate the bandwidth
bottleneck. We showed that the improvement in performance
achieved by mitigating the bandwidth bottleneck in the cache
hierarchy can exceed the speedup obtained by simply in-
creasing the on-chip DRAM bandwidth. We also showed that
addressing the bandwidth bottleneck in isolation at specific
levels can be suboptimal and can even be counter-productive.
In summary, we showed that it is imperative to resolve the
bandwidth bottleneck synergistically across all levels of the
memory hierarchy. The second part of our work in this area
aimed to reduce the pressure on the shared L2 bandwidth. One
of the key factors we have observed is that there is significant
replication of data among private L1 caches, presenting an
opportunity to reuse data among the L1s. We have proposed
a Cooperative Caching Network (CCN), which exploits reuse
by connecting the L1 caches with a lightweight ring network
to facilitate inter-core communication of shared data. When
measured on a selection of GPGPU benchmarks, this approach
delivers a performance improvement of 14.7% for applications
that exhibit reuse.
2) Evaluation of single-ISA heterogeneity: We have in-
vestigated the design of heterogeneous processors sharing
Normalized Power
Normalized Time
Baseline Selection
2-Core Selection
8-Core Selection
Fig. 18: Example of a Baseline selection, and 2- and 8-Core selections for a specific
benchmark application.
a common ISA. The underlying motivation for single-ISA
heterogeneity is that a diverse set of cores can enable runtime
flexibility. We argue that selecting a diverse set of hetero-
geneous cores to enable flexible operation at runtime is a
non-trivial problem due to diversity in program behaviour.
We further show that common evaluation methods lead to
false conclusions about diversity. We suggest the Kolmogorov–
Smirnov (KS) test statistical test as an evaluation metric.
The KS test is the first step towards a heterogeneous design
methodology that optimises for runtime flexibility [99], [100].
A major roadblock to the further development of heteroge-
neous processors is the lack of appropriate evaluation metrics.
Existing metrics can be used to evaluate individual cores,
but to evaluate a heterogeneous processor, the cores must be
considered as a collective. Without appropriate metrics, it is
impossible to establish design goals for processors, and it is
difficult to accurately compare two different heterogeneous
processors. We present four new metrics to evaluate user-
oriented aspects of sets of heterogeneous cores: localized
non-uniformity, gap overhead, set overhead, and generality.
The metrics consider sets rather than individual cores. We
use examples to demonstrate each metric, and show that the
metrics can be used to quantify intuitions about heterogeneous
cores [101].
For a heterogeneous processor to be effective, it must
contain a diverse set of cores to match a range of runtime
requirements and program behaviours. Selecting a diverse set
of cores is, however, a non-trivial problem. We present a
method of core selection that chooses cores at a range of
power-performance points. For example, we see on Fig. 18
that for a normalised power budget of 1.3 (1.3 times higher
than the most power-efficient alternative), the best possible
normalised time using the baseline selection is 1.75 (1.75
times the fastest execution time), whereas an 8 core selection
can lower this ratio to 1.4 without exceeding the normalised
power budget, i.e., our method brings a 20% speedup. Our
algorithm is based on the observation that it is not necessary
for a core to consistently have high performance or low power;
one type of core can fulfil different roles for different types
of programs. Given a power budget, cores selected with our
method provide an average speedup of 7% on EEMBC mobile
benchmarks, and a 23% on SPECint 2006 benchmarks over
the state of the art core selection method [102].
Design Space Exploration (DSE)
Fig. 19: Holistic optimisation methods explore all domains of the real-time 3D scene
understanding, including hardware, software, and computer vision algorithms. Two
holistic works presented here: Design Space Exploration and Crowdsourcing.
In this section, we introduce holistic optimisation methods
that combine developments from multiple domains, i.e. hard-
ware, software, and algorithm, to develop efficient end-to-end
solutions. The design space exploration work presents the idea
of exploring many sets of possible parameters to properly
exploit them at different situations. The crowdsourcing further
tests the DSE idea on a massive number of devices. Fig. 19
summarises their goals and contributions.
A. Design Space Exploration
Design space exploration is the exploration of various pos-
sible design choices before running the system [103]. In scene
understanding algorithms, the possible space of the design
choices is very large and spans from high-level algorithmic
choices down to parametric choices within an algorithm. For
instance, Zhang et al. [104] explore algorithmic choices for
a visual-inertial algorithmic parameters on an ARM CPU, as
well as a Xilinx Kintex-7 XC7K355T FPGA. In this section,
we introduce two DSE algorithms: The first one called multi-
domain DSE explores algorithmic, compiler and hardware pa-
rameters. The second one, coined motion-aware DSE, further
adds the complexity of the motion and the environment to the
exploration space. The latter work is extended to develop an
active SLAM algorithm.
1) Multi-domain DSE: Until now, resource-intensive scene
understanding algorithms, such as KinectFusion, could only
run in real-time on powerful desktop GPUs. In [105] we
examine how it can be mapped to power constrained em-
bedded systems and we introduce HyperMapper, a tool for
multi-objective DSE. HyperMapper was demonstrated in a
variety of applications ranging from computer vision and
robotics to compilers [105], [106], [44], [107]. Key to our
approach is the idea of incremental co-design exploration,
where optimisation choices that concern the domain layer are
incrementally explored together with low-level compiler and
architecture choices (See Fig. 21, dashed boxes). The goal of
this exploration is to reduce execution time while minimising
power and meeting our quality of result objective. Fig. 20
shows an example performed with KinectFusion, in which for
each point, a set of parameters, two metrics, maximum ATE
and runtime speed, is shown. As the design space is too large
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45
Runtime (sec)
Max ATE (m)
Accuracy limit = 0.05m
Default configuration
Active learning
Random sampling
Fig. 20: This plot illustrates the result of HyperMapper on the Design Space Exploration
of the KinectFusion algorithmic parameters considering accuracy and frame rate metrics.
We can see the result of random sampling (red) as well as the improvement of solutions
after active learning (black).
to exhaustively evaluate, we use active learning based on a
random forest predictor to find good designs. We show that
our approach can, for the first time, achieve dense 3D mapping
and tracking in the real-time range within a 1W power budget
on a popular embedded device. This is a 4.8x execution time
improvement and a 2.8x power reduction compared to the
2) Motion and Structure-aware DSE: In Multi-domain
DSE, when tuning software and hardware parameters, we also
need to take into account the structure of the environment and
the motion of the camera. In the Motion and Structure-aware
Design Space Exploration (MS-DSE) work [44], we deter-
mine the complexity of the structure and motion with a few
parameters calculated using information theory. Depending on
this complexity and the desired performance metrics, suitable
parameters are explored and determined. The hypothesis of
MS-DSE is that we can use a small set of parameters as a very
useful proxy for a full description of the setting and motion of
a SLAM application. We call these Motion and Structure (MS)
parameters, and define them based on information divergence
metric. Fig. 21 demonstrates the set of all design spaces.
MS-DSE presents a comprehensive parametrisation of 3D
understanding scene algorithms, and thus based on this new
parameterisation, many new concepts and applications can be
developed. One of these applications, active SLAM, is outlined
here. For more applications, please see [105], [106], [44],
a) Active SLAM: Active SLAM is the method for choos-
ing the optimal camera trajectory, in order to maximise the
camera pose estimation, the accuracy of the reconstruction,
or the coverage of the environment. In [44], it is shown that
MS-DSE can be utilised to optimise not only fixed system
parameters, but also to guide a robotic platform to maintain
a good performance for localisation and mapping. As shown
in Fig. 21, a Pareto front holds all optimal parameters. The
front has been prepared in advance by exploring the set of all
parameters. When the system is operating, optimal parameters
e.g. clock frequency
e.g. numerical precision
SLAM Algorithm
e.g. weights
Motion & Structure
e.g. divergence
Metric 1: execution time
Metric 2: trajectory error
Pareto Front
Design Spaces
Fig. 21: Motion and structure aware active SLAM design space exploration using
Success vs. Failure
Window Table Wall Carpet
Fig. 22: Success vs. failure rate when mapping the same environment with different
motion planning algorithms: active SLAM and random walk.
are chosen given the desired performance metrics. Then these
parameters are used to initialise the system. Using MS param-
eters, the objective is to avoid motions that cause very high
statistical divergence between two consecutive frames. This
way, we can provide a robust SLAM algorithm by allowing
the tracking work all the time. Fig. 22 compares the active
SLAM with a random walk algorithm. The experiments were
done in four different environments. In each environment, each
algorithm was run 10 times. Repeated experiments serve as a
measure of the robustness of the algorithm in dealing with
uncertainties rising from minor changes in illumination, or
inaccuracies of the response of the controller or actuator to
the commands. The consistency of the generated map was
evaluated manually as either a success or failure of SLAM.
If duplicates of one object were present in the map, it was
considered as failure. This experiment shows more than 50 %
success rate in SLAM when employing the proposed active
SLAM algorithm [44], an improvement in the robustness of
SLAM algorithms by relying on design space exploration.
3) Comparative DSE of Dense vs Semi-dense SLAM: An-
other different direction in any DSE work is the performance
exploration across multiple algorithms. While Multi-domain
DSE explores different parameters of a given algorithm, the
comparative DSE, presented in [108], explores the perfor-
mance of two different algorithms under different parametric
In comparative DSE, two state-of-the-art SLAM algorithms,
KinectFusion and LSD-SLAM, are compared on multiple
datasets. Using SLAMBench benchmarking capabilities, a
Absolute Trajectory Error (cm)
(b) Synthetic Scene
0 2 4 6 8 10 12 14 16 18
Absolute Trajectory Error (cm) Absolute Trajectory Error (cm)
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
0 2 4 6 8 10 12 14 16 18
(a) Real Scene
Absolute Trajectory Error (cm)
% of tiem recorded
% of tiem recorded
Fig. 23: Distribution of Absolute Trajectory Error (ATE) using KinectFusion and LSD-
SLAM, run with default parameters on Desktop. The mean absolute error has been
highlighted. (a) TUM RGB-D fr2 xyz (b) ICL-NUIM lr kt2.
full design space exploration is performed over algorithmic
parameters, compilation flags and multiple architectures. Such
thorough parameter space exploration gives us key insights on
the behaviour of each algorithm in different operative condi-
tions and the relationship between different sets of distinct,
yet correlated, parameters blocks.
As an example, in Fig. 23 we show the result of comparative
DSE between LSD-SLAM and KinectFusion in terms of their
ATE distribution across two scenes of two different datasets.
The histograms display the error distribution across the entire
sequence, from which we can get a sense of how well the
algorithms are performing for the whole trajectory. We hope
that these analyses enable researchers to develop more robust
algorithms. Without the holistic approach enabled by SLAM-
Bench such insights would have been much harder to obtain.
This sort of information is invaluable for a wild range of
SLAM practitioners, from VR/AR designers to roboticists that
want to select/modify the best algorithm for their particular use
B. Crowdsourcing
The SLAMBench framework and more specifically its vari-
ous KinectFusion implementations has been ported to Android.
More than 2000 downloads have been made since its official
release on the Google Play store. We received numerous
positive feedback reports and this application has generated
a great deal of interest in the community and with industrial
This level of uptake allowed us to collect data from more
than 100 different mobile phones Fig. 24 shows the speed-
up across many models of Android devices that we have
experimented with. Clearly it is possible to achieve more
than twice runtime speed by tuning the system parameters
using the tools introduced in the paper. We plan to use these
data to analyse the performance of KinectFusion on those
platforms, and to provide techniques to optimise KinectFusion
performance depending of the targeted platform. This work
will apply transfer-learning methodology. We believe that by
combining design space exploration [106] and the collected
data, we can train a decision machine to select code variants
and configurations for diverse mobile platforms automatically.
In this paper we focused on SLAM, which is an enabling
technology in many fields including virtual reality, augmented
reality, and robotics. The paper presented several contributions
0 2 4 6 8 10 12 14
Android Devices
Fig. 24: By combining design space exploration and crowdsourcing, we checked that
design space exploration efficiently works on various types of platforms. This figure
demonstrates the speed-up of the KinectFusion algorithm on various different types of
Android devices. Each bar represents the speed-up for one type (model) of Android
device. The models are not shown for the sake of clarity of the figure.
across hardware architecture, compiler and runtime software
systems, and computer vision algorithmic layers of SLAM
pipeline. We proposed not only contributions at each layer,
but also holistic methods that optimise the system as a whole.
In computer vision and applications, we presented bench-
marking tools that allow us to select a proper dataset and use it
to evaluate different SLAM algorithms. SLAMBench is used
to evaluate the KinectFusion algorithm on various different
hardware platforms. SLAMBench2 is used to compare various
SLAM algorithms very efficiently. We also extended the
KinectFusion algorithm, such that it can be used in robotic
path planning and navigation algorithms by mapping both
occupied and free space of the environment. Moreover, we
explored new sensing technologies such as focal-plane sensor-
processor arrays, which have low power consumption and high
effective frame rate.
The software layer of this project demonstrated that soft-
ware optimisation can be used to deliver significant improve-
ments in power consumption and speed trade-off when spe-
cialised for computer vision applications. We explored static,
dynamic, and hybrid approaches and focused their application
on the KinectFusion algorithm. Being able to select and
deploy optimisations adaptively is particularly beneficial in the
context of dynamic runtime environment where application-
specific details can strongly improve the result of JIT compi-
lation and thus the speed of the program.
The project has made a range of contributions across the
hardware design and development field. Profiling tools have
been developed in order to locate and evaluate performance
bottlenecks in both native and managed applications. These
bottlenecks could then be addressed by a range of special-
isation techniques, and the specialised hardware evaluated
using the presented simulation techniques. This represents a
full workflow for creating new hardware for computer vision
applications which might be used in future platforms.
Finally, we report on holistic methods that exploit our ability
to explore the design space at every level in a holistic fashion.
We demonstrated several design space exploration methods
where we showed that it is possible to fine-tune the system
such that we can meet desired performance metrics. It is also
shown that we can increase public engagement in accelerating
the design space exploration by crowdsourcing.
In future work, two main directions will be followed: The
first is exploiting our knowledge from all domains of this
paper to select a SLAM algorithm and design a chip that
is customised to efficiently implement the algorithm. This
approach will utilise data from SLAMBench2 and real-world
experiments to drive the design of a specialised vision proces-
sor. The second direction is utilising the tools and techniques
presented here to develop a standardised method that takes the
high-level scene understanding functionalities and develops the
optimal code that maps the functionalities to the heterogeneous
resources available, optimising for the desired performance
This research is supported by Engineering and Physi-
cal Sciences Research Council (EPSRC), grant reference
EP/K008730/1, PAMELA project.
[1] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,
I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous
localization and mapping: Toward the robust-perception age,” IEEE
Transactions on Robotics, vol. 32, no. 6, pp. 1309–1332, 2016.
[2] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics (Intelligent
Robotics and Autonomous Agents). The MIT Press, 2005.
[3] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and map-
ping: part I,” IEEE Robotics Automation Magazine, vol. 13, no. 2, pp.
99–110, 2006.
[4] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “MonoSLAM:
Real-time single camera SLAM,” IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, vol. 29, no. 6, pp. 1052–1067, 2007.
[5] G. Klein and D. Murray, “Parallel tracking and mapping on a camera
phone,” in Proceedings of IEEE and ACM International Symposium on
Mixed and Augmented Reality (ISMAR), 2009.
[6] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: Dense
tracking and mapping in real-time,” in Proceedings of International
Conference on Computer Vision (ICCV), 2011, pp. 2320–2327.
[7] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J.
Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon, “KinectFu-
sion: Real-time dense surface mapping and tracking,” in Proceedings
of IEEE International Symposium on Mixed and Augmented Reality
(ISMAR), 2011, pp. 127–136.
[8] L. Fan, F. Zhang, G. Wang, and Z. Liu, “An effective approximation
algorithm for the malleable parallel task scheduling problem,” Journal
of Parallel and Distributed Computing, vol. 72, no. 5, pp. 693–704,
[9] N. Melot, C. Kessler, J. Keller, and P. Eitschberger, “Fast crown
scheduling heuristics for energy-efficient mapping and scaling of
moldable streaming tasks on manycore systems,” ACM Transactions
on Architecture and Code Optimization (TACO), vol. 11, no. 4, pp.
62:1–62:24, 2015.
[10] N. Melot, C. Kessler, and J. Keller, “Improving energy-efficiency of
static schedules by core consolidation and switching off unused cores.”
in Proceedings of International Conference on Parallel Computing
(ParCo), 2015, pp. 285 – 294.
[11] H. Xu, F. Kong, and Q. Deng, “Energy minimizing for parallel real-
time tasks based on level-packing,” in IEEE International Conference
on Embedded and Real-Time Computing Systems and Applications
(RTCSA), 2012, pp. 98–103.
[12] T. Schwarzer, J. Falk, M. Glaß, J. Teich, C. Zebelein, and C. Haubelt,
“Throughput-optimizing compilation of dataflow applications for multi-
cores using quasi-static scheduling,” in Proceedings of ACM Interna-
tional Workshop on Software and Compilers for Embedded Systems,
2015, pp. 68–75.
[13] U. Dastgeer and C. Kessler, “Performance-aware composition frame-
work for GPU-based systems,” The Journal of Supercomputing, vol. 71,
no. 12, pp. 4646–4662, 2015.
[14] ——, “Smart containers and skeleton programming for GPU-based
systems,” International Journal of Parallel Programming, vol. 44,
no. 3, pp. 506–530, 2016.
[15] I. B¨
ohm, T. J. Edler von Koch, S. C. Kyle, B. Franke, and N. Topham,
“Generalized just-in-time trace compilation using a parallel task farm
in a dynamic binary translator,The ACM Special Interest Group on
Programming Languages (SIGPLAN) Notices, vol. 46, no. 6, pp. 74–
85, 2011.
[16] K. D. Cooper, A. Grosul, T. J. Harvey, S. Reeves, D. Subramanian,
L. Torczon, and T. Waterman, “Adaptive compilation made efficient,
The ACM Special Interest Group on Programming Languages (SIG-
PLAN) Notices, vol. 40, no. 7, pp. 69–77, 2005.
[17] G. Fursin, Y. Kashnikov, A. W. Memon, Z. Chamski, O. Temam,
M. Namolaru, E. Yom-Tov, B. Mendelson, A. Zaks, E. Courtois,
F. Bodin, P. Barnard, E. Ashton, E. Bonilla, J. Thomson, C. K. I.
Williams, and M. O’Boyle, “Milepost GCC: Machine learning enabled
self-tuning compiler,International Journal of Parallel Programming,
vol. 39, no. 3, pp. 296–327, 2011.
[18] Q. Wang, S. Kulkarni, J. Cavazos, and M. Spear, “A transactional
memory with automatic performance tuning,” ACM Transactions on
Architecture and Code Optimization (TACO), vol. 8, no. 4, p. 54, 2012.
[19] S. Kulkarni and J. Cavazos, “Mitigating the compiler optimization
phase-ordering problem using machine learning,” The ACM Special In-
terest Group on Programming Languages (SIGPLAN) Notices, vol. 47,
no. 10, pp. 147–162, 2012.
[20] H. Leather, E. Bonilla, and M. O’Boyle, “Automatic feature generation
for machine learning based optimizing compilation,” in Proceedings of
Annual IEEE/ACM International Symposium on Code Generation and
Optimization, 2009, pp. 81–91.
[21] G. Tournavitis, Z. Wang, B. Franke, and M. F. O’Boyle, “Towards
a holistic approach to auto-parallelization: Integrating profile-driven
parallelism detection and machine-learning based mapping,” in Pro-
ceedings of ACM SIGPLAN Conference on Programming Language
Design and Implementation, 2009, pp. 177–187.
[22] M. Zuluaga, E. Bonilla, and N. Topham, “Predicting best design trade-
offs: A case study in processor customization,” in Design, Automation
Test in Europe Conference Exhibition (DATE), 2012, pp. 1030–1035.
[23] I. Bohm, B. Franke, and N. Topham, “Cycle-accurate performance
modelling in an ultra-fast just-in-time dynamic binary translation
instruction set simulator,” in International Conference on Embedded
Computer Systems: Architectures, Modeling and Simulation, 2010, pp.
[24] K. T. Sundararajan, V. Porpodas, T. M. Jones, N. P. Topham, and
B. Franke, “Cooperative partitioning: Energy-efficient cache partition-
ing for high-performance CMPs,” in IEEE International Symposium on
High-Performance Comp Architecture, 2012, pp. 1–12.
[25] O. Almer, N. Topham, and B. Franke, “A learning-based approach
to the automated design of MPSoC networks,” in Proceedings of
International Conference on Architecture of Computing Systems, 2011,
pp. 243–258.
[26] L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. Kelly, A. J.
Davison, M. Luj´
an, M. F. O’Boyle, G. Riley et al., “Introducing SLAM-
Bench, a Performance and Accuracy Benchmarking Methodology for
SLAM,” in IEEE International Conference on Robotics and Automation
(ICRA), 2015, pp. 5783–5790.
[27] G. Reitmayr and H. Seichter, “KFusion GitHub,
[28] A. Handa, T. Whelan, J. McDonald, and A. Davison, “A Benchmark
for RGB-D Visual Odometry, 3D Reconstruction and SLAM,” in IEEE
International Conference on Robotics and Automation (ICRA), 2014,
pp. 1524–1531.
[29] P. Keir, “DAGR: A DSL for legacy OpenCL codes,” in 1st SYCL
Programming Workshop, 2016.
[30] R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy,
S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema et al., “Pencil:
A platform-neutral compute intermediate language for accelerator pro-
gramming,” in IEEE International Conference on Parallel Architecture
and Compilation (PACT), 2015, pp. 138–149.
[31] CARP-project. PENCIL-SLAMBench GitHub.
[32] B. Bodin, H. Wagstaff, S. Saeedi, L. Nardi, E. Vespa, J. Mayer,
A. Nisbet, M. Lujan, S. Furber, A. Davison, P. Kelly, and M. O’Boyle,
“SLAMBench2: Multi-objective head-to-head benchmarking for visual
SLAM,” in IEEE International Conference on Robotics and Automation
(ICRA), 2018, pp. 3637–3644.
[33] T. Whelan, S. Leutenegger, R. F. Salas-Moreno, B. Glocker, and A. J.
Davison, “ElasticFusion: Dense SLAM without a pose graph,” in RSS,
[34] O. Kahler, V. A. Prisacariu, C. Y. Ren, X. Sun, P. H. S. Torr, and D. W.
Murray, “Very high frame rate volumetric integration of depth images
on mobile device,” IEEE Transactions on Visualization and Computer
Graphics, vol. 21, no. 11, pp. 1241–1250, 2015.
[35] J. Engel, T. Sch¨
ops, and D. Cremers, “LSD-SLAM: Large-scale direct
monocular SLAM,” in European Conference on Computer Vision
(ECCV). Springer, 2014, pp. 834–849.
[36] R. Mur-Artal and J. D. Tardos, “ORB-SLAM2: An open-source SLAM
system for monocular, stereo, and RGB-D cameras,IEEE Transactions
on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
[37] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “MonoSLAM:
Real-time single camera SLAM,” IEEE transactions on pattern anal-
ysis and machine intelligence, vol. 29, no. 6, pp. 1052–1067, 2007.
[38] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale,
“Keyframe-based visual–inertial odometry using nonlinear optimiza-
tion,” The International Journal of Robotics Research, vol. 34, no. 3,
pp. 314–334, 2015.
[39] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast semi-direct
monocular visual odometry,” in IEEE International Conference on
Robotics and Automation (ICRA), 2014, pp. 15–22.
[40] M. Abouzahir, A. Elouardi, R. Latif, S. Bouaziz, and A. Tajer,
“Embedding SLAM algorithms: Has it come of age?” Robotics and
Autonomous Systems, vol. 100, pp. 14 – 26, 2018.
[41] D. Jeffrey and S. Davide, “A benchmark comparison of monocular
visual-inertial odometry algorithms for flying robot,” in IEEE Inter-
national Conference on Robotics and Automation (ICRA), 2018, pp.
[42] W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye,
Y. Huang, R. Tang, and S. Leutenegger, “InteriorNet: Mega-scale multi-
sensor photo-realistic indoor scenes dataset,” in British Machine Vision
Conference (BMVC), 2018.
[43] S. Saeedi, W. Li, D. Tzoumanikas, S. Leutenegger, P. H. J. Kelly, and
A. J. Davison. (2018) Characterising localization and mapping datasets.
[44] S. Saeedi, L. Nardi, E. Johns, B. Bodin, P. Kelly, and A. Davison,
“Application-oriented design space exploration for SLAM algorithms,
in IEEE International Conference on Robotics and Automation (ICRA),
2017, pp. 5716–5723.
[45] C. Loop, Q. Cai, S. Orts-Escolano, and P. A. Chou, “A closed-form
Bayesian fusion equation using occupancy probabilities,” in IEEE
International Conference on 3D Vision (3DV), 2016, pp. 380–388.
[46] E. Vespa, N. Nikolov, M. Grimm, L. Nardi, P. H. J. Kelly, and
S. Leutenegger, “Efficient octree-based volumetric SLAM supporting
signed-distance and occupancy mapping,” IEEE Robotics and Automa-
tion Letters, vol. 3, no. 2, pp. 1144–1151, 2018.
[47] J. D. Gammell, S. S. Srinivasa, and T. D. Barfoot, “Informed RRT*:
Optimal sampling-based path planning focused via direct sampling
of an admissible ellipsoidal heuristic,” in IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), 2014, pp. 2997–
[48] Point-Grey, “Bumblebee2 Datasheet,
[49] P. Fankhauser, M. Bloesch, D. Rodriguez, R. Kaestner, M. Hutter, and
R. Siegwart, “Kinect v2 for mobile robot navigation: Evaluation and
modeling,” in IEEE International Conference on Advanced Robotics
(ICAR), 2015, pp. 388–394.
[50] H. Kim, A. Handa, R. Benosman, S.-H. Ieng, and A. Davison, “Simul-
taneous mosaicing and tracking with an event camera,” in Proceedings
of the British Machine Vision Conference (BMVC). BMVA Press,
[51] P. Bardow, A. J. Davison, and S. Leutenegger, “Simultaneous optical
flow and intensity estimation from an event camera,” in IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), 2016, pp.
[52] A. Censi and D. Scaramuzza, “Low-latency event-based visual odom-
etry,” in IEEE International Conference on Robotics and Automation
(ICRA), 2014, pp. 703–710.
[53] E. Mueggler, B. Huber, and D. Scaramuzza, “Event-based, 6-DOF
pose tracking for high-speed maneuvers,” in IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), 2014, pp. 2761–
[54] H. Kim, S. Leutenegger, and A. J. Davison, “Real-time 3D recon-
struction and 6-DoF tracking with an event camera,” in European
Conference on Computer Vision (ECCV). Springer International
Publishing, 2016, pp. 349–364.
[55] R. Dominguez-Castro, S. Espejo, A. Rodriguez-Vazquez, R. A. Car-
mona, P. Foldesy, ´
A. Zar´
andy, P. Szolgay, T. Szir´
anyi, and T. Roska,
“A 0.8-/spl mu/m CMOS two-dimensional programmable mixed-signal
focal-plane array processor with on-chip binary imaging and instruc-
tions storage,” IEEE Journal of Solid-State Circuits, vol. 32, no. 7, pp.
1013–1026, 1997.
[56] G. Linan, S. Espejo, R. Dominguez-Castro, and A. Rodriguez-Vazquez,
“Architectural and basic circuit considerations for a flexible 128×128
mixed-signal SIMD vision chip,” Analog Integrated Circuits and Signal
Processing, vol. 33, no. 2, pp. 179–190, 2002.
[57] J. Poikonen, M. Laiho, and A. Paasio, “MIPA4k: A 64×64 cell mixed-
mode image processor array,” in IEEE International Symposium on
Circuits and Systems (ISCAS), 2009, pp. 1927–1930.
[58] P. Dudek and P. J. Hicks, “A general-purpose processor-per-pixel
analog SIMD vision chip,” IEEE Transactions on Circuits and Systems,
vol. 52, no. 1, pp. 13–20, 2005.
[59] P. Dudek, “Implementation of SIMD vision chip with 128×128 array
of analogue processing elements,” in IEEE International Symposium
on Circuits and Systems (ISCAS), 2005, pp. 5806–5809.
[60] S. J. Carey, A. Lopich, D. R. Barr, B. Wang, and P. Dudek, “A
100,000 FPS vision sensor with embedded 535GOPS/W 256×256
SIMD processor array,” in IEEE Symposium on VLSI Circuits (VLSIC),
2013, pp. C182–C183.
[61] W. Zhang, Q. Fu, and N. J. Wu, “A programmable vision chip based
on multiple levels of parallel processors,IEEE Journal of Solid-State
Circuits, vol. 46, no. 9, pp. 2132–2147, 2011.
[62] J. N. P. Martel, L. K. Mller, S. J. Carey, and P. Dudek, “Parallel HDR
tone mapping and auto-focus on a cellular processor array vision chip,”
in IEEE International Symposium on Circuits and Systems (ISCAS),
2016, pp. 1430–1433.
[63] L. Bose, J. Chen, S. J. Carey, P. Dudek, and W. Mayol-Cuevas, “Visual
odometry for pixel processor arrays,” in IEEE International Conference
on Computer Vision (ICCV), 2017, pp. 4614–4622.
[64] P. Viola and M. Jones, “Robust real-time object detection,” in Interna-
tional Journal of Computer Vision, vol. 57, no. 2. Kluwer Academic
Publishers, 2004, pp. 137–154.
[65] T. Debrunner, S. Saeedi, and P. H. J. Kelly, “Automatic kernel code
generation for cellular processor arrays,” in Submitted to ACM Trans-
actions on Architecture and Code Optimization (TACO), 2018.
[66] Intel-Movidius, “Intel Movidius Myriad VPU,https://www.movidius.
[67] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin,
C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb,
T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R.
Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey,
A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar,
S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke,
A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Na-
garajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick,
N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani,
C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing,
M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan,
R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter
performance analysis of a tensor processing unit,” in ACM International
Symposium on Computer Architecture (ISCA), 2017, pp. 1–12.
[68] P. Ginsbach, T. Remmelg, M. Steuwer, B. Bodin, C. Dubach, and
M. O’Boyle, “Automatic matching of legacy code to heterogeneous
APIs: An idiomatic approach,” in ACM International Conference on
Architectural Support for Programming Languages and Operating
Systems (ASPLOS), 2018, pp. 139–153.
[69] B. Bodin, L. Nardi, P. H. J. Kelly, and M. F. P. OBoyle, “Diplomat:
Mapping of multi-kernel applications using a static dataflow abstrac-
tion,” in IEEE International Symposium on Modeling, Analysis and
Simulation of Computer and Telecommunication Systems (MASCOTS),
2016, pp. 241–250.
[70] B. Bodin, A. Munier-Kordon, and B. D. de Dinechin, “Optimal and fast
throughput evaluation of CSDF,” in ACM Annual Design Automation
Conference (DAC), 2016, pp. 160:1–160:6.
[71] C. Kotselidis, J. Clarkson, A. Rodchenko, A. Nisbet, J. Mawer, and
M. Luj´
an, “Heterogeneous managed runtime systems: A computer vi-
sion case study,” in ACM SIGPLAN/SIGOPS International Conference
on Virtual Execution Environments (VEE), 2017, pp. 74–82.
[72] J. Clarkson, C. Kotselidis, G. Brown, and M. Luj´
an, “Boosting java
performance using gpgpus,” in International Conference on Architec-
ture of Computing Systems (ARCS). Springer International Publishing,
2017, pp. 59–70.
[73] J. Clarkson, J. Fumero, M. Papadimitriou, M. Xekalaki, and C. Kot-
selidis, “Towards practical heterogeneous virtual machines,” in ACM
MoreVMs Workshop on Modern Language Runtimes, Ecosystems, and
VMs, 2018, pp. 46–48.
[74] Beehive Lab, Maxine/MaxSim.
[75] C. Wimmer, M. Haupt, M. L. Van De Vanter, M. Jordan, L. Dayn`
and D. Simon, “Maxine: An approachable virtual machine for, and
in, Java,ACM Transactions on Architecture and Code Optimization
(TACO), vol. 9, no. 4, pp. 30:1–30:24, 2013.
[76] F. S. Zakkak, A. Nisbet, J. Mawer, T. Hartley, N. Foutris, O. Papadakis,
A. Andronikakis, I. Apreotesei, and C. Kotselidis, “On the future of
research VMs: A hardware/software perspective,” in ACM MoreVMs
Workshop on Modern Language Runtimes, Ecosystems, and VMs, 2018,
pp. 51–53.
[77] K. Chandramohan and M. F. O’Boyle, “Partitioning data-parallel
programs for heterogeneous MPSoCs: Time and energy design space
exploration,” in ACM SIGPLAN/SIGBED Conference on Languages,
Compilers and Tools for Embedded Systems (LCTES), 2014, pp. 73–
[78] K. Chandramohan and M. F. P. O’Boyle, “A compiler framework for
automatically mapping data parallel programs to heterogeneous MP-
SoCs,” in ACM International Conference on Compilers, Architecture
and Synthesis for Embedded Systems (CASE), 2014, pp. 9:1–9:10.
[79] T. Spink, H. Wagstaff, B. Franke, and N. Topham, “Efficient code
generation in a region-based dynamic binary translator,” in ACM
SIGPLAN/SIGBED Conference on Languages, Compilers and Tools
for Embedded Systems (LCTES), 2014, pp. 3–12.
[80] H. Wagstaff, M. Gould, B. Franke, and N. Topham, “Early partial
evaluation in a JIT-compiled, retargetable instruction set simulator
generated from a high-level architecture description,” in ACM Annual
Design Automation Conference (DAC), 2013, pp. 21:1–21:6.
[81] H. Wagstaff, T. Spink, and B. Franke, “Automated ISA branch coverage
analysis and test case generation for retargetable instruction set simu-
lators,” in IEEE International Conference on Compilers, Architecture
and Synthesis for Embedded Systems (CASES), 2014, pp. 1–10.
[82] T. Spink, H. Wagstaff, B. Franke, and N. Topham, “Efficient dual-
ISA support in a retargetable, asynchronous dynamic binary translator,
in IEEE International Conference on Embedded Computer Systems:
Architectures, Modeling, and Simulation (SAMOS), 2015, pp. 103–112.
[83] H. Wagstaff and T. Spink. The GenSim ADL toolset. http://www.
[84] K. Kaszyk, H. Wagstaff, T. Spink, B. Franke, M. O’Boyle, and
H. Uhrenholt, “Accurate emulation of a state-of-the-art mobile cpu/gpu
platform,” in Design Automation Conference (DAC) Work-in-Progress
Poster session, 2018.
[85] T. Spink, H. Wagstaff, and B. Franke, “Efficient asynchronous interrupt
handling in a full-system instruction set simulator,” in ACM SIGPLAN
Notices, vol. 51, no. 5, 2016, pp. 1–10.
[86] J. Mawer, O. Palomar, C. Gorgovan, A. Nisbet, W. Toms, and M. Lujn,
“The potential of dynamic binary modification and CPU-FPGA SoCs
for simulation,” in IEEE Annual International Symposium on Field-
Programmable Custom Computing Machines (FCCM), 2017, pp. 144–
[87] C. Gorgovan, A. d’Antras, and M. Luj´
an, “MAMBO: A low-overhead
dynamic binary modification tool for ARM,” ACM Transactions on
Architecture and Code Optimization (TACO), vol. 13, no. 1, pp. 14:1–
14:26, 2016.
[88] C. Gorgovan. MAMBO: A low-overhead dynamic binary modification
tool for ARM.
[89] D. L. Bruening, “Efficient, transparent, and comprehensive runtime
code manipulation,” Ph.D. dissertation, Massachusetts Institute of Tech-
nology, 2004.
[90] A. Rodchenko, C. Kotselidis, A. Nisbet, A. Pop, and M. Lujn,
“MaxSim: A simulation platform for managed applications,” in IEEE
International Symposium on Performance Analysis of Systems and
Software (ISPASS), 2017, pp. 141–152.
[91] D. Sanchez and C. Kozyrakis, “ZSim: Fast and accurate microar-
chitectural simulation of thousand-core systems,” in ACM Annual
International Symposium on Computer Architecture (ISCA), 2013, pp.
[92] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and
N. P. Jouppi, “McPAT: An integrated power, area, and timing modeling
framework for multicore and manycore architectures,” in IEEE/ACM
International Symposium on Microarchitecture (MICRO), 2009, pp.
[93] A. Rodchenko, C. Kotselidis, A. Nisbet, A. Pop, and M. Lujan, “Type
information elimination from objects on architectures with tagged
pointers support,” IEEE Transactions on Computers, vol. 67, no. 1,
pp. 130–143, 2018.
[94] V. Sze, “Designing hardware for machine learning: The important
role played by circuit designers,” IEEE Solid-State Circuits Magazine,
vol. 9, no. 4, pp. 46–54, 2017.
[95] Graphcore,
[96] A. Eliazar and R. Parr, “DP-SLAM: Fast, robust simultaneous localiza-
tion and mapping without predetermined landmarks,” in International
Joint Conference on Artificial Intelligence (IJCAI). Morgan Kauf-
mann, 2003, pp. 1135–1142.
[97] S. Dublish, V. Nagarajan, and N. Topham, “Characterizing memory
bottlenecks in GPGPU workloads,” in IEEE International Symposium
on Workload Characterization (IISWC), 2016, pp. 1–2.
[98] ——, “Cooperative caching for GPUs,ACM Transactions on Archi-
tecture and Code Optimization (TACO), vol. 13, no. 4, pp. 39:1–39:25,
[99] E. Tomusk, C. Dubach, and M. O’Boyle, “Measuring flexibility in
single-ISA heterogeneous processors,” in ACM International Confer-
ence on Parallel Architectures and Compilation (PACT), 2014, pp. 495–
[100] E. Tomusk and C. Dubach, “Diversity: A design goal for heterogeneous
processors,” IEEE Computer Architecture Letters, vol. 15, no. 2, pp.
81–84, 2016.
[101] E. Tomusk, C. Dubach, and M. O’boyle, “Four metrics to evaluate
heterogeneous multicores,” ACM Transactions on Architecture and
Code Optimization (TACO), vol. 12, no. 4, pp. 37:1–37:25, 2015.
[102] ——, “Selecting heterogeneous cores for diversity,” ACM Transactions
on Architecture and Code Optimization (TACO), vol. 13, no. 4, pp.
49:1–49:25, 2016.
[103] E. Kang, E. Jackson, and W. Schulte, An Approach for Effective Design
Space Exploration. Springer Berlin Heidelberg, 2011, pp. 33–54.
[104] Z. Zhang, A. Suleiman, L. Carlone, V. Sze, and S. Karaman, “Visual-
inertial odometry on chip: An algorithm-and-hardware co-design ap-
proach,” in Robotics: Science and Systems (RSS), 2017.
[105] B. Bodin, L. Nardi, M. Z. Zia, H. Wagstaff, G. Sreekar Shenoy,
M. Emani, J. Mawer, C. Kotselidis, A. Nisbet, M. Lujan, B. Franke,
P. H. Kelly, and M. O’Boyle, “Integrating Algorithmic Parameters into
Benchmarking and Design Space Exploration in 3D Scene Understand-
ing,” in ACM International Conference on Parallel Architectures and
Compilation (PACT), 2016, pp. 57–69.
[106] L. Nardi, B. Bodin, S. Saeedi, E. Vespa, A. J. Davison, and P. H. J.
Kelly, “Algorithmic Performance-Accuracy Trade-off in 3D Vision
Applications Using HyperMapper,” in International Workshop on Au-
tomatic Performance Tuning (iWAPT), hosted by IEEE International
Parallel and Distributed Processing Symposium (IEEE IPDPS), 2017.
[107] D. Koeplinger, M. Feldman, R. Prabhakar, Y. Zhang, S. Hadjis,
R. Fiszel, T. Zhao, L. Nardi, A. Pedram, C. Kozyrakis et al., “Spatial: a
language and compiler for application accelerators,” in ACM SIGPLAN
Conference on Programming Language Design and Implementation,
2018, pp. 296–311.
[108] M. Z. Zia, L. Nardi, A. Jack, E. Vespa, B. Bodin, P. H. Kelly, and
A. J. Davison, “Comparative design space exploration of dense and
semi-dense SLAM,” in IEEE International Conference on Robotics and
Automation (ICRA), 2016, pp. 1292–1299.
... One possible solution to improve the kit assembly operations is the use of Augmented Reality (AR), considered one of the nine pillars of Industry 4.0 to support operators with real-time information for faster decision-making, while improving work processes [15,24,50,54,56,62]. This technology can integrate virtual information in the operators workspace [35,42], helping them in assembly tasks [18,43,49], provide context-aware assistance [5], data visualization and interaction (acting as a Human-Machine Interface (HMI)) [16,40], indoor localization [60], maintenance applications [8,18,61], quality control [4,65], material management [16,51] or remote collaboration [7,39,66], by presenting additional layers of digital information on top of real-world environments [3,28,33,37,38,57]. Prior studies identify certain benefits of applying AR for technological industrialization, like increased work safety, effective learning and training, as well as more task effectiveness [10,12,31], as well as improved Human-Robot Interaction (HRI) [1,13,19,34]. ...
Full-text available
Augmented Reality (AR) is a pillar of the transition to Industry 4.0 and smart manufacturing. It can facilitate training, maintenance, assembly, quality control, remote collaboration and other tasks. AR has the potential to revolutionize the way information is accessed, used and exchanged, extending user's perception and improving their performance. This work proposes a Pervasive AR tool, created with partners from the industry sector, to support the training of logistics operators on industrial shop floors. A Human-Centered Design (HCD) methodology was used to identify operators difficulties, challenges, and define requirements. After initial meetings with stakeholders, two distinct methods were considered to configure and visualize AR content on the shop floor: Head-Mounted Display (HMD) and Handheld Device (HHD). A first (preliminary) user study with 26 participants was conducted to collect qualitative data regarding the use of AR in logistics, from individuals with different levels of expertise. The feedback obtained was used to improve the proposed AR application. A second user study was realized, in which 10 participants used different conditions to fulfill distinct logistics tasks: C1-paper; C2-HMD; C3-HHD. Results emphasize the potential of Pervasive AR in the operators' workspace, in particular for training of operators not familiar with the tasks. Condition C2 was preferred by all participants and considered more useful and efficient in supporting the operators activities on the shop floor.
... In addition to the benefits to communication, larger array size (high angular resolution) and larger bandwidth (high delay resolution) in high-frequency systems also enable high-accuracy localization, which has been extensively explored within multiple-input-multiple-output (MIMO) communication systems [3], [4]. It is foreseeable that the potential distance-/angle-aware applications, such as virtual reality (VR)/augmented reality (AR) [5], vehicular safety [6], global navigation satellite system (GNSS) [7], etc, will be further exploited in the future communication systems [1], [8]. ...
... A popular technique which utilises cameras and lidar sensors for perception and localisation is Simultaneous Localisation and Mapping (SLAM). SLAM is method in which an autonomous navigation system obtains 2D or 3D geometric information about its surroundings, which is usually unknown, estimates its pose within that environment, and generates a map of the area [14,15]. SLAM-based systems have been used in a wide range of applications such drones, mobile robots, virtual reality, and augmented reality [14,16]. ...
Full-text available
The recent advancements in Information and Communication Technology (ICT) as well as increasing demand for vehicular safety has led to significant progressions in Autonomous Vehicle (AV) technology. Perception and Localisation are major operations that determine the success of AV development and usage. Therefore, significant research has been carried out to provide AVs with the capabilities to not only sense and understand their surroundings efficiently, but also provide detailed information of the environment in the form of 3D maps. Visual Simultaneous Localisation and Mapping (V-SLAM) has been utilised to enable a vehicle understand its surroundings, map the environment, and identify its position within the area. This paper presents a detailed review of V-SLAM techniques implemented for AV perception and localisation. An overview of SLAM techniques is presented. In addition, an in-depth review is conducted to highlight various V-SLAM schemes, their strengths, and limitations. Challenges associated with V-SLAM deployment and future research directions are also provided in this paper.
... Augmented Reality (AR) is one of its pillars, given its ability to provide solutions for supporting operators during their daily tasks. Prior studies support the added value that AR can have in industrial scenarios, integrating digital information in the human-operators workspace [2], helping them in assembly tasks [3], context-aware assistance [4], data visualization and interaction (acting as a Human-Machine Interface (HMI)) [5], indoor localization [6], maintenance applications [3], quality control [7] or material management [5]. Literature identifies several benefits of using AR, like increased work safety, effective learning and training, as well as error and task-time reduction [8]. ...
Conference Paper
Augmented Reality (AR) has been applied in Industry 4.0 contexts for training, assistance, maintenance, assembly or quality control. This work describes the use of Pervasive AR to support human operators in a shop floor through pervasive experiences, while performing logistics operations. A Human-Centered Design methodology with partners from the industry was used to identify operators' difficulties, challenges, and define requirements, leading to the creation of a Pervasive AR prototype to support the operators' task or/and the initial training in such scenarios. An initial user study with 12 participants was conducted in a simulated environment, comparing three conditions: C1-Head-Mounted Display (HMD), C2-Handheld Device (HHD) and C3-Paper manuals. Later on, a second user study with 26 participants took place in a real shop floor, to collect preliminary feedback from individuals with different expertise. Results from both studies suggest advantages in using AR, in particular for the training of operators not familiar with the task. Condition C1 was preferred and considered more useful to support the operator's task by the majority of participants.
Full-text available
Maintaining a microbe-free environment in healthcare facilities has become increasingly crucial for minimizing virus transmission, especially in the wake of recent epidemics like COVID-19. To meet the urgent need for ongoing sterilization, autonomous ultraviolet disinfection (UV-D) robots have emerged as vital tools. These robots are gaining popularity due to their automated nature, cost advantages, and ability to instantly disinfect rooms and workspaces without relying on human labor. Integrating disinfection robots into medical facilities reduces infection risk, lowers conventional cleaning costs, and instills greater confidence in patient safety. However, UV-D robots should complement rather than replace routine manual cleaning. To optimize the functionality of UV-D robots in medical settings, additional hospital and device design modifications are necessary to address visibility challenges. Achieving seamless integration requires more technical advancements and clinical investigations across various institutions. This mini-review presents an overview of advanced applications that demand disinfection, highlighting their limitations and challenges. Despite their potential, little comprehensive research has been conducted on the sterilizing impact of disinfection robots in the dental industry. By serving as a starting point for future research, this review aims to bridge the gaps in knowledge and identify unresolved issues. Our objective is to provide an extensive guide to UV-D robots, encompassing design requirements, technological breakthroughs, and in-depth use in healthcare and dentistry facilities. Understanding the capabilities and limitations of UV-D robots will aid in harnessing their potential to revolutionize infection control practices in the medical and dental fields.
Simultaneously localization and mapping (SLAM) is a core component in many embedded domains, e.g., robots, augmented and virtual reality. Due to SLAM’s high demand on computation resources, general-purpose graphic processing units (GPGPUs) are often used as its processing engine. Meanwhile, embedded systems usually have strict power constraint. Thus, how to deliver required performance for SLAM, yet still meet the power limit, is a great challenge faced by GPGPU designer. In this work, we discover the general principles of designing energy-efficient GPGPU for SLAM as “many SMs, enough SPs and registers, small caches”, by analyzing the implication of individual design parameters on both performance and power. Then, we conduct large-scale design space exploration and fit the Pareto frontier with a two-term exponential model. Further, we construct gradient boosting decision tree (GBDT)-based design models to predict the performance and power given the design parameters. The evaluation shows that our GBDT-based models can achieve [Formula: see text]3% mean average percentage error, which significantly outperform other machine learning models. With these models, a kernel’s requirement on hardware resources can be well understood. Based on such knowledge, we introduce design model guided power management strategies, including power gating and dynamic frequency and voltage scaling (DFVS). Overall, by combining these two power management strategies, we can improve the energy delay product by 36%.
The novel coronavirus (COVID-19) pandemic has completely changed our lives and how we interact with the world. The pandemic has brought about a pressing need to have effective disinfection practices that can be incorporated into daily life. They are needed to limit the spread of infections through surfaces and air, particularly in public settings. Most of the current methods utilize chemical disinfectants, which can be laborious and time-consuming. Ultraviolet (UV) irradiation is a proven and powerful means of disinfection. There has been a rising interest in the implementation of UV disinfection robots by various public institutions, such as hospitals, long-term care homes, airports, and shopping malls. The use of UV-based disinfection robots could make the disinfection process faster and more efficient. The objective of this review is to equip readers with the necessary background on UV disinfection and provide relevant discussion on various aspects of UV robots.
Full-text available
Simultaneous localization and mapping (SLAM) is an active research topic in machine vision and robotics. It has various applications in many different fields such as mobile robots, augmented and virtual reality, medical imaging, image-guided surgery systems, and unmanned aerial vehicles (UAVs). The computational complexity of SLAM algorithms is very high. Therefore, in many applications, it is necessary to implement them in real-time on platforms with low power consumption and small sizes. This paper reviews the implementation and the performance of SLAM algorithms on various platforms. Although there are various review studies on SLAM algorithms, the studies assessing the hardware implementation of these algorithms are very limited. This study attempts to fill this gap. It is shown that using the hardware–software (HW/SW) co-design approaches over mere Software (SW) or hardware (HW) approaches is currently the primary option for implementing SLAM algorithms on hardware platforms. A combination of a hardware accelerator and a software approach increases the speed of the implementation as well as the performance and the speed of the algorithm. Also, dividing different parts of the algorithm according to the structure and the nature of the algorithm between hardware and software in the HW/SW co-design approaches reduces the resource consumption and the cost. Furthermore, the design of hardware-compatible algorithms is one of the most critical gaps in the implementation of SLAM algorithms on hardware platforms.
Visual Simultaneous Localization and Mapping (vSLAM) is the method of employing an optical sensor to map the robot’s observable surroundings while also identifying the robot’s pose in relation to that map. The accuracy and speed of vSLAM calculations can have a very significant impact on the performance and effectiveness of subsequent tasks that need to be executed by the robot, making it a key building component for current robotic designs. The application of vSLAM in the area of humanoid robotics is particularly difficult due to the robot’s unsteady locomotion. This paper introduces a pose graph optimization module based on RGB (ORB) features, as an extension of the KinectFusion pipeline (a well-known vSLAM algorithm), to assist in recovering the robot’s stance during unstable gait patterns when the KinectFusion tracking system fails. We develop and test a wide range of embedded MPSoC FPGA designs, and we investigate numerous architectural improvements, both precise and approximation, to study their impact on performance and accuracy. Extensive design space exploration reveals that properly designed approximations, which exploit domain knowledge and efficient management of CPU and FPGA fabric resources, enable real-time vSLAM at more than 30 fps in humanoid robots with high energy-efficiency and without compromising robot tracking and map construction. This is the first FPGA design to achieve robust, real-time dense SLAM operation targeting specifically humanoid robots. An open source release of our implementations and data can be found in [1].
Although there are methods of artificial intelligence (AI) applied to virtual reality (VR) solutions, there are few studies in the literature. Thus, to fill this gap, we performed a systematic literature review of these methods. In this review, we apply a methodology proposed in the literature that locates existing studies, selects and evaluates contributions, analyses, and synthesizes data. We used Google Scholar and databases such as Elsevier's Scopus, ACM Digital Library, and IEEE Xplore Digital Library. A set of inclusion and exclusion criteria were used to select documents. The results showed that when AI methods are used in VR applications, the main advantages are high efficiency and precision of algorithms. Moreover, we observe that machine learning is the most applied AI scientific technique in VR applications. In conclusion, this paper showed that the combination of AI and VR contributes to new trends, opportunities, and applications for human-machine interactive devices, education, agriculture, transport, 3D image reconstruction, and health. We more concluded that the usage of AI in VR provides potential benefits in other fields of the real world such as teleconferencing, emotion interaction, tourist services, and image data extraction.
Conference Paper
Full-text available
In the recent years, we have witnessed an explosion of the usages of Virtual Machines (VMs) which are currently found in desktops, smartphones, and cloud deployments. These recent developments create new research opportunities in the VM domain extending from performance to energy efficiency, and scalability studies. Research into these directions necessitates research frameworks for VMs that provide full coverage of the execution domains and hardware platforms. Unfortunately, the state of the art on Research VMs does not live up to such expectations and lacks behind industrial-strength software, making it hard for the research community to provide valuable insights. This paper presents our work in attempting to tackle those shortcomings by introducing Beehive, our vision towards a modular and seamlessly extensible ecosystem for research on virtual machines. Beehive unifies a number of existing state-of-the-art tools and components with novel ones providing a complete platform for hardware/software co-design of Virtual Machines.
SLAM is becoming a key component of robotics and augmented reality (AR) systems. While a large number of SLAM algorithms have been presented, there has been little effort to unify the interface of such algorithms, or to perform a holistic comparison of their capabilities. This is a problem since different SLAM applications can have different functional and non-functional requirements. For example, a mobile phonebased AR application has a tight energy budget, while a UAV navigation system usually requires high accuracy. SLAMBench2 is a benchmarking framework to evaluate existing and future SLAM systems, both open and close source, over an extensible list of datasets, while using a comparable and clearly specified list of performance metrics. A wide variety of existing SLAM algorithms and datasets is supported, e.g. ElasticFusion, InfiniTAM, ORB-SLAM2, OKVIS, and integrating new ones is straightforward and clearly specified by the framework. SLAMBench2 is a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs across SLAM systems.
Conference Paper
Heterogeneous computing has emerged as a means to achieve high performance and energy efficiency. Naturally, this trend has been accompanied by changes in software development norms that do not necessarily favor programmers. A prime example is the two most popular heterogeneous programming languages, CUDA and OpenCL, which expose several low-level features to the API making them difficult to use by non-expert users. Instead of using low-level programming languages, developers tend to prefer more high-level, object-oriented languages typically executed on managed runtime environments. Although many programmers might expect that such languages would have already been adapted for execution on heterogeneous hardware, the reality is that their support is either very limited or totally absent. This paper highlights the main reasons and complexities of enabling heterogeneous managed runtime systems and proposes a number of directions to address those challenges.
Conference Paper
Industry is increasingly turning to reconfigurable architectures like FPGAs and CGRAs for improved performance and energy efficiency. Unfortunately, adoption of these architectures has been limited by their programming models. HDLs lack abstractions for productivity and are difficult to target from higher level languages. HLS tools are more productive, but offer an ad-hoc mix of software and hardware abstractions which make performance optimizations difficult. In this work, we describe a new domain-specific language and compiler called Spatial for higher level descriptions of application accelerators. We describe Spatial's hardware-centric abstractions for both programmer productivity and design performance, and summarize the compiler passes required to support these abstractions, including pipeline scheduling, automatic memory banking, and automated design tuning driven by active machine learning. We demonstrate the language's ability to target FPGAs and CGRAs from common source code. We show that applications written in Spatial are, on average, 42% shorter and achieve a mean speedup of 2.9x over SDAccel HLS when targeting a Xilinx UltraScale+ VU9P FPGA on an Amazon EC2 F1 instance.
Conference Paper
Flying robots require a combination of accuracy and low latency in their state estimation in order to achieve stable and robust flight. However, due to the power and payload constraints of aerial platforms, state estimation algorithms must provide these qualities under the computational constraints of embedded hardware. Cameras and inertial measurement units (IMUs) satisfy these power and payload constraints, so visualinertial odometry (VIO) algorithms are popular choices for state estimation in these scenarios, in addition to their ability to operate without external localization from motion capture or global positioning systems. It is not clear from existing results in the literature, however, which VIO algorithms perform well under the accuracy, latency, and computational constraints of a flying robot with onboard state estimation. This paper evaluates an array of publicly-available VIO pipelines (MSCKF, OKVIS, ROVIO, VINS-Mono, SVO+MSF, and SVO+GTSAM) on different hardware configurations, including several singleboard computer systems that are typically found on flying robots. The evaluation considers the pose estimation accuracy, per-frame processing time, and CPU and memory load while processing the EuRoC datasets, which contain six degree of freedom (6DoF) trajectories typical of flying robots. We present our complete results as a benchmark for the research community. Narrated video presentation:
Conference Paper
Heterogeneous accelerators often disappoint. They provide the prospect of great performance, but only deliver it when using vendor specific optimized libraries or domain specific languages. This requires considerable legacy code modifications, hindering the adoption of heterogeneous computing. This paper develops a novel approach to automatically detect opportunities for accelerator exploitation. We focus on calculations that are well supported by established APIs: sparse and dense linear algebra, stencil codes and generalized reductions and histograms. We call them idioms and use a custom constraint-based Idiom Description Language (IDL) to discover them within user code. Detected idioms are then mapped to BLAS libraries, cuSPARSE and clSPARSE and two DSLs: Halide and Lift. We implemented the approach in LLVM and evaluated it on the NAS and Parboil sequential C/C++ benchmarks, where we detect 60 idiom instances. In those cases where idioms are a significant part of the sequential execution time, we generate code that achieves 1.26x to over 20x speedup on integrated and external GPUs.
We present a dense volumetric simultaneous localisation and mapping (SLAM) framework that uses an octree representation for efficient fusion and rendering of either a truncated signed distance field (TSDF) or an occupancy map. The primary aim of this letter is to use one single representation of the environment that can be used not only for robot pose tracking and high-resolution mapping, but seamlessly for planning. We show that our highly efficient octree representation of space fits SLAM and planning purposes in a real-time control loop. In a comprehensive evaluation, we demonstrate dense SLAM accuracy and runtime performance on-par with flat hashing approaches when using TSDF-based maps, and considerable speed-ups when using occupancy mapping compared to standard occupancy maps frameworks. Our SLAM system can run at 10–40 Hz on a modern quadcore CPU, without the need for massive parallelization on a GPU. We, furthermore, demonstrate a probabilistic occupancy mapping as an alternative to TSDF mapping in dense SLAM and show its direct applicability to online motion planning, using the example of informed rapidly-exploring random trees (RRT $^*$ ).