ArticlePDF Available

Abstract and Figures

Real-time 3D space understanding is becoming prevalent across a wide range of applications and hardware platforms. To meet the desired Quality of Service (QoS), computer vision applications tend to be heavily parallelized and exploit any available hardware accelerators. Current approaches to achieving real-time computer vision, evolve around programming languages typically associated with High Performance Computing along with binding extensions for OpenCL or CUDA execution. Such implementations, although high performing, lack portability across the wide range of diverse hardware resources and accelerators. In this paper, we showcase how a complex computer vision application can be implemented within a managed runtime system. We discuss the complexities of achieving high-performing and portable execution across embedded and desktop configurations. Furthermore, we demonstrate that it is possible to achieve the QoS target of over 30 frames per second (FPS) by exploiting FPGA and GPGPU acceleration transparently through the managed runtime system.
Content may be subject to copyright.
{christos.kotselidis,james.clarkson,andrey.rodchenko,andy.nisbet,john.mawer,mikel.lujan}@manchester.ac.uk
http://dx.doi.org/http://dx.doi.org/10.1145/3050748r.
3050764
VEE’17, Xi’an, China C. Kotselidis et al.
Track Host Solve Device
Track Host Solve Device
Track Host Solve Device
4 Iterations
80x60
5 Iterations
160x120
10 Iterations
320x240
Preprocessing
Tracking
Integration
Raycast
Rendering
Acquisition
Input
(a) (b)
Heterogeneous Managed Runtime Systems: A Computer Vision Case Study VEE’17, Xi’an, China
4×4
VEE’17, Xi’an, China C. Kotselidis et al.
ARMv7
x86-64 GPUs FPGAs
Hardware
Maxine VM
T1X
OpenCL
Heterogeneous
Accelerator
Java7, Java8, C++, OpenMP KFusion Implementations
(derived fromSLAMBench)
Native
(C++/OpenMP)
ApplicationsRuntime Layer
OpenJDK
C1X/Graal
Client
Memory
Manager (GC)
Memory
Manager (GC)
MAST
FPGA Accelerator
Framework
thin thick
Heterogeneous Managed Runtime Systems: A Computer Vision Case Study VEE’17, Xi’an, China
57.08
75.98
81.63
36.36
86.30
68.96
90.49
35.13
26.13
99.7
44.8
72.64
37.53
50.34
0
10
20
30
40
50
60
70
80
90
100
Hotspot-C2-1.8.0.25 Hotspo t-Graal-21075 (Original) Maxine-Graal-20290 (Original) Maxine-Graal-20381 (Current)
12
20
8
29
17
5
14 8
25
6
27
20 18
57
44
13
31
22
38
28
40
20
34
65
76
24
50
31
49 47
0
20
40
60
80
geome an startup compiler compress crypto derby mpegaudio scimark sunflow xml
MaxineVM-ARMv7 OpenJDK_1.7.0_40-Client OpenJDK_1.7.0_40-Server
Serial
Task Graph
Methods
OpenCL/Java API
preprocessingGraph = new TaskGraph()
.streamIn(depthImageInput)
.add(ImagingOps::mm2metersKernel,
scaledDepthImage,
depthImageInput, scalingFactor)
.add(ImagingOps::bilateralFilter,
pyramidDepths[0],
scaledDepthImage,
gaussian, eDelta, radius)
.mapAllTo(deviceMapping);
Optimized
Graph
- Users create Task Graphs
with our OpenCL API.
Graph Optimizer
- The compiler expands
graphs to include data
movement.
- Graph is optimized to
remove redundant data
transfers.
Runtime
Code Cache Memory
Task Queue
Device
Device Device Device
- Runtime schedules tasks on devices.
VEE’17, Xi’an, China C. Kotselidis et al.
SimCtrl
SimCtrl
SimObject
SimObject
Heterogeneous Managed Runtime Systems: A Computer Vision Case Study VEE’17, Xi’an, China
C++ - 2.72 FPS
Java - 0.81 FPS
Java/OpenCL
- 33.13 FPS
0
10
20
30
0 500 1000
Frame Number
Frames Per Second
10
1000
Acq. Pre. Tra. Int. Ray. Rend. Total
Pipeline Stage
Speedup Over Java (log10)
C++ Java/OpenCL
preprocessing
mm meters
malloc
VEE’17, Xi’an, China C. Kotselidis et al.
http://dl.acm.org/citation.cfm?id=823453.823860
DOI:
http://dx.doi.org/10.1145/1869459.1869469
DOI:
http://dx.doi.org/10.
1145/1941553.1941562
http://code.google.com/p/
scalacl
DOI:http://dx.doi.org/10.1145/1926354.1926358
DOI:
http://dx.doi.org/10.
1145/1808954.1808959
http://ejml.org
DOI:
http://dx.doi.org/10.1145/2627373.
2627381
Heterogeneous Managed Runtime Systems: A Computer Vision Case Study VEE’17, Xi’an, China
DOI:
http://dx.doi.org/10.1145/2500828.2500840
DOI:
http://dx.doi.org/10.1145/2502323.2502329
DOI:
http://dx.doi.org/10.1145/2509136.2509516
http://www.jocl.org/
DOI:
http://dx.doi.org/10.1145/1863523.1863533
DOI:http://dx.doi.org/10.1109/ISMAR.2011.6092378
DOI:
http://dx.doi.
org/10.1145/2047862.2047883
http://openjdk.java.net/
DOI:http://dx.doi.org/10.1109/HPCC.2012.57
https://www.spec.org/jvm2008/
DOI:http://dx.doi.org/10.1145/2544137.2544157
https://get.google.com/tango/
DOI:
http://dx.doi.org/10.1145/2159430.
2159439
... In this work, we utilize the TornadoVM framework [12,18] to design and build a real time GPU-accelerated ray tracer fully in Java; mainly due to its proven capabilities to achieve high performance graphics applications [29]. TornadoVM offers an API that allows developers to identify which methods and loops to parallelize through a set of annotations, such as the @Parallel annotation. ...
Preprint
Full-text available
Ray tracing has been typically known as a graphics rendering method capable of producing highly realistic imagery and visual effects generated by computers. More recently the performance improvements in Graphics Processing Units (GPUs) have enabled developers to exploit sufficient computing power to build a fair amount of ray tracing applications with the ability to run in real-time. Typically, real-time ray tracing is achieved by utilizing high performance kernels written in CUDA, OpenCL, and Vulkan which can be invoked by high-level languages via native bindings; a technique that fragments application code bases as well as limits portability. This paper presents a hardware-accelerated ray tracing rendering engine, fully written in Java, that can seamlessly harness the performance of underlying GPUs via the TornadoVM framework. Through this paper, we show the potential of Java and acceleration frameworks to process in real time a compute intensive application. Our results indicate that it is possible to enable real time ray tracing from Java by achieving up to 234, 152, 45 frames-per-second in 720p, 1080p, and 4K resolutions, respectively.
Article
The increase in computational capability of low-power Arm architectures has seen them diversify from their more traditional domain of portable battery powered devices into data center servers, personal computers, and even Supercomputers. Thus, managed languages (Java, Javascript, etc.) that require a managed runtime environment (MRE) need to be ported to the Arm architecture, requiring an understanding of different design trade-offs. This paper studies how the lack of strong hardware support for Self Modifying Code (SMC) in low-power architectures (e.g. absence of cache coherence between instruction cache and data caches), affects Just-In-Time (JIT) compilation and runtime behavior in MREs. Specifically, we focus on the implementation and treatment of call-sites, that must maintain code consistency in the face of concurrent execution and modification to redirect control (patching) by the MRE. The lack of coherence, is compounded with the maximum distance (reach of) a call-site can jump to as the reach is more constrained (smaller distance) in Arm when compared with Intel/AMD. We present four different robust implementations for call-sites and discuss their advantages and disadvantages in the absence of strong hardware support for SMC. Finally, we evaluate each approach using a microbenchmark, further evaluating the best three techniques using three JVM benchmark suites and the open source MaxineVM showcasing performance differences up to 12%. Based on these observations, we propose extending code-cache partitioning strategies for JIT compiled code to encourage more efficient local branching for architectures with limited direct branch ranges.
Conference Paper
Full-text available
This paper addresses the fundamental question of how modern Big Data frameworks can dynamically and transparently exploit heterogeneous hardware accelerators. After presenting the major challenges that have to be addressed towards this goal, we describe our proposed architecture for automatic and transparent hardware acceleration of Big Data frameworks and applications. Our vision is to retain the uniform programming model of Big Data frameworks and enable automatic, dynamic Just-In-Time compilation of the candidate code segments that benefit from hardware acceleration to the corresponding format. In conjunction with machine learning-based device selection, that respect user-defined constraints (e.g., cost, time, etc.), we enable dynamic code execution on GPUs and FPGAs transparently to the user. In addition, we dynamically re-steer execution at runtime based on the availability of resources. Our preliminary results demonstrate that our approach can accelerate an existing Apache Flink application by up to 16.5x.
Conference Paper
With micro-services continuously gaining popularity and low-power processors making their way into data centers, efficient execution of managed runtime systems on low-power architectures is also gaining interest. Apart from the inherent performance differences between high and low power processors, porting a managed runtime system to a low-power architecture may result in spuriously introducing additional overheads and design trade-offs. In this work we investigate how the lack of strong hardware support for Self Modifying Code (SMC) in low-power architectures, influences Just-In-Time (JIT) compilation and execution in modern virtual machines. In particular, we examine how low-power architectures, with no or limited hardware support for SMC, impose restrictions on call-site implementations, when the latter need to be patchable by the runtime system. We present four different memory-safe implementations for call-site generation and discuss their advantages and disadvantages in the absence of strong hardware support for SMC. Finally, we evaluate each technique on different workloads using micro-benchmarks and we evaluate the best two techniques on the Dacapo benchmark suite showcasing performance differences up to 15%.
Preprint
Full-text available
Parallel skeletons are essential structured design patterns for efficient heterogeneous and parallel programming. They allow programmers to express common algorithms in such a way that it is much easier to read, maintain, debug and implement for different parallel programming models and parallel architectures. Reductions are one of the most common parallel skeletons. Many programming frameworks have been proposed for accelerating reduction operations on heterogeneous hardware. However, for the Java programming language , little work has been done for automatically compiling and exploiting reductions in Java applications on GPUs. In this paper we present our work in progress in utilizing compiler snippets to express parallelism on heterogeneous hardware. In detail, we demonstrate the usage of Graal's snippets, in the context of the Tornado compiler, to express a set of Java reduction operations for GPU acceleration. The snippets are expressed in pure Java with OpenCL semantics, simplifying the JIT compiler optimizations and code generation. We showcase that with our technique we are able to execute a predefined set of reductions on GPUs within 85% of the performance of the native code and reach up to 20x over the Java sequential execution.
Article
Full-text available
Heterogeneous computing has emerged as a means to achieve high performance and energy efficiency. Naturally, this trend has been accompanied by changes in software development norms that do not necessarily favor programmers. A prime example is the two most popular heterogeneous programming languages, CUDA and OpenCL, which expose several low-level features to the API making them difficult to use by non-expert users. Instead of using low-level programming languages, developers tend to prefer more high-level, object-oriented languages typically executed on managed runtime environments. Although many programmers might expect that such languages would have already been adapted for execution on heterogeneous hardware, the reality is that their support is either very limited or totally absent. This paper highlights the main reasons and complexities of enabling heterogeneous managed runtime systems and proposes a number of directions to address those challenges.
Conference Paper
Full-text available
In this paper, we describe our experiences in co-designing a domain-specific compilation stack. Our motivation stems from the missed optimization opportunities we observed while implementing a computer vision library in Java. To tackle the performance shortcomings, we developed Indigo, a computer vision API co-designed with a compilation plugin for optimizing computer vision applications. Indigo exploits the extensible nature of the Graal compiler which provides invocation plugins, that replace methods with dedicated nodes, and generates machine code compatible with both the Java Virtual Machine (JVM) and the SIMD hardware unit. Our approach improves performance by up to 66.75× when compared to pure Java implementations and by up to 2.75× when compared to the original C++ implementation. These performance improvements are the result of low-level concurrency, idiomatic implementation of algorithms, and by keeping temporary objects in the wider vector unit registers.
Conference Paper
Full-text available
In this paper we describe Jacc, an experimental framework which allows developers to program GPGPUs directly from Java. The goal of Jacc, is to allow developers to benefit from using heterogeneous hardware whilst minimizing the amount of code refactoring required. Jacc utilizes two key abstractions: tasks which encapsulate all the information needed to execute code on a GPGPU; and task graphs which capture both inter-task control-flow and data dependencies. These abstractions enable the Jacc runtime system to automatically choreograph data movement and synchronization between the host and the GPGPU; eliminating the need to explicitly manage disparate memory spaces. We demonstrate the advantages of Jacc, both in terms of programmability and performance, by evaluating it against existing Java frameworks. Experimental results show an average performance speedup of 19x, using NVIDIA Tesla K20m GPU, and a 4x decrease in code complexity when compared with writing multi-threaded Java code across eight evaluated benchmarks.
Article
Full-text available
The end of Dennard scaling combined with stagnation in architectural and compiler optimizations makes it challenging to achieve significant performance deltas. Solutions based solely in hardware or software are no longer sufficient to maintain the pace of improvements seen during the past few decades. In hardware, the end of single-core scaling resulted in the proliferation of multi-core system architectures, however this has forced complex parallel programming techniques into the mainstream. To further exploit physical resources, systems are becoming increasingly heterogeneous with specialized computing elements and accelerators. Programming across a range of disparate architectures requires a new level of abstraction that programming languages will have to adapt to. In software, emerging complex applications, from domains such as Big Data and computer vision, run on multi-layered software stacks targeting hardware with a variety of constraints and resources. Hence, optimizing for the power-performance (and resiliency) space requires experimentation platforms that offer quick and easy prototyping of hardware/software co-designed techniques. To that end, we present Project Beehive: A Hardware/Software co-designed stack for runtime and architectural research. Project Beehive utilizes various state-of-the-art software and hardware components along with novel and extensible co-design techniques. The objective of Project Beehive is to provide a modern platform for experimentation on emerging applications, programming languages, compilers, runtimes, and low-power heterogeneous many-core architectures in a full-system co-designed manner.
Article
Full-text available
Real-time dense computer vision and SLAM offer great potential for a new level of scene modelling, tracking and real environmental interaction for many types of robot, but their high computational requirements mean that use on mass market embedded platforms is challenging. Meanwhile, trends in low-cost, low-power processing are towards massive parallelism and heterogeneity, making it difficult for robotics and vision researchers to implement their algorithms in a performance-portable way. In this paper we introduce SLAMBench, a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs in performance, accuracy and energy consumption of a dense RGB-D SLAM system. SLAMBench provides a KinectFusion implementation in C++, OpenMP, OpenCL and CUDA, and harnesses the ICL-NUIM dataset of synthetic RGB-D sequences with trajectory and scene ground truth for reliable accuracy comparison of different implementation and algorithms. We present an analysis and breakdown of the constituent algorithmic elements of KinectFusion, and experimentally investigate their execution time on a variety of multicore and GPUaccelerated platforms. For a popular embedded platform, we also present an analysis of energy efficiency for different configuration alternatives.
Conference Paper
Full-text available
Heterogeneous computing has now become mainstream with virtually every desktop machines featuring accelerators such as Graphics Processing Units (GPUs). While heterogeneity offers the promise of high performance and high-efficiency, it comes at the cost of huge programming difficulties. Languages and interfaces for programming such system tend to be low-level and require expert knowledge of the hardware in order to achieve its potential. A promising approach for programming such heterogeneous systems is the use of array programming. This style of programming relies on well known parallel patterns that can be easily translated into GPU or other accelerator code. However, only little work has been done on integrating such concepts in mainstream languages such as Java. In this work, we propose a new Array Function interface implemented with the new features from Java 8. While similar in spirit to the new Stream API of Java, our API follows a different design based on reusability and composability. We demonstrate that this API can be used to generate OpenCL code for a simple application. We present encouraging preliminary performance results showing the potential of our approach.
Conference Paper
Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. Its goal is to provide a flexible environment to design and train learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can easily be in- terfaced to third-party software thanks to Lua's light interface.
Conference Paper
We introduce the Imperial College London and National University of Ireland Maynooth (ICL-NUIM) dataset for the evaluation of visual odometry, 3D reconstruction and SLAM algorithms that typically use RGB-D data. We present a collection of handheld RGB-D camera sequences within synthetically generated environments. RGB-D sequences with perfect ground truth poses are provided as well as a ground truth surface model that enables a method of quantitatively evaluating the final map or surface reconstruction accuracy. Care has been taken to simulate typically observed real-world artefacts in the synthetic imagery by modelling sensor noise in both RGB and depth data. While this dataset is useful for the evaluation of visual odometry and SLAM trajectory estimation, our main focus is on providing a method to benchmark the surface reconstruction accuracy which to date has been missing in the RGB-D community despite the plethora of ground truth RGB-D datasets available.
Conference Paper
Escape Analysis allows a compiler to determine whether an object is accessible outside the allocating method or thread. This information is used to perform optimizations such as Scalar Replacement, Stack Allocation and Lock Elision, allowing modern dynamic compilers to remove some of the abstractions introduced by advanced programming models. The all-or-nothing approach taken by most Escape Analysis algorithms prevents all these optimizations as soon as there is one branch where the object escapes, no matter how unlikely this branch is at runtime. This paper presents a new, practical algorithm that performs control flow sensitive Partial Escape Analysis in a dynamic Java compiler. It allows Escape Analysis, Scalar Replacement and Lock Elision to be performed on individual branches. We implemented the algorithm on top of Graal, an open-source Java just-in-time compiler, and it performs well on a diverse set of benchmarks. In this paper, we evaluate the effect of Partial Escape Analysis on the DaCapo, ScalaDaCapo and SpecJBB2005 benchmarks, in terms of run-time, number and size of allocations and number of monitor operations. It performs particularly well in situations with additional levels of abstraction, such as code generated by the Scala compiler. It reduces the amount of allocated memory by up to 58.5%, and improves performance by up to 33%.