Preprint

Compact Native Code Generation for Dynamic Languages on Micro-core Architectures

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Micro-core architectures combine many simple, low memory, low power-consuming CPU cores onto a single chip. Potentially providing significant performance and low power consumption, this technology is not only of great interest in embedded, edge, and IoT uses, but also potentially as accelerators for data-center workloads. Due to the restricted nature of such CPUs, these architectures have traditionally been challenging to program, not least due to the very constrained amounts of memory (often around 32KB) and idiosyncrasies of the technology. However, more recently, dynamic languages such as Python have been ported to a number of micro-cores, but these are often delivered as interpreters which have an associated performance limitation. Targeting the four objectives of performance, unlimited code-size, portability between architectures, and maintaining the programmer productivity benefits of dynamic languages, the limited memory available means that classic techniques employed by dynamic language compilers, such as just-in-time (JIT), are simply not feasible. In this paper we describe the construction of a compilation approach for dynamic languages on micro-core architectures which aims to meet these four objectives, and use Python as a vehicle for exploring the application of this in replacing the existing micro-core interpreter. Our experiments focus on the metrics of performance, architecture portability, minimum memory size, and programmer productivity, comparing our approach against that of writing native C code. The outcome of this work is the identification of a series of techniques that are not only suitable for compiling Python code, but also applicable to a wide variety of dynamic languages on micro-cores.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Deep Learning (DL) applications are gaining momentum in the realm of Artificial Intelligence, particularly after GPUs have demonstrated remarkable skills for accelerating their challenging computational requirements. Within this context, Convolutional Neural Network (CNN) models constitute a representative example of success on a wide set of complex applications, particularly on datasets where the target can be represented through a hierarchy of local features of increasing semantic complexity. In most of the real scenarios, the roadmap to improve results relies on CNN settings involving brute force computation, and researchers have lately proven Nvidia GPUs to be one of the best hardware counterparts for acceleration. Our work complements those findings with an energy study on critical parameters for the deployment of CNNs on flagship image and video applications, ie, object recognition and people identification by gait, respectively. We evaluate energy consumption on four different networks based on the two most popular ones (ResNet/AlexNet), ie, ResNet (167 layers), a 2D CNN (15 layers), a CaffeNet (25 layers), and a ResNetIm (94 layers) using batch sizes of 64, 128, and 256, and then correlate those with speed‐up and accuracy to determine optimal settings. Experimental results on a multi‐GPU server endowed with twin Maxwell and twin Pascal Titan X GPUs demonstrate that energy correlates with performance and that Pascal may have up to 40% gains versus Maxwell. Larger batch sizes extend performance gains and energy savings, but we have to keep an eye on accuracy, which sometimes shows a preference for small batches. We expect this work to provide a preliminary guidance for a wide set of CNN and DL applications in modern HPC times, where the GFLOPS/w ratio constitutes the primary goal.
Article
Full-text available
Cython is a Python language extension that allows explicit type declarations and is compiled directly to C. As such, it addresses Python's large overhead for numerical loops and the difficulty of efficiently using existing C and Fortran code, which Cython can interact with natively.
Article
Full-text available
We provide an overview of the key architectural features of recent microprocessor designs and describe the programming model and abstractions provided by OpenCL, a new parallel programming standard targeting these architectures.
Article
Micro-core architectures combine many low memory, low power computing cores together in a single package. These are attractive for use as accelerators but due to limited on-chip memory and multiple levels of memory hierarchy, the way in which programmers offload kernels needs to be carefully considered. In this paper we use Python as a vehicle for exploring the semantics and abstractions of higher level programming languages to support the offloading of computational kernels to these devices. By moving to a pass by reference model, along with leveraging memory kinds, we demonstrate the ability to easily and efficiently take advantage of multiple levels in the memory hierarchy, even ones that are not directly accessible to the micro-cores. Using a machine learning benchmark, we perform experiments on both Epiphany-III and MicroBlaze based micro-cores, demonstrating the ability to compute with data sets of arbitrarily large size. To provide context of our results, we explore the performance and power efficiency of these technologies, demonstrating that whilst these two micro-core technologies are competitive within their own embedded class of hardware, there is still a way to go to reach HPC class GPUs.
Book
With the SPARC (Scalable Processor ARChitecture) architecture and system software as the underlying foundation, Sun Microsys­ terns is delivering a new model of computing-easy workgroup computing-to enhance the way people work, automating processes across groups, departments, and teams locally and globally. Sun and a large and growing number of companies in the computer industry have embarked on a new approach to meet the needs of computer users and system developers in the 1990s. Originated by Sun, the approach targets users who need a range of compatible computer systems with a variety of application soft­ ware and want the option to buy those systems from a choice of vendors. The approach also meets the needs of system developers to be part of a broad, growing market of compatible systems and software-developers who need to design products quickly and cost-effecti vel y. The SPARe approach ensures that computer systems can be easy to use for all classes of users and members of the workgroup, end users, system administrators, and software developers. For the end user, the SPARC technologies facilitate system set-up and the daily use of various applications. For the system administrator supporting the computer installation, setting up and monitoring the network are easier. For the software developer, there are ad­ vanced development tools and support. Furthermore, the features of the SPARC hardware and software technologies ensure that SPARC systems and applications play an important role in the years to come.
Conference Paper
Dynamic, interpreted languages, like Python, are attractive for domain-experts and scientists experimenting with new ideas. However, the performance of the interpreter is often a barrier when scaling to larger data sets. This paper presents a just-in-time compiler for Python that focuses in scientific and array-oriented computing. Starting with the simple syntax of Python, Numba compiles a subset of the language into efficient machine code that is comparable in performance to a traditional compiled language. In addition, we share our experience in building a JIT compiler using LLVM[1].
Conference Paper
An intermediate language for the machine independent compilation of ALGOL 68 is described. It makes very few assumptions on the target machine but provides a strong descriptive mechanism for abstract machine objects by which they can easily be mapped on target machine objects.
Book
This work is a textbook for an undergraduate course in compiler construction.
OpenMP Application Program Interface Version 4.0. Accessed
OpenMP Architecture Review Board. 2013. OpenMP Application Program Interface Version 4.0. Accessed: 2018-07-25.
ELF binary compilation of a python script -part 1 : Cython
  • Olivier Brunet
Olivier Brunet. 2020. ELF binary compilation of a python script -part 1 : Cython. https://obrunet.github.io/pythonic%20ideas/compilation_ cython/
The Essence of Compilers
  • Robin Hunter
Robin Hunter. 1999. The Essence of Compilers. Prentice Hall Europe, Hemel Hempstead, United Kingdom, Chapter 7.
Poster 99: Eithne: A Framework for Benchmarking Micro-Core Accelerators
  • Maurice Jamieson
  • Nick Brown
Maurice Jamieson and Nick Brown. 2019. Poster 99: Eithne: A Framework for Benchmarking Micro-Core Accelerators. https://sc19.supercomputing.org/proceedings/tech_poster/tech_ poster_pages/rpost186.html
2020. drewvid/parallella-lisp
  • Andrew Ernest Ritz
Andrew Ernest Ritz. 2020. drewvid/parallella-lisp. https://github.com/ drewvid/parallella-lisp original-date: 2015-12-29T23:42:15Z.
Here's how you can get some free speed on your Python code with Numba
  • George Seif
George Seif. 2019. Here's how you can get some free speed on your Python code with Numba. https://towardsdatascience.com/hereshow-you-can-get-some-free-speed-on-your-python-code-withnumba-89fdc8249ef3
Use Cython to get more than 30X speedup on your Python code
  • George Seif
George Seif. 2019. Use Cython to get more than 30X speedup on your Python code. https://towardsdatascience.com/use-cython-to-getmore-than-30x-speedup-on-your-python-code-f6cb337919b6
An llVM backend for GHC
  • A David
  • Terei
  • M T Manuel
  • Chakravarty
David A. Terei and Manuel M.T. Chakravarty. 2010. An llVM backend for GHC. ACM SIGPLAN Notices 45, 11 (Sept. 2010), 109-120. https: //doi.org/10.1145/2088456.1863538
PicoRV32: A Size-Optimized RISC-V CPU. Contribute to cliffordwolf/picorv32 development by creating an account on GitHub
  • Clifford Wolf
Clifford Wolf. 2018. PicoRV32: A Size-Optimized RISC-V CPU. Contribute to cliffordwolf/picorv32 development by creating an account on GitHub. https://github.com/cliffordwolf/picorv32