Chapter

Performance of the Vipera Framework for DSLs on Micro-Core Architectures

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Vipera provides a compiler and runtime framework for implementing dynamic Domain-Specific Languages on micro-core architectures. The performance and code size of the generated code is critical on these architectures. In this paper we present the results of our investigations into the efficiency of Vipera in terms of code performance and size.KeywordsDomain-specific languagesPythonnative code generationRISC-Vmicro-core architectures

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Constantly increasing hardware parallelism poses more and more challenges to programmers and language designers. One approach to harness the massive parallelism is to move to task-based programming models that rely on runtime systems for dependency analysis and scheduling. Such models generally benefit from the existence of a global address space. This paper presents the parallel memory allocator of the Myrmics runtime system, in which multiple allocator instances organized in a tree hierarchy cooperate to implement a global address space with dynamic region support on distributed memory machines. The Myrmics hierarchical memory allocator is step towards improved productivity and performance in parallel programming. Productivity is improved through the use of dynamic regions in a global address space, which provide a convenient shared memory abstraction for dynamic and irregular data structures. Performance is improved through scaling on manycore systems without system-wide cache coherency. We evaluate the stand-alone allocator on an MPI-based x86 cluster and find that it scales well for up to 512 worker cores, while it can outperform Unified Parallel C by a factor of 3.7-10.7x.
Conference Paper
Full-text available
High level productivity languages such as Python or Matlab enable the use of computational resources by non-expert programmers. However, these languages often sac-rifice program speed for ease of use. This paper proposes Parakeet, a library which provides a just-in-time (JIT) parallel accelerator for Python. Parakeet bridges the gap between the usability of Python and the speed of code written in efficiency languages such as C++ or CUDA. Parakeet accelerates data-parallel sections of Python that use the standard NumPy scientific comput-ing library. Parakeet JIT compiles efficient versions of Python functions and automatically manages their execu-tion on both GPUs and CPUs. We assess Parakeet on a pair of benchmarks and achieve significant speedups.
Article
Full-text available
Cython is a Python language extension that allows explicit type declarations and is compiled directly to C. As such, it addresses Python's large overhead for numerical loops and the difficulty of efficiently using existing C and Fortran code, which Cython can interact with natively.
Article
Full-text available
We discuss the main novelties of the implementation of Lua 5.0: its register- based virtual machine, the new algorithm for optimizing tables used as arrays, the implementation of closures, and the addition of coroutines.
Conference Paper
Python is a popular language for end-user software development in many application domains. End-users want to harness parallel compute resources effectively, by exploiting commodity manycore technology including GPUs. However, existing approaches to parallelism in Python are esoteric, and generally seem too complex for the typical end-user developer. We argue that implicit, or automatic, parallelization is the best way to deliver the benefits of manycore to end-users, since it avoids domain-specific languages, specialist libraries, complex annotations or restrictive language subsets. Auto-parallelization fits the Python philosophy, provides effective performance, and is convenient for non-expert developers. Despite being a dynamic language, we show that Python is a suitable target for auto-parallelization. In an empirical study of 3000+ open-source Python notebooks, we demonstrate that typical loop behaviour ‘in the wild’ is amenable to auto-parallelization. We show that staging the dependence analysis is an effective way to maximize performance. We apply classical dependence analysis techniques, then leverage the Python runtime’s rich introspection capabilities to resolve additional loop bounds and variable types in a just-in-time manner. The parallel loop nest code is then converted to CUDA kernels for GPU execution. We achieve orders of magnitude speedup over baseline interpreted execution and some speedup (up to 50x, although not consistently) over CPU JIT-compiled execution, across 12 loop-intensive standard benchmarks.
Conference Paper
The simplicity and flexibility of dynamic languages make them popular for prototyping and scripting, but the lack of compile-time type information makes it very challenging to generate efficient executable code. Inspired by ideas from scripting, just-in-time compilers, and optional type systems, we are developing Pallene, a statically typed companion language to the Lua scripting language. Pallene is designed to be amenable to standard ahead-of-time compilation techniques, to interoperate seamlessly with Lua (even sharing its runtime), and to be familiar to Lua programmers. In this paper, we compare the performance of the Pallene compiler against LuaJIT, a just in time compiler for Lua, and with C extension modules. The results suggest that Pallene can achieve similar levels of performance.
Conference Paper
Dynamic, interpreted languages, like Python, are attractive for domain-experts and scientists experimenting with new ideas. However, the performance of the interpreter is often a barrier when scaling to larger data sets. This paper presents a just-in-time compiler for Python that focuses in scientific and array-oriented computing. Starting with the simple syntax of Python, Numba compiles a subset of the language into efficient machine code that is comparable in performance to a traditional compiled language. In addition, we share our experience in building a JIT compiler using LLVM[1].
Article
Modern parallel microprocessors deliver high performance on applications that expose substantial fine-grained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in low-level languages is often an unnecessarily difficult task. The characteristics of parallel microprocessors and the limitations of current programming methodologies motivate our design of Copperhead, a high-level data parallel language embedded in Python. The Copperhead programmer describes parallel computations via composition of familiar data parallel primitives supporting both flat and nested data parallel computation on arrays of data. Copperhead programs are expressed in a subset of the widely used Python programming language and interoperate with standard Python modules, including libraries for numeric computation, data visualization, and analysis. In this paper, we discuss the language, compiler, and runtime features that enable Copperhead to efficiently execute data parallel code. We define the restricted subset of Python which Copperhead supports and introduce the program analysis techniques necessary for compiling Copperhead code into efficient low-level implementations. We also outline the runtime support by which Copperhead programs interoperate with standard Python modules. We demonstrate the effectiveness of our techniques with several examples targeting the CUDA platform for parallel programming on GPUs. Copperhead code is concise, on average requiring 3.6 times fewer lines of code than CUDA, and the compiler generates efficient code, yielding 45-100% of the performance of hand-crafted, well optimized CUDA code.
Article
This paper describes the LINPACK Benchmark (41) and some of its variations commonly used to assess perfor- mance of computer systems. Aside from the LINPACK benchmark suite, the TOP500 (43), and the HPL (48) code are presented. The latter is frequently used to obtained results for TOP500 submissions. Information is also given on how to interpret results of the benchmark and how the results fit into performance evaluation process. Keywords: BLAS, benchmarking, high performance computing, HPL, linear algebra, LINPACK, TOP500
Eithne micro-core benchmarking framework
  • M Jamieson
orchestrating the execution of stream programs on multicore platforms
  • M Kudlur
  • S Mahlke