Roland Leißa's research while affiliated with Universität Mannheim and other places
What is this page?
This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
Publications (25)
The pursuit of scientific knowledge strongly depends on the ability to reproduce and validate research results. It is a well-known fact that the scientific community faces challenges related to transparency, reliability, and the reproducibility of empirical published results. Consequently, the design and preparation of reproducible artifacts has a...
In recent years, the rapidly increasing number of reads produced by next-generation sequencing (NGS) technologies has driven the demand for efficient implementations of sequence alignments in bioinformatics. However, current state-of-the-art approaches are not able to leverage the massively parallel processing capabilities of modern GPUs with close...
FPGAs have found their way into data centers as accelerator cards, making reconfigurable computing more accessible for high-performance applications. At the same time, new high-level synthesis compilers like Xilinx Vitis and runtime libraries such as XRT attract software programmers into the reconfigurable domain. While software programmers are fam...
FPGAs have found their way into data centers as accelerator cards, making reconfigurable computing more accessible for high-performance applications. At the same time, new high-level synthesis compilers like Xilinx Vitis and runtime libraries such as XRT attract software programmers into the reconfigurable domain. While software programmers are fam...
This paper investigates the suitability of the AnyDSL partial evaluation framework to implement tinyMD: an efficient, scalable, and portable simulation of pairwise interactions among particles. We compare tinyMD with the miniMD proxy application that scales very well on parallel supercomputers. We discuss the differences between both implementation...
FPGAs excel in low power and high throughput computations, but they are challenging to program. Traditionally, developers rely on hardware description languages like Verilog or VHDL to specify the hardware behavior at the register-transfer level. High-Level Synthesis (HLS) raises the level of abstraction, but still requires FPGA design knowledge. P...
This paper investigates the suitability of the AnyDSL partial evaluation framework to implement tinyMD: an efficient, scalable, and portable simulation of pairwise interactions among particles. We compare tinyMD with the miniMD proxy application that scales very well on parallel supercomputers. We discuss the differences between both implementation...
FPGAs excel in low power and high throughput computations, but they are challenging to program. Traditionally, developers rely on hardware description languages like Verilog or VHDL to specify the hardware behavior at the register-transfer level. High-Level Synthesis (HLS) raises the level of abstraction, but still requires FPGA design knowledge. P...
Sequence alignments are fundamental to bioinformatics which has resulted in a variety of optimized implementations. Unfortunately, the vast majority of them are hand-tuned and specific to certain architectures and execution models. This not only makes them challenging to understand and extend, but also difficult to port to other platforms. We prese...
Monte-Carlo Renderers must generate many color samples to produce a noise-free image, and for each of those, they must evaluate complex mathematical models representing the appearance of the objects in the scene. These models are usually in the form of shaders: Small programs that are executed during rendering in order to compute a value for the cu...
This paper advocates programming high-performance code using partial evaluation. We present a clean-slate programming system with a simple, annotation-based, online partial evaluator that operates on a CPS-style intermediate representation. Our system exposes code generation for accelerators (vectorization/parallelization for CPUs and GPUs) via com...
Modern processors are often equipped with vector instruction sets. Such instructions operate on multiple elements of data at once, and greatly improve performance for specific applications. A programmer has two options to take advantage of these instructions: writing manually vectorized code, or using an auto-vectorizing compiler. In the latter cas...
In order to achieve the highest possible performance, the ray traversal and intersection routines at the core of every high-performance ray tracer are usually hand-coded, heavily optimized, and implemented separately for each hardware platform—even though they share most of their algorithmic core. The results are implementations that heavily mix al...
In order to achieve the highest possible performance, the ray traversal and intersection routines at the core of every high-performance ray tracer are usually hand-coded, heavily optimized, and implemented separately for each hardware platform—even though they share most of their algorithmic core. The results are implementations that heavily mix al...
This paper investigates shallow embedding of DSLs by means of online partial evaluation. To this end, we present a novel online partial evaluator for continuation-passing style languages. We argue that it has, in contrast to prior work, a predictable termination policy that works well in practice. We present our approach formally using a continuati...
This paper investigates shallow embedding of DSLs by means of online partial evaluation. To this end, we present a novel online partial evaluator for continuation-passing style languages. We argue that it has, in contrast to prior work, a predictable termination policy that works well in practice. We present our approach formally using a continuati...
Partial evaluation allows for specialization of program fragments. This can be realized by staging, where one fragment is executed earlier than its surrounding code. However, taking advantage of these capabilities is often a cumbersome endeavor. In this paper, we present a new metaprogramming concept using staging parameters that are first-class ci...
A straightforward implementation of an algorithm in a general-purpose programming language does usually not deliver peak performance: Compilers often fail to automatically tune the code for certain hardware peculiarities like memory hierarchy or vector execution units. Manually tuning the code is firstly error-prone as well as time-consuming and se...
Nowadays, SIMD hardware is omnipresent in computers. Nonetheless, many software projects make hardly use of SIMD instructions: Applications are usually written in general-purpose languages like C++. However, general-purpose languages only provide poor abstractions for SIMD programming enforcing an error-prone, assembly-like programming style. An al...
We present a simple SSA construction algorithm, which allows
direct translation from an abstract syntax tree or bytecode into an
SSA-based intermediate representation. The algorithm requires no prior
analysis and ensures that even during construction the intermediate representation
is in SSA form. This allows the application of SSA-based optimizati...
SIMD instructions are common in CPUs for years now. Using these instructions effectively requires not only vectorization of code, but also modifications to the data layout. However, automatic vectorization techniques are often not powerful enough and suffer from restricted scope of applicability; hence, programmers often vectorize their programs ma...
Citations
... The literature on accelerating the Smith-Waterman and Needleman-Wunsch algorithms for sequence alignment is extensive [37][38][39][40]. The classical sequence alignment problem without additional heuristics computes the entire dynamic programming matrix for alignment. ...
... In addition, dataflow circuits are necessary for efficient C-to-circuit translation of any software program, as they are capable of handling variable latencies and erratic memory dependencies. Many HLS compilers have included dataflow modeling to generate an FPGA design in which all tasks are pipelined and executed concurrently [34,35]. ...
... In previous work we developed tinyMD [11], a proxy-app (also based on miniMD) created to evaluate the portability of MD applications with the AnyDSL framework. TinyMD uses higher-order functions to abstract device iteration loops, data layouts and communication strategies, providing a domainspecific library to implement pair-wise interaction kernels that execute efficiently on multi-CPU and multi-GPU targets. ...
... In this article, a new development tool, the HLS development platform, is used in combination. To address problems in traditional FPGA development processes, such as long development cycles and tedious debugging processes, this tool uses the C/C++ high-level programming language to design and implement algorithms, and synthesizes the designed program into Verilog language to complete the transformation to the RTL level [19]. In the synthesis process, optimization instructions such as pipeline and loop parallelism can be added to reduce system delay. ...
... Implementing DSLs this way typically promotes a more functional style of programming which allows for abstraction from concerns such as hardware dependence by separating these issue into dedicated functions that are passed to the core algorithms as high-order arguments. While functional programming is often associated with slower execution, AnyDSL has demonstrated that the potential overhead of functional programming is succinctly eliminated using partial evaluation [20,24]. ...
... Recently, domain-extensible compilers such as Delite [14], L [15,16], or AnyDSL [17] provide an extensible set of program abstractions and optimisations, showing potential to mitigate Challenge 1 (Section 2.2). However, domain-extensible compilers remain relatively immature and lack adoption compared to established domain-speci c compilers. ...
... We stress that runtime schedules are entirely independent of the execution state: for any execution state of any P-LLVM program any sequence of basic blocks of that program is a valid runtime schedule. We emphasize this aspect of P-LLVM because it breaks with a standard staple in data-parallel languages: it is common for language semantics to derive scheduling decisions from the execution state, e.g. by deĄning a stack-based scheduling mechanism that takes the threadsŠ branching decisions into account [Farrell and Kieronska, 1996;Leißa, 2017;Habermaier and Knapp, 2012;Karrenberg, 2015;Coutinho et al., 2011;Sampaio et al., 2013;Alur et al., 2017]. We discuss restrictions on the schedule space in Chapter 5. ...
... The goal of CoreGen is three fold. First, CoreGen seeks to provide a rapid high level verification flow that is competitive in performance to traditional optimizing compilers [33] [30] [29]. As a result, the performance of the CoreGen verification flow is a first order design and implementation constraint. ...
... In regular programming languages with vector types where control-ow statements are not overloaded, the code has to be linearized: Non-scalar control ow must be replaced with data ow. The library RaTrace [Pér+17] performs this transformation manually, and uses type inference to avoid writing complex vector types. However, this forces the programmer to mask out inactive lanes by hand in conditionals, and clutters APIs and functions signatures with execution masks. ...
Reference: Generating renderers
... Other applications rely on domain-specific languages to generate parallel particle methods [21], but this approach requires the development of specific compilation tools that are able to generate efficient code. In this paper we explore the benefits from using the AnyDSL framework, where we shallow-embed [22,23] our domain-specific library into its front-end Impala and can then abstract device, memory layout and communication pattern through higher-order functions. Thus, we use the compiler tool-chain provided by AnyDSL and do not need to develop specific compilation tools for MD or particle methods. ...