Content uploaded by Frank Hannig
Author content
All content in this area was uploaded by Frank Hannig on Jan 23, 2019
Content may be subject to copyright.
Domain-specic and Resource-aware Computing
Domänenspezisches und ressourcengewahres Rechnen
Habilitationsschrift
venia legendi
Fachmentorat:
Gutachter:
Abstract
Resource-aware computing domain-specic
computing
modeling and system simulation architecture/compiler co-design of invasive
tightly coupled processor arrays (TCPAs)
domain-specic high-level synthesis (HLS)
heterogeneous image processing acceleration framework HIPAcc ExaStencils:
Advanced stencil-code engineering
Contents
1 Introduction 1
2 Resource-aware Computing 9
3 Domain-specic Computing 23
Contents
4 Conclusions 47
A Bibliography 49
B Image Credits 81
C Paper Reprints 83
List of Abbreviations
ACD
ADAS
ALU
APGAS
ASIC
ASIP
AST
AVX
BRAM
CGRA
CMOS
CNC
CPU
DAG
DLP
DoP
DSL
DSP
FLOPS
FPGA
FU
GPU
List of Abbreviations
HDL
HLS
HPC
HSA
IC
ILP
IR
LoC
LPGS
LSGP
LUT
MIPS
MPI
MPSoC
NoC
NPP
OpenCL
OpenCV
PC
PDE
PE
QoR
RISC
SDK
SIMD
SNR
SoC
SQL
SSE
TCPA
TI
TPDL
UML
VHDL
VHLL
VHSIC
VLIW
1 Introduction
power wall
1. Introduction
Transistors
(thousands)
Single-thread
performance
(SpecINT)
Frequency
(MHz)
Typical power
(watts)
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
1E+7
1E+6
1E+5
1E+4
1E+3
1E+2
1E+1
1E+0
1E-1
Number of
cores
utilization wall
dark silicon
Customization heterogeneity
1.1 Contributions
to master the design and programming
complexity of parallel systems as well as their rising heterogeneity
resource-aware computing
domain-specic computing
resource-aware programming
Resource-aware Computing
(1) Modeling and System Simulation.
invasive
invasive
actor models
1. Introduction
(2) Architecture/Compiler Co-Design of Invasive TCPAs.
invade retreat
compact code generation
symbolic tiling symbolic scheduling
Domain-specic Computing
(3) Domain-specic High-level Synthesis.
template metaprogramming
generative programming
(4) e Heterogeneous Image Processing Acceleration Framework.
(5) e ExaStencils Approach.
Architecture and Compiler Design (
ACD
)
1.2 Papers of this Habilitation Treatise
Resource-aware Computing
Modeling and System Simulation Papers
DAC ’15
page 87ff.
Roloff, Schafhauser, Hannig, and Teich. “Execution-driven parallel simulation
of PGAS applications on heterogeneous tiled architectures”
[P24]
X10 ’16
page 93ff.
Roloff, Pöppl, Schwarzer, Wildermann, Bader, Glaß, Hannig, and Teich. “Ac-
torX10: An actor library for X10”
[P16]
ESTIMedia ’17
page 99ff.
Roloff, Hannig, and Teich. “High performance network-on-chip simulation by
interval-based timing predictions”
[P4]
1. Introduction
Papers on Architecture/Compiler Co-Design of Invasive TCPAs
ACM TECS ’14
page 109ff.
Hannig, Lari, Boppu, Tanase, and Reiche. “Invasive tightly-coupled processor
arrays: A domain-specic architecture/compiler co-design approach”
[J18]
RSP ’17
page 139ff.
Witterauf,Hannig, and Teich. “Constructing fast and cycle-accurate simulators
for congurable accelerators using C++ templates”
[P5]
Springer JSPS ’14
page 147ff.
Teich, Tanase, and Hannig. “Symbolic mapping of loop programs onto processor
arrays”
[J15]
MEMOCODE ’14
page 177ff.
Tanase, Witterauf, Teich, and Hannig. “Symbolic inner loop parallelisation for
massively parallel processor arrays”
[P32]
ACM TECS ’17
page 187ff.
Tanase, Witterauf, Teich, and Hannig. “Symbolic multi-level loop mapping of
loop programs for massively parallel processor arrays”
[J1]
ASAP ’16
page 215ff.
Witterauf, Tanase, Hannig, and Teich. “Modulo scheduling of symbolically tiled
loops for tightly coupled processor arrays”
[P15]
Springer JSPS ’14
page 225ff.
Boppu, Hannig, and Teich. “Compact code generation for tightly-coupled pro-
cessor arrays”
[J17]
Domain-specic Computing
Domain-specic HLS Papers
ASAP ’14
page 251ff.
Schmid, Tanase, Hannig, Teich, Bhadouria, and Ghoshal. “Domain-specic aug-
mentations for high-level synthesis”
[P37]
FPL ’14
page 257ff.
Schmid, Apelt, Hannig, and Teich. “An image processing library for C-based
high-level synthesis”
[P33]
Springer JSPS ’17
page 261ff.
Bhadouria, Tanase, Schmid, Hannig, Teich, and Ghoshal. “A novel image im-
pulse noise removal algorithm optimized for hardware accelerators”
[J2]
ASAP ’17
page 279ff.
Özkan, Reiche, Hannig, and Teich. “Hardware design and analysis of efcient
loop coarsening and border handling for image processing”
[P9]
HIPAcc Papers
IEEE TPDS ’16
page 289ff.
Membarth, Reiche, Hannig, Teich, Körner, and Eckert. “HIPAcc: A domain-
specic language and compiler for image processing”
[J9]
DATE ’14
page 305ff.
Membarth, Reiche, Hannig, and Teich. “Code generation for embedded hetero-
geneous architectures on Android”
[P41]
CODES+ISSS ’14
page 311ff.
Reiche, Schmid, Hannig, Membarth, and Teich. “Code generation from a
domain-specic language for C-based HLS of hardware accelerators”
[P31]
Elsevier JPDC ’14
page 321ff.
Membarth, Reiche, Schmitt, Hannig, Teich, Stürmer, and Köstler. “Towards a
performance-portable description of geometric multigrid algorithms using a
domain-specic language”
[J12]
FPL ’16
page 333ff.
Özkan, Reiche, Hannig, and Teich. “FPGA-based accelerator design from a
domain-specic language”
[P13]
Springer JSPS ’17
page 343ff.
Reiche, Özkan, Hannig, Teich, and Schmid. “Loop parallelization techniques for
FPGA accelerator synthesis”
[J5]
LCTES ’17
page 369ff.
Reiche, Kobylko, Hannig, and Teich. “Auto-vectorization for image processing
DSLs”
[P11]
ExaStencils Papers
ICCSA ’14
page 379ff.
Schmitt, Kuckuk, Köstler, Hannig, andTeich. “An evaluation of domain-specic
language technologies for code generation”
[P38]
Euro-Par ’14
page 389ff.
Lengauer, Apel, Bolten, Größlinger, Hannig, Köstler, Rüde, Teich, Grebhahn,
Kronawitter, Kuckuk, Rittich, and Schmitt. “ExaStencils: Advanced stencil-
code engineering”
[P35]
WOLFHPC ’14
page 401ff.
Schmitt, Kuckuk, Hannig, Köstler, and Teich. “ExaSlang: A domain-specic lan-
guage for highly scalable multigrid solvers”
[P29]
Springer LNCSE ’16
page 411ff.
Schmitt, Kuckuk, Hannig, Teich, Köstler, Rüde, and Lengauer. “Systems of par-
tial differential equations in ExaSlang”
[C1]
1. Introduction
1.3 Structure of this Habilitation Treatise
resource-aware computing
modeling and system simulation architecture/compiler co-design
of invasive tightly coupled processor arrays
domain-specic computing
2 Resource-aware Computing
resource-aware computing
“resource”
noun
“aware”
adjective [with adverb or in combination]
“computing” noun
Resources physical
virtual
awareness
thieves
challenge of resource-aware program execution
https://en.oxforddictionaries.com
2. Resource-aware Computing
invasive computing
2.1 Invasive Computing
invasive algorithms invasive architectures
invasive computing
resource-aware programming
invadeinfect retreat
reinvasion partial retreat
reinfect
modeling and simulation of
start invade infect retreat exit
invasive applications and invasive architectures compilation and architecture
research
2.2 Modeling and System Simulation
2.2.1 Goals
2.2.2 Approach
invadeinfect retreat
2. Resource-aware Computing
InvadeSIM
Architecture Model
Application Model
(InvadeX10 / ActorX10)
val c = new AND();
c.add(new TypeConstraint(PEType.RISC));
c.add(new PEQuantity(2));
val claim = homeClaim + Claim.invade(c);
val ilet = (id:IncarnationID) => {
Console.OUT.println("Hello, World!");
};
claim.infect(ilet);
claim.retreat();
Time interval ∆t on the host processor
wall clock time
Time interval ∆ton the target process or
simulated time
Performance Counters
Time Warping
Number of executed instructions I
Processor Simulation
Target
Processor
Host
Processor
Start Processor
Simulation
Simulation
Stop Processor
Simulation
Time Warping
Event Generation
synchronization point
Barrier
Advance
Global Time
Barrier
Check
Global Time
global time ==
local time
global time <
local time
Synchronization
Thread
Simulation
Thread
Synchronization
Simulation Results
CPU
CPU CPU
TCPA
CPU
CPU
CPU
Memory
Memory
CPU i-Core
CPU
CPU
i-Core
CPU
Memory
I/O
TCPA
CPU CPU
CPU
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NA
NA
Memory
NA
NA
Memory
NA NA
Memory
NA
Memory
NA NA
CPU
CPU
i-Core
i-Core
i-Core
time warping
approximately timed simulation discrete event
synchronization
time warping
i-lets
parallel simulation
hybrid
network-on-chip simulation
2.2.3 Results
4×4
16 ×16
2.2.4 Key Papers
DAC ’15
page 87ff.
Roloff, Schafhauser, Hannig, and Teich. “Execution-driven parallel simulation
of PGAS applications on heterogeneous tiled architectures”
[P24]
2. Resource-aware Computing
X10 ’16
page 93ff.
Roloff, Pöppl, Schwarzer, Wildermann, Bader, Glaß, Hannig, and Teich. “Ac-
torX10: An actor library for X10”
[P16]
ESTIMedia ’17
page 99ff.
Roloff, Hannig, and Teich. “High performance network-on-chip simulation by
interval-based timing predictions”
[P4]
it-by-it
a
a
ow control digit
2.3 Architecture/Compiler Co-Design of Invasive
Tightly Coupled Processor Arrays
compiler-friendly architectures architecture-
friendly / retargetable design tools and compilers
2.3.1 Challenges
2.3.2 Approach
invasion controller i
2. Resource-aware Computing
CPU CPU
CPU CPU
TCPA
CPU CPU
CPU CPU
Memory
Memory
CPU iCore
iCore CPU
CPU iCore
iCore CPU
Memory
I/O
TCPA
CPU CPU
CPU CPU
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NA
NA
Memory
NA
NA
Memory
NA NA
Memory
NA
Memory
NA NA
Advanced High-performance Bus (AHB)
Conf. & Com.
Processor
(LEON3)
IRQ Ctrl.
IM GC
AG
IM
GC
AG
IM
GC
AG
IM
GC
AG
Configuration Manager
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
I/O Buffers
I/O Buffers
I/O Buffers
I/O Buffers
orthotope
processor classes
2.3.3 Results
atomic iterations
Atomic execution tile
Atomic iteration
2. Resource-aware Computing
2.3.4 Key Papers
ACM TECS ’14
page 109ff.
Hannig, Lari, Boppu, Tanase, and Reiche. “Invasive tightly-coupled processor
arrays: A domain-specic architecture/compiler co-design approach”
[J18]
RSP ’17
page 139ff.
Witterauf,Hannig, and Teich. “Constructing fast and cycle-accurate simulators
for congurable accelerators using C++ templates”
[P5]
Springer JSPS ’14
page 147ff.
Teich, Tanase, and Hannig. “Symbolic mapping of loop programs onto processor
arrays”
[J15]
MEMOCODE ’14
page 177ff.
Tanase, Witterauf, Teich, and Hannig. “Symbolic inner loop parallelisation for
massively parallel processor arrays”
[P32]
ACM TECS ’17
page 187ff.
Tanase, Witterauf, Teich, and Hannig. “Symbolic multi-level loop mapping of
loop programs for massively parallel processor arrays”
[J1]
2. Resource-aware Computing
ASAP ’16
page 215ff.
Witterauf, Tanase, Hannig, and Teich. “Modulo scheduling of symbolically tiled
loops for tightly coupled processor arrays”
[P15]
Springer JSPS ’14
page 225ff.
Boppu, Hannig, and Teich. “Compact code generation for tightly-coupled pro-
cessor arrays”
[J17]
path strides
3 Domain-specic Computing
programmability gap
performance generality productivity
Performance
Generality,
expressiveness
general-purpose
Productivity
Turing completeness computational universality
3. Domain-specific Computing
Performance
Generality
Productivity
C / C++
Ruby
Matlab
Domain-specic Languages
3.1 Domain-specic Languages
machine independence
problem orientation
Design of Real-Time Computer Systems
natural
problem-oriented languages
libraries
Programming Languages:
History and Fundamentals
problem-oriented
knowledge
domain
domain knowledge
Domain-Specic programming
Languages (
DSL
s)
math
array programming language
3. Domain-specific Computing
3.1.1 Denition
domain-specic language
Domain-Specic Languages
particular eld of application domain
abstractions notations
small
Limited expressiveness
declarative
Nature of a programming language
Specication languages lile languages
micro-languagesminilanguages task-specic program-
ming languages Very High-Level programming Languages (
VHLL
s)
special
purpose languages languages for specialized application areas
3.1.2 Classication of DSLs
textual graphical
internal external
external
DSL
internal
DSL
host language
high-level
programming languages
meaning
3. Domain-specific Computing
Domain-specic
Language
Domain-specic
Extensions
Host Language
Domain-specic
Language
Domain-specic
Extensions
Host Language
Extension
Reduction
embedded
embedded
DSL
extensions
reduction
3.2 Domain-specic High-level Synthesis
polyhedron model
ane loop nests
dynamic
piecewise linear/regular algorithms
recurrence equations system
of uniform recurrence equations
constant and
variable propagationcommon subexpression eliminationloop perfectization dead
code eliminationstrength reduction of operators(partial) unrolling of l