ThesisPDF Available

Domain-specific and Resource-aware Computing

Authors:

Abstract and Figures

This cumulative habilitation treatise summarizes the research I have conducted with my group Architecture and Compiler Design (ACD) at the Chair of Hardware/Software Co-Design, focusing on selected results published within the last four years. My research can be divided mainly into two categories: Resource-aware computing and domain-specific computing. Both computing paradigms try to tackle the very complex programming and design challenge of parallel heterogeneous computer architectures, having different — to some extent common — goals in mind, e.g., performance, resource utilization, energy efficiency, predictability of even multiple execution qualities, or programming effort. While resource-aware computing provides a full control loop from hardware status information to the program level and back, domain-specific computing drastically separates the concerns of algorithm development and target architecture implementation (including parallelization and low-level implementation details). In the context of resource-aware computing, my research can be further subdivided into (1) modeling and system simulation and (2) architecture/compiler co-design of invasive tightly coupled processor arrays (TCPAs). In the area of domain-specific computing, three approaches are presented: (3) domain-specific high-level synthesis (HLS), (4) the heterogeneous image processing acceleration framework HIPAcc, and (5) the ExaStencils: Advanced stencil-code engineering approach.
Content may be subject to copyright.
Domain-specic and Resource-aware Computing
Domänenspezisches und ressourcengewahres Rechnen
   
 

Habilitationsschrift
 
  
  
     
   venia legendi   
Fachmentorat:    
 
    
 
    
 
Gutachter:     
  
     
 
Abstract
           
     

      
             
      Resource-aware computing  domain-specic
computing           
        
         
         
          
         
         
     
           
  modeling and system simulation   architecture/compiler co-design of invasive
tightly coupled processor arrays (TCPAs)      
     domain-specic high-level synthesis (HLS)  
heterogeneous image processing acceleration framework HIPAcc    ExaStencils:
Advanced stencil-code engineering 

Contents
1 Introduction 1
 
                         
                        
2 Resource-aware Computing 9
 
                          
  
 
 
 
                
 
 
  
 
3 Domain-specic Computing 23
                            
 
                           
                       
                       
 
 
                      
        
 
 
 
                             
                 
                             
                          
Contents
                            
                           
4 Conclusions 47
A Bibliography 49
                               
                               
B Image Credits 81
C Paper Reprints 83
                           
                   
       
     
                  
     
             
        
      
  
 
       
     
        
                      
       
                
        
      
        
               
       
                      
                           
                        
      
 
        
 


         
     
        
       
 
        
                   
       
                    
       
       
      
      
                      
       
 
       
                    
       
                            
       
                 
      
      
               
       
  

List of Abbreviations
ACD    
ADAS    
ALU   
APGAS     
ASIC   
ASIP    
AST   
AVX   
BRAM    
CGRA   
CMOS    
CNC   
CPU   
DAG   
DLP  
DoP   
DSL   
DSP   
FLOPS    
FPGA    
FU  
GPU   

List of Abbreviations
HDL   
HLS  
HPC  
HSA   
IC  
ILP  
IR  
LoC   
LPGS    
LSGP    
LUT  
MIPS    
MPI   
MPSoC  
NoC 
NPP   
OpenCL   
OpenCV    
PC  
PDE   
PE  
QoR   
RISC    
SDK   
SIMD    
SNR  
SoC 
SQL   
SSE   
TCPA    
TI  
TPDL    
UML   
VHDL    
VHLL    
VHSIC    
VLIW    

1 Introduction
         
            
          
             
            
       

 
            
           
             
            
          

            
             
 

           

               
         
           
            
       
           
            
            
             
         
 

           
   

        
            

            
              
        power wall    
           
           
      

     
           
               
1. Introduction
Transistors
(thousands)
Single-thread
performance
(SpecINT)
Frequency
(MHz)
Typical power
(watts)
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
1E+7
1E+6
1E+5
1E+4
1E+3
1E+2
1E+1
1E+0
1E-1
Number of
cores
            
           
              
            
     
            
            
         
             
           
             
          utilization wall 
   dark silicon         
               
         
             
  

 Customization  heterogeneity    
       
          
     
 
 

      
           
        
 

          
          
          
         
         
           
     
1.1 Contributions
        to master the design and programming
complexity of parallel systems as well as their rising heterogeneity   
           


          
  

        
 

           
    

 

  
        
            
        
         resource-aware computing   
 domain-specic computing       
            
      resource-aware programming      
          
         
         
      
Resource-aware Computing
(1) Modeling and System Simulation.
     
          
 invasive        
           
     invasive    
         
    actor models  
   
1. Introduction
  
         
         
     
   

  
       
         
(2) Architecture/Compiler Co-Design of Invasive TCPAs.
  
   

 
      
          
   invade    retreat    
   

      
 
         
  
         
        
       compact code generation 

     symbolic tiling  symbolic scheduling  
 
 

 
         
        
Domain-specic Computing
(3) Domain-specic High-level Synthesis.
  

         
            
       template metaprogramming
 generative programming      
      

          


  
 
            
        
(4) e Heterogeneous Image Processing Acceleration Framework.
  

            
           
       
         
            
    
 
    


     

 
   
       
     
(5) e ExaStencils Approach.
     
         
            
   
 
     
   

     
   
      
          
      
           
             
 Architecture and Compiler Design (
ACD
)
     
            
         
          
                
         
1.2 Papers of this Habilitation Treatise
           
              
            
              
 
Resource-aware Computing
Modeling and System Simulation Papers
DAC ’15
page 87ff.
Roloff, Schafhauser, Hannig, and Teich. “Execution-driven parallel simulation
of PGAS applications on heterogeneous tiled architectures”
[P24]
X10 ’16
page 93ff.
Roloff, Pöppl, Schwarzer, Wildermann, Bader, Glaß, Hannig, and Teich. “Ac-
torX10: An actor library for X10”
[P16]
ESTIMedia ’17
page 99ff.
Roloff, Hannig, and Teich. “High performance network-on-chip simulation by
interval-based timing predictions”
[P4]
     

       
 
1. Introduction
Papers on Architecture/Compiler Co-Design of Invasive TCPAs
ACM TECS ’14
page 109ff.
Hannig, Lari, Boppu, Tanase, and Reiche. “Invasive tightly-coupled processor
arrays: A domain-specic architecture/compiler co-design approach”
[J18]
RSP ’17
page 139ff.
Witterauf,Hannig, and Teich. “Constructing fast and cycle-accurate simulators
for congurable accelerators using C++ templates”
[P5]
Springer JSPS ’14
page 147ff.
Teich, Tanase, and Hannig. “Symbolic mapping of loop programs onto processor
arrays”
[J15]
MEMOCODE ’14
page 177ff.
Tanase, Witterauf, Teich, and Hannig. “Symbolic inner loop parallelisation for
massively parallel processor arrays”
[P32]
ACM TECS ’17
page 187ff.
Tanase, Witterauf, Teich, and Hannig. “Symbolic multi-level loop mapping of
loop programs for massively parallel processor arrays”
[J1]
ASAP ’16
page 215ff.
Witterauf, Tanase, Hannig, and Teich. “Modulo scheduling of symbolically tiled
loops for tightly coupled processor arrays”
[P15]
Springer JSPS ’14
page 225ff.
Boppu, Hannig, and Teich. “Compact code generation for tightly-coupled pro-
cessor arrays”
[J17]
Domain-specic Computing
Domain-specic HLS Papers
ASAP ’14
page 251ff.
Schmid, Tanase, Hannig, Teich, Bhadouria, and Ghoshal. “Domain-specic aug-
mentations for high-level synthesis”
[P37]
FPL ’14
page 257ff.
Schmid, Apelt, Hannig, and Teich. “An image processing library for C-based
high-level synthesis”
[P33]
Springer JSPS ’17
page 261ff.
Bhadouria, Tanase, Schmid, Hannig, Teich, and Ghoshal. “A novel image im-
pulse noise removal algorithm optimized for hardware accelerators”
[J2]
ASAP ’17
page 279ff.
Özkan, Reiche, Hannig, and Teich. “Hardware design and analysis of efcient
loop coarsening and border handling for image processing”
[P9]
     
HIPAcc Papers
IEEE TPDS ’16
page 289ff.
Membarth, Reiche, Hannig, Teich, Körner, and Eckert. “HIPAcc: A domain-
specic language and compiler for image processing”
[J9]
DATE ’14
page 305ff.
Membarth, Reiche, Hannig, and Teich. “Code generation for embedded hetero-
geneous architectures on Android”
[P41]
CODES+ISSS ’14
page 311ff.
Reiche, Schmid, Hannig, Membarth, and Teich. “Code generation from a
domain-specic language for C-based HLS of hardware accelerators”
[P31]
Elsevier JPDC ’14
page 321ff.
Membarth, Reiche, Schmitt, Hannig, Teich, Stürmer, and Köstler. “Towards a
performance-portable description of geometric multigrid algorithms using a
domain-specic language”
[J12]
FPL ’16
page 333ff.
Özkan, Reiche, Hannig, and Teich. “FPGA-based accelerator design from a
domain-specic language”
[P13]
Springer JSPS ’17
page 343ff.
Reiche, Özkan, Hannig, Teich, and Schmid. “Loop parallelization techniques for
FPGA accelerator synthesis”
[J5]
LCTES ’17
page 369ff.
Reiche, Kobylko, Hannig, and Teich. “Auto-vectorization for image processing
DSLs”
[P11]
ExaStencils Papers
ICCSA ’14
page 379ff.
Schmitt, Kuckuk, Köstler, Hannig, andTeich. “An evaluation of domain-specic
language technologies for code generation”
[P38]
Euro-Par ’14
page 389ff.
Lengauer, Apel, Bolten, Größlinger, Hannig, Köstler, Rüde, Teich, Grebhahn,
Kronawitter, Kuckuk, Rittich, and Schmitt. “ExaStencils: Advanced stencil-
code engineering”
[P35]
WOLFHPC ’14
page 401ff.
Schmitt, Kuckuk, Hannig, Köstler, and Teich. “ExaSlang: A domain-specic lan-
guage for highly scalable multigrid solvers”
[P29]
Springer LNCSE ’16
page 411ff.
Schmitt, Kuckuk, Hannig, Teich, Köstler, Rüde, and Lengauer. “Systems of par-
tial differential equations in ExaSlang”
[C1]
1. Introduction
1.3 Structure of this Habilitation Treatise
            
       
   resource-aware computing        
         
            
  modeling and system simulation    architecture/compiler co-design
of invasive tightly coupled processor arrays       
              
     
       
     domain-specic computing    
           
  

        
            
           
                

      
          
            

      
      
     
            
            
  
2 Resource-aware Computing
   resource-aware computing         
          
“resource”
noun            
             
“aware”
adjective [with adverb or in combination]     
    
“computing” noun      
Resources          physical   
          
  virtual        
             
           
    awareness     
               
         
            
              
          
             
             
             
          
         
           
          
thieves              
   
 challenge of resource-aware program execution     
           
          
            
https://en.oxforddictionaries.com
2. Resource-aware Computing
           
               
         
       
          
  invasive computing
2.1 Invasive Computing
       invasive algorithms  invasive architectures
          
 invasive computing       
     
            
     resource-aware programming   
          
             

  
           
     

      
           
            
   invadeinfect  retreat     
             
             
             
           
              
               
               
 reinvasion    partial retreat     
reinfect       
            
    
          
            
          
         modeling and simulation of
           
                

    
start invade infect retreat exit
             
invasive applications and invasive architectures    compilation and architecture
research     
2.2 Modeling and System Simulation
2.2.1 Goals
            
        
            
       
            
           
           

2.2.2 Approach
            
 
           
          
        
              
     invadeinfect  retreat   
           
           
          
              
              
           
      
     
          
            
    
        
       

2. Resource-aware Computing
InvadeSIM
Architecture Model
Application Model
(InvadeX10 / ActorX10)
val c = new AND();
c.add(new TypeConstraint(PEType.RISC));
c.add(new PEQuantity(2));
val claim = homeClaim + Claim.invade(c);
val ilet = (id:IncarnationID) => {
Console.OUT.println("Hello, World!");
};
claim.infect(ilet);
claim.retreat();
Time interval t on the host processor
wall clock time
Time interval ton the target process or
simulated time
Performance Counters
Time Warping
Number of executed instructions I
Processor Simulation
Target
Processor
Host
Processor
Start Processor
Simulation
Simulation
Stop Processor
Simulation
Time Warping
Event Generation
synchronization point
Barrier
Advance
Global Time
Barrier
Check
Global Time
global time ==
local time
global time <
local time
Synchronization
Thread
Simulation
Thread
Synchronization
Simulation Results
CPU
CPU CPU
TCPA
CPU
CPU
CPU
Memory
Memory
CPU i-Core
CPU
CPU
i-Core
CPU
Memory
I/O
TCPA
CPU CPU
CPU
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NA
NA
Memory
NA
NA
Memory
NA NA
Memory
NA
Memory
NA NA
CPU
CPU
i-Core
i-Core
i-Core
          
        time warping  
   
          


          
      approximately timed simulation   discrete event
synchronization         
   

     
           
           

    
    time warping     
            
          
   i-lets          
          
            
              
            
             
             
     parallel simulation 
  hybrid
network-on-chip simulation 
      

2.2.3 Results
            
           
    

     
 

            
 
            
              
           
  
  

  
    
       

 
    

  
4×4

16 ×16
 
       

    
             
        

    
        
2.2.4 Key Papers
DAC ’15
page 87ff.
Roloff, Schafhauser, Hannig, and Teich. “Execution-driven parallel simulation
of PGAS applications on heterogeneous tiled architectures”
[P24]
        
        
      
        

2. Resource-aware Computing
        
         
         
        
         
         
           
          
X10 ’16
page 93ff.
Roloff, Pöppl, Schwarzer, Wildermann, Bader, Glaß, Hannig, and Teich. “Ac-
torX10: An actor library for X10”
[P16]
       
         
           
 

       
           
         
          
          
         
       

     
ESTIMedia ’17
page 99ff.
Roloff, Hannig, and Teich. “High performance network-on-chip simulation by
interval-based timing predictions”
[P4]
     

   
         

 
        
         
     it-by-it
a
  
    

    
        
a
     ow control digit       
  
           
           
           
            
          

     
2.3 Architecture/Compiler Co-Design of Invasive
Tightly Coupled Processor Arrays
             
     

      
           
          

      compiler-friendly architectures    architecture-
friendly / retargetable design tools and compilers       
           
   

  

   
            
           

 
            
            
            
 
2.3.1 Challenges
           
           
             
          
           
      

 
        

    
      

        
          
2.3.2 Approach
           

            
         

 


   invasion controller   i       
      

       
   
       
         


           
 

2. Resource-aware Computing
CPU CPU
CPU CPU
TCPA
CPU CPU
CPU CPU
Memory
Memory
CPU iCore
iCore CPU
CPU iCore
iCore CPU
Memory
I/O
TCPA
CPU CPU
CPU CPU
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
Router
NA
NA
Memory
NA
NA
Memory
NA NA
Memory
NA
Memory
NA NA
Advanced High-performance Bus (AHB)
Conf. & Com.
Processor
(LEON3)
IRQ Ctrl.
IM GC
AG
IM
GC
AG
IM
GC
AG
IM
GC
AG
Configuration Manager
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
PE
iCtrl
I/O Buers
I/O Buers
I/O Buers
I/O Buers
            
        

     
            
         
        

    

        
          

   
     
    
        
 
  
       
    
          
           
             
  

         
         
 
             

           
          


       
            
       
 orthotope              
        

     
         
           
     

  

  processor classes 
               
             
              
             
          
 
 
          
              
   
2.3.3 Results
 
         

 
   

       
          
  

  

  



 
      

   
           
           
      
 
 

  
   
           
         
               
          
     
      
             
  

         
atomic iterations
          
           
  
        
 

      
     
  

        
Atomic execution   tile            
             
           
      
Atomic iteration                 
  

2. Resource-aware Computing

          

          
  

          
              
         
       


      
  
           
          
             
        
2.3.4 Key Papers
              
           
     
ACM TECS ’14
page 109ff.
Hannig, Lari, Boppu, Tanase, and Reiche. “Invasive tightly-coupled processor
arrays: A domain-specic architecture/compiler co-design approach”
[J18]
        


         
        
       
          
       
         


     
        
             
       
RSP ’17
page 139ff.
Witterauf,Hannig, and Teich. “Constructing fast and cycle-accurate simulators
for congurable accelerators using C++ templates”
[P5]
          
       

  
        
          
            
        

     
       
 
 

      
Springer JSPS ’14
page 147ff.
Teich, Tanase, and Hannig. “Symbolic mapping of loop programs onto processor
arrays”
[J15]
           
         
           
         
 

   
          
        
        
        
       
          
           
   
MEMOCODE ’14
page 177ff.
Tanase, Witterauf, Teich, and Hannig. “Symbolic inner loop parallelisation for
massively parallel processor arrays”
[P32]
          
        
   

    
         
        
        
         
           
       
          
          
        
 
ACM TECS ’17
page 187ff.
Tanase, Witterauf, Teich, and Hannig. “Symbolic multi-level loop mapping of
loop programs for massively parallel processor arrays”
[J1]

           
          
          

2. Resource-aware Computing


       
         
          
   

       
           
       
        
ASAP ’16
page 215ff.
Witterauf, Tanase, Hannig, and Teich. “Modulo scheduling of symbolically tiled
loops for tightly coupled processor arrays”
[P15]
         
         
             
  

     
         
           
       
          
            
  
Springer JSPS ’14
page 225ff.
Boppu, Hannig, and Teich. “Compact code generation for tightly-coupled pro-
cessor arrays”
[J17]
            
           
  

      
  

     
         
            
         
         
          
    
   

      
               
        


          


           
            

     
   path strides           
            
           
        


 
          
    

    
  

3 Domain-specic Computing
             
             
         
            
             
         programmability gap 
               
            
           
  performance  generality   productivity      
            

Performance
           
           
             
           
           
             
        
Generality,
    expressiveness     
              
    general-purpose      
           
         
 
          
            
       
Productivity
         
         
         
           

   
Turing completeness  computational universality        
           

3. Domain-specific Computing
Performance
Generality
Productivity
C / C++
Ruby
Matlab
Domain-specic Languages
         
         
 
            
            
          
            
 
           
           

3.1 Domain-specic Languages
              
        machine independence  
          
           
   problem orientation   

  
        Design of Real-Time Computer Systems
           
          
           
     natural        
             
           
  problem-oriented languages     
           
    libraries       
          
            
           
     
         Programming Languages:
History and Fundamentals
  problem-oriented        
         
           knowledge 
             domain 
           
   domain knowledge      
      Domain-Specic programming
Languages (
DSL
s)      

   
            
           
         

   
         
     

       
          
         
  

        
             
   

      
    
          
      math         
array programming language            
         

3. Domain-specific Computing
           
             
         


          
     

      
          
    
3.1.1 Denition
         

 
        
 domain-specic language       
        
          
     
       
        
         
        
     
      Domain-Specic Languages   

       
        
            
           
      
     particular eld of application  domain
  abstractions  notations
small          
     
Limited expressiveness       


  
 declarative  
Nature of a programming language       
            
 
      

        
Specication languages        lile languages 
micro-languagesminilanguages        task-specic program-
ming languages      Very High-Level programming Languages (
VHLL
s)
         

  special
purpose languages     languages for specialized application areas   
              
     
3.1.2 Classication of DSLs
    

    textual  graphical  

        
              

       

     
  

         
            
   
      

       
  

    

    
    internal  external 
 external
DSL
         


 

         
       

   
             
           
           
 internal
DSL
       
   host language 

        
           
             

             high-level
programming languages
       

         meaning     
            

3. Domain-specific Computing
Domain-specic
Language
Domain-specic
Extensions
Host Language
Domain-specic
Language
Domain-specic
Extensions
Host Language
     

 Extension       
               
   Reduction         
          
             
     

    

 embedded       

 embedded
DSL
   

   extensions   
               
            
           
              
        

   
               
              

           reduction
          

    
        
       

   
   
  

        

          

              
   

     

 
         
               
             

   
3.2 Domain-specic High-level Synthesis
             
            
            
           

 
             
polyhedron model      

    
          ane loop nests  
           
           
             
             
          dynamic
piecewise linear/regular algorithms        
              
  recurrence equations             system
of uniform recurrence equations         
               
               
                
        
           constant and
variable propagationcommon subexpression eliminationloop perfectization  dead
code eliminationstrength reduction of operators(partial) unrolling of l