ArticlePDF Available

Abstract and Figures

In this paper, DFGenTool, a dataflow graph (DFG) generation tool, is presented, which converts loops in a sequential program given in a high-level language such as C, into a DFG. DFGenTool adapts DFGs for mapping to Coarse Grain Reconfigurable Architectures (CGRA) to enable a variety of CGRA implementations and compilers to be benchmarked against a standard set of DFGs. Several kernels have been converted and are presented in this paper as case studies. The output of DFGenTool is in DOT, a popular graph description standard which could be used with a variety of CGRA compilers. Furthermore, DFGenTool has been released as open-source.
Content may be subject to copyright.
DFGenTool: A Dataflow Graph Generation Tool for
Coarse Grain Reconfigurable Architectures
Manideepa Mukherjee, Alexander Fell, Apala Guha
New Delhi, India
Email: {manideepam, alex, apala}
Abstract—In this paper, DFGenTool, a dataflow graph (DFG)
generation tool, is presented, which converts loops in a se-
quential program given in a high-level language such as C,
into a DFG. DFGenTool adapts DFGs for mapping to Coarse
Grain Reconfigurable Architectures (CGRA) to enable a variety
of CGRA implementations and compilers to be benchmarked
against a standard set of DFGs. Several kernels have been
converted and are presented in this paper as case studies. The
output of DFGenTool is in DOT, a popular graph description
standard which could be used with a variety of CGRA compilers.
Furthermore, DFGenTool has been released as open-source.
In System on Chips (SoC) the increasing demand for high
computation capacities resulted in advancements of General
Purpose Processors (GPP) in terms of high clock frequencies,
complex instruction execution units and distributed computing.
In addition the aspect of mobility emerged, in which battery
operated devices are also equipped with GPPs. To provide
sufficient computation capabilities and to prolong the battery
life of mobile devices, these processors are surrounded by
Application Specific Integrated Circuits (ASIC) destined to
perform specific tasks. The manufacture of ASICs inflicts high
Non-Recurring Engineering (NRE) costs and long time to
market periods which require high volumes to remain prof-
itable. Coarse-Grained Reconfigurable Architectures (CGRA)
is an emerging class of reconfigurable architecture. They are
similar to FPGAs, however the LUTs have been replaced by
Processing Units (PU) and the interconnect by a packet-based
segmented bus system called Network-on-Chip (NoC). The
PU consists of a local memory, an Arithmetic Logic Unit
(ALU) and an interface to the router establishing connectivity
to other PUs through the NoC. The set of PUs and the
interconnect form the Execution Fabric providing the required
computational capacities. Compared to FPGAs, CGRAs are
not programmed on the bit-level, but on the level of instruc-
tions due to the presence of ALUs. Therefore a powerful high
level language such as C can be used to implement algorithms
which are then compiled into instructions understood by the
CGRA. To execute a sequential C program on a CGRA, a
Data Flow Graph (DFG) is generated in which the vertices are
instructions and the edges represent the dependencies among
these instructions.
A DFG visualizes the Instruction Level Parallelism (ILP)
in CGRAs. The DFG needs to be transformed to match
the structure given by the Execution Fabric. Scheduling in-
structions temporally and mapping them spatially onto PUs
has been in the focus of research and many heuristics exist
such as [7], [8], [9], [10]. On the architectural side many
CGRAs have been proposed till date such as ADRES[14],
KressArray[11], Layered CGRA[19], [20], PACT XPP[5], [17]
and REDEFINE[3]. For testing and benchmarking, algorithms
have been adopted, compiled and executed for each of the
platforms by the respective authors. However, it is difficult to
compare performance in terms of execution time and power
consumption across CGRAs because the code used by each
research group is not available to the rest of the community.
To alleviate the situation, we propose a DFG generator, which
is able to convert an algorithm written in C/C++ into a DFG,
that can be used by the community for benchmarking their
respective platforms.
We implemented our tool in LLVM [12], which
is a popular, open-source compiler platform. We pro-
vide detailed documentation and maintain the tool at so that it can be
used and extended by the community. Furthermore, our tool
produces the DFGs based on LLVM Intermediate Representa-
tion (IR) code, which has two advantages. The first advantage
is that IR code is not tied to a particular front-end or high-level
language, rather IR code can be produced from any high-level
language that LLVM already provides a front-end for. Since the
LLVM front-end covers several popular high-level languages
and will add more in the future, the tool can work with a large
set of languages. The second advantage is that IR code is not
tied to a particular back-end, and yet resembles the instruction
set architecture of most hardware platforms i.e. there are
analogous hardware instructions for most IR instructions in
common architectures. In addition IR lacks register bindings
thereby providing a higher-level abstraction for data. Finally,
we output the DFGs in the dot format [2] which is amenable
to visulization as well as automated processing by popular
graph processing software such as Boost C++ Library [1]
and python networkx.
The key contributions that we make in this paper are the
A tool to build DFGs from LLVM IR.
Documentation and maintenance of the tool in a
GitHub repository.
Optimizations that adapt DFGs for CGRAs.
An output module that stores graphs in the dot
A set of DFGs of kernels in the Polybench 3.2 linear
algebra [18] benchmark suite.
The rest of the paper is organized as follows: Section II dis-
cusses related works, then continues with the implementation
process and optimizations tailoring DFGs for use in CGRAs
in section III. Results are shown in section IV followed by
The paper in [4] describes a DFG generation method
to transform a C code into a dataflow graph. The GNU
Compiler Collection (GCC) is used to create an intermediate
representation called GIMPLE[15] which is a reduced subset
of intermediate code GENERIC[15]. GIMPLE creates blocks
of functions connected through control edges and the DFGs
are created for each block. Although this method can be
used to transform a C code into a dataflow graph, this will
not directly work for CGRAs as the optimizations related to
CGRAs (such as Φnode removal) has not been addressed here.
In addition user is forced to use GCC compiler if GIMPLE is
used where as DOT does not have this limitation. Paper[13]
gives an algorithm for automatic extraction of coarse grained
data-flow threads from imperative programs. This is a thread
based data flow graph generation algorithm targeted for general
purpose processors. Although, this would be interesting to map
these coarse grain data flow graphs onto CGRAs, choosing
the optimum thread size that can be fitted in the memory
of a PU needs further research. Moreover the algorithm is
limited to scalar data only. Turbine[6] and DFTtools[16] are
DFG generation tools available online as open source. They
can generate graphs by setting parameters such as the number
of nodes required, input and output edges, their weight etc.
which are used to generate a DFG randomly. The tools cannot
generate a DFG based on an algorithm and are therefore not
suitable for benchmarking specific applications.
The purpose of this paper is to provide an open source
DFG generation tool that can be used to generate DFGs for
analysis, mapping, transformation and comparison of results.
The goal of this work is to design a DFG generator tool that
is 1) open-source, 2) works for multiple high-level languages,
3) adapted to CGRAs, 4) independent of the characteristics
of a particular CGRA chip, 5) outputs results in a popular
stand-alone format, and, 6) is easily expandable.
Figure 1 shows the schematic diagram of the DFG gener-
ation tool from high level C/C++ code to the dataflow graph
output. The tool is composed of four parts. The first part
generates intermediate code from high level C/C++ code using
the Clang compiler and it is briefly discussed in section III-A.
Secondly, the DFG generation tool consists of stages including
inner loop extraction and initial graph formation followed by
two optimizations specific to CGRAs (refer to sections III-B
and III-C). The last part described in section III-D, converts
the graph to a dot file as output.
A. Intermediate Code Generation
LLVM has a frontend compiler called Clang, to translate
high level C/C++ code to Intermediate Representation (IR)
code (.ll/.bc). We build our tool at the IR code level to make
Inner Loop Extraction
Node Formation
Edge Insertion
Phi Removal
GEP Removal
Print Graph.c/.cpp
DFG Generation CGRA Optimization
Fig. 1: Schematic diagram of DFG generation tool
it platform independent. Listing 1 is an example that will be
used to demonstrate DFGenTool.
1for (i= 0 ; i<n;i++) {
2i f (i%2 == 0 )
3data [i] = i;
5data [i] = 2 i;
6sum =sum +data [i] ;
Listing 1: A simple C code
Listing 2 shows the generated intermediate code for listing
1. It consists of four basic blocks with unique identifiers
(labels) 7,10,13 and 17. The first basic block contains the
intermediate code for the if condition. The second block
assigns values to data[i], if the condition is true, while
block 13 handles the assignment for the else branch. Block 17
is the intermediate code for the final summation. The generated
DFG for this intermediate code is shown in figure 2.
B. DFG Generation Tool
In this section the DFG generation tool is described in
detail. As most of the scientific and linear algebra kernels
spend a significant amount of their execution time in loops,
different loop optimizations are used to improve the loop
performance[18]. Since the innermost loop has the largest
number of iterations, any optimization applied to this loop
level has a high impact on the overall performance of the
kernel. First it extracts the innermost loop or function from
the intermediate code using LLVM after which it forms the
nodes and edges. However, the DFGenTool is not limited to
loops only and it can also be used to generate DFGs for a
complete function in a program.
1) Node Formation: Each instruction in the inner loop is
represented by a node in the graph. Most instructions have two
source operands and one destination operand. Since LLVM
intermediate code is in static single assignment (SSA) form,
the destination operand (if present) uniquely identifies the
instruction. Therefore, the source operand of an instruction
itself could be an instruction node. If the source operands are
not instruction nodes, new operand nodes are created as data
To increase the readability of the graph, different types of
nodes are given distinguishable shapes and border lines. The
different shapes and borders used for DFG nodes, are shown
in table I.
1<label >:7 ; p re d s = % . l r . p h , %17
2 %indvars.iv =p hi i 6 4 [ 0 , ] , [ , %17 ]
3 %sum.02 =p hi i 3 2 [ 0 , ] , [ % 20 , %17 ]
4 %8 = a nd i 6 4 %indvars.iv , 1
5 %9 = i cm p eq i 6 4 %8 , 0
6br i 1 %9 , label %10 , label %13
7<label >:1 0 ; p r e d s = %7
8 %11 = g e t e l e m e n t p t r i n bo u n ds [1000 xi 3 2 ]%data ,i 6 4 0 , i 6 4 %indvars.iv
9 %12 = t r u n c i 6 4 %indvars.iv t o i 3 2
10 s t o r e i 3 2 %12, i 3 2 %1 1 , align 4 , ! tbaa ! 1
11 br l a b e l %17
12 <label >:1 3 ; p r e d s = %7
13 %14 = t r u n c i 6 4 %indvars.iv t o i 3 2
14 %15 = s h l n sw i 3 2 %1 4 , 1
15 %16 = g e t e l e m e n t p t r i n bo u n ds [1000 xi 3 2 ]%data ,i 6 4 0 , i 6 4 %indvars.iv
16 s t o r e i 3 2 %15, i 3 2 %1 6 , align 4 , ! tbaa ! 1
17 br l a b e l %17
18 <label >:1 7 ; p r e d s = % 13 , %10
19 %18 = g e t e l e m e n t p t r i n bo u n ds [1000 xi 3 2 ]%data ,i 6 4 0 , i 6 4 %indvars.iv
20 %19 = l o a d i 3 2 %18 , align 4 , ! tbaa ! 1
21 %20 = a dd nsw i 3 2 %19, %sum.02
22 =add nuw nsw i 6 4 %indvars.iv , 1
23 %21 = t r u n c i 6 4 t o i 3 2
24 %22 = icmp s l t i 32 %2 1 , %6
25 br i 1 %22 , label %7 , label %._crit_edge
Listing 2: Intermediate code generated for listing 1
Node Node Border Line
Type Shape Integer Float Vector
Compute Circular
Load/store Octagon Single Double Triple
Data Rectangle
TABLE I: Node types with their shapes and border lines
2) Edge Insertion: Instructions in the IR code can be
dependent on other instructions. Data edges are added between
the instruction that generates a result (producer), and the
instruction which uses that result (consumer). Dependences
between load and store instructions are discovered using the
LLVM alias analysis module. In addition control edges are
inserted between branch instructions and the instructions that
are control-dependent on these branches. In the DFG generated
by our tool, true dependencies are denoted by dotted lines,
while false dependencies are depicted by dashed lines.
C. Optimizations for CGRAs
In this section we describe two CGRA targeted optimiza-
tions: Φnode modification and GEP instruction expansion.
1) ΦNode Modification: LLVM uses a static single assign-
ment (SSA) based intermediate representation. SSA represen-
tation requires that each variable is defined before use and
assigned exactly only once. Therefore, for existing variables
in the original high level code multiple copies are created by
divergent control paths. At control merge points, computation
may have to choose between values that were computed by
different paths leading to the merge point. φnodes represent
such merging of values. The phi-nodes cannot be mapped onto
the CGRA as there is no corresponding instruction present in
hardware for the phi-operator. Therefore, this node has to be
modified in the DFG in such a way that all the original data
dependencies are preserved. The attribute of the φhas been
modified as a zero latency no operation node (NOP) which
gives output if any of the input is available.
2) GetElementPtr (GEP) Instruction Expansion: LLVM
uses getelementptr (GEP) instructions to compute the
address of a sub element of an aggregated data structure. The
GEP instruction can be arbitrarily large with an undefined
number of dereferences. However, a corresponding hardware
instruction to represent a GEP is nonexistent and therefore it
needs to be broken up into a set of known operations. The
first argument of getelementptr is the data type, while
the second argument is the base pointer. All the remaining
arguments are indices to be used by the respective dereference
Algorithm 1 shows a pseudo code to expand the GEP
instruction by inserting additions and a multiplication for static
two dimensional arrays. The function getAddress returns
a set of source nodes for the base address of the structure
to be accessed, while getOffset and getElementIndex
return the nodes computing the offset and the element index
respectively. The multiplication calculates the address of the
element to be accessed by multiplying the width of the data
type with the desired index. The nodes add2and add1add the
offset to the base address whose sum is added to the product
to obtain the absolute address of the element. The edges are
attached to the newly created compute nodes accordingly.
Finally the succeeding nodes of the GEP node are reattached
to node add1, before the GEP node is removed. Figure 3
shows the generated DFG after GEP instructions expansion
of listing 2. The shaded nodes are the newly added addition
and multiplication nodes.
11 %11
8 %16
15 %18
store %12
store %15
16 %19
7 %15
1 %indvars.iv
4 %9
3 %8
12 %12
trunc 6 %14
br i1 %9
2 %sum.02
br i1 %22
20 %22
32 28
br %17 10
br %17
19 %21
26 25
17 %20
Fig. 2: The DFG generated by the tool for the intermediate code listing 2
D. Print Graph
The output of DFGenTool is in the dot format which is
human readable and can be used to process the DFG further
to adapt it to the desired architecture.
In this section we present results obtained by generating
DFGs for common kernels taken from the Polybench3.2 linear
algebra benchmark [18] and a case study of the benchmark
kernel 2mm. Table II shows the statistics in terms of number
of inner loops, size of the DFGs in nodes, size of the largest
DFG, number of GEP nodes and Φnodes of the generated
DFGs for selected kernels. The source code of each kernel has
been compiled by the Clang compiler with O2 optimization
level. The table reveals that for these kernels, the number of
loops and size of the DFGs are quite large making it tedious
to create the corresponding DFGs manually and perform any
transformation on them. Therefore by using the our tool,
DFGs for a numerous amount of kernels can be generated
A. Case study: 2mm
In the case study, the generated DFG for kernel 2mm
is discussed in detail. It targets two matrix multiplications,
D=A.B and E=C.D, in which the second multiplication
depends on the result of the first one. However due to clarity
we discuss only the first inner loop for the DFG generation
further (see listing 3). Figure 4 depicts the resulting DFG after
Bench- Inner Total size Largest Total GEP Total
mark Loops of DFGs DFG nodes Φnodes
atax 3 71 29 4 6
cholesky 4 92 33 6 6
gemm 5 115 33 6 6
gemver 8 212 39 18 10
trmm 3 81 33 5 4
symm 4 107 33 7 5
3mm 8 182 33 11 11
2mm 7 161 33 9 8
TABLE II: Comparison of generated DFGs for various kernels
the GEP node has been expanded for the intermediate code in
listing 3.
As seen from this case study that for this small ker-
nel in only one loop there are two Φnodes and two
getelementptr nodes. The complexity of the edges to be
handled increases with the number of these nodes. It is tedious
to create the modification manually when the number of these
nodes are large.
In this paper we presented an open source DFG generation
tool which converts C code into DFGs automatically. The
tool addresses researchers working in the domain of CGRAs,
who like to compare inter-CGRA performances of various
kernels easily. Due to the presentation of the DFG in the
standard, human readable dot format, these dependencies can
be addressed further by scheduling and mapping algorithms.
add 36
7 %15
1 phi
4 %9
17 %20
br i1 %22
20 %22
icmp 32 28
br %17 10
br %17
3 %8
12 %12
trunc 6 %14
trunc 37
2 phi
store %15
br i1 %9
26 25
store %12
sizeofelement 38
sizeofelement 16 %19
19 %21
Fig. 3: Final DFG after GEP node expanded
1;<label >: 51 ; p r e d s = %5 1, %49
2 %52 = p h i doubl e [ 0.000000e + 00 , %49 ] , [ %5 9, %51 ]
3 %indvars.iv15.i11 =p hi i 6 4 [ 0 , %49 ] , [ %indvars.iv.next16.i12 , %51 ]
4 %53 = g e t e l e m e n t p t r i n bo u n ds [1024 xd ou ble ]% 6 , i 6 4 %indvars.iv22.i ,i 6 4 %indvars.iv15.i11
5 %54 = l o a d d o u b l e %5 3, align 8 , ! tbaa ! 1
6 %55 = fm u l d o u b le %54, 3.241200e+04
7 %56 = g e t e l e m e n t p t r i n bo u n ds [1024 xd ou ble ]% 17 , i 6 4 %indvars.iv15.i11 ,i 6 4 %-
8 %57 = l o a d d o u b l e %5 6, align 8 , ! tbaa ! 1
9 %58 = fm u l d o u b le %55, %57
10 %59 = fadd d o u b le %52, %58
11 s t o r e d o ub l e %59, do u b le %50 , align 8 , ! tbaa ! 1
12 %indvars.iv.next16.i12 =add nuw nsw i 6 4 %indvars.iv15.i11 , 1
13 %exitcond17.i13 =ic mp e q i 6 4 %indvars.iv.next16.i12 , 1024
14 br i 1 %exitcond17.i13 ,label %60 , label %51
Listing 3: Generated intermediate code for 2mm.c
Therefore the DFG generation tool not only allows to com-
pare the performance among several CGRA implementations,
but also to estimate the impact of various scheduling and
mapping methods for a specific CGRA.
[1] Boost C++ Libraries. online:
[2] Graphviz - Graph Visualization Software. online: http://www.graphviz.
[3] Mythri Alle, Keshavan Varadarajan, Alexander Fell, C. Ramesh Reddy,
Nimmy Joseph, Saptarsi Das, Prasenjit Biswas, Jugantor Chetia, Adarsh
Rao, S. K. Nandy, and Ranjani Narayan. REDEFINE: Runtime
Reconfigurable Polymorphic ASIC. ACM Transactions on Embedded
Computing Systems, 2009.
[4] P´
eter Arat´
o and Gergely Suba. A data flow graph generation method
starting from c description by handling loop nest hierarchy. In 9th
IEEE International Symposium on Applied Computational Intelligence
and Informatics, SACI 2014, Timisoara, Romania, May 15-17, 2014,
pages 269–274, 2014.
[5] Volker Baumgarten, G. Ehlers, Frank May, Armin N¨
uckel, Martin Vor-
bach, and Markus Weinhardt. PACT XPP - A Self-Reconfigurable Data
Processing Architecture. The Journal of Supercomputing, 26(2):167–
184, 2003.
[6] Bruno Bodin, Youen Lesparre, Jean-Marc Delosme, and Alix Munier
Kordon. Fast and efficient dataflow graph generation. In 17th Inter-
national Workshop on Software and Compilers for Embedded Systems,
SCOPES ’14, Sankt Goar, Germany, June 10-11, 2014, pages 40–49,
[7] A. Fell, Z.E. Rakossy, and A. Chattopadhyay. Force-directed scheduling
for Data Flow Graph mapping on Coarse-Grained Reconfigurable
Architectures. In ReConFigurable Computing and FPGAs (ReConFig),
Algorithm 1 Algorithm for GEP Node Expansion
1: for all node DF G do
2: if node =GEP node then
3: PParent(node)
4: CChild(node)
5: addNode(add1)
6: addNode(add2)
7: addNode(mult)
8: addNode(sizeof element)
9: for all pPdo
10: if pgetAddress(node)then
11: addEdge(p,add1)
12: end if
13: if pgetOffset(node)then
14: addEdge(p,add2)
15: end if
16: if pgetElementIndex(node)then
17: addEdge(p,mult)
18: end if
19: addEdge(sizeof element,mult)
20: addEdge(mult,add2)
21: addEdge(add2,add1)
22: deleteEdge(node,p)
23: end for
24: for all cCdo
25: addEdge(Add1,c)
26: deleteEdge(node,c)
27: end for
28: deleteNode(node)
29: end if
30: end for
2014 International Conference on, pages 1–8, Dec 2014.
[8] Stephen Friedman, Allan Carroll, Brian Van Essen, Benjamin Ylvisaker,
Carl Ebeling, and Scott Hauck. SPR: an architecture-adaptive CGRA
mapping tool. In Proceedings of the ACM/SIGDA international sympo-
sium on Field programmable gate arrays, FPGA ’09, pages 191–200,
New York, NY, USA, 2009. ACM.
[9] Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula. EPIMap:
Using Epimorphism to Map Applications on CGRAs. In Proceedings
of the 49th Design Automation Conference (DAC), 2012.
[10] Mahdi Hamzeh, Aviral Shrivastava, and Sarma B. K. Vrudhula.
REGIMap: Register-Aware Application Mapping on Coarse-Grained
Reconfigurable Architectures (CGRAs). In The 50th Annual Design
Automation Conference 2013, DAC ’13. ACM, May 29–June 7, 2013.
[11] R. Hartenstein, M. Herz, Th. Hoffmann, and U. Nageldinger. Genera-
tion of Design Suggestions for Coarse-Grain Reconfigurable Architec-
tures, pages 389–399. Springer Berlin Heidelberg, Berlin, Heidelberg,
[12] Chris Lattner and Vikram Adve. LLVM: A Compilation Framework for
Lifelong Program Analysis & Transformation. In Proceedings of the
2004 International Symposium on Code Generation and Optimization
(CGO’04), Palo Alto, California, Mar 2004.
[13] F. Li, A. Pop, and A. Cohen. Automatic extraction of coarse-grained
data-flow threads from imperative programs. IEEE Micro, 32(4):19–31,
July 2012.
[14] Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and
Rudy Lauwereins. ADRES: An Architecture with Tightly Coupled
VLIW Processor and Coarse-Grained Reconfigurable Matrix. In Peter
Y. K. Cheung, George A. Constantinides, and Jos´
e T. de Sousa, editors,
Field Programmable Logic and Application, 13th International Confer-
ence, FPL 2003, Lisbon, Portugal, September 1-3, 2003, Proceedings,
volume 2778 of Lecture Notes in Computer Science, pages 61–70.
Springer, 2003.
4 %54
2 %indvars.iv15.i11
11 %invars.iv.next16.i12
5 %55
7 %57
8 %58
12 %exitcond17.i13
br i1 %exitcond17.i13
10 store
1 %52
9 %59
Fig. 4: Final DFG generated after expansion of GEP nodes for
the IR code in listing 3
[15] Jason Merrill. GENERIC and GIMPLE: A New Tree Representation
for Entire Functions. In Proceedings of the 2003 GCC Developers’
Summit, pages 171–179. Citeseer, 2003.
[16] M. Pelcat, K. Desnos, J. Heulot, C. Guy, J.-F. Nezan, and S. Aridhi.
Preesm: A dataflow-based rapid prototyping framework for simplifying
multicore dsp programming. In Education and Research Conference
(EDERC), 2014 6th European Embedded Design in, pages 36–40, 2014.
[17] Mihail Petrov, Tudor Murgan, Frank May, Martin Vorbach, Peter Zipf,
and Manfred Glesner. The XPP Architecture and Its Co-simulation
Within the Simulink Environment. In Field Programmable Logic and
Application, 14th International Conference, FPL, Leuven, Belgium,
volume 3203 of Lecture Notes in Computer Science, pages 761–770.
Springer, August 30 2004.
[18] Louis-No¨
el Pouchet. Polybench: The polyhedral benchmark suite.
online:, 2012.
[19] Zolt´
an Endre R´
akossy, Tejas Naphade, and Anupam Chattopadhyay.
Design and Analysis of Layered Coarse-Grained Reconfigurable Ar-
chitecture. In International Conference on ReConFigurable Computing
and FPGAs (ReConFig), Cancun, Mexico, December 2012.
[20] Zolt´
an Endre R´
akossy, Dominik Stengele, Axel Acosta-Aponte, Saumi-
tra Chafekar, Paolo Bientinesi, and Anupam Chattopadhyay. Scalable
and Efficient Linear Algebra Kernel Mapping for Low Energy Con-
sumption on the Layers CGRA. 11th International Symposium on
Applied Reconfigurable Computing, Bochum, Germany, April 2015.
Conference Paper
Full-text available
A scalable mapping is proposed for 3 important kernels from the Numerical Linear Algebra domain, to exploit architectural features to reach asymptotically optimal efficiency and a low energy consumption. Performance and power evaluations were done with input data set matrix sizes ranging from 64×64 to 16384×16384. 12 architectural variants with up to 10×10 processing elements were used to explore scalability of the mapping and the architecture, achieving < 10% energy increase for architectures up to 8×8 PEs coupled with performance speed-ups of more than an order of magnitude. This enables a clean area-performance trade-off on the Layers architecture while keeping energy constant over the variants.
Conference Paper
Full-text available
Coarse-Grained ReconfigurableArchitectures (CGRA) are proven to be advantageous over fine-grained architectures, massively parallel GPUs and generic CPUs, in terms of energy and flexibility. However the key challenge of programmability is preventing wide-spread adoption. To exploit instruction level parallelism inherent to such architectures, optimal scheduling and mapping of algorithmic kernels is essential. Transforming an input algorithm in the form of a Data Flow Graph (DFG) into a CGRA schedule and mapping configuration is very challenging, due the necessity to consider architectural details such as memory bandwidth requirements, communication patterns, pipelining and heterogeneity to optimally extract maximum performance. In this paper, an algorithm is proposed that employs Force-Directed Scheduling concepts to solve such scheduling and resource minimization problems. Our euristic extensions are flexible enough for generic heterogeneous CGRAs, allowing to estimate the execution time of an algorithm with different configurations, while maximizing the utilization of available hardware. Beside our experiments, we compare also given CGRA configurations introduced by state-of-the-art mapping algorithms such as EPIMap, achieving optimal resource utilization by our schedule with a reduced overall DFG execution time by 39% on average.
Conference Paper
Full-text available
The high performance Digital Signal Processors (DSPs) currently manufactured by Texas Instruments are heteroge-neous multiprocessor architectures. Programming these ar-chitectures is a complex task often reserved to specialized engineers because the bottlenecks of both the algorithm and the architecture need to be deeply understood in order to ob-tain a fairly parallel execution. The PREESM framework objective is to simplify the pro-gramming of multicore DSP systems by building on dataflow programming methods. The current functionalities of this scalable framework cover memory and time analysis, as well as automatic deadlock-free code generation. Several tutori-als are provided with the tool for fast initiation of C program-mers to multicore DSP programming. This paper demon-strates PREESM capabilities by comparing simulation and execution performances on a stereo matching algorithm pro-totyped on the TMS320C6678 8-core DSP device.
Full-text available
Coarse-Grained Reconfigurable Architectures (CGRAs) are an attractive platform that promise simultaneous high-performance and high power-efficiency. One of the primary challenges in using CGRAs is to develop efficient compilers that can automatically and efficiently map applications to the CGRA. To this end, this paper makes several contributions: i) Using Re-computation for Resource Limitations: For the first time in CGRA compilers, we propose the use of re-computation as a solution for resource limitation problem. This extends the solutions space, and enables better mappings, ii) General Problem Formulation: A precise and general formulation of the application mapping problem on a CGRA is presented, and its computational complexity is established. iii) Extracting an Efficient Heuristic: Using the insights from the problem formulation, we design an effective global heuristic called EPIMap. EPIMap transforms the input specification (a directed graph) to an Epimorphic equivalent graph that satisfies the necessary conditions for mapping on to a CGRA, reducing the search space. Experimental results on 14 important kernels extracted from well known benchmark programs show that using EPIMap can improve the performance of the kernels on CGRA by more than 2.8X on average, as compared to one of the best existing mapping algorithm, EMS. EPIMap was able to achieve the theoretical best performance for 9 out of 14 benchmarks, while EMS could not achieve the theoretical best performance for any of the benchmarks. EPIMap achieves better mappings at acceptable increase in the compilation time.
Conference Paper
The system-level synthesis of complex hardware or multiprocessing systems start from some kind of a task description formalized usually in a high-level programming language. For this purpose, the C language is used very often. The further steps of the synthesis procedure are based on some kind of data flow graph representation of the task. Therefore, transforming C-code into a graph representation (as systematic as possible) is crucial step in the whole synthesis procedure. One of the difficulty in formalizing transformation algorithm is that the C-code may contain nested loops. The existing solutions suffer from the difficulty of handling such loop nest hierarchy. We present a method, which can solve systematically the transformation from the C-code into a multi-rate data flow graph representation by handling the nested loops. The main steps of the method are illustrated by a simple example.
Dataflow modeling is a highly regarded method for the design of embedded systems. Measuring the performance of the associated analysis and compilation tools requires an efficient dataflow graph generator. This paper presents a new graph generator for Phased Computation Graphs (PCG), which augment Cyclo-Static Dataflow Graphs with both initial phases and thresholds. A sufficient condition of liveness is first extended to the PCG model. The determination of initial conditions minimizing the total amount of initial data in the channels and ensuring liveness can then be expressed using Integer Linear Programming. This contribution and other improvements of previous works are incorporated in Turbine, a new dataflow graph generator. Its effectiveness is demonstrated experimentally by comparing it to two existing generators, DFTools and SDF³.
Coarse-Grained Reconfigurable Architectures (CGRAs) are an extremely attractive platform when both performance and power efficiency are paramount. Although the power-efficiency of CGRAs can be very high, their performance critically hinges upon the capabilities of the compiler. This is because a CGRA compiler has to perform explicit pipelining, scheduling, placement, and routing of operations. Existing CGRA compilers struggle with two main problems: 1) effectively utilizing the local register files in the PEs, and 2) high compilation times. This paper significantly improves the state-of-the-art in CGRA compilers by first creating a precise and general formulation of the problem of loop mapping on CGRAs, considering the local registers, and from the insights gained from the problem formulation, distilling an efficient and constructive heuristic solution. We show that the mapping problem, once characterized, can be reduced to the problem of finding maximal weighted clique in the product graph of the time-extended CGRA and the data dependence graph of the kernel. The heuristic we've developed results in average of 1.89 X better performance than the state-of-the-art methods when applied to several kernels from multimedia and SPEC2006 benchmarks. A unique feature of our heuristic is that it learns from failed attempts and constructively changes the schedule to achieve better mappings at lower compilation times.
Conference Paper
Coarse-grained reconfigurable architectures (CGRAs) represent an important class of programmable accelerators with a significant performance advantage for data-driven, systolic algorithms. In this paper, we present a novel CGRA where data access, data transport and execution are separately layered into dedicated, independent structures. The proposed architecture concept allows for independent control and optimization on each layer to address the storage access bottleneck, faced by state-of-the-art CGRAs. The architecture is programmable and the implementation is derived from a high-level language specification, allowing fast design exploration, debugging and simulation. Up to 50% run-time performance improvement and 5× area-time-energy product gain of the layered CGRA over a non-layered one is demonstrated with 2 case studies from demanding linear algebra applications.
This article presents a general algorithm for transforming sequential imperative programs into parallel data-flow programs. The algorithm operates on a program dependence graph in static-single-assignment form, extracting task, pipeline, and data parallelism from arbitrary control flow, and coarsening its granularity using a generalized form of typed fusion. A prototype based on GNU Compiler Collection (GCC) is applied to the automatic parallelization of recursive C programs.