Content uploaded by Junaid Iqbal
Author content
All content in this area was uploaded by Junaid Iqbal on Oct 16, 2015
Content may be subject to copyright.
RuRot: Run-time Rotatable-expandable Partitions
for Efficient Mapping in CGRAs
Syed M. A. H. Jafri∗‡§, Guilermo Serrano †‡, Junaid Iqbal‡, Masoud Daneshtalab‡§, Ahmed Hemani§,
Kolin Paul¶, Juha Plosila‡, and Hannu Tenhunen‡§
Email: jafri@kth.se, guiserle@teleco.upv.es, junaid.iqbal@abo.fi, masdan@utu.fi, hemani@kth.se,
kolin.paul@gmail.com, juplos@utu.fi, hannu@kth.se
∗Turku Centre for Computer Science
†Universidad Politcnica de Valencia, Spain
‡University of Turku, Finland
§Royal Institute of Technology, Sweden
¶Indian Institute of Technology, Delhi
Abstract—Today, Coarse Grained Reconfigurable Architec-
tures (CGRAs) host multiple applications, with arbitrary com-
munication and computation patterns. Compile-time mapping
decisions are neither optimal nor desirable to efficiently support
the diverse and unpredictable application requirements. As a
solution to this problem, recently proposed architectures offer
run-time remapping. The run-time remappers displace or expand
(parallelize/serialize) an application to optimize different param-
eters (such as platform utilization). However, the existing remap-
pers support application displacement or expansion in either
horizontal or vertical direction. Moreover, most of the works only
address dynamic remapping in packet-switched networks and
therefore are not applicable to the CGRAs that exploit circuit-
switching for low-power and high predictability. To enhance the
optimality of the run-time remappers, this paper presents a de-
sign framework called Run-time Rotatable-expandable Partitions
(RuRot). RuRot provides architectural support to dynamically
remap or expand (i.e. parallelize) the hosted applications in
CGRAs with circuit-switched interconnects. Compared to state
of the art, the proposed design supports application rotation
(in clockwise and anticlockwise directions) and displacement (in
horizontal and vertical directions), at run-time. Simulation results
using a few applications reveal that the additional flexibility
enhances the device utilization, significantly (on average 50 %
for the tested applications). Synthesis results confirm that the
proposed remapper has negligible silicon (0.2 % of the platform)
and timing (2 cycles per application) overheads.
I. INTRODUCTION AND MOTIVATION
The increasing processing and communication demands of
modern telecommunication applications coupled with a need
to reduce the non-recurring engineering costs have made re-
configurable architectures a popular implementation platform
[1]. The reconfigurable architectures can be classified on
the basis of granularity i.e. the number of bits that can be
explicitly manipulated by the programmer. Coarse Grained
Reconfigurable Architectures (CGRAs), provide operator level
configurable functional blocks, word level datapaths, and very
area-efficient routing switches. Compared to the fine-grained
architectures (like FPGAs), CGRAs require lesser configu-
ration memory and configuration time (two or more orders
of magnitude [2]). As a result, CGRAs enjoy a significant
reduction in area (from 66% to 99.06% [1]) and energy
consumed per computation (from 88% to 98% [1]), at the
cost of a loss in flexibility compared to bit-level operations.
Therefore, CGRAs have been a subject of intensive research
since the last decade [1].
Today, platforms are required to simultaneously host mul-
tiple applications with arbitrary inter-application communica-
tion and concurrency patterns [3]. Prevailing technology trends
(like utilization wall and power wall) dictate the use of aggres-
sive optimization techniques. To enhance device utilization
and energy efficiency (for unpredictable scenarios) concepts
like run-time remapping [4], [3] and dynamic parallelism [5],
[6], [7] have been proposed. Run-time remapping changes the
physical placement of an application to reduce communication
[4], memory [8], and/or reconfiguration [9] costs. Dynamic
parallelism parallelizes an application to induce speedup and
generate additional time slacks that allow the platform to
operate at a lower voltage/frequency. However, the proposed
remapping techniques (in CGRAs domain) allow to displace
or parallelize/serialize an application in either horizontal or
vertical direction. Moreover, they are mostly targeted for either
packet-switched Network on Chips (NoCs) or FPGAs and
are therefore not applicable to most CGRAs (that employ
circuit-switched interconnects for high predictability and low
power). To enhance the optimization potential of the presented
works, this paper presents Run-time Rotatable-expandable
Partitions (RuRot) framework. The proposed framework al-
lows to dynamically create virtual partitions in a platform.
Each partitions can be rotated (clockwise/anti clockwise) and
expanded/contracted in horizontal and vertical direction.
To illustrate the motivation for our approach, consider Fig.
1, that shows a CGRA with nine processing elements. Fig. 1
(a) shows the impact of dynamic rotation. Instance 1 in the
figure depicts a scenario, where all the processing elements
are occupied by the hosted applications (App1, App2, App3,
and App4). In Instance 2, App4 leaves and App5 requests
processing elements. Most of the existing run-time mappers
(in CGRAs with circuit-switched interconnects) support only
horizontal or vertical displacements and therefore will decline
the application request. As shown in Instance 3, the RuRot
framework will be able to accommodate App5 on the existing
resources, by rotating it. Fig. 1 (b) illustrates the benefits
of run-time mapping for application expansion. In the figure
App2 has three implementations, each with a different degree
of parallelism (serial, partially parallel, and parallel). We
assume that given the resources, a parallel implementation is
978-1-4799-3770-7/14/$31.00 ©2014 IEEE
233
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)
App2
App1
App3
App5
App4
Instance 1 Instance 2 Instance 3
Instance 1 Instance 2 Instance 3
(a) Application rotation
(b) Diagonal application expansion
147
852
369
147
852
369
147
2
3
5
6
8
9
1
147
852
369
1
852
369
147
852
369
App2
24
57
2 5
258
2
47
Serial
Partially
parallel
Parallel
Fig. 1. Motivation for RuRot
preferred (since it allows to lower the voltage and frequency
[5]). In Instance 2 of the figure, sufficient resources to map
the parallel version of App2 are available. However, existing
CGRA remappers will be unable to parallelize the application
because the resource 5 (needed for partially parallel version of
App2) is not free and they bind the mappings for the parallel
versions at compile-time [6], [5], [10]. As a solution to this
problem, the RuRot will be able to map the parallel version to
resources 1 and 2 by performing vertical expansion, as shown
in Fig. 1 (b) Instance 3. It should be noted that same utilization
efficiency can be achieved by storing all the combinations at
compile-time. However, it will be shown later in sections V
and VI, that the compile-time storage will incur prohibitive
configuration memory overheads.
This paper is organized as follows. In Section II, a brief
survey of existing run-time mappers is presented. In Section
III, an overview of the CGRA platform, used in this paper
is described. In Section IV, details of the system level archi-
tectural modifications, hardware, and algorithm to implement
RuRot is explained. In Section V, we formalize the potential
benefits of our methods. In Section VI, we evaluate the actual
benefits and redundancies imposed by our method on an
actual CGRA. Finally, in Section VII, we summarize our
contributions and suggest directions for future research.
II. RELATED WORK AND CONTRIBUTIONS
Application mapping is one of the most fundamental chal-
lenges in CGRA domain. Therefore, it has been a subject
of intensive research. In this section, we will discuss only
the most prominent work related to our approach. Broadly, a
mapping technique can be classified on the basis of adaptiv-
ity (static/dynamic mapping), parameter to optimize (energy,
space, etc.), targeted architecture (CGRA, NoC or FPGA
etc.) and flexibility offered (rotation, displacement, run-time
parallelism).
Traditionally, the mapping decisions were taken by the com-
piler and therefore they offered no run-time adaptivity [11].
The static mappers enjoy relaxed processing deadlines. The
relaxed processing deadlines allow them to execute complex
algorithms such as modulo scheduling [12][13] and affine
loop transformation [14] to efficiently exploit parallelism
[15][16]. Although they find optimal mappings, the compile-
time decisions are unable to efficiently cope with unpredictable
scenarios found in many real world applications.
To deal with unpredictability, inherent in todays real world
applications, recently proposed platforms offer run-time map-
ping/remapping. However, most of the work on run-time
remapping targets packet-switched 2-D Network on Chips
(NoCs) [17], [18], [19]. Chou and Marculescu [17] proposed a
run-time mapping strategy that allowed to allocate resources at
run-time, based on user behavior. Faruque et al. [18] presented
a distributed approach to map applications dynamically in a
NoC. Holzenspies et al. [19] proposed a run-time spatial appli-
cation mapping strategy targeted for streaming applications. In
[20], network congestion-aware run-time mapping architecture
was presented. To reduce communication load, [4] proposed
a run-time remapper that uses an adaptive task allocation
algorithm. Quan et al. [3] and Schor et al. [21] proposed
scenario-based run-time mapping. All these techniques target
packet-switched NoCs and they do not support run-time par-
allelism or application rotation. Asad et al. [5], [6] presented
an architecture to realize run-time parallelism in CGRAs.
However, in the proposed method each element was tightly
coupled (similar to Instance 2 in Fig. 1 (b)). The only work
that deals with application rotation was presented by Compton
et al. [22]. Their approach enhanced device utilization by
minimizing the unusable created during reconfiguration. The
proposed method also allowed rotation. However, Comptons
approach was FPGAs centric while our approach is designed
for CGRAs. Moreover, Compton did not consider run-time
application expansion (parallelism) which is one of the main
contributions of this paper.
The related work reveals that most of application mapping
strategies (in CGRAs) rely on compile-time decisions. The
approaches that do address run-time mapping are targeted
for either packet-switched NoCs or FPGAs (in which the
interconnects are significantly different). Moreover, none of
the approaches allow generic run-time parallelism.
To cater these problems, this paper has three major
contributions:
1) We propose an architecture that allows to dynamically
remap an application in horizontal, vertical, clockwise
and/or anticlockwise direction;
2) We present a framework to parallelize an application
at run-time. Compared to state of the art expansion
techniques in CGRAs [6], [5], [10] (where the mapping
for each implementation is statically determined) our ap-
proach allows to dynamically parallelize an application
in vertical or horizontal directions; and
3) We perform formal and gate-level analysis to evaluate
overheads for implementing the proposed architecture.
III. SYSTEM OVERVIEW
We have chosen the Dynamically Reconfigurable Resource
Array (DRRA) [23] as a vehicle to evaluate experimentally
the efficiency of RuRot on a CGRA. Nevertheless, it seems
that the results are essentially applicable to most grid based
234
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)
CGRAs (with circuit-switched interconnects) as well. DRRA
is a CGRA template designed to support multiple DSP stan-
dards on a unique platform.
System
controller
DRRA storage
layer
DRRA computation
layer
App1 App2 App3
Memory
elements
Cells
Fig. 2. Different applications executing in its private environment
As depicted in Fig. 2, it is composed of three main
components: (i) system controller, (ii) computation layer, and
(iii) memory layer. For each hosted application, a separate
partition can be created in memory and computation layers.
The partition is optimal in terms of energy, power, and
reliability [6], [24], [5]. Table I briefly describes the basic
functionality of these components.
TAB L E I
OVERVIEW OF DIFFERENT DRRA COMPONENTS
Component Functionality
System controller
(i) Send configware to DRRA computation layer
(ii) Create memory partitions for each
application
Computation layer Perform computations
DRRA storage layer Store data for DRRA computation layer
A. System controller
Figure 3 illustrates the overall system architecture for man-
aging DRRA. The controlling intelligence is provided by a
Run-Time resource Manager (RTM). The RTM resides in the
LEON3 processor and has two main responsibilities: (i) to
configure DRRA by loading the binary from the configuration
memory, and (ii) to parallelize/serialize application tasks,
depending on the deadlines. Before our modifications, the
application to component mapping was done at compile-time.
While the architecture accommodated run-time parallelism, it
did not allow to dynamically remap an implementation (similar
to Instance 3 in Fig. 1 (b)). In this paper we will present the
architectural modifications required to remap at run-time.
LEON3
Global
configuration
memory
Loader
DRRA
Fig. 3. System management layer of DRRA
B. DRRA Computation Layer
DRRA computational layer is shown in Fig. 4. It is com-
posed of four elements: (i) Register Files (reg-files), (ii) mor-
phable Data Path Units (DPUs), (iii) circuit-Switched Boxes
(SBs), and (iv) sequencers. The reg-files store data for DPUs.
The DPUs are functional units responsible for performing
computations. SBs provide interconnectivity between different
components of DRRA. The sequencers hold the configware
which corresponds to the configuration of the reg-files, DPUs,
and SBs. Each sequencer can store up to 64 36-bit instructions
and can reconfigure the elements only in its own cell. As
shown in Fig. 3, a cell consists of a Reg-file, a DPU, SBs, and
a sequencer, all having the same row and column number as a
given cell. The configware loaded in the sequencers contains a
sequence of instructions (reg-file, DPU, and SB instructions)
that implements the DRRA program.
Column 0
Row0
Row1
Column 1
Sequencer Sequencer
DPU DPU
DPU
DPU
Reg-file Reg-file
SB
SB
SB
SB
SB
SB
SB
SB
Reg-file Reg-file
Cell0
Cell1
Cell2
Cell3
Sequencer Sequencer
Fig. 4. DRRA computation layer
C. DRRA Storage Layer (DiMArch)
DiMArch is a distributed scratch pad (data/configware)
memory that complements DRRA with a scalable memory
architecture. Its distributed nature allows a high speed data
and configware access to the DRRA computational layer (com-
pared to the global configuration memory). Further discussion
of DiMArch is beyond the scope of this paper and for details
we refer to [24].
D. DRRA configuration flow
As shown in Figure 5, DRRA is programmed in two
phases (off-line and on-line) [10]. The configware (binary) for
commonly used DSP functions (FFT, FIR filter etc.) is written
either in VESYLA (HLS tool for DRRA) and stored in an off-
line library. For each function, multiple versions, with different
degree of parallelism, are stored. The library, thus created, is
profiled with frequencies and worst case time of each version.
To map an application, its (simulink type) representation is
fed to the compiler. The compiler, based on the available
functions (present in library) constructs the binary for the
complete application (e.g. WLAN). Since the actual execution
times are unknown at compile-time, the compiler sends all
the versions (of each function), meeting deadlines, to the run-
time configware memory. To reduce memory requirements
for storing multiple versions, the compiler generates a com-
pact representation of these versions. Details of compression
algorithm and how it is unraveled are given in [10]. The
compact representation is unraveled (parallelized/serialized)
dynamically by the run-time resource manager (running on
LEON3 processor).
235
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)
Application
deadlines
Vesyla
(HLS tool) Library Compiler
Simulink
model
Configware
Parallelize/
serialize
Versions
DRRA
Compile time
Runtime
Leon3
Compression
Map
Fig. 5. Configuration Model
IV. RUN-TIME ROTATABLE-EXPANDABLE PARTITIONS
(RUROT)
Before our enhancements, DRRA only accommodated com-
piler based mapping [6], [5], [24]. Although the architecture
allowed to serialize/parallelize an application at run-time,
the application to component (i.e. DPUs, reg-files, and SBs)
mapping for each implementation was determined during com-
pilation. The proposed approach guides the resource manager
to dynamically allocate the resources.
A. Configuration flow modifications
To realize run-time application remapping and rotation, as
shown in Fig. 6, we have added an additional block (called
RuRot), to the original configuration flow (cf. Fig. 5). RuRot
is composed of three blocks: (i) Maximal Empty Rectangle
(MER) calculator (ii) rotation calculator, and (iii) remapper.
The MER calculator keeps a record of the free spaces available
in the platform and provides the information to the rotation
calculator. The rotation calculator receives the information
about the rows/columns required by the application, from
the RTM (residing in the LEON3 processor). It compares
the required space (provided by RTM) with the available
space (provided by MER calculator) to determine the loca-
tion and the angle for remapping. The calculated angle and
displacement is sent to the remapper. The remapper applies
the rotation and/or displacement transform to the configuration
bitstream received from the RTM. The MER calculator works
in background, as soon as an application is mapped, while the
rotation calculator and the remapper operate at run-time when
a mapping request is made. Therefore, the MER calculator
is implemented in software while the rotation calculator and
remapper are implemented in hardware. In this section, we
will discuss the remapper followed by a detailed description
of MER and rotation calculator.
B. Remapper
The remapper applies displacement and rotation transfor-
mation (received from rotation calculator) to the bitstream
received from the RTM. In packet-switched NoCs application
rotation/displacement is easily implemented by modifying the
destination and/or source location in the configware. How-
ever, in circuit-switched NoCs (used in prominent CGRAs)
application displacement/rotation is not a trivial issue. The
displacement/rotation transformation depends on the type of
element (memory, computation or interconnect) configured.
The memory and computational elements (i.e. reg-files and
Application
deadlines
Statically
mapped
configware
Remapper
LEON3
(RTM)
MER
calculator Rotation
calculator
DRRA
Cell
required
MER
table
Rot
CW
Unmappable
RuRot
Software Hardware
Fig. 6. Modifications in configuration flow
DPUs in DRRA) are remapped simply by migrating the con-
figware to the new location. The interconnect instructions, not
only need to be migrated but also the new circuits (connecting
the source and destination) have to be recalculated. We have
implemented the remapper in hardware. The motivation for
hardware based implementation is that all the applications,
with any resource in the same column as the remapped
application, have to be stalled during remapping. The block
level implementation of the remapper is shown in Fig. 7.
The remapper is composed of two parts (i) rotation to offset
generator and (ii) shifter.
Rotation to
offset Shifter
CW
Rot
X
Y
DX
DY
Configuration
bitstream
New
Bitstream
New
location
Fig. 7. Block level diagram of remapper
1) Rotation to offset generator: To clearly illustrate the
functionality of rotation to offset generator, consider the ex-
ample depicted in Figure 8. The figure shows an application,
App1, mapped to a platform with 12 cells. Initially, App1
is mapped to the cells 2 (i.e. x=0, y=2), 7 (i.e. x=1, y=3),
and 10 (i.e. x=2, y=2). To rotate App1 by 90◦anti-clockwise
with respect to the cell 2, cells 7 and 10 should be remapped
respectively to cells 5 and 0. In other words the horizontal and
vertical offsets for cell 7 (DX7and DY7) and cell 10 (DX10
and DY10)aregivenby(DX7=0DY7=−2)and(DX10=-
2andDY10 =−2), respectively. The rotation to offset block
performs these transformations.
1
2
3
5
6
7
9
10
11
048
1
2
37
6
59
0
11
10
4 8
App1
Fig. 8. Application rotation anticlockwise by 90 degrees
236
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)
Fig. 9 illustrates the hardware used to calculate the offsets.
The circuit receives five inputs: (i) rotation (Rot), (ii) Clock
Wise (CW), (iii) X, (iv) Y, and (iv) present Address (Adrs). Rot
determines whether the rotation should be applied. Rot =0,
deactivates this circuit in order to perform a displacement and
therefore X and Y inputs are copied as displacement offset
values (DX, DY). Rot =1, activates rotation. In this case,
(X, Y) denotes the coordinates of the point over which the
application is rotated (like cell 2 in Fig. 8). If the rotation is
activated, DX and DY are calculated depending on whether the
clockwise or anticlockwise rotation is required. The calculated
offsets are sent to the shifter which remaps the application.
/H
%H
-
-
+
-
-
Adrs
Rot
CW
X
Y
Rot
CW
X
Y
Adrs
H
=
=
=
=
=
=
Rotation
Clockwise
Column number
Row number
Address
Height
DX
DY
Fig. 9. Rotation to offset generator
2) DRRA interconnection network: We will use the in-
terconnect network of DRRA as a vehicle to quantitatively
analyze complexity and overheads of modifying interconnect
instructions in CGRAs. The motivation for choosing DRRA
network is that it is well documented, its bandwidth has been
tested on demanding industrial applications (with Huawei)
[14], and we have available all its architectural details from
RTL codes to physical design. DRRA hosts a circuit-switched
interconnect network. It allows to directly connect any cell
component (reg-files and DPUs) with the another component
up to three hops away in a non-blocking fashion (i.e. in a
single cycle). Fig. 10 shows a fragment of the DRRA fabric
that consists of three cells arranged in two rows and three
columns. Each output is connected to a horizontal bus while
the inputs are connected through a vertical bus. Further details
about the motivation and scalability of this interconnect can
be found in [1]
Reg-file
DPU
Reg-file
DPU
Reg-file
DPU
Fig. 10. DRRA interconnect network [1]
Remember from Section III, SB instructions are used to con-
figure the switch boxes in DRRA. To configure a connection,
the information is sent to the sequencer of the destination cell.
Since the remapper will modify the SB instructions, before
explaining the modification hardware, we will give the detailed
description of the SB instruction. Various fields of the SB
instruction are shown in Fig. 11. To establish a connection
the 36-bit SB instruction requires five parameters: (i) Device
(Dav), (ii) Out index (Oi), (iii) Hierarchical index (Hi) , (iv)
Vertical index (Vi), and (v) Data. Dav identifies whether the
source is Reg-file or DPU. Oi field indicates the output port
of the source. Hi field specifies the distance between source
and destination columns. In the current implementation the
maximum values of Hi is 6. Ri field determines the source
row. Vi field indicates the input port of the destination. Data
field is unused for the SB instructions (it is only used in reg-
file and the DPU instructions). It should be noted that while
different CGRAs will have different field sizes (depending
on the topology, number of input/output ports and maximum
hops), the basic concept of the circuit-switched NoC is the
same. Therefore, it is expected that the overheads calculated
for DRRA can be migrated to other architectures as well with
slight modifications.
Ri DataViHi
Dav
Oi
Hi
Vi
Ri
Dav Oi
=
=
=
=
=
Device
Out index
Hierarchical index
Vertical index
Row index
Fig. 11. Contents of the SB instruction
3) Shifter: The shifter takes the current instructions along
with the offsets (DX, DY) as inputs and generates the instruc-
tion for mapping the instructions to the new location. The
proposed hardware is shown in Fig. 12. We will explain the
diagram from the left to the right considering how the data
flows in it. Initially, the SB instruction (of the present location),
along with DX and DY (form rotation to offset generator), and
CW and Rot (form rotation calculator) are received. At this
stage, the instruction type is checked. If instr =7implying
that the received is not an interconnect (SB) instruction, only
the address field of the instruction is modified. To calculate the
new address, the DX and DY values are added to the original
address. If instr =7, indicating that the received instruction
is a SB instruction, Hi, Vi, and Ri fields are transformed. To
calculate the indices, an intermediate index correction is made
by the block shown in Fig. 13. If the rotation is activated, the
new Hi value is recalculated by first applying the intermediate
index calculation. Then depending on the values of Rot and
CW the final index values are calculated.
When rotation=0 (i.e. a vertical or horizontal shift is re-
quired) Vi(i+1) = Vi(i)+6∗DY .WhereV(i+1) and Vi(i)
denote respectively the new and present Vi value. The reason
for an increment of 6∗DY is that the value of Vi in consecutive
rows differs by 6 (i.e. each cell has six inputs, two for register
file and four for DPU). Similarly, since each cell contains
two rows of horizontal buses (cf. Fig. 10), the new Ri value,
237
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)
Ri(i+1) = Ri(i)+2∗DY . In this mode (i.e. when rotation=0),
Hi is modified only when the original or final position lies
in the first 3 columns (to compensate insufficient cells for 3-
hop connectivity, on the left). The new value of Hi(i+1) is
given by Hi(i+1) = Hi(i)+PC−NC.WherePCandNC
denote respectively the positive and negative corrections. PC
and NC themselves are given by PC =3−OC (when the
application is presently mapped to the first three columns) and
NC =3−DC (when the final address of the application lies
in first three columns). OC and DC denote respectively the
present and final column of the application. When rotation=1
(i.e. rotation is activated), final value of Ri(i+1) is given by
equation:
Ri(i+1)=Ri(i)+2∗(DY +HD),clockwise
Ri(i)−2∗(DY +HD),otherwise (1)
Where HD denotes horizontal distance. HD itself is given by
Hi(i)−min(OrigC ol, 3).W
hereOrigCol is the obtained
by dividing the original address by height. Simply, the new
Ri(i+1) is calculated by either incrementing or decrementing
original Ri(i)by 2∗DY depending on the rotation direction
and the original horizontal connection distance. In rotation
mode the final value of Vi(i+1) is given by Vi(i+1) =
(Ri(i)/2+Dy)∗6+Vi(i)%6.ThefinalvalueofHi,Hi(i+1),
is calculated by summing the vertical connection distance (i.e.
distance from the switch box to the cell that contains the
connected input) with the correction needed for the sliding
window connection. Hi(i+1) is given by equation:
Hi(i+1)=(Vi(i)/6+Ri(i)/2) + mxhpd,clockwise
−(Vi(i)/6+Ri(i)/2) −mxhpd,otherwise
(2)
Where mxhpd=min(destination column, 3).
Fig. 12. Shifter hardware
Fig. 13. Index corrector
C. MER and rotation calculation
The remapper receives displacement and rotation values
calculated by the MER and the rotation calculator. Fig. 14
illustrates the functionality to the MER and rotation calculator.
1) MER calculator: The MER calculator is activated when
either an application is mapped, an existing application is
parallelized/serialized or an existing application leaves the
platform. Once activated, it operates in background to find
a complete set of Maximal Empty Rectangles (MERs). The
MERs are calculated using the Enhanced Scan Line Algo-
rithm [25]. After isolating all the MERs, the MER calculator
arranges them in order of decreasing area (where area is the
product of MER rows and columns). Finally, the reference
address (bottom left element address in our case), the number
of rows MERrows, and the number of columns MERcols
are sent to the rotation calculator. Since MER operates in
background, we have opted to implement it in software, as
a separate thread on LEON3 processor.
2) Rotation calculator: The rotation calculator is activated
when an application requests platform resources (i.e. a new
application arrives or an application is parallelized) or MER
calculator sends update MER request. If the required re-
sources are not free, the rotation calculator determines the
number of rows Approws and columns Appcols required by
the application. After that it attempts to find a MER with
MERrows ≥Approws and MERcols ≥Appcols. If one of
the MERs satisfies this condition, the MER location (X, Y) is
sent to the shifter with Rot =0. If none of the MERs meets
this condition, the rotation calculator tries to find a MER such
that MERrows ≥Appcols and MERcols ≥Approws If one
of the MERs satisfies this condition, the MER location (X, Y)
is sent to the shifter with Rot =1. If the rotation calculator
is unable to find a MER with the required rows and columns,
it updates the RTM about it.
V. FORMAL EVALUATION OF ENHANCED UTILIZATION
The fundamental advantage of using RuRot is that it in-
creases the probability to find a free resource, compared to
static mapping. Thereby, it potentially offers higher resource
utilization and energy/power reduction (when combined with
dynamic parallelism and DVFS) [5]. In this section we will
formally evaluate the probability to accommodate an applica-
tion using static and RuRot based application mapping.
Consider a platform plat contains nresources. A resource
can be either free or occupied. For simplicity we assume that
both states (free and occupied) are equally likely. At a given
instant, all possible combination Comall of ifree resources
238
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)
Update mapping
App leaves
Update
MER
Rotation calculator
End
MER calculator
End
New
MER table
Yes
Calc MERs
Arrange MERs
Update MER
Map req
Update MER
MER
rows
(i)
>= App
rows
MER
cols
(i)
>= App
cols
Yes
Yes
Rot=0
Update
mapping
(x, y) = MER(i)
i=enteries
?
i++
No
No
No
MER
rows
(i)
>= App
cols
MER
cols
(i)
>= App
rows
Yes
Yes
Rot=1
Update
mapping
(x, y) = MER(i)
i=0
i=enteries
?
i++
No
No
No
End
Yes
Yes
App
Calc
MER
Col
Req
Application
Calculate
Maximal
empty
rectangle
Column
Request
=
=
=
=
=
Fig. 14. Information flow between MER and rotation calculator
are given by equation:
Comall =
n
i=1 n
i.(3)
Suppose that an application app requests rresources from
the platform. For traditional CGRAs (with static compile-
time binding) the total combinations Comswith sufficient
resources to accommodate app is given by equation:
Coms=
n
i=rn−r
i−r.(4)
So at any instant, the probability Psto accommodate app using
the static mapping is given by:
Ps=Comr/Comall.(5)
To evaluate the benefit of displacing an application, consider
that the platform contains platrrows and platccolumns. If
the platform supports horizontal application displacement the
total combinations that can accommodate app are given by
equation:
Comdisx =(
n
i=rn−r
i−r)∗(platc−appc+1),(6)
where appcdenotes the number of columns in the application
to be mapped. If the platform supports vertical application
displacement the total combinations that can accommodate
app are given by equation:
Comdisy =(
n
i=rn−r
i−r)∗(platr−appr−1),(7)
where apprdenotes the number of rows in the application
to be mapped. If appc=appr, a rotation by 90◦,the
total combinations that can accommodate app are given by
equation:
Comrot =(Comdisx +Comdisy)∗2,(8)
TAB L E II
APPLICATION AND RESOURCES
Application Resources Memory
(Cells) (bits)
FFT1 65148
FFT2 12 6588
FFT3 18 7998
MM1 3756
MM2 61548
MM3 92340
WLAN1 18 7380
WLAN2 24 8820
WLAN3 30 10320
Therefore, the total combinations Comdr, with support for
RuRot are given by equation:
Comdr =Comdisx +Comdisy,if appc=appr
Comrot,otherwise (9)
So the probability of finding a free space using RuRot is given
by:
PRuRot =Comdr/Comall,(10)
Equations 5, 6, 7, 8, and 10 clearly indicate that the by
allowing to rotate and displace an application, the probability
to find a free space increases significantly.
VI. RESULTS
The formalizations in the previous section were generic and
should be applicable to most grid based CGRAs. However,
unlike FPGAs (that in general have a standard structure)
existing CGRAs vary greatly in architecture and it is not
possible to provide concrete generic results applicable to all
CGRAs. Therefore, we have chosen DRRA as a representative
platform because of the reasons already stated in Section
IV-B2. To analyze the utilization or memory savings of the
proposed approach on real application, we mapped three
representative applications on the DRRA: (i) Fast Fourier
Transform (FFT), (ii) Matrix Multiplication (MM), and (iii)
Wireless LAN transmitter (WLAN) (see Table II). For each
application, implementations with different levels of paral-
lelism were simulated. The table shows the resources and the
configuration memory needed by each application.
A. Overhead analysis
To estimate additional overhead incurred by the proposed
architecture, we synthesized the DRRA fabric with RuRot
for 65 nanometer technology at 400 MHz frequency, using
Synopsys design compiler. The modified architecture contains
two additional hardware components: (i) remapper and (ii)
rotation calculator. Table III shows that the remapper (capable
to support up to 32 rows) incurs negligible overhead (1.3 %
power and 0.2 % area ) compared to DRRA computation layer
(containing 14 cells). The motivation for selecting 14 cells
for DRRA is that it is the minimum cells needed to support
a common CDMA application with Huawei [14]. Of course,
to host multiple applications, will require more DRRA cells
that will further reduce the comparative hardware overhead
incurred by our approach.
The rotation calculator requires a simple comparator that
has negligible overhead. The memory overheads for the
239
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)
TABLE III
AREA AND POWER COMPARISON:
RS DRRA Overhead
(%)
Power µW 919 70406 1.3
Area µm22300 1199506 0.2
rotation calculator, Rotmem,aregivenbyRotmem =
(log(cells)/log2∗8) ∗MER. In our experiments we used
MER =10and cells =60giving an overhead of only 480
bits.
B. Scalability analysis
As explained in Section IV, the size of the remapper
depends on the rows in the platform. To analyze the scalability
of the remapper, we synthesized the remapper with different
number of rows. Figures 15 and 16 show respectively the
area and the power overheads. Two distinct overhead patterns
(linear and pow 2 in the figures) were observed, depending on
the number of rows. If the number of rows in the platform is a
power of 2 (i.e. 2, 4, 8....) the overhead is significantly lesser.
Consider for example that with 32 rows the silicon overhead is
almost 28 % lesser than a platform with 22 rows. The reason
for this is that the architecture shown in Section IV contains
many divisions and module operations using the number of
rows as an operand. By using a power of 2, these values
can be computed simply by shifting the bits. To conclude,
the synthesis results reveal that the proposed architecture is
scalable and the overheads can be significantly reduced by
using the rows as powers of 2.
Fig. 15. Area overheads for different row sizes
Fig. 16. Power overheads for different row sizes
C. Utilization enhancement
To demonstrate the benefits of our scheme we mapped the
applications, shown in Table II, on DRRA containing 60 cells.
TAB L E IV
SUCCESSFULMAPPINGS:STATIC MAPPING VS RUROT
Parallel apps RuRot Static Improvement
allowed (mappings) (mappings) (%)
3935 423 54
6318 129 59
9125 63 49
12 40 35 12
15 32 13 59
18 20 10 53
The entry time and the life of each application was chosen
randomly using an online random number generator. In total
1000 randomly generated application instances and their life
times were generated. Figure 17 shows the mapping requests
that were handled successfully by the platform. As expected,
the figure reveals that the successful mappings, regardless
of the mapping technique (static or RuRot), decrease as
the maximum allowable parallel (simultaneous) applications
increase. The figure clearly shows that RuRot promises greater
successful mappings than the static application mapper used
in [6], [5].
Fig. 17. Successful mappings using static mapping and RuRot
Table IV shows the improvements of using RuRot based
approach. It can be seen that RuRot accepts approximately
50 % more applications compared to the static mapper. An
abnormality can be seen when 12 parallel applications are
allowed. In this case RuRot allowed only 12 % more mappings
than the static mapping. After investigation, we found out
that for this case, WLAN3 (with 30 resources) and FFT3
were successfully mapped to the platform and they remained
active during most of the simulation. Therefore, most of
the remaining mapping requests were declined by both the
mapping algorithms.
To achieve the same device utilization we evaluated the con-
figuration memory needed by existing state of the art remap-
pers [6], [5], [7]. Fig. 18 shows the memory requirements
needed by the memory based and RuRot based remappers. It
clearly shows the RuRot based remapper not only needs lesser
configuration memory, it is also independent of the number of
cells in DRRA. This massive savings come from the hardware
implemented remapper described in Section IV-B.
VII. CONCLUSION
In this paper, we have presented a design framework called
Run-time Rotatable-expandable Partitions (RuRot), to enhance
240
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)
Fig. 18. Memory requirements for pre-calculated mappings vs RuRot
the optimization potential of dynamic remappers. The pro-
posed architecture allows to reposition and expand/contract
an application. Compared to state of the art RuRot supports
repositioning an application in four directions: (i) horizontal,
(ii) vertical, (iii) clockwise, and (iv) anti-clockwise. In addi-
tion, it also allows to serialize and/or parallelize an application
horizontally and vertically. The obtained results suggest that
the added flexibility increases the device utilization signifi-
cantly (on average 50 % for the tested applications). Synthesis
results confirm a negligible penalty (0.2 % area and 1.3 %
power) compared to the fabric. In the future work, we plan
to investigate deadlocks and congestion resulting from the
repositioning. In addition, we will also combine RuRot with
configuration defragmentation to further enhance the device
utilization.
ACKNOWLEDGMENT
This work was supported by Nokia Foundation, Higher Ed-
ucation Comission Pakistan, and VINNOVA (Swedish Agency
for Innovation Systems) within the CUBRIC project.
REFERENCES
[1] M. A. Shami, “Dynamically reconfigurable resource array,” Ph.D.
dissertation, Royal Institute of Technology (KTH), Stockholm, Sweden,
2012. [Online]. Available: web.it.kth.se/∼hemani/Athesis15.pdf
[2] D. Alnajjar, H. Konoura, Y. Ko, Y. Mitsuyama, M. Hashimoto, and
T. Onoye, “Implementing flexible reliability in a coarse-grained reconfig-
urable architecture,” IEEE Transactions on Very Large Scale Integration
(VLSI) Systems,, vol. PP, no. 99, pp. 1–1, 2012.
[3] W. Quan and A. D. Pimentel, “A scenario-based run-time task
mapping algorithm for mpsocs,” in Proceedings of the 50th
Annual Design Automation Conference, ser. DAC ’13. New York,
NY, USA: ACM, 2013, pp. 131:1–131:6. [Online]. Available:
http://doi.acm.org/10.1145/2463209.2488895
[4] J. Huang, A. Raabe, C. Buckl, and A. Knoll, “A workflow for runtime
adaptive task allocation on heterogeneous mpsocs,” in Design, Automa-
tion Test in Europe Conference Exhibition (DATE), 2011, March 2011,
pp. 1–6.
[5] S. M. A. H. Jafri, O. Ozbak, A. Hemani, N. Farahini, K. Paul,
J. Plosila, and H. Tenhunen, “Energy-aware CGRAs using dynamically
reconfigurable isolation cells.” in Proc. International symposium for
quality and design (ISQED), 2013, pp. 104–111.
[6] S.Jafri,M.A.Tajammul,A.Hemani,K.Paul,J.Plosila,andH.Ten-
hunen, “Energy-aware-task-parallelism for efficient dynamic voltage,
and frequency scaling, in cgras,” in Embedded Computer Systems: Ar-
chitectures, Modeling, and Simulation (SAMOS XIII), 2013 International
Conference on, 2013, pp. 104–112.
[7] N. Abbas and Z. MA, “Run-time parallelization switching for resource
optimization on an MPSoC platform,” Springer design and Automation
of embedded systems, 2014.
[8] Y. Kim, J. Lee, A. Shrivastava, J. Yoon, and Y. Paek, “Memory-aware
application mapping on coarse-grained reconfigurable arrays,” in High
Performance Embedded Architectures and Compilers, ser. Lecture
Notes in Computer Science, Y. Patt, P. Foglia, E. Duesterwald,
P. Faraboschi, and X. Martorell, Eds. Springer Berlin Heidelberg,
2010, vol. 5952, pp. 171–185. [Online]. Available: http://dx.doi.org/10.
1007/978-3- 642-11515- 8 14
[9] M. Santambrogio, D. Pnevmatikatos, K. Papadimitriou, C. Pilato,
G. Gaydadjiev, D. Stroobandt, T. Davidson, T. Becker, T. Todman,
W. Luk, A. Bonetto, A. Cazzaniga, G. Durelli, and D. Sciuto, “Smart
technologies for effective reconfiguration: The faster approach,” in Re-
configurable Communication-centric Systems-on-Chip (ReCoSoC), 2012
7th International Workshop on, July 2012, pp. 1–7.
[10] S. Jafri, A. Hemani, K. Paul, J. Plosila, and H. Tenhunen, “Compact
generic intermediate representation (CGIR) to enable late binding in
coarse grained reconfigurable architectures,” in Proc. International Con-
ference on Field-Programmable Technology (FPT),, Dec. 2011, pp. 1 –6.
[11] W. B¨ohm, J. Hammes, B. Draper, M. Chawathe, C. Ross, R. Rinker,
and W. Najjar, “Mapping a single assignment programming language
to reconfigurable systems,” J. Supercomput., vol. 21, no. 2, pp.
117–130, Feb. 2002. [Online]. Available: http://dx.doi.org/10.1023/A:
1013623303037
[12]H.Park,K.Fan,S.A.Mahlke,T.Oh,H.Kim,andH.-s.Kim,
“Edge-centric modulo scheduling for coarse-grained reconfigurable
architectures,” in Proceedings of the 17th International Conference
on Parallel Architectures and Compilation Techniques, ser. PACT ’08.
New York, NY, USA: ACM, 2008, pp. 166–176. [Online]. Available:
http://doi.acm.org/10.1145/1454115.1454140
[13] M. Hamzeh, A. Shrivastava, and S. Vrudhula, “Epimap: Using
epimorphism to map applications on cgras,” in Proceedings of the
49th Annual Design Automation Conference, ser. DAC ’12. New
York, NY, USA: ACM, 2012, pp. 1284–1291. [Online]. Available:
http://doi.acm.org/10.1145/2228360.2228600
[14] N. Farahini, S. Li, M. A.l Tajammul, M. A. Shami, G. Chen, A. Hemani,
W. Ye, “39.9 GOPs/Watt multi-mode CGRA accelerator for a multi-
standard base station,” in Proc. IEEE Int. Symp. Circuits and Systems
(ISCAS), 2013.
[15] O. Dragomir, T. Stefanov, and K. Bertels, “Loop unrolling and shifting
for reconfigurable architectures,” in Field Programmable Logic and
Applications, 2008. FPL 2008. International Conference on, Sept 2008,
pp. 167–172.
[16] S. Yin, C. Yin, L. Liu, M. Zhu, and S. Wei, “Configuration context
reduction for coarse-grained reconfigurable architecture.” IEICE TRANS-
ACTIONS on Information and Systems, vol. 95, no. 2, pp. 335–344,
2012.
[17] C.-L. Chou and R. Marculescu, “User-aware dynamic task allocation
in networks-on-chip,” in Design, Automation and Test in Europe, 2008.
DATE ’08 , March 2008, pp. 1232–1237.
[18] M. A. Al Faruque, R. Krist, and J. Henkel, “Adam: Run-time agent-
based distributed application mapping for on-chip communication,” in
Proceedings of the 45th Annual Design Automation Conference,ser.
DAC ’08. New York, NY, USA: ACM, 2008, pp. 760–765. [Online].
Available: http://doi.acm.org/10.1145/1391469.1391664
[19] P. Holzenspies, J. Hurink, J. Kuper, and G. J. M. Smit, “Run-time spatial
mapping of streaming applications to a heterogeneous multi-processor
system-on-chip (mpsoc),” in Design, Automation and Test in Europe,
2008. DATE ’08, March 2008, pp. 212–217.
[20] E. Carvalho and F. Moraes, “Congestion-aware task mapping in hetero-
geneous mpsocs,” in System-on-Chip, 2008. SOC 2008. International
Symposium on, Nov 2008, pp. 1–4.
[21] L. Schor, I. Bacivarov, D. Rai, H. Yang, S.-H. Kang, and L. Thiele,
“Scenario-based design flow for mapping streaming applications onto
on-chip many-core systems,” in Proceedings of the 2012 International
Conference on Compilers, Architectures and Synthesis for Embedded
Systems, ser. CASES ’12. New York, NY, USA: ACM, 2012, pp. 71–80.
[Online]. Available: http://doi.acm.org/10.1145/2380403.2380422
[22] K. Compton, Z. Li, J. Cooley, S. Knol, and S. Hauck, “Configuration
relocation and defragmentation for run-time reconfigurable computing,”
Very Large Scale Integration (VLSI) Systems, IEEE Transactions on,
vol. 10, no. 3, pp. 209–220, 2002.
[23] M. A. Shami and A. Hemani, “Classification of massively parallel
computer architectures,” in Proc. IEEE Int. Parallel and Distributed
Processing Symposium Workshops PhD Forum (IPDPSW), May 2012,
pp. 344–351.
[24] M. A. Tajammul, S. M. A. H. Jafri, A. Hemani, J. Plosila, and H. Ten-
hunen, “Private configuration environments for efficient configuration
in CGRAs,” in Proc. Application Specific Systems Architectures and
Processors (ASAP), Washington, D.C., USA, 5–7 June 2013.
[25] J. Cui, Q. Deng, X. He, and Z. Gu, “An efficient algorithm for online
management of 2d area of partially reconfigurable fpgas,” in Design,
Automation Test in Europe Conference Exhibition, 2007. DATE ’07,
April 2007, pp. 1–6.
241
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)