Conference PaperPDF Available

RuRot: Run-time rotatable-expandable partitions for efficient mapping in CGRAs

Authors:

Abstract and Figures

Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications, with arbitrary communication and computation patterns. Compile-time mapping decisions are neither optimal nor desirable to efficiently support the diverse and unpredictable application requirements. As a solution to this problem, recently proposed architectures offer run-time remapping. The run-time remappers displace or expand (parallelize/serialize) an application to optimize different parameters (such as platform utilization). However, the existing remappers support application displacement or expansion in either horizontal or vertical direction. Moreover, most of the works only address dynamic remapping in packet-switched networks and therefore are not applicable to the CGRAs that exploit circuitswitching for low-power and high predictability. To enhance the optimality of the run-time remappers, this paper presents a design framework called Run-time Rotatable-expandable Partitions (RuRot). RuRot provides architectural support to dynamically remap or expand (i.e. parallelize) the hosted applications in CGRAs with circuit-switched interconnects. Compared to state of the art, the proposed design supports application rotation (in clockwise and anticlockwise directions) and displacement (in horizontal and vertical directions), at run-time. Simulation results using a few applications reveal that the additional flexibility enhances the device utilization, significantly (on average 50 % for the tested applications). Synthesis results confirm that the proposed remapper has negligible silicon (0.2 % of the platform) and timing (2 cycles per application) overheads.
Content may be subject to copyright.
RuRot: Run-time Rotatable-expandable Partitions
for Efficient Mapping in CGRAs
Syed M. A. H. Jafri∗‡§, Guilermo Serrano †‡, Junaid Iqbal, Masoud Daneshtalab‡§, Ahmed Hemani§,
Kolin Paul, Juha Plosila, and Hannu Tenhunen‡§
Email: jafri@kth.se, guiserle@teleco.upv.es, junaid.iqbal@abo.fi, masdan@utu.fi, hemani@kth.se,
kolin.paul@gmail.com, juplos@utu.fi, hannu@kth.se
Turku Centre for Computer Science
Universidad Politcnica de Valencia, Spain
University of Turku, Finland
§Royal Institute of Technology, Sweden
Indian Institute of Technology, Delhi
Abstract—Today, Coarse Grained Reconfigurable Architec-
tures (CGRAs) host multiple applications, with arbitrary com-
munication and computation patterns. Compile-time mapping
decisions are neither optimal nor desirable to efficiently support
the diverse and unpredictable application requirements. As a
solution to this problem, recently proposed architectures offer
run-time remapping. The run-time remappers displace or expand
(parallelize/serialize) an application to optimize different param-
eters (such as platform utilization). However, the existing remap-
pers support application displacement or expansion in either
horizontal or vertical direction. Moreover, most of the works only
address dynamic remapping in packet-switched networks and
therefore are not applicable to the CGRAs that exploit circuit-
switching for low-power and high predictability. To enhance the
optimality of the run-time remappers, this paper presents a de-
sign framework called Run-time Rotatable-expandable Partitions
(RuRot). RuRot provides architectural support to dynamically
remap or expand (i.e. parallelize) the hosted applications in
CGRAs with circuit-switched interconnects. Compared to state
of the art, the proposed design supports application rotation
(in clockwise and anticlockwise directions) and displacement (in
horizontal and vertical directions), at run-time. Simulation results
using a few applications reveal that the additional flexibility
enhances the device utilization, significantly (on average 50 %
for the tested applications). Synthesis results confirm that the
proposed remapper has negligible silicon (0.2 % of the platform)
and timing (2 cycles per application) overheads.
I. INTRODUCTION AND MOTIVATION
The increasing processing and communication demands of
modern telecommunication applications coupled with a need
to reduce the non-recurring engineering costs have made re-
configurable architectures a popular implementation platform
[1]. The reconfigurable architectures can be classified on
the basis of granularity i.e. the number of bits that can be
explicitly manipulated by the programmer. Coarse Grained
Reconfigurable Architectures (CGRAs), provide operator level
configurable functional blocks, word level datapaths, and very
area-efficient routing switches. Compared to the fine-grained
architectures (like FPGAs), CGRAs require lesser configu-
ration memory and configuration time (two or more orders
of magnitude [2]). As a result, CGRAs enjoy a significant
reduction in area (from 66% to 99.06% [1]) and energy
consumed per computation (from 88% to 98% [1]), at the
cost of a loss in flexibility compared to bit-level operations.
Therefore, CGRAs have been a subject of intensive research
since the last decade [1].
Today, platforms are required to simultaneously host mul-
tiple applications with arbitrary inter-application communica-
tion and concurrency patterns [3]. Prevailing technology trends
(like utilization wall and power wall) dictate the use of aggres-
sive optimization techniques. To enhance device utilization
and energy efficiency (for unpredictable scenarios) concepts
like run-time remapping [4], [3] and dynamic parallelism [5],
[6], [7] have been proposed. Run-time remapping changes the
physical placement of an application to reduce communication
[4], memory [8], and/or reconfiguration [9] costs. Dynamic
parallelism parallelizes an application to induce speedup and
generate additional time slacks that allow the platform to
operate at a lower voltage/frequency. However, the proposed
remapping techniques (in CGRAs domain) allow to displace
or parallelize/serialize an application in either horizontal or
vertical direction. Moreover, they are mostly targeted for either
packet-switched Network on Chips (NoCs) or FPGAs and
are therefore not applicable to most CGRAs (that employ
circuit-switched interconnects for high predictability and low
power). To enhance the optimization potential of the presented
works, this paper presents Run-time Rotatable-expandable
Partitions (RuRot) framework. The proposed framework al-
lows to dynamically create virtual partitions in a platform.
Each partitions can be rotated (clockwise/anti clockwise) and
expanded/contracted in horizontal and vertical direction.
To illustrate the motivation for our approach, consider Fig.
1, that shows a CGRA with nine processing elements. Fig. 1
(a) shows the impact of dynamic rotation. Instance 1 in the
figure depicts a scenario, where all the processing elements
are occupied by the hosted applications (App1, App2, App3,
and App4). In Instance 2, App4 leaves and App5 requests
processing elements. Most of the existing run-time mappers
(in CGRAs with circuit-switched interconnects) support only
horizontal or vertical displacements and therefore will decline
the application request. As shown in Instance 3, the RuRot
framework will be able to accommodate App5 on the existing
resources, by rotating it. Fig. 1 (b) illustrates the benefits
of run-time mapping for application expansion. In the figure
App2 has three implementations, each with a different degree
of parallelism (serial, partially parallel, and parallel). We
assume that given the resources, a parallel implementation is
978-1-4799-3770-7/14/$31.00 ©2014 IEEE
233
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)
App2
App1
App3
App5
App4
Instance 1 Instance 2 Instance 3
Instance 1 Instance 2 Instance 3
(a) Application rotation
(b) Diagonal application expansion
147
852
369
147
852
369
147
2
3
5
6
8
9
1
147
852
369
1
852
369
147
852
369
App2
24
57
2 5
258
2
47
Serial
Partially
parallel
Parallel
Fig. 1. Motivation for RuRot
preferred (since it allows to lower the voltage and frequency
[5]). In Instance 2 of the figure, sufficient resources to map
the parallel version of App2 are available. However, existing
CGRA remappers will be unable to parallelize the application
because the resource 5 (needed for partially parallel version of
App2) is not free and they bind the mappings for the parallel
versions at compile-time [6], [5], [10]. As a solution to this
problem, the RuRot will be able to map the parallel version to
resources 1 and 2 by performing vertical expansion, as shown
in Fig. 1 (b) Instance 3. It should be noted that same utilization
efficiency can be achieved by storing all the combinations at
compile-time. However, it will be shown later in sections V
and VI, that the compile-time storage will incur prohibitive
configuration memory overheads.
This paper is organized as follows. In Section II, a brief
survey of existing run-time mappers is presented. In Section
III, an overview of the CGRA platform, used in this paper
is described. In Section IV, details of the system level archi-
tectural modifications, hardware, and algorithm to implement
RuRot is explained. In Section V, we formalize the potential
benefits of our methods. In Section VI, we evaluate the actual
benefits and redundancies imposed by our method on an
actual CGRA. Finally, in Section VII, we summarize our
contributions and suggest directions for future research.
II. RELATED WORK AND CONTRIBUTIONS
Application mapping is one of the most fundamental chal-
lenges in CGRA domain. Therefore, it has been a subject
of intensive research. In this section, we will discuss only
the most prominent work related to our approach. Broadly, a
mapping technique can be classified on the basis of adaptiv-
ity (static/dynamic mapping), parameter to optimize (energy,
space, etc.), targeted architecture (CGRA, NoC or FPGA
etc.) and flexibility offered (rotation, displacement, run-time
parallelism).
Traditionally, the mapping decisions were taken by the com-
piler and therefore they offered no run-time adaptivity [11].
The static mappers enjoy relaxed processing deadlines. The
relaxed processing deadlines allow them to execute complex
algorithms such as modulo scheduling [12][13] and affine
loop transformation [14] to efficiently exploit parallelism
[15][16]. Although they find optimal mappings, the compile-
time decisions are unable to efficiently cope with unpredictable
scenarios found in many real world applications.
To deal with unpredictability, inherent in todays real world
applications, recently proposed platforms offer run-time map-
ping/remapping. However, most of the work on run-time
remapping targets packet-switched 2-D Network on Chips
(NoCs) [17], [18], [19]. Chou and Marculescu [17] proposed a
run-time mapping strategy that allowed to allocate resources at
run-time, based on user behavior. Faruque et al. [18] presented
a distributed approach to map applications dynamically in a
NoC. Holzenspies et al. [19] proposed a run-time spatial appli-
cation mapping strategy targeted for streaming applications. In
[20], network congestion-aware run-time mapping architecture
was presented. To reduce communication load, [4] proposed
a run-time remapper that uses an adaptive task allocation
algorithm. Quan et al. [3] and Schor et al. [21] proposed
scenario-based run-time mapping. All these techniques target
packet-switched NoCs and they do not support run-time par-
allelism or application rotation. Asad et al. [5], [6] presented
an architecture to realize run-time parallelism in CGRAs.
However, in the proposed method each element was tightly
coupled (similar to Instance 2 in Fig. 1 (b)). The only work
that deals with application rotation was presented by Compton
et al. [22]. Their approach enhanced device utilization by
minimizing the unusable created during reconfiguration. The
proposed method also allowed rotation. However, Comptons
approach was FPGAs centric while our approach is designed
for CGRAs. Moreover, Compton did not consider run-time
application expansion (parallelism) which is one of the main
contributions of this paper.
The related work reveals that most of application mapping
strategies (in CGRAs) rely on compile-time decisions. The
approaches that do address run-time mapping are targeted
for either packet-switched NoCs or FPGAs (in which the
interconnects are significantly different). Moreover, none of
the approaches allow generic run-time parallelism.
To cater these problems, this paper has three major
contributions:
1) We propose an architecture that allows to dynamically
remap an application in horizontal, vertical, clockwise
and/or anticlockwise direction;
2) We present a framework to parallelize an application
at run-time. Compared to state of the art expansion
techniques in CGRAs [6], [5], [10] (where the mapping
for each implementation is statically determined) our ap-
proach allows to dynamically parallelize an application
in vertical or horizontal directions; and
3) We perform formal and gate-level analysis to evaluate
overheads for implementing the proposed architecture.
III. SYSTEM OVERVIEW
We have chosen the Dynamically Reconfigurable Resource
Array (DRRA) [23] as a vehicle to evaluate experimentally
the efficiency of RuRot on a CGRA. Nevertheless, it seems
that the results are essentially applicable to most grid based
234
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)
CGRAs (with circuit-switched interconnects) as well. DRRA
is a CGRA template designed to support multiple DSP stan-
dards on a unique platform.
System
controller
DRRA storage
layer
DRRA computation
layer
App1 App2 App3
Memory
elements
Cells
Fig. 2. Different applications executing in its private environment
As depicted in Fig. 2, it is composed of three main
components: (i) system controller, (ii) computation layer, and
(iii) memory layer. For each hosted application, a separate
partition can be created in memory and computation layers.
The partition is optimal in terms of energy, power, and
reliability [6], [24], [5]. Table I briefly describes the basic
functionality of these components.
TAB L E I
OVERVIEW OF DIFFERENT DRRA COMPONENTS
Component Functionality
System controller
(i) Send configware to DRRA computation layer
(ii) Create memory partitions for each
application
Computation layer Perform computations
DRRA storage layer Store data for DRRA computation layer
A. System controller
Figure 3 illustrates the overall system architecture for man-
aging DRRA. The controlling intelligence is provided by a
Run-Time resource Manager (RTM). The RTM resides in the
LEON3 processor and has two main responsibilities: (i) to
configure DRRA by loading the binary from the configuration
memory, and (ii) to parallelize/serialize application tasks,
depending on the deadlines. Before our modifications, the
application to component mapping was done at compile-time.
While the architecture accommodated run-time parallelism, it
did not allow to dynamically remap an implementation (similar
to Instance 3 in Fig. 1 (b)). In this paper we will present the
architectural modifications required to remap at run-time.
LEON3
Global
configuration
memory
Loader
DRRA
Fig. 3. System management layer of DRRA
B. DRRA Computation Layer
DRRA computational layer is shown in Fig. 4. It is com-
posed of four elements: (i) Register Files (reg-files), (ii) mor-
phable Data Path Units (DPUs), (iii) circuit-Switched Boxes
(SBs), and (iv) sequencers. The reg-files store data for DPUs.
The DPUs are functional units responsible for performing
computations. SBs provide interconnectivity between different
components of DRRA. The sequencers hold the configware
which corresponds to the configuration of the reg-files, DPUs,
and SBs. Each sequencer can store up to 64 36-bit instructions
and can reconfigure the elements only in its own cell. As
shown in Fig. 3, a cell consists of a Reg-file, a DPU, SBs, and
a sequencer, all having the same row and column number as a
given cell. The configware loaded in the sequencers contains a
sequence of instructions (reg-file, DPU, and SB instructions)
that implements the DRRA program.
Column 0
Row0
Row1
Column 1
Sequencer Sequencer
DPU DPU
DPU
DPU
Reg-file Reg-file
SB
SB
SB
SB
SB
SB
SB
SB
Reg-file Reg-file
Cell0
Cell1
Cell2
Cell3
Sequencer Sequencer
Fig. 4. DRRA computation layer
C. DRRA Storage Layer (DiMArch)
DiMArch is a distributed scratch pad (data/configware)
memory that complements DRRA with a scalable memory
architecture. Its distributed nature allows a high speed data
and configware access to the DRRA computational layer (com-
pared to the global configuration memory). Further discussion
of DiMArch is beyond the scope of this paper and for details
we refer to [24].
D. DRRA configuration flow
As shown in Figure 5, DRRA is programmed in two
phases (off-line and on-line) [10]. The configware (binary) for
commonly used DSP functions (FFT, FIR filter etc.) is written
either in VESYLA (HLS tool for DRRA) and stored in an off-
line library. For each function, multiple versions, with different
degree of parallelism, are stored. The library, thus created, is
profiled with frequencies and worst case time of each version.
To map an application, its (simulink type) representation is
fed to the compiler. The compiler, based on the available
functions (present in library) constructs the binary for the
complete application (e.g. WLAN). Since the actual execution
times are unknown at compile-time, the compiler sends all
the versions (of each function), meeting deadlines, to the run-
time configware memory. To reduce memory requirements
for storing multiple versions, the compiler generates a com-
pact representation of these versions. Details of compression
algorithm and how it is unraveled are given in [10]. The
compact representation is unraveled (parallelized/serialized)
dynamically by the run-time resource manager (running on
LEON3 processor).
235
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)
Application
deadlines
Vesyla
(HLS tool) Library Compiler
Simulink
model
Configware
Parallelize/
serialize
Versions
DRRA
Compile time
Runtime
Leon3
Compression
Map
Fig. 5. Configuration Model
IV. RUN-TIME ROTATABLE-EXPANDABLE PARTITIONS
(RUROT)
Before our enhancements, DRRA only accommodated com-
piler based mapping [6], [5], [24]. Although the architecture
allowed to serialize/parallelize an application at run-time,
the application to component (i.e. DPUs, reg-files, and SBs)
mapping for each implementation was determined during com-
pilation. The proposed approach guides the resource manager
to dynamically allocate the resources.
A. Configuration flow modifications
To realize run-time application remapping and rotation, as
shown in Fig. 6, we have added an additional block (called
RuRot), to the original configuration flow (cf. Fig. 5). RuRot
is composed of three blocks: (i) Maximal Empty Rectangle
(MER) calculator (ii) rotation calculator, and (iii) remapper.
The MER calculator keeps a record of the free spaces available
in the platform and provides the information to the rotation
calculator. The rotation calculator receives the information
about the rows/columns required by the application, from
the RTM (residing in the LEON3 processor). It compares
the required space (provided by RTM) with the available
space (provided by MER calculator) to determine the loca-
tion and the angle for remapping. The calculated angle and
displacement is sent to the remapper. The remapper applies
the rotation and/or displacement transform to the configuration
bitstream received from the RTM. The MER calculator works
in background, as soon as an application is mapped, while the
rotation calculator and the remapper operate at run-time when
a mapping request is made. Therefore, the MER calculator
is implemented in software while the rotation calculator and
remapper are implemented in hardware. In this section, we
will discuss the remapper followed by a detailed description
of MER and rotation calculator.
B. Remapper
The remapper applies displacement and rotation transfor-
mation (received from rotation calculator) to the bitstream
received from the RTM. In packet-switched NoCs application
rotation/displacement is easily implemented by modifying the
destination and/or source location in the configware. How-
ever, in circuit-switched NoCs (used in prominent CGRAs)
application displacement/rotation is not a trivial issue. The
displacement/rotation transformation depends on the type of
element (memory, computation or interconnect) configured.
The memory and computational elements (i.e. reg-files and
Application
deadlines
Statically
mapped
configware
Remapper
LEON3
(RTM)
MER
calculator Rotation
calculator
DRRA
Cell
required
MER
table
Rot
CW
Unmappable
RuRot
Software Hardware
Fig. 6. Modifications in configuration flow
DPUs in DRRA) are remapped simply by migrating the con-
figware to the new location. The interconnect instructions, not
only need to be migrated but also the new circuits (connecting
the source and destination) have to be recalculated. We have
implemented the remapper in hardware. The motivation for
hardware based implementation is that all the applications,
with any resource in the same column as the remapped
application, have to be stalled during remapping. The block
level implementation of the remapper is shown in Fig. 7.
The remapper is composed of two parts (i) rotation to offset
generator and (ii) shifter.
Rotation to
offset Shifter
CW
Rot
X
Y
DX
DY
Configuration
bitstream
New
Bitstream
New
location
Fig. 7. Block level diagram of remapper
1) Rotation to offset generator: To clearly illustrate the
functionality of rotation to offset generator, consider the ex-
ample depicted in Figure 8. The figure shows an application,
App1, mapped to a platform with 12 cells. Initially, App1
is mapped to the cells 2 (i.e. x=0, y=2), 7 (i.e. x=1, y=3),
and 10 (i.e. x=2, y=2). To rotate App1 by 90anti-clockwise
with respect to the cell 2, cells 7 and 10 should be remapped
respectively to cells 5 and 0. In other words the horizontal and
vertical offsets for cell 7 (DX7and DY7) and cell 10 (DX10
and DY10)aregivenby(DX7=0DY7=2)and(DX10=-
2andDY10 =2), respectively. The rotation to offset block
performs these transformations.
1
2
3
5
6
7
9
10
11
048
1
2
37
6
59
0
11
10
4 8
App1
Fig. 8. Application rotation anticlockwise by 90 degrees
236
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)
Fig. 9 illustrates the hardware used to calculate the offsets.
The circuit receives five inputs: (i) rotation (Rot), (ii) Clock
Wise (CW), (iii) X, (iv) Y, and (iv) present Address (Adrs). Rot
determines whether the rotation should be applied. Rot =0,
deactivates this circuit in order to perform a displacement and
therefore X and Y inputs are copied as displacement offset
values (DX, DY). Rot =1, activates rotation. In this case,
(X, Y) denotes the coordinates of the point over which the
application is rotated (like cell 2 in Fig. 8). If the rotation is
activated, DX and DY are calculated depending on whether the
clockwise or anticlockwise rotation is required. The calculated
offsets are sent to the shifter which remaps the application.
/H
%H
-
-
+
-
-
Adrs
Rot
CW
X
Y
Rot
CW
X
Y
Adrs
H
=
=
=
=
=
=
Rotation
Clockwise
Column number
Row number
Address
Height
DX
DY
Fig. 9. Rotation to offset generator
2) DRRA interconnection network: We will use the in-
terconnect network of DRRA as a vehicle to quantitatively
analyze complexity and overheads of modifying interconnect
instructions in CGRAs. The motivation for choosing DRRA
network is that it is well documented, its bandwidth has been
tested on demanding industrial applications (with Huawei)
[14], and we have available all its architectural details from
RTL codes to physical design. DRRA hosts a circuit-switched
interconnect network. It allows to directly connect any cell
component (reg-files and DPUs) with the another component
up to three hops away in a non-blocking fashion (i.e. in a
single cycle). Fig. 10 shows a fragment of the DRRA fabric
that consists of three cells arranged in two rows and three
columns. Each output is connected to a horizontal bus while
the inputs are connected through a vertical bus. Further details
about the motivation and scalability of this interconnect can
be found in [1]
Reg-file
DPU
Reg-file
DPU
Reg-file
DPU
Fig. 10. DRRA interconnect network [1]
Remember from Section III, SB instructions are used to con-
figure the switch boxes in DRRA. To configure a connection,
the information is sent to the sequencer of the destination cell.
Since the remapper will modify the SB instructions, before
explaining the modification hardware, we will give the detailed
description of the SB instruction. Various fields of the SB
instruction are shown in Fig. 11. To establish a connection
the 36-bit SB instruction requires five parameters: (i) Device
(Dav), (ii) Out index (Oi), (iii) Hierarchical index (Hi) , (iv)
Vertical index (Vi), and (v) Data. Dav identifies whether the
source is Reg-file or DPU. Oi field indicates the output port
of the source. Hi field specifies the distance between source
and destination columns. In the current implementation the
maximum values of Hi is 6. Ri field determines the source
row. Vi field indicates the input port of the destination. Data
field is unused for the SB instructions (it is only used in reg-
file and the DPU instructions). It should be noted that while
different CGRAs will have different field sizes (depending
on the topology, number of input/output ports and maximum
hops), the basic concept of the circuit-switched NoC is the
same. Therefore, it is expected that the overheads calculated
for DRRA can be migrated to other architectures as well with
slight modifications.
Ri DataViHi
Dav
Oi
Hi
Vi
Ri
Dav Oi
=
=
=
=
=
Device
Out index
Hierarchical index
Vertical index
Row index
Fig. 11. Contents of the SB instruction
3) Shifter: The shifter takes the current instructions along
with the offsets (DX, DY) as inputs and generates the instruc-
tion for mapping the instructions to the new location. The
proposed hardware is shown in Fig. 12. We will explain the
diagram from the left to the right considering how the data
flows in it. Initially, the SB instruction (of the present location),
along with DX and DY (form rotation to offset generator), and
CW and Rot (form rotation calculator) are received. At this
stage, the instruction type is checked. If instr =7implying
that the received is not an interconnect (SB) instruction, only
the address field of the instruction is modified. To calculate the
new address, the DX and DY values are added to the original
address. If instr =7, indicating that the received instruction
is a SB instruction, Hi, Vi, and Ri fields are transformed. To
calculate the indices, an intermediate index correction is made
by the block shown in Fig. 13. If the rotation is activated, the
new Hi value is recalculated by first applying the intermediate
index calculation. Then depending on the values of Rot and
CW the final index values are calculated.
When rotation=0 (i.e. a vertical or horizontal shift is re-
quired) Vi(i+1) = Vi(i)+6DY .WhereV(i+1) and Vi(i)
denote respectively the new and present Vi value. The reason
for an increment of 6DY is that the value of Vi in consecutive
rows differs by 6 (i.e. each cell has six inputs, two for register
file and four for DPU). Similarly, since each cell contains
two rows of horizontal buses (cf. Fig. 10), the new Ri value,
237
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)
Ri(i+1) = Ri(i)+2DY . In this mode (i.e. when rotation=0),
Hi is modified only when the original or final position lies
in the first 3 columns (to compensate insufficient cells for 3-
hop connectivity, on the left). The new value of Hi(i+1) is
given by Hi(i+1) = Hi(i)+PCNC.WherePCandNC
denote respectively the positive and negative corrections. PC
and NC themselves are given by PC =3OC (when the
application is presently mapped to the first three columns) and
NC =3DC (when the final address of the application lies
in first three columns). OC and DC denote respectively the
present and final column of the application. When rotation=1
(i.e. rotation is activated), final value of Ri(i+1) is given by
equation:
Ri(i+1)=Ri(i)+2(DY +HD),clockwise
Ri(i)2(DY +HD),otherwise (1)
Where HD denotes horizontal distance. HD itself is given by
Hi(i)min(OrigC ol, 3).W
hereOrigCol is the obtained
by dividing the original address by height. Simply, the new
Ri(i+1) is calculated by either incrementing or decrementing
original Ri(i)by 2DY depending on the rotation direction
and the original horizontal connection distance. In rotation
mode the final value of Vi(i+1) is given by Vi(i+1) =
(Ri(i)/2+Dy)6+Vi(i)%6.ThefinalvalueofHi,Hi(i+1),
is calculated by summing the vertical connection distance (i.e.
distance from the switch box to the cell that contains the
connected input) with the correction needed for the sliding
window connection. Hi(i+1) is given by equation:
Hi(i+1)=(Vi(i)/6+Ri(i)/2) + mxhpd,clockwise
(Vi(i)/6+Ri(i)/2) mxhpd,otherwise
(2)
Where mxhpd=min(destination column, 3).
Fig. 12. Shifter hardware
Fig. 13. Index corrector
C. MER and rotation calculation
The remapper receives displacement and rotation values
calculated by the MER and the rotation calculator. Fig. 14
illustrates the functionality to the MER and rotation calculator.
1) MER calculator: The MER calculator is activated when
either an application is mapped, an existing application is
parallelized/serialized or an existing application leaves the
platform. Once activated, it operates in background to find
a complete set of Maximal Empty Rectangles (MERs). The
MERs are calculated using the Enhanced Scan Line Algo-
rithm [25]. After isolating all the MERs, the MER calculator
arranges them in order of decreasing area (where area is the
product of MER rows and columns). Finally, the reference
address (bottom left element address in our case), the number
of rows MERrows, and the number of columns MERcols
are sent to the rotation calculator. Since MER operates in
background, we have opted to implement it in software, as
a separate thread on LEON3 processor.
2) Rotation calculator: The rotation calculator is activated
when an application requests platform resources (i.e. a new
application arrives or an application is parallelized) or MER
calculator sends update MER request. If the required re-
sources are not free, the rotation calculator determines the
number of rows Approws and columns Appcols required by
the application. After that it attempts to find a MER with
MERrows Approws and MERcols Appcols. If one of
the MERs satisfies this condition, the MER location (X, Y) is
sent to the shifter with Rot =0. If none of the MERs meets
this condition, the rotation calculator tries to find a MER such
that MERrows Appcols and MERcols Approws If one
of the MERs satisfies this condition, the MER location (X, Y)
is sent to the shifter with Rot =1. If the rotation calculator
is unable to find a MER with the required rows and columns,
it updates the RTM about it.
V. FORMAL EVALUATION OF ENHANCED UTILIZATION
The fundamental advantage of using RuRot is that it in-
creases the probability to find a free resource, compared to
static mapping. Thereby, it potentially offers higher resource
utilization and energy/power reduction (when combined with
dynamic parallelism and DVFS) [5]. In this section we will
formally evaluate the probability to accommodate an applica-
tion using static and RuRot based application mapping.
Consider a platform plat contains nresources. A resource
can be either free or occupied. For simplicity we assume that
both states (free and occupied) are equally likely. At a given
instant, all possible combination Comall of ifree resources
238
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)
Update mapping
App leaves
Update
MER
Rotation calculator
End
MER calculator
End
New
MER table
Yes
Calc MERs
Arrange MERs
Update MER
Map req
Update MER
MER
rows
(i)
>= App
rows
MER
cols
(i)
>= App
cols
Yes
Yes
Rot=0
Update
mapping
(x, y) = MER(i)
i=enteries
?
i++
No
No
No
MER
rows
(i)
>= App
cols
MER
cols
(i)
>= App
rows
Yes
Yes
Rot=1
Update
mapping
(x, y) = MER(i)
i=0
i=enteries
?
i++
No
No
No
End
Yes
Yes
App
Calc
MER
Col
Req
Application
Calculate
Maximal
empty
rectangle
Column
Request
=
=
=
=
=
Fig. 14. Information flow between MER and rotation calculator
are given by equation:
Comall =
n
i=1 n
i.(3)
Suppose that an application app requests rresources from
the platform. For traditional CGRAs (with static compile-
time binding) the total combinations Comswith sufficient
resources to accommodate app is given by equation:
Coms=
n
i=rnr
ir.(4)
So at any instant, the probability Psto accommodate app using
the static mapping is given by:
Ps=Comr/Comall.(5)
To evaluate the benefit of displacing an application, consider
that the platform contains platrrows and platccolumns. If
the platform supports horizontal application displacement the
total combinations that can accommodate app are given by
equation:
Comdisx =(
n
i=rnr
ir)(platcappc+1),(6)
where appcdenotes the number of columns in the application
to be mapped. If the platform supports vertical application
displacement the total combinations that can accommodate
app are given by equation:
Comdisy =(
n
i=rnr
ir)(platrappr1),(7)
where apprdenotes the number of rows in the application
to be mapped. If appc=appr, a rotation by 90,the
total combinations that can accommodate app are given by
equation:
Comrot =(Comdisx +Comdisy)2,(8)
TAB L E II
APPLICATION AND RESOURCES
Application Resources Memory
(Cells) (bits)
FFT1 65148
FFT2 12 6588
FFT3 18 7998
MM1 3756
MM2 61548
MM3 92340
WLAN1 18 7380
WLAN2 24 8820
WLAN3 30 10320
Therefore, the total combinations Comdr, with support for
RuRot are given by equation:
Comdr =Comdisx +Comdisy,if appc=appr
Comrot,otherwise (9)
So the probability of finding a free space using RuRot is given
by:
PRuRot =Comdr/Comall,(10)
Equations 5, 6, 7, 8, and 10 clearly indicate that the by
allowing to rotate and displace an application, the probability
to find a free space increases significantly.
VI. RESULTS
The formalizations in the previous section were generic and
should be applicable to most grid based CGRAs. However,
unlike FPGAs (that in general have a standard structure)
existing CGRAs vary greatly in architecture and it is not
possible to provide concrete generic results applicable to all
CGRAs. Therefore, we have chosen DRRA as a representative
platform because of the reasons already stated in Section
IV-B2. To analyze the utilization or memory savings of the
proposed approach on real application, we mapped three
representative applications on the DRRA: (i) Fast Fourier
Transform (FFT), (ii) Matrix Multiplication (MM), and (iii)
Wireless LAN transmitter (WLAN) (see Table II). For each
application, implementations with different levels of paral-
lelism were simulated. The table shows the resources and the
configuration memory needed by each application.
A. Overhead analysis
To estimate additional overhead incurred by the proposed
architecture, we synthesized the DRRA fabric with RuRot
for 65 nanometer technology at 400 MHz frequency, using
Synopsys design compiler. The modified architecture contains
two additional hardware components: (i) remapper and (ii)
rotation calculator. Table III shows that the remapper (capable
to support up to 32 rows) incurs negligible overhead (1.3 %
power and 0.2 % area ) compared to DRRA computation layer
(containing 14 cells). The motivation for selecting 14 cells
for DRRA is that it is the minimum cells needed to support
a common CDMA application with Huawei [14]. Of course,
to host multiple applications, will require more DRRA cells
that will further reduce the comparative hardware overhead
incurred by our approach.
The rotation calculator requires a simple comparator that
has negligible overhead. The memory overheads for the
239
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)
TABLE III
AREA AND POWER COMPARISON:
RS DRRA Overhead
(%)
Power µW 919 70406 1.3
Area µm22300 1199506 0.2
rotation calculator, Rotmem,aregivenbyRotmem =
(log(cells)/log28) MER. In our experiments we used
MER =10and cells =60giving an overhead of only 480
bits.
B. Scalability analysis
As explained in Section IV, the size of the remapper
depends on the rows in the platform. To analyze the scalability
of the remapper, we synthesized the remapper with different
number of rows. Figures 15 and 16 show respectively the
area and the power overheads. Two distinct overhead patterns
(linear and pow 2 in the figures) were observed, depending on
the number of rows. If the number of rows in the platform is a
power of 2 (i.e. 2, 4, 8....) the overhead is significantly lesser.
Consider for example that with 32 rows the silicon overhead is
almost 28 % lesser than a platform with 22 rows. The reason
for this is that the architecture shown in Section IV contains
many divisions and module operations using the number of
rows as an operand. By using a power of 2, these values
can be computed simply by shifting the bits. To conclude,
the synthesis results reveal that the proposed architecture is
scalable and the overheads can be significantly reduced by
using the rows as powers of 2.
Fig. 15. Area overheads for different row sizes
Fig. 16. Power overheads for different row sizes
C. Utilization enhancement
To demonstrate the benefits of our scheme we mapped the
applications, shown in Table II, on DRRA containing 60 cells.
TAB L E IV
SUCCESSFULMAPPINGS:STATIC MAPPING VS RUROT
Parallel apps RuRot Static Improvement
allowed (mappings) (mappings) (%)
3935 423 54
6318 129 59
9125 63 49
12 40 35 12
15 32 13 59
18 20 10 53
The entry time and the life of each application was chosen
randomly using an online random number generator. In total
1000 randomly generated application instances and their life
times were generated. Figure 17 shows the mapping requests
that were handled successfully by the platform. As expected,
the figure reveals that the successful mappings, regardless
of the mapping technique (static or RuRot), decrease as
the maximum allowable parallel (simultaneous) applications
increase. The figure clearly shows that RuRot promises greater
successful mappings than the static application mapper used
in [6], [5].
Fig. 17. Successful mappings using static mapping and RuRot
Table IV shows the improvements of using RuRot based
approach. It can be seen that RuRot accepts approximately
50 % more applications compared to the static mapper. An
abnormality can be seen when 12 parallel applications are
allowed. In this case RuRot allowed only 12 % more mappings
than the static mapping. After investigation, we found out
that for this case, WLAN3 (with 30 resources) and FFT3
were successfully mapped to the platform and they remained
active during most of the simulation. Therefore, most of
the remaining mapping requests were declined by both the
mapping algorithms.
To achieve the same device utilization we evaluated the con-
figuration memory needed by existing state of the art remap-
pers [6], [5], [7]. Fig. 18 shows the memory requirements
needed by the memory based and RuRot based remappers. It
clearly shows the RuRot based remapper not only needs lesser
configuration memory, it is also independent of the number of
cells in DRRA. This massive savings come from the hardware
implemented remapper described in Section IV-B.
VII. CONCLUSION
In this paper, we have presented a design framework called
Run-time Rotatable-expandable Partitions (RuRot), to enhance
240
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)
Fig. 18. Memory requirements for pre-calculated mappings vs RuRot
the optimization potential of dynamic remappers. The pro-
posed architecture allows to reposition and expand/contract
an application. Compared to state of the art RuRot supports
repositioning an application in four directions: (i) horizontal,
(ii) vertical, (iii) clockwise, and (iv) anti-clockwise. In addi-
tion, it also allows to serialize and/or parallelize an application
horizontally and vertically. The obtained results suggest that
the added flexibility increases the device utilization signifi-
cantly (on average 50 % for the tested applications). Synthesis
results confirm a negligible penalty (0.2 % area and 1.3 %
power) compared to the fabric. In the future work, we plan
to investigate deadlocks and congestion resulting from the
repositioning. In addition, we will also combine RuRot with
configuration defragmentation to further enhance the device
utilization.
ACKNOWLEDGMENT
This work was supported by Nokia Foundation, Higher Ed-
ucation Comission Pakistan, and VINNOVA (Swedish Agency
for Innovation Systems) within the CUBRIC project.
REFERENCES
[1] M. A. Shami, “Dynamically reconfigurable resource array,” Ph.D.
dissertation, Royal Institute of Technology (KTH), Stockholm, Sweden,
2012. [Online]. Available: web.it.kth.se/hemani/Athesis15.pdf
[2] D. Alnajjar, H. Konoura, Y. Ko, Y. Mitsuyama, M. Hashimoto, and
T. Onoye, “Implementing flexible reliability in a coarse-grained reconfig-
urable architecture,” IEEE Transactions on Very Large Scale Integration
(VLSI) Systems,, vol. PP, no. 99, pp. 1–1, 2012.
[3] W. Quan and A. D. Pimentel, “A scenario-based run-time task
mapping algorithm for mpsocs,” in Proceedings of the 50th
Annual Design Automation Conference, ser. DAC ’13. New York,
NY, USA: ACM, 2013, pp. 131:1–131:6. [Online]. Available:
http://doi.acm.org/10.1145/2463209.2488895
[4] J. Huang, A. Raabe, C. Buckl, and A. Knoll, “A workflow for runtime
adaptive task allocation on heterogeneous mpsocs,” in Design, Automa-
tion Test in Europe Conference Exhibition (DATE), 2011, March 2011,
pp. 1–6.
[5] S. M. A. H. Jafri, O. Ozbak, A. Hemani, N. Farahini, K. Paul,
J. Plosila, and H. Tenhunen, “Energy-aware CGRAs using dynamically
reconfigurable isolation cells.” in Proc. International symposium for
quality and design (ISQED), 2013, pp. 104–111.
[6] S.Jafri,M.A.Tajammul,A.Hemani,K.Paul,J.Plosila,andH.Ten-
hunen, “Energy-aware-task-parallelism for efficient dynamic voltage,
and frequency scaling, in cgras,” in Embedded Computer Systems: Ar-
chitectures, Modeling, and Simulation (SAMOS XIII), 2013 International
Conference on, 2013, pp. 104–112.
[7] N. Abbas and Z. MA, “Run-time parallelization switching for resource
optimization on an MPSoC platform,” Springer design and Automation
of embedded systems, 2014.
[8] Y. Kim, J. Lee, A. Shrivastava, J. Yoon, and Y. Paek, “Memory-aware
application mapping on coarse-grained reconfigurable arrays,” in High
Performance Embedded Architectures and Compilers, ser. Lecture
Notes in Computer Science, Y. Patt, P. Foglia, E. Duesterwald,
P. Faraboschi, and X. Martorell, Eds. Springer Berlin Heidelberg,
2010, vol. 5952, pp. 171–185. [Online]. Available: http://dx.doi.org/10.
1007/978-3- 642-11515- 8 14
[9] M. Santambrogio, D. Pnevmatikatos, K. Papadimitriou, C. Pilato,
G. Gaydadjiev, D. Stroobandt, T. Davidson, T. Becker, T. Todman,
W. Luk, A. Bonetto, A. Cazzaniga, G. Durelli, and D. Sciuto, “Smart
technologies for effective reconfiguration: The faster approach,” in Re-
configurable Communication-centric Systems-on-Chip (ReCoSoC), 2012
7th International Workshop on, July 2012, pp. 1–7.
[10] S. Jafri, A. Hemani, K. Paul, J. Plosila, and H. Tenhunen, “Compact
generic intermediate representation (CGIR) to enable late binding in
coarse grained reconfigurable architectures,” in Proc. International Con-
ference on Field-Programmable Technology (FPT),, Dec. 2011, pp. 1 –6.
[11] W. B¨ohm, J. Hammes, B. Draper, M. Chawathe, C. Ross, R. Rinker,
and W. Najjar, “Mapping a single assignment programming language
to reconfigurable systems,” J. Supercomput., vol. 21, no. 2, pp.
117–130, Feb. 2002. [Online]. Available: http://dx.doi.org/10.1023/A:
1013623303037
[12]H.Park,K.Fan,S.A.Mahlke,T.Oh,H.Kim,andH.-s.Kim,
“Edge-centric modulo scheduling for coarse-grained reconfigurable
architectures,” in Proceedings of the 17th International Conference
on Parallel Architectures and Compilation Techniques, ser. PACT ’08.
New York, NY, USA: ACM, 2008, pp. 166–176. [Online]. Available:
http://doi.acm.org/10.1145/1454115.1454140
[13] M. Hamzeh, A. Shrivastava, and S. Vrudhula, “Epimap: Using
epimorphism to map applications on cgras,” in Proceedings of the
49th Annual Design Automation Conference, ser. DAC ’12. New
York, NY, USA: ACM, 2012, pp. 1284–1291. [Online]. Available:
http://doi.acm.org/10.1145/2228360.2228600
[14] N. Farahini, S. Li, M. A.l Tajammul, M. A. Shami, G. Chen, A. Hemani,
W. Ye, “39.9 GOPs/Watt multi-mode CGRA accelerator for a multi-
standard base station,” in Proc. IEEE Int. Symp. Circuits and Systems
(ISCAS), 2013.
[15] O. Dragomir, T. Stefanov, and K. Bertels, “Loop unrolling and shifting
for reconfigurable architectures,” in Field Programmable Logic and
Applications, 2008. FPL 2008. International Conference on, Sept 2008,
pp. 167–172.
[16] S. Yin, C. Yin, L. Liu, M. Zhu, and S. Wei, “Configuration context
reduction for coarse-grained reconfigurable architecture.” IEICE TRANS-
ACTIONS on Information and Systems, vol. 95, no. 2, pp. 335–344,
2012.
[17] C.-L. Chou and R. Marculescu, “User-aware dynamic task allocation
in networks-on-chip,” in Design, Automation and Test in Europe, 2008.
DATE ’08 , March 2008, pp. 1232–1237.
[18] M. A. Al Faruque, R. Krist, and J. Henkel, “Adam: Run-time agent-
based distributed application mapping for on-chip communication,” in
Proceedings of the 45th Annual Design Automation Conference,ser.
DAC ’08. New York, NY, USA: ACM, 2008, pp. 760–765. [Online].
Available: http://doi.acm.org/10.1145/1391469.1391664
[19] P. Holzenspies, J. Hurink, J. Kuper, and G. J. M. Smit, “Run-time spatial
mapping of streaming applications to a heterogeneous multi-processor
system-on-chip (mpsoc),” in Design, Automation and Test in Europe,
2008. DATE ’08, March 2008, pp. 212–217.
[20] E. Carvalho and F. Moraes, “Congestion-aware task mapping in hetero-
geneous mpsocs,” in System-on-Chip, 2008. SOC 2008. International
Symposium on, Nov 2008, pp. 1–4.
[21] L. Schor, I. Bacivarov, D. Rai, H. Yang, S.-H. Kang, and L. Thiele,
“Scenario-based design flow for mapping streaming applications onto
on-chip many-core systems,” in Proceedings of the 2012 International
Conference on Compilers, Architectures and Synthesis for Embedded
Systems, ser. CASES ’12. New York, NY, USA: ACM, 2012, pp. 71–80.
[Online]. Available: http://doi.acm.org/10.1145/2380403.2380422
[22] K. Compton, Z. Li, J. Cooley, S. Knol, and S. Hauck, “Configuration
relocation and defragmentation for run-time reconfigurable computing,”
Very Large Scale Integration (VLSI) Systems, IEEE Transactions on,
vol. 10, no. 3, pp. 209–220, 2002.
[23] M. A. Shami and A. Hemani, “Classification of massively parallel
computer architectures,” in Proc. IEEE Int. Parallel and Distributed
Processing Symposium Workshops PhD Forum (IPDPSW), May 2012,
pp. 344–351.
[24] M. A. Tajammul, S. M. A. H. Jafri, A. Hemani, J. Plosila, and H. Ten-
hunen, “Private configuration environments for efficient configuration
in CGRAs,” in Proc. Application Specific Systems Architectures and
Processors (ASAP), Washington, D.C., USA, 5–7 June 2013.
[25] J. Cui, Q. Deng, X. He, and Z. Gu, “An efficient algorithm for online
management of 2d area of partially reconfigurable fpgas,” in Design,
Automation Test in Europe Conference Exhibition, 2007. DATE ’07,
April 2007, pp. 1–6.
241
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)
Chapter
The dark silicon constraint will restrict the VLSI designers to utilize an increasingly smaller percentage of transistors as we progress deeper into nano-scale regime because of the power delivery and thermal dissipation limits. The best way to deal with the dark silicon constraint is to use the transistors that can be turned on as efficiently as possible. Inspired by this rationale, the VLSI design community has adopted customization as the principal means to address the dark silicon constraint. Two categories of customization, often in tandem have been adopted by the community. The first is the processors that are heterogeneous in functionality and/or have ability to more efficiently match varying functionalities and runtime load. The second category of customization is based on the fact that hardware implementations often offer 2–6 orders more efficiency compared to software. For this reason, designers isolate the power and performance critical functionality and map them to custom hardware implementations called accelerators. Both these categories of customizations are partial in being compute centric and still implement the bulk of functionality in the inefficient software style. In this chapter, we propose a contrarian approach: implement the bulk of functionality in hardware style and only retain control intensive and flexibility critical functionality in small simple processors that we call flexilators. We propose using a micro-architecture level coarse grain reconfigurable fabric as the alternative to the Boolean level standard cells and LUTs of the FPGAs as the basis for dynamically reconfigurable hardware implementation. This coarse grain reconfigurable fabric allows dynamic creation of arbitrarily wide and deep datapath with their hierarchical control that can be coupled with a cluster of storage resources to create private execution partitions that host individual applications. Multiple such partitions can be created that can operate at different voltage frequency operating points. Unused resources can be put into a range of low power modes. This CGRA fabric allows not just compute centric customization but also interconnect, control, storage and access to storage can be customized. The customization is not only possible at compile/build time but also at runtime to match the available resources and runtime load conditions. This complete, micro-architecture level hardware centric customization overcomes the limitations of partial compute centric customization offered by the state-of-the-art accelerator-rich heterogeneous multi-processor implementation style by extracting more functionality and performance from the limited number of transistors that can be turned on. Besides offering complete and more effective customization and a hardware centric implementation style, we also propose a methodology that dramatically reduces the cost of customization. This methodology is based on a concept called SiLago (Silicon Large Grain Objects) method. The core idea behind the SiLago method is to use large grain micro-architecture level hardened and characterized blocks, the SiLago blocks, as the atomic physical design building blocks and a grid based structured layout scheme that enables composition of the SiLago fabric simply by abutting the blocks to produce a timing and DRC clean GDSII design. Effectively, the SiLago method raises the abstraction of the physical design to micro-architectural level from the present Boolean level standard cell and LUT based physical design. This significantly improves the efficiency and predictability of synthesis from higher levels of abstraction. In addition, it also enables true system-level synthesis that by virtue of correct-by-construction guarantee eliminates the costly functional verification step. The proposed solution allows a fully customized design with dynamic fine grain power management to be automatically generated from Simulink down to GDSII with computational and silicon efficiencies that are modestly lower than ASIC. The micro-architecture level SiLago block based design process with correct by construction guarantee is 5–6 orders more efficient and 2 orders more accurate compared to the Boolean standard cell based design flows.
Article
Full-text available
This paper presents results of using a Coarse Grain Reconfigurable Architecture called DRRA (Dynamically Reconfigurable Resource Array) for FFT implementations varying in order and degree of parallelism using radix-2 decimation in time (DIT). The DRRA fabric is extended with memory architecture to be able to deal with data-sets much larger than what can be accommodated in the register files of DRRA. The proposed implementation scheme is generic in terms of the number of FFT point, the size of memory and the size of register file in DRRA. Two implementations (DRRA-1 and DRRA-2) have been synthesized in 65 nm technology and energy/delay numbers measured with post-layout annotated gate level simulations. The results are compared to other Coarse Grain Reconfigurable Architectures (CGRAs), and dedicated FFT processors for 1024 and 2048 point FFT. For 1024 point FFT, in terms of FFT operations per unit energy, DRRA-1 and DRRA-2 outperforms all CGRA by at least 2× and is worse than ASIC by 3.45×. However, in terms of energy-delay product DRRA-2 outperforms CGRAs by at least 1.66× and dedicated FFT processors by at least 10.9×. For 2048-point FFT, DRRA-1 and DRRA-2 are 10× better for energy efficiency and 94.84 better for energy-delay product. However, radix-2 implementation is worse by 9.64× and 255× in terms of energy efficiency and energy-delay product when compared against a radix-24 implementation.
Conference Paper
Full-text available
The next generation of embedded software has high performance requirements and is increasingly dynamic. Multiple applications are typically sharing the system, running in parallel in different combinations, starting and stopping their individual execution at different moments in time. The different combinations of applications are forming system execution scenarios. In this paper, we present the distributed application layer, a scenario-based design flow for mapping a set of applications onto heterogeneous on-chip many-core systems. Applications are specified as Kahn process networks and the execution scenarios are combined into a finite state machine. Transitions between scenarios are triggered by behavioral events generated by either running applications or the run-time system. A set of optimal mappings are precalculated during design-time analysis. Later, at run-time, hierarchically organized controllers monitor behavioral events, and apply the precalculated mappings when starting new applications. To handle architectural failures, spare cores are allocated at design-time. At run-time, the controllers have the ability to move all processes assigned to a faulty physical core to a spare core. Finally, we apply the proposed design flow to design and optimize a picture-in-picture software.
Conference Paper
Full-text available
The application workloads in modern MPSoC-based embedded systems are becoming increasingly dynamic. Different applications concurrently execute and contend for resources in such systems which could cause serious changes in the intensity and nature of the workload demands over time. To cope with the dynamism of application workloads at run time and improve the efficiency of the underlying system architecture, this paper presents a novel scenario-based run-time task mapping algorithm. This algorithm combines a static mapping strategy based on workload scenarios and a dynamic mapping strategy to achieve an overall improvement of system efficiency. We evaluated our algorithm using a homogeneous MPSoC system with three real applications. From the results, we found that our algorithm achieves an 11.3% performance improvement and a 13.9% energy saving compared to running the applications without using any run-time mapping algorithm. When comparing our algorithm to three other, well-known run-time mapping algorithms, it is superior to these algorithms in terms of quality of the mappings found while also reducing the overheads compared to most of these algorithms.
Article
Full-text available
This paper proposes a coarse-grained dynamically reconfigurable architecture that offers flexible reliability to deal with soft errors and aging. The notion of a cluster is introduced as a basic architectural element; each cluster can select four operation modes with different levels of spatial redundancy and area efficiency. We evaluate the aging effect due to negative bias temperature instability and illustrate that periodically alternating active cells with resting ones will greatly mitigate the effects of the aging process with a negligible power overhead. The area of circuits that are added for immunity to soft errors and for mitigating aging effects is 29.3% of the proposed reconfigurable device. A fault-tolerance evaluation of a Viterbi decoder mapped on the architecture suggests that there is a considerable tradeoff between reliability and area overhead. Finally, we design and fabricate a test chip that contains a 4 × 8 cluster array in a 65-nm process and demonstrate its immunity to soft errors. Accelerated tests using an alpha particle foil showed that the mean time to failure and failure in time are well characterized with the number of sensitive bits and that our architecture can trade off soft error immunity with the area of implementation.
Conference Paper
Full-text available
Current and future computing systems increasingly require that their functionality stays flexible after the system is operational, in order to cope with changing user requirements and improvements in system features, i.e. changing protocols and data-coding standards, evolving demands for support of different user applications, and newly emerging applications in communication, computing and consumer electronics. Therefore, extending the functionality and the lifetime of products requires the addition of new functionality to track and satisfy the customers needs and market and technology trends. Many contemporary products along with the software part incorporate hardware accelerators for reasons of performance and power efficiency. While adaptivity of software is straightforward, adaptation of the hardware to changing requirements constitutes a challenging problem requiring delicate solutions. The FASTER (Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration) project aims at introducing a complete methodology to allow designers to easily implement a system specification on a platform which includes a general purpose processor combined with multiple accelerators running on an FPGA, taking as input a high-level description and fully exploiting, both at design time and at run time, the capabilities of partial dynamic reconfiguration. The goal is that for selected application domains, the FASTER toolchain will be able to reduce the design and verification time of complex reconfigurable systems providing additional novel verification features that are not available in existing tool flows.
Article
The recent development of multimedia applications on mobile terminals raised the need for flexible and scalable computing platforms that are capable of providing considerable (application specific) computational performance within a low cost and a low energy budget. The MPSoC with multi-disciplinary approach, resolving application mapping, platform architecture and runtime management issues, provides such multiple heterogeneous, flexible processing elements. In MPSoC, the run-time manager takes the design time exploration information as an input and selects an active Pareto point based on quality requirement and available platform resources, where a Pareto point corresponds to a particular parallelization possibility of the target application. To use system’s scalability at best and enhance application’s flexibility a step further, the resource management and Pareto point selection decisions need to be adjustable at run-time. This research work experiment run-time Pareto point switching for the MPEG-4 encoder. The work involves design time exploration and then embedding of two parallelization possibilities of the MPEG-4 encoder into one single component and enabling run-time switching between these parallelizations, to give run-time control over adjusting performance-cost criteria and allocation deallocation of hardware resources at run-time. The new system has the capability to encode each video frame with different parallelization. The obtained results offer a number of operating points on the Pareto curve in between the previous ones at sequence encoding level. The run-time manager can improve application performance up to 50 % or can save memory bandwidth up to 15 %, according to quality request.
Conference Paper
Today, coarse grained reconfigurable architectures (CGRAs) host multiple applications, with arbitrary communication and computation patterns. Each application itself is composed of multiple tasks, spatially mapped to different parts of platform. Providing worst-case operating point to all applications leads to excessive energy and power consumption. To cater this problem, dynamic voltage and frequency scaling (DVFS) is a frequently used technique. DVFS allows to scale the voltage and/or frequency of the device, based on runtime constraints. Recent research suggests that the efficiency of DVFS can be significantly enhanced by combining dynamic parallelism with DVFS. The proposed methods exploit the speedup induced by parallelism to allow aggressive frequency and voltage scaling. These techniques, employ greedy algorithm, that blindly parallelizes a task whenever required resources are available. Therefore, it is likely to parallelize a task(s) even if it offers no speedup to the application, thereby undermining the effectiveness of parallelism. As a solution to this problem, we present energy aware task parallelism. Our solution relies on a resource allocation graphs and an autonomous parallelism, voltage, and frequency selection algorithm. Using resource allocation graph, as a guide, the autonomous parallelism, voltage, and frequency selection algorithm parallelizes a task only if its parallel version reduces overall application execution time. Simulation results, using representative applications (MPEG4, WLAN), show that our solution promises better resource utilization, compared to greedy algorithm. Synthesis results (using WLAN) confirm a significant reduction in energy (up to 36%), power (up to 28%), and configuration memory requirements (up to 36%), compared to state of the art.
Conference Paper
Faced with slowing performance and energy benefits of technology scaling, VLSI/Computer architectures have turned from parallel to massively parallel machines for personal and embedded applications in the form of multi and many core architectures. Additionally, in the pursuit of finding the sweet spot between engineering and computational efficiency, massively parallel Coarse Grain Reconfigurable Architectures(CRGAs) have been researched. While hese articles have been surveyed, they have not been rigorously classified to enable objective differentiation and comparison for performance, area and flexibility. In this paper, we extend the well known Skillicorn taxonomy to create new classes, present a scoring system to rate these classes on flexibility, and present equations for early estimation of area and configuration overheads. Furthermore, we use this extended classification scheme to classify and compare 25 different massively parallel architectures that covers most of the reported CGRAs and other well known multi and many core architectures.
Conference Paper
In this paper, we propose a polymorphic configuration architecture, that can be tailored to efficiently support reconfiguration needs of the applications at runtime. Today, CGRAs host multiple applications, running simultaneously on a single platform. Novel CGRAs allow each application to exploit late binding and time sharing for enhancing the power and area efficiency. These features require frequent reconfigurations, making reconfiguration time a bottleneck for time critical applications. Existing solutions to this problem either employ powerful configuration architectures or hide configuration latency (using configuration caching). However, both these methods incur significant costs when designed for worst-case reconfiguration needs. As an alternative to worst-case dedicated configuration mechanism, we exploit reconfiguration to provide each application its private configuration environment (PCE). PCE relies on a morphable configuration infrastructure, a distributed memory sub-system, and a set of PCE controllers. The PCE controllers customize the morphable configuration infrastructure and reserve portion of the a distributed memory sub-system, to act as a context memory for each application, separately. Thereby, each application enjoys its own configuration environment which is optimal in terms of configuration speed, memory requirements and energy. Simulation results using representative applications (WLAN and Matrix Multiplication) showed that PCE offers up to 58% reduction in memory requirements, compared to dedicated, worst case configuration architecture. Synthesis results show that the morphable reconfiguration architecture incurs negligible overheads ( 3% area and 4% power compared of a single processing element).