Content uploaded by Saptarsi Das
Author content
All content in this area was uploaded by Saptarsi Das on Mar 13, 2014
Content may be subject to copyright.
REDEFINE: Runtime Reconfigurable
Polymorphic ASIC
Mythri Alle, Keshavan Varadarajan, Alexander Fell, Ramesh Reddy C, Nimmy
Joseph, Saptarsi Das, Prasenjit Biswas, Jugantor Chetia, Adarsh Rao, S K Nandy
CAD Lab, SERC, Indian Institute of Science, Bangalore
{mythri, keshavan, alefel, crreddy, jnimmy, sdas, prasenjit, jugantor, adarsh,
nandy}@cadl.iisc.ernet.in
and
Ranjani Narayan
Morphing Machines, Bangalore, India
ranjani.narayan@morphingmachines.com
Emerging embedded applications are based on evolving standards (ex. MPEG2/4, H.264/265,
IEEE802.11a/b/g/n). Since most of these applications run on handheld devices, there is an
increasing need for a single chip solution that can dynamically interoperate between different
standards and their derivatives. In order to achieve high resource utilization and low power
dissipation, we propose REDEFINE, a Polymorphic ASIC in which specialized hardware units
are replaced with basic hardware units that can create the same functionality by runtime re-
composition. It is a “future-proof ” custom hardware solution for multiple applications and their
derivatives in a domain. In this paper, we describe a compiler framework and supporting hardware
comprising compute, storage and communication resources.
Applications described in High Level Language (for ex: C) are compiled into application sub-
structures. For each application substructure a set of Compute Elements on the hardware are
interconnected during runtime to form a pattern that closely matches the communication pattern
of that particular application. The advantage is that the bounded CEs are neither processor cores
nor logic elements as in FPGAs. Hence REDEFINE offers the power and performance advantage of
an ASIC and the hardware reconfigurability and programmability of that of an FPGA/Instruction
Set Processor.
In addition, the hardware supports Custom Instruction pipelining. Existing Instruction set
extensible processors determine a sequence of instructions that repeatedly occur within the appli-
cation to create Custom Instructions at design time to speed up the execution of this sequence.
We extend this scheme further, where a kernel is compiled into Custom Instructions that bear
strong producer-consumer relationship (and not limited to frequently occurring sequences of in-
structions). Custom Instructions realized as hardware compositions effected at runtime, allow
several instances of the same to be active in parallel. A key distinguishing factor in majority
of the emerging embedded applications is stream processing. To reduce the overheads of data
transfer between Custom Instructions direct communication paths are employed among Custom
Instructions.
In this paper, we present the overview of the hardware-aware compiler framework which deter-
mines the NoC-aware schedule of transports of the data exchanged between the Custom Instruc-
tions on the interconnect. The results for the FFT kernel indicate a 25% reduction in the number
of loads/stores and throughput improves by log(n) for n-point FFT when compared to sequential
implementation. Overall REDEFINE offers flexibility and a runtime reconfigurability at the ex-
pense of 1.16×in power and 8×in area when compared to an ASIC. REDEFINE implementation
consumes 0.1×the power of an FPGA implementation. In addition the configuration overhead of
the FPGA implementation is 1000×more than that of REDEFINE.
Categories and Subject Descriptors: C.1.3 [Computer Systems Organization]: Processor Architectures—Other
Architecture Styles
ACM Journal Name, Vol. V, No. N, Month 20YY, Pages 1–45.
2·Polymorphic ASIC
General Terms: Design
Additional Key Words and Phrases: Polymorphic ASIC, Runtime Reconfiguration, Honeycomb,
NoC, Router, Dataflow Software pipeline, Custom Instruction Extension, Application Synthesis
1. INTRODUCTION
Mobile handsets are emerging as the unified embedded platform which supports a plethora
of applications such as communications, multimedia, image processing etc. Such a vast
range of applications require flexible computing platforms for different needs of each ap-
plication and its derivatives. General Purpose processing platforms are good candidates
for the flexibility they offer; but they cannot meet the critical performance, throughput
and power criteria. Unlike traditional desktop devices, embedded platforms have stringent
requirements on power and performance. On the other hand, customized processors and
ASICs do not scale to accommodate new application features or application derivatives.
The emergence of embedded devices have created a need for new generation of vertical
processors which are necessarily “domain specific” to meet the performance and power
requirements and sufficiently “general purpose” to handle such a large number of applica-
tions.
The embedded industry has successfully used the paradigm of custom co-processor to
meet the power-performance goals of these devices. These co-processors are designed to
accelerate certain specific instruction sequences which occur frequently in the application
under consideration. Such solutions are offered by several companies such as Xtensa from
Tensilica Inc. [Tensilica Inc. 2007], Triton Builder from Poseidon Inc. [Poseidon Design
Systems Inc. 2007], LISA from CoWare [CoWare Inc. 2007], Pico from Synfora Inc [Syn-
fora Inc. 2007] etc. But such solutions are useful for accelerating individual applications.
In order to provide accelerators for an entire application domain, we necessarily need to
support the following:
(1) Domain specific Compute Elements (CEs) and Communication Structures: Applica-
tions that contain same kernels within them, exhibit similar computation and commu-
nication patterns. The similar nature of communication and computation in kernels
allows us to determine suitable CEs and interconnect characteristics of an application
domain well in advance.
(2) Dynamic reconfigurability: The accelerators need to configure themselves at runtime
to cater to the performance requirements of the particular application or its derivatives
(in the domain).
ASIC vs Reconfigurable Hardware: It is well known that ASIC solutions win over
other configurable/reconfigurable solutions with respect to power and performance. ASICs
play a role in the solution space where the high NRE costs are recovered through high vol-
ume sales. A primary reason for ASICs not to be the universal solution is that ASICs, as
the name denotes, are application specific custom hardware solutions. In an ever changing
world of varying market demands and multiple variants of applications catering to dif-
ferent customer needs, spinning an ASIC for every application is prohibitively expensive.
Another dimension of user demand is the real-time or close to real-time response without
compromising on performance. An example is decoding over a variety of standards, viz.,
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·3
...
Interconnection
Macro Functional
Blocks
f1
f2
f3
fn
Fig. 1. Abstract Model of an ASIC.
MPEG2, MPEG4 and H.264. This has lead to evolution of flexible ASIC architectures.
Runtime reconfigurability is the ideal solution to meet the above mentioned requirements;
however the overheads associated with this is a challenge that needs to be addressed. Be-
low we present an abstract model of an ASIC and a runtime reconfigurable hardware,
before proceeding to give an overview of our proposal. These models help us describe our
proposal more formally.
An ASIC can be thought of as a binary relation R:F→F, where Fis the set of
macro functional blocks used in the ASIC. A macro functional block executes a specific
and pre-designed task, whereas the binary relationship Rcaptures the communication re-
quirements between the various macro functional blocks of the ASIC. Hence Rdetermines
the static interconnect between the macro functional blocks (figure 1). If the fixed inter-
connect described by the relation R, is replaced by a network topology that is the cross
product F×F, then the abstraction can support a more flexible interconnection among the
macro functional blocks which aid in supporting some application derivatives. Any gen-
eral purpose network helps in providing communication between all ordered pair of macro
functional blocks at the cost of increased communication latencies. Replacing only the
interconnect and not the macro functional blocks, limits the different kinds of application
derivativesthat can be emulated on a platform by using subsets of the existing macro func-
tional blocks. For example: The macro functional blocks used for H.264 video decoder
can be used to generate the MPEG2 decoder, since the difference is primarily the absence
of the Deblocking filter in the MPEG2 decoder. However, the reverse is not true i.e. the
macro functional blocks in a MPEG2 decoder cannot be used to construct the Deblock-
ing filter for H.264 decoder. In order to emulate a wider range of applications and their
derivatives, the macro functional blocks need to be replaced by elementary CEs that can be
aggregated to behave as the desired macro functional blocks. Figure 2 is a representation
of such a reconfigurable hardware fabric.
To obtain a desired functionality from the reconfigurable hardware fabric a binary rela-
tion R′:G→G(where Gis the set of all elementary CEs g1,g2, ... gm) has to be defined.
R′captures the interconnect required to obtain the desired macro function from elemen-
tary CEs and the interconnection required across the “composed” macro functional blocks
(given by relation R). The set of all R′represents all possible distinct interconnections that
can be realized on the interconnect. Stated in a different way, gR′g⊆G×G∀g∈G.
Consequently, an application that is to be executed on this reconfigurable fabric needs two
kinds of Metadata. The selection of the specific relation representing the communication
ACM Journal Name, Vol. V, No. N, Month 20YY.
4·Polymorphic ASIC
Programmable Interconnection
Compute
Element Compute
Element
Compute
Element Compute
Element
Compute
Element
Compute
Element
Compute
Element
Compute
Element
...
Reconfigurable Hardware Fabric
Compute
Element
g2g4
g3g5
g6
g7
gm−1
gm
g1
f1
f2
Fig. 2. Abstract Model of a Reconfigurable Hardware Fabric. Macro functional blocks composed from elemen-
tary Compute Elements are indicated by dashed contours.
requirement of the application that is to be realized on the interconnect, is determined by
the Transport Metadata. Additional Metadata is needed to aggregate the CEs to perform a
specific macro function. We call this the Compute Metadata. The Compute and Transport
Metadata together is called a Configuration. The process of achieving the functionality
of an ASIC on a reconfigurable hardware fabric involves loading the Configuration. In
figure 2, g1,g2, ... gm. refer to the elementary CEs aggregated to emulate the macro
functional blocks i.e. f1,f2, ... fn. Chip Multi Processors (CMPs) and Multi-core SoCs
are possible realizations of such a reconfigurable hardware fabric. These solutions offer
different power-performance tradeoffs, at a lower cost. The limitations of these solutions
both in terms of interconnections and programmability is an impediment to its interoper-
ability between applications and their derivatives. Compute elements in these realizations
are limited to processor cores/co-processors. In order to have a solution closer to ASIC im-
plementation, we need to have compute elements at the granularity of ALUs/FUs, which
are composed to form macro functional blocks.
Polymorphic ASIC: To achieve reconfigurability at runtime, it is therefore necessary to
determine Compute Metadata and Transport Metadata to realize the desired macro func-
tional blocks on demand, for which additional support hardware is needed. It is also re-
quired that the overhead of reconfiguration is kept small. In this paper we propose a runtime
reconfigurable architecture that can serve as a vertical processor. We refer to this as the
Polymorphic ASIC. Composition of macrofunctional blocks using the CEs (as indicated
by Compute Metadata) and interconnection among them achieved through an NoC (as in-
dicated by Transport Metadata) in space and time on the reconfigurable hardware fabric is
the distinguishing characteristic of the Polymorphic ASIC.
Overview: We propose REDEFINE - the Polymorphic ASIC, in which reconfiguration
is achieved through the synthesis of hardware execution structures at runtime based on the
Compute Metadata and Transport Metadata, determined through a compilation process.
The compiler identifies coarse grained operations for which hardware execution structures
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·5
Manager
Hardware Resource
Data Store
Memory Access Unit
Data Forwarder
Inter HyperOp
Support Logic
HyperOp Selection
Logic
Global
Wait Match Unit (GWMU)
Operation Store
Reconfigurable Hardware Fabric
Fig. 3. Schematic diagram of REDEFINE.
are synthesized. The coarse grained operations are called Hyper Operations (HyperOps). A
block schematic of REDEFINE is presented in figure 3. Details of the same appear in sec-
tion 2. The compilation techniques to transform applications into HyperOps is discussed
in section 3.
The execution engine of REDEFINE (referred to as the Reconfigurable Hardware Fabric
in figure 3) comprises a matrix of tiles as shown in figure 4. Each tile comprises a CE
which is either an ALU or a FU (depending on the granularity of operations that constitute
HyperOps) and router that connects the CEs. The set of routers together serve as the
Network on Chip (NoC). With these choices, the Compute Metadata is a set of opcodes
(akin to a general purpose processor opcode) and the Transport Metadata is a location on
the fabric represented as (x, y)-coordinates of the target CE.
The proposed approach/methodology to realize applications on REDEFINE relies on a
strong interplay between the microarchitecture and the compiler. The basic unit (which
is atomically executed) assumed for compilation is a HyperOp, which is a subgraph of
application’s dataflow graph. We refer to a Custom Instruction as a set of HyperOps. As
will be elaborated in section 3, selected HyperOps are aggregated into Custom Instructions
and are synthesized on REDEFINE at runtime. Further, support for stream processing re-
quires the compiler to be throughput aware. Compiled code for such applications contain
Transport and Compute Metadata which are used to synthesize ASIC like control and data
paths on the hardware fabric at runtime. This is achieved by maintaining a Custom In-
struction pipeline on REDEFINE. Details of compiler and NoC enhancements for Custom
Instruction pipeline are provided in section 5.
Related Work: Unlike any NoC based processor array viz. Intel 80 tile [Vangal et al.
2007], we group CEs to form several macro-functional blocks at runtime. Interconnections
among them are also dynamically established at runtime. Whereas in [Vangal et al. 2007],
[Sankaralingam et al. 2003], [Ambric Inc. 2007], [Bauer et al. 2007], [Taylor et al. 2002],
each tile is an independent processor and there is no aggregation of these to form a single
logical entity. ElementCXI [Element CXI 2008] and QuickSilver Technology [QuickSilver
2008] provide dynamically reconfigurable solutions. Both adopt a similar organization, in
ACM Journal Name, Vol. V, No. N, Month 20YY.
6·Polymorphic ASIC
TTTTTTTT
TTTTTTTT
TTTTTTTT
TTTTTTTT
TTTTTTTT
TTTTTTTT
TTTTTTTT
TTTTTTTT
Tile = CE + Router
Express Lanes
Memory Access Unit
From Operation/Data Memory
987
(141*7)
packet = 141 bits
141
282
(141*2)
987
(141*22)
3102
Fig. 4. Figure showing the Reconfigurable Hardware Fabric containing 64 tiles. Each tile contains a Compute
Element and router.
which the FUs (in case of ElementCXI) and processing core (in QuickSilver’s Adaptive
Computing Machine) are organized in a hierarchy through matrix interconnects. Our pro-
posal architecturally differs from them in that, we use a distributed NoC to establish the
desired interconnections between the CEs on demand at runtime, supported by a dynamic
dataflow execution paradigm. On a FPGA, loading a configuration involvesbit by bit pro-
gramming of the multiplexers of the interconnect and programming the truth table in each
Logic Element (LE/LUT). Such a type of configuration makes it difficult to reconfigure
dynamically. MathStar’s Field Programmable Object Array (FPOA) [MathStar 2008] is a
solution in which silicon objects can be interconnected in a manner similar to FPGAs. This
enables FPOA to be used to support large computationally intensive applications. How-
ever, they are not runtime reconfigurable and also share similar limitations as FPGA. In
order to reduce the configuration overhead, we choose ALUs/FUs as opposed to Logic El-
ements and replace the programmable interconnect with a NoC (refer [Joseph et al. 2008]
and section 5.3).
This paper is organized as follows. Frequently used terms and abbreviations are listed
immediately following this section. Section 2 gives an overview of the various components
of REDEFINE. Section 3 discusses details of the compilation framework for REDEFINE.
Implementation details and results of our compiler framework are given in section 4. Sup-
port for Custom Instructions provided by the compiler framework and NoC appear in sec-
tion 5 in which streaming FFT is taken as a case study to quantify improvements both in
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·7
compiler and architecture. We conclude our findings in section 6.
Frequently used Terms and Abbreviations
This section introduces the abbreviations and terms used in this paper. The table I lists the
abbreviations used in this paper. We also introduce few terms in this section. The formal
definitions of these terms will appear later in the text.
Table I. Terms and Abbreviations used in text
TermsAbbreviation Expansion
ASIC Application Specific Integrated Circuit
BDD Binary Decision Diagram [S. B. Akers 1978]
CE Compute Element
CMP Chip Multi Processor
CFG Control Flow Graph [Cytron et al. 1991]
DFG Data Flow Graph [Cytron et al. 1991]
GWMU Global Wait Match Unit(section 4)
HL HyperOp Launcher
HRM Hardware Resource Manager
HyperOp Hyper Operation (section 3)
IHDF Inter HyperOp Data Forwarder (section 4)
LLVM Low Level Virtual Machine [Chris Lattner and Vikram Adve 2004]
LWMU Local Wait Match Unit (section 4.1)
LOpOr Local Operation Orchestrator (section 4.1)
NoC Network on Chip
NRE Non Recurring Engineering Costs
NPI New Packet Indicator
p-HyperOp Partitioned HyperOp (section 3.4)
ROBDD Reduced Order Binary Decision Diagram [S. B. Akers 1978]
SSA Static Single Assignment Form [Cytron et al. 1991]
SoC System on Chip
Tile CE + Router
ViCi Virtual Circuit (section 5.3)
—HyperOp: It is a collection of operations and is the atomic entity of execution in our
proposal. HyperOps are acyclic subgraphs of the dataflowgraph.
—CustomInstruction: It is a group of HyperOps. These are identified by the compiler and
are launched simultaneously for execution. These are introduced to reduce the over-
heads.
—Sticky Tokens: Sticky tokens [Inagami and Foley 1989] are those values which are gen-
erated once and are consumed multiple times. For Ex: loop invariant data.
—Instance number of a HyperOp: This is used to distinguish between various dynamic
instances of the HyperOp when they are active simultaneously. These are akin to tags
used in dynamic dataflow machines.
—Predicate: Predicates are used in the dataflow graph to specify on what condition an
operation has to be executed [Mahlke et al. 1992].
—Configuration: This refers to the bit stream used to configure the proposed solution. It
consists of Compute Metadata which determines what to perform and Transport Meta-
data that determines the communication required between the operations.
ACM Journal Name, Vol. V, No. N, Month 20YY.
8·Polymorphic ASIC
—hops:This term refers to the number of routers a packet has traversed from source to
destination.
—Steer Nodes: Steer nodes [Inagami and Foley 1989] are used in the dataflow graph to
direct data to appropriate destinations.
2. MICROARCHITECTURE
2.1 Overview
The HyperOps are synthesized on the CEs that constitute the reconfigurable hardware
fabric (see figure 3). As shown in figure 4 the reconfigurable fabric comprises CEs and
communication elements that are interconnected in a honey comb topology [Joseph et al.
2008].
Each CE includes storage for operations, data and an ALU/FU to perform computations.
Operations of a HyperOp can occupy several CEs. CE follows static dataflow execution
paradigm [Vinod et al. 1980]. Intra HyperOp data communication (for scalar variables) is
achieved over the interconnect i.e. no explicit intermediate storage (like registers) are used
for this data communication. Inter-HyperOp data communication is achieved through the
Inter-HyperOp Data Forwarder (see figure 3). These communications have a longer latency
when compared to the communication within the interconnect. A Hardware Resource
Manager (see figure 3) is responsible for scheduling HyperOps for execution. A HyperOp
is ready for execution when all its input operands are available (which are input operands
of the operations in the HyperOp). The input operands are for only those operations which
do not have a source for their operands within the HyperOp. Global Wait Match Unit
(GWMU) contains the HyperOps waiting for their inputs.
When several HyperOps are grouped together to form a Custom Instruction, and a Cus-
tom Instruction is launched for execution on the fabric, the communication between the
HyperOps is accomplished through the use of Virtual Circuits on the interconnect between
the Compute Elements. Special support is provided by the routers for establishing virtual
circuits. This is explained in further detail in section 5.3. Compiler identifies producer-
consumer relationship among the HyperOps constituting a Custom Instruction. The com-
piler generates the necessary Transport Metadata to specify the inter HyperOp communi-
cation in a Custom Instruction. If the desired communication bandwidth is not available or
resources for synthesizing all HyperOps of a Custom Instruction are not available on the
fabric, then HyperOps are executed in a data driven manner.
The different components of REDEFINE are detailed in the following sections.
2.2 Reconfigurable Hardware Fabric
The reconfigurable hardware fabric is an interconnection of tiles. Each tile includes a
CE, and a router. The operations and data of the HyperOps are transferred to the CEs
via Express Lanes (refer figure 4). The Express Lanes are connected to the router ports
along the periphery. An appropriate Express Lane is chosen based on the destination. This
ensures that no tile is more than a certain number of hops away from the Memory Access
Unit. For a 64-tile fabric, a tile is at most 4 hops away from the Memory Access Unit. In
a CE the execution of operations proceeds in a data driven manner. The operation whose
data is available executes and provides data for the consumer operations. An operation
is invalidated as soon its execution completes. Once all the operations loaded onto a CE
complete execution, the tile declares its free state so that other HyperOps can be scheduled.
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·9
Bank Bank Bank Bank
Bank Bank Bank Bank
Bank Bank Bank Bank
Bank Bank Bank Bank
Address + Data
(a)
Bank Bank Bank Bank
Bank Bank Bank Bank
Bank Bank Bank Bank
Bank Bank Bank Bank
Address + Data
(b)
Fig. 5. The figure shows two different ways of organizing a multi-banked memory. (a) shows an H-tree based
organization. This provides uniform access delay across all banks but has a higher area overhead. (b) shows
a “flower-pot” based organization. This organization has lower area overhead but different banks experience
different delays. Combinations of these two organizations can also be employed.
2.3 Memory Access Unit
The architecture (see figure 3) includes an Operation and Data Store. The Operation Store
contains the Compute and Transport Metadata of all the HyperOps. The Data Store holds
all the data elements which are accessed from memory. These include external inputs
and vector variables (for ex: array variables). As seen from figure 4 the Memory Access
Unit has 22 access points for a fabric comprising 64 tiles. The memory can therefore
be organized to support a maximum of 22 parallel accesses. To enable parallel accesses,
we multi-bank1the Operation and Data Stores. At any particular instant, the number of
operations that are accessed in parallel (for them to be loaded to the CEs) and the number
of data that can be accessed in parallel influence the organization of these memory banks.
In addition, area occupied and power expended determine the viability of an organization.
Two such organizations are shown in figure 5. These choices are highly application/domain
dependent.
2.4 Hardware Resource Manager
The task of the Hardware Resource Manager (HRM) is to store the inputs of the HyperOps
and provide CEs for HyperOps on the fabric when a HyperOp is ready to be launched. A
HyperOp is ready to be launched when all the inputs of a HyperOp are available. HRM
comprises two modules viz., Global Wait Match Unit (GWMU) and HyperOp Selection
Logic. GWMU contains the HyperOps for which partial inputs are available. GWMU con-
stantly checks for availability of all inputs of a HyperOp to mark them ready for launching.
Arbitration among multiple ready HyperOps is done by the HyperOp Selection Logic using
a prioritized scheduling algorithm as described in algorithm 4 in section 4. The selected
HyperOp is launched on the reconfigurable fabric.
1We employ multi-banking as opposed to multi-porting due to the power advantages of multi-banking.
ACM Journal Name, Vol. V, No. N, Month 20YY.
10 ·Polymorphic ASIC
2.5 Inter HyperOp Data Forwarder
Inter HyperOp Data Forwarder (IHDF) is responsible for transporting results produced by
the HyperOps on the fabric to the destination HyperOps that are not resident on the fabric.
Details of the IHDF can be found in section 4.3.
3. COMPILATION FRAMEWORK
In this section we describe the process of compiling applications onto REDEFINE. The
input to the compiler is an application described in C language. Our compiler is ANSI C
complaint. Before we describe the compilation framework used to identify HyperOps, we
list below the microarchitectural features of REDEFINE exposed to the compiler.
(1) Communication between any two operations in a HyperOp, which are executing on
the hardware is accomplished through an interconnect for scalar variables and through
memory for vector variables. (There is no central register file which is seen by the
compiler. The use of the interconnect enables direct communication of the result and
avoids the overhead of accessing the register file for a read or write.)
(2) The interconnect follows a Honeycomb topology. Details of this topology are provided
in [Satrawala et al. 2007].
(3) All CEs are homogeneous. Each CE is capable of executing a set of arithmetic, logic,
compare and memory access operations. Apart from these operations, few special
operations are used to transfer data directly to otherCEs.
(4) In order delivery of data is guaranteed between each pair of communicating HyperOps
that constitute a Custom Instruction.
The compilation process is divided into various phases:
—Phase I - Formation of DFG: Our framework follows a data driven execution paradigm.
The first phase transforms the application into a dataflow graph(DFG) and performs
several optimizations to reduce the overhead of data transfer.
—Phase II - HyperOp formation: The basic entity in our paradigm is a HyperOp. This
phase divides the application into several HyperOps.
—Phase III - Tag generation: In our execution paradigm multiple HyperOp instances can
be active on the fabric simultaneously. To distinguish these HyperOps we generate tags
(similar to tags in dynamic dataflow [Vinod et al. 1980]) at runtime by the hardware.
The necessary information required for the generation of the tags is identified in this
phase. To reduce the overhead of tag generation we generate tags only for inputs and
outputs of HyperOp. The data tokens within a HyperOp do not contain a tag.
—Phase IV - Mapping HyperOps: This phase of compilation is aware of the intercon-
nect topology between the tiles of the reconfigurable fabric. The process of Metadata
generation involves identifying HyperOp partitions called p-HyperOps, such that all op-
erations in a p-HyperOp can be assigned to a single CE. These p-HyperOps are mapped
onto multiple CEs in the reconfigurable fabric based on communication patterns between
them.
—Phase V - Formation of Custom Instructions: This step identifies HyperOps that can
be aggregated into Custom Instructions. Custom Instructions are necessary to reduce
the overhead of inter HyperOp communication. Unlike HyperOps, Custom Instructions
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·11
need not be acyclic. We assume special hardware support to execute a Custom Instruc-
tion (as explained in section 5).
Each phase is detailed in the following sections.
3.1 Phase I: Formation of DFG
The first phase of the compilation is to generate a DFG from the C application specifi-
cation. As a first step, we convert C application into SSA form using the LLVM [Chris
Lattner and Vikram Adve 2004] infrastructure that compiles the application into a virtual
instruction set. Most of the operations are arithmetic operations and can be converted
directly into dataflow nodes. φinstructions and branch require special attention. φinstruc-
tions are handled using predicated steer nodes and branches are handled using predicate
edges. Memory dependencies are also handled using predicates (refer to [Alle et al. 2008]
for more details). A naive approach for adding predicates results in a large overhead es-
pecially when the control flow is irregular. As mentioned in [Hicks et al. 1993], [Petersen
et al. 2006] the overhead can be as high as the computation itself. We perform the following
optimizations to reduce this overhead:
3.1.1 Predicate Hoisting. According to the imperative language semantics, an instruc-
tion is executed only when the control reaches that instruction. The data for the instruction
might be produced irrespective of whether control reaches the instruction or not. There-
fore, while building a DFG these control dependencies have to be marked explicitly by
appropriately predicating the operation. For any operation Oin a basic block B, the pred-
icate is the boolean AND of all the basic blocks in the path from the entry basic block to
the basic block B. However the execution of the immediate parent of the basic block B
ensures that all the other predicates are T rue . Hence we add predicates only from the im-
mediate parent. In the case where there are more than one immediate parents (i.e. the basic
block has more than one in-edges), the predicate added is a boolean OR of the predicates
of all the immediate parents of the basic block B. Program semantics ensure that only one
of predecessors will be active. The basic block Bis executed when one of the parent basic
blocks executes. Hence when any one of the predicates is T r ue, the basic block Bwill
execute. If there are several predecessor basic blocks, the expression controlling the execu-
tion of operation Oinvolves several input predicates (therefore several predicate in-edges).
To minimize the predicate in-edges the following optimizations are performed:
—If some of the predecessor basic blocks have a common predecessor Aand all paths
from Aterminate at B, then the predicates of the predecessors of Bdescending from A
can be replaced with the predicate of A.
—If there does not exist any such common predecessor A, then in order to minimize the
expression for computing the predicate of O, the Control Flow Graph (CFG) of the
application is transformed into a Binary Decision Diagram (BDD). This BDD is reduced
to Reduced Ordered Boolean Decision Diagram (ROBDD)[S. B. Akers 1978]. ROBDD
gives the minimized expression for the predicate. Detailed explanation of this technique
is presented in [Alle et al. 2008].
Figure 6 shows an example of this optimization.
3.1.2 Optimizations on steer nodes. φnodes [Cytron et al. 1991] serve as merge points.
During execution an appropriate value is chosendepending on the path taken. In the DFG
ACM Journal Name, Vol. V, No. N, Month 20YY.
12 ·Polymorphic ASIC
B4
B6
B1
B2
n0
B5
B3
B2
B4
B1
B3
a, a)
d = phi (a,
a = 10
B5
B6
(c)
(a)
B4 B5
B6
B1
B2 B3
n1 n2
n3
(b)
Fig. 6. Example showing predicate hoisting. (a) shows a control flow graph with instructions
in basic block. (b) shows a DFG after the control dependencies are added. n1, n2, n3 are nodes in
basic blocks B4, B5, B3 respectively. (c) displays the DFG after predicate hoisting is performed.
The ROBDD of the CFG will be node B1 hence only one control dependency is added.
this information is captured by adding steer nodes in each possible path that deliver data
to the merge point. Further, we ensure that only one token is delivered by appropriately
predicating the steer node. Steer nodes are an overhead in terms of the number of nodes
and also in terms of the delay added. To reduce this overhead we use the same steer node
for multiple paths if they share the same source. To identify the minimal set of the paths
which require a steer node, we obtain ROBDD for the subgraph of CFG (same technique
mentioned earlier). Further, it is not always required to add a special steer node. The
source node can act as a steer node, if there is only one consumer of the source node. In
addition steer node is not required, if the steer node is not predicated (figure 7).
c1
b
a2
a3
a1
a1
a2
c1
a3
b
a2 )
( a1,a3 = φ
(c)(b)
steer
steer
B1
B2
B3
a1 = 10
a2 = 20
c1 = a2 + 1
(a)
br b B2 B3
s2
s1
Fig. 7. Example showing optimization of steer nodes. (a) shows a control flow graph and in-
structions in basic block. (b) displays a DFG after “steer” nodes are added. Since a3 can receive data
from two possible control paths, two steer nodes are inserted. The dotted edges represent predicate
edges. There is a unconditional branch from basic block B2 to B3, hence steer node s2 is not pred-
icated. Steer node s1 is predicated using b. (c) shows the DFG after optimizations are performed.
Steer node s2 is not predicated and hence steer node s2 is eliminated. Steer node s1 is the only
consumer of its parent, a1, so steer node s1 is eliminated.
3.2 Phase II: HyperOp formation
We reproduce the definition of HyperOp (from [Alle et al. 2008]) below:
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·13
Definition AHyperOp is defined as a directed acyclic graph H(V′,E′). H is a vertex in-
duced subgraph of DFG G(V,E), where V,V′are the vertex sets of DFG G and HyperOp
H respectively and E, E′are the edge sets of DFG G and HyperOp H respectively such
that V⊆V′. For every edge (vi, vj)∈Gsuch that vi, vj∈V′∃an edge (vi, vj)in E′
Given two HyperOps Hi(V′
i, Ei)and Hj(V′
j, Ej)∀i, j where i6=j, V ′
iTV′
j={}
where V′
iis the vertex set of Hiand V′
jis the vertex set of Hj. A HyperOp must also
satisfy the Convexity condition.
Convexity condition: HyperOp interconnection graph should be a directed acyclic graph.
The HyperOp interconnection graph captures producer-consumer relationship between the
set of HyperOps. More formally, HyperOp interconnection graph I(H′′,E′′ ) where H′′ is
the set of HyperOps. For every HyperOp Hi∃VHi∈H′′ . For every edge (vi, vj)in the
DFG such that vi∈Hi and vj∈Hj,i6=j, there exists an edge between VHiand VHj.
HyperOps can be constructed by grouping together either operations or basic blocks.
Grouping operations: This provides finer granularity for selection and hence gives
more flexibility in forming HyperOps. However it is more difficult to satisfy convexity
condition required to form valid HyperOps. The problem of forming HyperOps can be
mapped to that of partitioning the graph such that there are no cyclic dependencies across
the partitions. This problem is in the class of NP-complete and an optimal solution can not
be obtained in polynomial time. Our experiments with a few heuristics did not yield good
results.
Grouping basic blocks: When a HyperOp is formed by grouping together basic blocks,
additional information from the Control Flow Graph (CFG) can be used to perform con-
sistency check. In most of the cases the operations within a basic block have close de-
pendences. Hence it is beneficial to group all these operations into one HyperOp. In this
paper, we describe an algorithm to group together basic blocks. This algorithm allows us
to include a partial basic block when a HyperOp cannot include the complete basic block
due to structural constraints (i.e. the number of operations in a HyperOp crosses a pre-
determined threshold). HyperOps are scheduled as atomic entities and hence it is required
that the entire HyperOp is launched onto the hardware. The compiler assumes the avail-
ability of a certain minimum number of resources and it ensures that resource requirement
of all HyperOps does not exceed this limit.
The algorithm presented in this paper uses greedy strategy which forms one HyperOp at
a time. That is, once a HyperOp is created, basic blocks are added till resource constraints
are met or till a basic block cannot be added due to the convexity constraint. Basic blocks
are considered in a depth biased topologically sorted order for inclusion into a HyperOp.
When multiple paths in the CFG merge, topological order forces us to traverse different
paths of the CFG. This increases the number of basic blocks that are complementary to
each other (as they are included from different paths). We bias the search towards depth
whenever possible (refer figure 8) to minimize the complementary basic blocks included in
the HyperOp. If a complete basic block cannot be added, it is included into the HyperOp
partially such that it does not violate the resource constraints. An overview of the algorithm
to create of HyperOps is presented in algorithm 1. Function choose bb that identifies
a basic block for inclusion into the HyperOp, is described in algorithm 2. Algorithm 3
describes the method to partition a basic block. Here the operations of the basic block are
considered in topologically sorted order and appropriate number of operations are included
into the HyperOp to ensure that the resulting HyperOp does not exceed the resource limits.
ACM Journal Name, Vol. V, No. N, Month 20YY.
14 ·Polymorphic ASIC
function create HyperOp()
/*start with entry basic block */
active list = entry bb;
while true do
/*choose a basic block from the active list */
new bb = choose bb (active list);
/*create a new HyperOp including new bb into the HyperOp */
for each child of the new bb do
if all parents of the child basic block have been assigned to HyperOps then
/*Add child basic block to the active list */
endif
endfor
while true do
/* choose the basic block from active list that can be included into the current
HyperOp under construction based on correctness conditions */
new bb = choose bb (active list, new HyperOp);
if no such basic block exists then
break;
endif
/*if resource requirement of the HyperOp after including the chosen basic
block is less than the maximum resources allowed add the basic block */
if resource required HyperOp(new HyperOp, new bb) ≤max resources then
/*add the chosen basic block to the HyperOp */
for each child of the new bb do
if all parents of the child basic block have been assigned to HyperOps
then
/*Add child basic block to the active list */
endif
endfor
endif
else/*Partition the basic block into two basic blocks such that first portion
does not violate the maximum resource requirements */
partition bb(chosen bb, portion of chosen bb);
/*add the first portion of the chosen bb to the HyperOp */
break;
endif
/* no basic blocks to be processed in the active list */
if active list is empty then
break;
endif
endw
/* no basic blocks to be processed in the active list */
if active list is empty then
break;
endif
endw Algorithm 1: An Algorithm to create HyperOp
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·15
BB A
BB B BB C
BB D BB E BB F BB G
BB H BB I
BB J
BB A
BB B BB C
BB D BB E BB F BB G
BB H BB I
BB J
HyperOp
(a) (b)
candidate
basic blocks that
are not candidates
basic blocks
Fig. 8. (a) Example showing the order in which basic blocks are included. We go into depth first and include
basic blocks B, D, E, H before we include basic block C as indicated by the dashed line. (b) shows the candidate
basic blocks for the partially formed HyperOp.
To check for convexity condition, it is sufficient to check if the immediate dominator
[Cytron et al. 1991] of a basic block is part of the HyperOp when basic blocks are included
in topologically sorted order. Theorem 3.1 gives a proof for the same.
function choose bb()
for each basic block in active list (bb) considered in the reverse order of inclusion do
if bb is not in the same loop as other basic blocks from HyperOp then
continue;
endif
if immediate dominator of bb is not in HyperOp then
continue;
endif
return bb;
endfor
return NULL;
Algorithm 2: An algorithm to choose a basic block for inclusion into a HyperOp
THEOREM 3.1. HyperOp formation technique described in algorithm 1 forms valid
HyperOps.
PROOF. Consider a scenario where there is a partially formed HyperOp and a candidate
basic block (say A) being considered for inclusion into a HyperOp. Depending on already
included basic blocks in the HyperOp, the HyperOps can be classified into four groups.
ACM Journal Name, Vol. V, No. N, Month 20YY.
16 ·Polymorphic ASIC
function partition basicblock()
/*add all operations which do not receive any inputs from with in the basic block to
the active list */
active list = all top level operations of basic block;
while true do
if there are no operations in the active list then
break;
endif
else if number of operations added to the partitioned basic block >node limit
then
break;
endif
/*Choose the first operation from the active list */
/*add the operation to the partitioned basic block */
for each child of the operation added to the partitioned basic block do
if all parents of the child operation have been added to partitioned basic block
then
/*Add child operation to the active list */
endif
endfor
endw
Algorithm 3: An Algorithm to choose a subset of operations from a basic block to be
included into a HyperOp
Case I HyperOp contains basic blocks that are ancestors to the immediate dominator
of BB A, but not the immediate dominator of BB A:HyperOp H1 (in figure 9) is a
representative of this case. If BB Ais included in such a HyperOp, there will be a
mutual dependency between the HyperOps containing the immediate dominator of A
and the HyperOp H1 (in which BB Ais included) which violates convexity condition.
Such basic blocks are therefore not included in the HyperOp.
Case II HyperOp contains the immediate dominator of BB Aand some of the ancestors
of BB Athat are descendents of the immediate dominator. Including BB Ainto HyperOp
H2 will lead to a mutual dependency between HyperOp H2 and the HyperOp includ-
ing the ancestors of BB A(BB B, BB Cin figure 9) which is a violation of convexity
condition. Hence such basic blocks are not included.
Case III HyperOp contains basic blocks that are neither ancestors nor descendents of
A:Including BB Ainto HyperOp H3 (in figure 9) would lead to a case where predicate
governing BB Aand H3 are complimentary. Hence the HyperOp H3 cannot contain BB
A.
Case IV HyperOp contains the immediate dominator of BB Aand all ancestors of BB
Athat are descendents of the immediate dominator. The predicate governing BB Ais
being generated inside the HyperOp H4. Hence it is safe to include BB Ainto such a
HyperOp.
From the above discussion it can be seen that only BB Ais included into a HyperOp H,
when the immediate dominator of BB Aand all basic blocks in the paths from immediate
dominator to the BB Aare included into a HyperOp. This is true for any BB Aand hence
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·17
all HyperOps formed are valid.
..
.
Immediate
Dominator
of A .
.
.
(Case I)
BB A
HyperOp H1
(Case II)
BB B BB C
BB A
HyperOp H2
..
.
..
.
(Case III)
BB A
(Case IV)
BB A
HyperOp H3 HyperOp H4
Fig. 9. Example showing different kinds of HyperOps
3.3 Phase III: Tag Generation
When a single instance of producer and consumer exist, it is sufficient to have static tags
to identify the consumer. Also, if it is ensured (either by adding dependencies or by using
hardware support) that only a single instance is active, a static identifier is sufficient. In
cases, where multiple producer/consumers can be active simultaneously, each data item
needs a dynamic tag along with the static identifier. In the case when multiple producers
exist (for the same consumer instance) then only the latest token has to reach the consumer.
Such cases are handled by adding a steer node and appropriately predicating the node so
that only the latest node reaches the consumer. In the case when a producer produces once
ACM Journal Name, Vol. V, No. N, Month 20YY.
18 ·Polymorphic ASIC
and consumer consumes it multiple times (ex: loop invariant data) the produced token has
to be stored and retrieved whenever they are required. These tokens are marked as sticky
tokens [Inagami and Foley 1989] and hardware performs the necessary actions as described
in section 2. These scenarios are illustrated in the figure 10 with an example. These cases
are also summarized in table II.
a = 20 HyperOp A
........
.........
.........
(a) (b)
c += j;
for ( j = 0; j < n; j++)
else
HyperOp C
HyperOp D
HyperOp E
b = a + 30; HyperOp B
d = b;
.........
b += c;
.........
Body
Loop
Sticky Token
Static HyperOp
Index
Predicated
Steer Node
HyperOp A
Static HyperOp
i
i
− −
−
− −
−
−if ( /*condition*/) −
HyperOp E
HyperOp D
HyperOp C
HyperOp B
Iteration
index
i−
Empty Tag field
Dynamic Tag Structure
Index + Dynamic Tag
Fig. 10. Code sequence in figure (a) is an example of a nested loop along with the grouping of instructions into
HyperOps. As shown in (b) HyperOp A and HyperOp B have only one instance, hence a static tag is sufficient.
HyperOp C consumes data produced by HyperOp A multiple times and hence a sticky token is required. HyperOp
C and HyperOp D have multiple instances hence a dynamic tag is required. A predicated steer node is necessary
for communication between HyperOp D and HyperOp E, since HyperOp D produces the token multiple times.
Only the last value should be given as input to HyperOp E.
Table II. Producer Consumer relationship based on cardinality of their instances
Single Producer Instance Multiple Producer Instance
Single Consumer Instance Static ID Predicated Steer Instruction
Multiple Consumer Instance Sticky token Dynamic Tag
As mentioned earlier, to identify tokens of the multiple active instances uniquely, tags
have to be generated dynamically using the iteration number of each of the loops in the
nesting hierarchy. We use two bits to identify the iteration number and therefore four in-
stances of a loop can be active simultaneously. We allow a nesting depth of four. The
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·19
compiler adds necessary dependencies to ensure that only one instance of the loop is active
when nesting depth is greater than four. The tags of token belonging to consumer Hyper-
Ops are calculated using producer tags. The compiler provides necessary information to
compute the same. This information includes
—Producer-consumer loop nesting relation: When creating the DFG we ensure that a
loop communicates only with its immediate parent loop by adding additional steer nodes
where required. An example is shown in figure 11. This results in the following three
producer-consumer relationship: (a) producer and consumer belong to the same loop,
(b) producer is the parent loop of the consumer and (c) consumer is the parent loop of
the producer. These three cases are encoded using two bits. Limiting the number of
producer-consumer relationship reduces hardware complexity.
—Iteration dependence distance: It can be clearly seen that the iteration dependence
distance can be ’0’ or ’1’, since any scalar data produced in a loop iteration has to be
consumed in the same iteration or in the next iteration. This is encoded using one bit
and serves as a part of compiler information for tag generation. Vector data are not sent
directly to the consumer. Producer stores the data in the Data Store and the consumer
loads it from there (refer figure 3). Hence tags need not be generated for these variables.
The three bits (two for identification of producer-consumer nesting relation and one for
identification of the iteration dependence distance) are sufficient to generate the tag at ex-
ecution time. The tag generation involves an increment and shift operations. For example,
consider the scenario when the producer is the child loop and the consumer is the parent
loop and the iteration dependence distance is 1. The tag of the destination is generated by
discarding the child instance bits of the producer instance and incrementing the immediate
parent instance by 1. This method is expected to be less hardware intensive, and reduces
the overhead of tag generation. The tag generation scheme is also applicable for sticky
tokens. This method eliminates the need to re-circulate the same data with different tags
as in Wavescalar [Swanson et al. 2007]. Instead the consumer picks it up from the sticky
token store (section 2.5).
3.4 Phase IV: Mapping HyperOpsto the reconfigurable fabric
The mapping of HyperOps onto the reconfigurable fabric is done in three steps as detailed
below.
—Creation of partitioned HyperOps: Each CE has a fixed amount of storage to hold the
operations. The HyperOp may contain more operations than the available storage in a
single CE. This requires subdivision of the HyperOp into several partitions, each parti-
tion referred to as partitioned HyperOp (p-HyperOp). Each p-HyperOp is loaded onto a
CE. Since each CE of the reconfigurable fabric executes one operation at a time, better
performance can be obtained when parallel operations are executed on different CEs.
Alternatively these parallel operations can be executed on the same CE in a pipelined
fashion, by overlapping computation of the next operation while the results of the cur-
rent operation are communicated. In order to identify the p-HyperOps that satisfy these
constraints several sequential paths (identified through a pre-order depth first traversal of
the dataflow graph of the HyperOp) are interleaved within the same p-HyperOp. Other
heuristics proposed in [Swanson et al. 2007], [Chu et al. 2003] can also be adapted for
this purpose.
ACM Journal Name, Vol. V, No. N, Month 20YY.
20 ·Polymorphic ASIC
HyperOp A HyperOp A
HyperOp B HyperOp B
HyperOp C HyperOp C
HyperOp D HyperOp D
Loop2
Loop1
Loop2
Loop1
Fig. 11. In the example shown, HyperOp A and HyperOp C communicate, but HyperOp A is not in the immediate
parent loop and hence an additional node is added in HyperOp B to ensure each HyperOp communicates only
with HyperOps in their parent loop.
—Mapping the communication needs of the HyperOps on the interconnect: The p-
HyperOp interaction graph captures the interactions among the operations contained in
different p-HyperOps. We employ a greedy algorithm to map a p-HyperOp interaction
graph onto the honeycomb topology, by removing certain edges of the p-HyperOp in-
teraction graph. The edges to be removed are considered in the decreasing order of the
number of data exchanges that occur between two p-HyperOps. This has the effect of
placing highly interacting p-HyperOps onto CEs that are closely located to each other.
—Compute and Transport Metadata Generation: This phase translates all the oper-
ations present in the p-HyperOps into their equivalent opcodes as understood by the
ALUs/FUs. The generated code is stored in the Operation Store as Compute Metadata.
Transport Metadata is a homeomorph of the p-HyperOp interaction graph that specifies
the network topology needed for the HyperOp. Transport Metadata is encoded as a part
of instruction and is used by the HRM to launch the HyperOp on appropriate CEs.
3.5 Phase V: Formation of Custom Instructions
The main purpose of Custom Instructions is to enhance the efficiency by minimizing the
overheads incurred while launching and executing HyperOps. Overheads due to repeated
launching of HyperOps can be avoided, if the Metadata loaded on the fabric is retained for
multiple executions of the HyperOp. Compiler identifies HyperOps that are executed in a
loop as candidates for such an optimization. Further, when multiple HyperOps (which have
a strong producer-consumer relationship) are kept persistent on the fabric, the overheads
due to delivery of the data between these HyperOps can be reduced by directly delivering
the data using the NoC (instead of IHDF). It is essential for the compiler to know which of
these HyperOps will be active simultaneously on the fabric to generate appropriate code.
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·21
This phase groups together HyperOps and forms Custom Instructions. All the HyperOps in
the Custom Instruction are launched simultaneously. An example of a Custom Instruction
is shown in figure 12. Since the HyperOps within a Custom Instruction are persistent on
the fabric, all iterations of the loops within a Custom Instruction reuse the same set of CEs.
Hence only one iteration is active at any point of time. As mentioned earlier, dynamic
instances of HyperOps in a Custom Instruction do not require to be tagged. However, to
ensure flow control special support is required from the hardware as described in section
5.3. These iterations can potentially be pipelined depending on the data dependencies
between these HyperOps. Section 5 extends the concept of Custom Instruction to form
Custom Instruction pipelines. These pipelines are necessary to achieve high throughput
required in streaming applications.
TTTTTTTT
TTTTTTTT
TTTTTTTT
TTTTTTTT
TTTTTTTT
TTTTTTTT
TTTTTTTT
TTTTTTTT
Inorder data delivery
Inorder data delivery
Custom Instruction
HyperOp 2HyperOp 3
HyperOp 1
Fig. 12. An example of a Custom Instruction. The NoC ensures in order delivery of packets. HyperOps execute
all instructions of one iteration before starting a new iteration. These two conditions ensure correctness
4. IMPLEMENTATION DETAILS AND RESULTS
In this section we provide implementation details of a tile and other components of RE-
DEFINE. As mentioned earlier a tile comprises a CE and a router. The interconnection of
routers form the NoC as detailed in [Joseph et al. 2008].
4.1 Compute Element
The CE comprises an ALU, Local Wait Match Unit (LWMU), transporter, Local Operation
Orchestrator (LOpOr) and the control FSM. Figure 13 shows the overall organization of a
CE along with control and data flow.
The LWMU provides storage space for operations that have been assigned to the CE.
It consists of a set of registers which are logically organized as ’slots’. The composition
of a single slot is shown in figure 14(a).The Compute Metadata (OPCODE field) in a
ACM Journal Name, Vol. V, No. N, Month 20YY.
22 ·Polymorphic ASIC
Same_CE
Adr_src
TRANSPORTER
10
MUX
LOpOr
Operand 1
To FSM
C0
C1
alu_free (from FSM)
(from FSM)
alu_free
Operand 2
To ROUTER
BUSY Signal Router_busy
Bypass Channel
From ROUTER
W_EN
Wait Match Unit
(LWMU)
NPI
ALU/FU
Transport Metadata
Compute Metadata
Local
Invalidation
Logic
FIFO
Fig. 13. Block diagram of a Compute Element
slot specifies the operation to be performed by the ALU. The Transport Metadata (DEST1,
DEST2, DEST3, C0 and C1 fields) specifies the destinations for the result of the operation.
Further, storage for two operands (OPERAND1 and OPERAND2 fields) and a predicate
bit (PV field) are also provided in each slot. If the predicate bit is T ru e, the instruction in
the slot is executed; if not, the instruction is discarded. The fields V1, V2 and PA indicate
the availability of the operands and the predicate respectively. For operations which are not
predicated, both PV and PA fields are set to T rue. An operation is launched onto the ALU
only when all the operands and predicate are available. The Launch bit (L field) is used to
distinguish between the operations that are already launched and the operations which are
yet to be executed.
The LOpOr is responsible for launching the operations onto the ALU. An operation is
ready to be launched, (a) if the Launch bit indicates that it has not been launched, (b) all
its operands are available and (c) the predicate is available and asserted. Apart from an
operation being ready, the ALU of the CE should be free to accept a new operation before
any operation can be launched. The check for launch-readiness of an operation is per-
formed by logically ANDing the V1, V2 and PV bits from each slot. Multiple operations
may be ready to be launched simultaneously. In such scenario, the LOpOr selects one
operation among all the ready operations in the LWMU based on a fixed priority scheme.
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·23
DEST1OPCODEC0C1L DEST3 OPERAND1 V1 OPERAND2 PA PVDEST2
12 0335
V2
3652678297104 103 102
(a) LWMU slot format
Y
X
NPI
PAYLOAD
0
SLOT OPS
7271 6367 56
59
Router Payload (64 bits)
Address (15 bits)
(b) Transport Packet Format
Fig. 14. LWMU Slot Format and Transport Packet format
The selected operation is launched onto the ALU. The LOpOr also forwards the Transport
Metadata to the transporter.
An operation is invalidated after it is launched or when its predicate is F alse. The
schematic of the invalidation logic is shown in figure 15. The invalidation logic invalidates
an operation by resetting the operand availability bits and setting the Launch bit in the
corresponding LWMU slot.
DEMUX
PA PV
V2
1
2
3
. . .
150
(slot select lines)
Buffer
alu_free
Priority Encoder output
L
SLOTs
1
3
2
V1
. . . .
Fig. 15. Block diagram of the Invalidation Logic
The ALU receives the Compute Metadata of the launched operation and performs the
computation specified in the OPCODE field. After the completion of the computation,
the results are forwarded to the transporter. The transporter creates the transport packets
using the result of the ALU computation and the Transport Metadata. The composition of
transport packet is shown in figure 14(b). The X and Y fields in the packet indicate the
ACM Journal Name, Vol. V, No. N, Month 20YY.
24 ·Polymorphic ASIC
relative coordinates of the destination CE. The SLOT field represents the address of the
slot in destination LWMU. The OPS field of the packet indicates the type of the payload
as shown in table III. The New Packet Indicator (NPI) field in the transport packet marks
the arrival of a new packet. The value of this bit is inverted for every new packet that is
generated.
Table III. OPS and Destined field in a slot
OPS Destined field in a slot
00 Metadata Field
01 Operand 1
10 Operand 2
11 Predicate
The block diagram of the transporter is shown in figure 16. The Transport Metadata
and the ALU result is stored in a register till all the packets are sent to their respective
destinations. The bits C0 and C1 in the LWMU slot specify the number of valid destination
fields in the Transport Metadata. The Adr select signal specifies whether the transport
packet is destined to the same CE or to a different CE. If the result are destinedto operations
contained in the same CE, the multiplexer at the input of the LWMU selects the packets
from the bypass channel (refer figure 13). Otherwise the packet is sent to the router. If
the router is not ready to accept the packet, the transporter waits till the router becomes
available again.
X field
Y field
D1 D2 D3
Adr_select
(from FSM)
Packet
Same_CE
Transport Metadata alu_output
1 0
DEMUX
To Router
Bypass Channel
alu_free
Clk WB
T
Q
clk
T Flipflop
en
RST rst Same_CE & Router_busy
00 01 10
NPI
Fig. 16. Block diagram of the transporter.
The different components of the CE are controlled by a Moore type FSM. The state
diagram of the FSM is shown in figure 17 and different signals generated by the FSM are
listed in table IV. The significance of different input and output signals of the FSM are
shown in table V and table VI. The CE free signal is asserted, when all the operations in
the LWMU are launched.
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·25
IN={00010,000X1}
IN={00110,001X1}
{01X10,01XX1}
IN=
IN={11X10,11XX1}
IN={10010,100X1}
Launch_en=1
EOT=0
C1,C0={01,10,11}
BEGIN
WB2
WB1
WB3
ALU
EOT=1 Launch_en=0
IN={10110,101X1}
C1,C0={10,11}
& (Same_CE=1 or Router_busy=0)
& (Same_CE=1 or Router_busy=0)
IN={CE_free,C1,C0,Router_busy,Same_CE}
& Router_busy=1
Same_CE=0
& Router_busy=1
Same_CE=0
& Router_busy=1
Same_CE=0
RESET
RST or RST=1
Fig. 17. State Diagram of the FSM
Table IV. Output signals generated in different states
State Generated output signal
RESET alu free = 0, WB = 0
BEGIN alu free = 0, WB = 0
ALU alu free = 1, WB = 0
WB1 Adr select = 00, alu free = 0, WB = 1
WB2 Adr select = 01, alu free = 0, WB = 1
WB3 Adr select = 10, alu free = 0, WB = 1
4.2 Scheduling algorithm used in HRM
As mentioned in section 2, the HyperOp Selection Logic of HRM uses a prioritized al-
gorithm, presented in algorithm 4 to resolve contention among all the HyperOps that are
ready to be launched. Our objective is to contain the explosion of parallelism at the time
of execution of HyperOps. We need to restrict the explosion of parallelism because of the
limitation of the number of compute, communication and storage elements on the fabric
as well as the limitation in the storage space of GWMU. If a HyperOp that is a source for
a large number of HyperOps is selected for execution, it will result in a large number of
ACM Journal Name, Vol. V, No. N, Month 20YY.
26 ·Polymorphic ASIC
Table V. Input signals fed to the FSM
Input signal Value Significance
EOT 0 p-HyperOp is being launched into the LWMU.
EOT 1 p-HyperOp launching is completed and execution may commence subject to
availability of the operands.
Launch enable 0 None of the operations in the LWMUhas both the operands available.
Launch enable 1 Atleast one of the operations in the LWMU has both the operands available.
C1 C0 00 One destination is valid in the Transport Metadata.
C1 C0 01 Two destinations are valid in the Transport Metadata.
C1 C0 10 or 11 Three destinations are valid in the Transport Metadata.
Router busy 1 The router cannot accept a packet.
Router busy 0 The router can accept a packet.
CE free 0 Some or all of the operations in the LWMU are yet to be executed.
CE free 1 All the operations in the LWMU have been executed and hence the CE is
declared free.
Same CE 0 Result is destined for some other CE.
Same CE 1 Result is destined for the same CE.
Table VI. Output signals derived from the FSM
Output signal Value Significance
ALU free 0 The ALU is busy and hence no operations should be launched.
ALU free 1 The ALU is free and hence one operation may be launched.
Adr select 00 The destination in the result packet is selected from the first destination in the
Transport Metadata.
Adr select 01 The destination in the result packet is selected from the second destination in the
Transport Metadata.
Adr select 10 The destination in the result packet is selected from the third destination in the
Transport Metadata.
WB 0 Packet formation is stalled.
WB 1 Packet formation in progress.
HyperOps which need to be stored in GWMU or a large number of HyperOps which are
ready to be launched. This will result in explosion of parallelism. We hence assign priority
to a HyperOp based on the number of HyperOps for which it serves as a source. If the
number of HyperOps that are currently launched is low we choose a HyperOp with max-
imum priority to enhance the parallelism. If the number of HyperOps that are currently
launched is high we choose a HyperOp with minimum priority. The threshold where this
inversion happens is determined by the number of HyperOps that can be accommodated
on the fabric and by the storage space of the GWMU.
4.3 Inter HyperOp Data Forwarder
The IHDF receives the data and the static identifier of the consumer HyperOp. This module
computes the consumer instance number based on the producer instance number and the
compiler hint. Once the instance number of the consumer HyperOp is computed, a lookup
is performed to determine the location of the HyperOp in the GWMU. The lookup table
called the HyperOp Instance lookup table (refer figure 18), operates in a manner similar to
a traditional cache. Each entry of the lookup table is indexed by the static HyperOp Id and
is tagged with the HyperOp Instance Number. The data contains the slot number within the
GWMU where the HyperOp instance is present. The data is forwarded to that slot of the
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·27
function schedule HyperOps()
while there are HyperOps executing in fabric OR there are HyperOps waiting in HRM
OR there are HyperOps ready for execution in HRM do
let rset = set of ready HyperOps
let wset = set of waiting HyperOps
/*wset and rset are mutually disjoint */
if |rset|>1then
if |rset|>specified limit then
pset = subset of HyperOps from rset with minimum priority
/* priority is assigned to HyperOps based on the number of HyperOps for
which it provides inputs */
if |pset|>1then
chosen hyperop = HyperOp with least ready time in pset
/*ready time is defined as the time when a HyperOp is removed from
wset and added to rset */
endif
chosen hyperop = the only element of pset
endif
pset = subset of HyperOps from rset with maximum priority
if |pset|>1then
chosen hyperop = HyperOp with least ready time in pset
endif
chosen hyperop = the only element of pset
if chosen hyperop can be mapped on reconfigurable hardware fabric then
schedule chosen hyperop for execution
endif
wait for updates on rset and wset
endif
wait for updates on rset and wset
endw
Algorithm 4: An algorithm to schedule HyperOps in HRM for execution
GWMU. If there is a miss in the lookup table, a new slot is created and the instance lookup
table is updated with this information. As in the earlier case, the data is forwarded to the
appropriate slot of GWMU. The lookup table is four way associative, hence allowing only
four instances to be active at any point in time. If a request can not be processed (due to
lack of slots in GWMU or due to lack of space in cache), the request is appended to the
end of the queue.
The IHDF is also responsible for storing the loop invariants in its “sticky token” [In-
agami and Foley 1989] store. These sticky tokens are inputs to several HyperOp instances.
In order to minimize the load on the GWMU, the compiler identifies all sticky tokens a
priori. These sticky tokens are stored in a separate storage and are copied to the GWMU
when the first input operand which is not sticky arrives. The sticky token store operates in
a manner similar to a cache. It is indexed by the HyperOp identifier and tagged with the
instance number of the parent loop.
ACM Journal Name, Vol. V, No. N, Month 20YY.
28 ·Polymorphic ASIC
...
.
.
..
.
..
.
..
.
..
.
..
.
..
.
..
.
..
.
..
.
.
Request Buffer
From Fabric Request Processing Block
HyperOp HyperOp HyperOp HyperOp HyperOpHyperOp
ID IDInstance #1 Instance #1Instance #8 Instance #8
Template Template Location Location
Sticky Token Store HyperOp Instance Lookup Table
. . . . . .
Fig. 18. Two lookup tables of the Inter-HyperOp Data Forwarder
TableVII. Table of Dataflow overheads forvarious media kernels.
Application Number of Compute Total number of nodes Percentage Overhead of
Nodes DFG nodes
Deblocking Filter 162 199 18.59
IDCT 677 754 10.21
FIR 19 25 24
FFT 47 61 22.95
4.4 Results
In this section we present the results and discuss the efficiency of the compiler framework.
Compiler efficiency is measured in terms of overheads that compiler adds to the DFG and
also with regard to the time spent in the executing operations other than those directly
compiled from the applications.
Overhead in compiler includes all additional nodes introduced in the DFG. The addi-
tional nodes include the steer nodes and nodes which are used to transport data. This over-
head impacts the storage space required and hence the number of CEs required to execute
the HyperOp. The impact on total execution time depends on the ability of the compiler to
overlap compute operations with overheads associated with executing a HyperOp. Another
measure of the quality of the DFG is in terms of extra edges added. Edges between any
two compute nodes is considered to be an essential edge. All the other edges which include
the predicate edges are considered as control overhead that impact the communication la-
tency. The optimizations presented in sections 3.1.1 and 3.1.2 mainly target reduction of
latency. These optimizations also decrease the DFG overhead. Table VII summarizes the
DFG overheads for various media kernels. We observe higher percentage of overhead for
small kernels, like FIR and FFT since the DFG overhead is amortized over lesser number
of nodes. We observe lesser overhead for applications like IDCT and Deblocking Filter
which are highly parallelizable. Refer to [Alle et al. 2008] for a more detailed study on the
impact of these optimizations on the number of edges.
For the purpose of measuring the execution time we have built a simulator that is execu-
tion driven simulator and mimics the hardware. The functionality of each of the modules
is described in C++ language. System C is used to connect all the modules appropriately.
System C framework has the built in notion of time and hence it is easy to keep track of
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·29
Table VIII. Table of Execution time overheads for various media kernels.
Application Time taken for actual Total Time taken Percentage Overhead for
computation management
Deblocking Filter 10296 21012 51
IDCT 16109 31741 49
FIR 95 167 43
FFT 5593 9124 38
cycles in System C framework. Since we describe functionality in C++ language the simu-
lation time is reduced drastically. The interfaces of the modules are modeled as they would
exist in hardware, so that each module can be replaced by its cycle accurate counterparts.
This also helps in testing the HDL description of each of the modules independently. The
execution time includes the actual time spent in the computation on the fabric and the time
spent in performing other management tasks like launching the HyperOps on the fabric,
delivering the results through the IHDF etc. These activities although necessary for ex-
ecution, are not a part of the actual computation. Hence we term the time spent in these
activities as time for management. Table VIII provides details of the same. A more detailed
analysis is presented in [Alle et al. 2008].
Analysis of the Results From the table VIII we can see that the time taken for man-
agement is around 50% of the total computation time. The current implementation of the
compiler does not use the knowledge of alias information. The memory dependencies are
added conservatively and we do not observe overlap in the execution of HyperOps. As
a result, the management overheads are not overlapped with useful computations in most
of the cases. In applications like deblocking filter and IDCT, the same computation is
performed on multiple sets of data. In such cases alias analysis will help identify paral-
lelism, and hence computation can be overlapped with other tasks to reduce the execution
time. However, in cases like FIR and FFT where multiple operations are performed on
the same data, alias analysis does not uncover more parallelism. In such cases, to reduce
overhead we build Custom Instructions. In addition, we form a pipeline to achieve higher
throughput. This technique is described in section 5.
5. EXTENSIONS TO SUPPORT CUSTOM INSTRUCTION PIPELINE
In this section we present details of compiler extensions and NoC support for Custom
Instruction Pipeline. We also present results of the same.
5.1 Compiler extensions
In streaming applications Custom Instruction sequences that realize application kernels
must meet a desired throughput to satisfy the real time constraint. Maintaining the required
application throughput (at lower operating frequency) requires that the kernel is pipelined.
To break up the kernel into pipeline stages (where each stage is a Custom Instruction), we
employ a variant of software pipelining. In software pipelining, speedup is achieved by
overlapping different instructions from multiple iterations. However, software pipelining
implementations in current compilers only pipeline the innermost loop in a loop nesting.
A recent work proposed an extension of the same to allow pipelining of outer loops as
well [Rong et al. 2007]. This work also chooses the best loop to pipeline based on various
criteria. Though this is a major step towards overlapping the computations at a desired
nesting level this is not applicable to all loops in general. This work assumes a rectangular
ACM Journal Name, Vol. V, No. N, Month 20YY.
30 ·Polymorphic ASIC
iteration space. There are several loop structures which do not fall under this category.
Further it is not possible to software pipeline, if the inter iteration dependency distance is
not a known constant. In this paper we extend software pipelining techniques to support
loops whose iteration index are in a Geometric Progression (GP) as it occurs commonly in
applications kernels involving FFT and other kernels in wireless media applications.
For applications that require high memory bandwidth, Custom Instruction pipeline alone
is not sufficient to improve the performance of the application. A large amount of time is
spent in the exchange of data between the Custom Instructions, if the communication be-
tween pipeline stages is through memory. To improve the performance of the pipeline, the
data between neighboring stages is streamed (when possible). The compiler is enhanced to
determine when streaming is possible and when the result cannot be streamed. In order to
stream data between the pipeline stages, it is essential to ensure flow control. We assume
an in-order delivery of tokens to ensure the same. Further to achieve a constant through-
put, a dedicated channel is required for streaming the data between pipeline stages. NoC
presented in [Joseph et al. 2008] is extended to support the same. Details of the NoC and
the other compiler enhancements are presented in this section.
5.1.1 Forming Custom Instruction pipeline. Consider a generic streaming application
that has a nested hierarchy of loops as shown in code sequence given in figure 19. In order
to identify pipeline stages, compiler chooses a loop in the loop nesting and unrolls the loop.
Each iteration of the loop is assigned a pipeline stage. Such unrolling is possible only when
there is no negative dependence with regard to the parent of the loop being unrolled. In
the example shown in figure 19, Loop 3 is pipelined. Using the notation for dependence
distance employed in [Rong et al. 2007] in the distance vector d=< d1, d2, d3, d4>,
d1, d2, d3, d4are the distance at Loop 1, Loop 2, Loop 3, Loop 4 respectively. d2(the
distance at Loop 2) should be non-negative if Loop 3 is chosen for pipelining. We assume
that sufficient hardware resources are available on the fabric to support arbitrary number of
pipeline stages. Deeper pipelines can be realized by unrolling other loops in the hierarchy
to deliver higher throughput. For example, in addition to Loop 3,Loop 2 can be unrolled
into N2/d2(refer figure 19) pipeline stages corresponding to each iteration of Loop 2.
Each stage of the pipeline is a Custom Instruction that constitutes all the HyperOps in the
body of the loop being unrolled. Each HyperOp executes in a data driven manner. The
use of predicates based on the initiation interval for pipeline control obviates the need for
synchronous control.
5.1.2 Identifying memory accesses for streaming. To identify potential memory ac-
cesses for streaming, first we find store-load pairs that share a producer-consumer rela-
tionship with focus on pairs across iterations of the loop that is pipelined. For example,
in figure 19, we should find memory accesses across the iterations of Loop 3. Iteration
distance between the producer and consumer determines the number of cycles a data has to
be buffered before delivering to the consumer. We term this distance as temporal distance.
As mentioned earlier we use NoC to deliver data on the fabric. The number of routing
hops between the producer pipeline stage and consumer pipeline stage on the fabric is de-
fined as spatial distance. The spatial distance determines the amount of buffer available
for the data on the fabric. Hence, data can be streamed, if temporal distance is equal to
spatial distance. If the temporal distance is greater than the spatial distance (i.e. the buffer
requirement exceeds the available buffer due to the spatial distance), the producer can be
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·31
.
.
.
.
HyperOp 3_n
.
.
HyperOp 4_*
//Code for Loop 1
for ( i = 0; i < N1; i+= d1)
//Code for Loop 2
//Code for Loop 3
//Code for Loop 4
for ( j = 0; j < N2; i+= d2)
for ( k = 0; k < N3; i+= d3)
for ( l = 0; l < N4; l+= d4)
Control logic for the pipeline
Stages of the pipeline
HyperOp 1
HyperOp 3_0 HyperOp 3_1
.
..
.
HyperOp 4_* HyperOp 4_* . . . .
HyperOp 2
Fig. 19. A general case of Custom Instruction pipeline. In the figure HyperOp1-4 correspond to code blocks of
Loop1-4. HyperOp m n represents nth dynamic instance of mth HyperOp. HyperOp m * indicated all dynamic
instances of HyperOp m
delayed to enable streaming support. However, these optimizations are not addressed in
this paper. When temporal distance is less than the spatial distance consumer stalls till the
data is available. The method used to determine store-load pairs is described in a later sec-
tion. This analysis produces store-load pairs and the range of indices (of the inner loops)
over which the store and load have the same value. If an operation requires multiple mem-
ory accesses to perform a computation, only the latest store-load pair (in temporal order)
is a candidate for streaming. Algorithm 5 shows the identification of store-load pairs that
can be streamed.
5.1.3 Case study:. FFT is chosen as a case study. FFT has a very high communication
requirement and belongs to the spectral Dwarf [Asanovic et al. 2006].
Algorithm 6 describes the code sequence for the FFT kernel. This code sequence is a
representation of the technique proposed by Cooley et al. [Cooley and Tukey 1965]. The
kernel contains three loops. The index of the outermost loop is in a Geometric Progression
with a ratio r1= 1/2. The indices of the two inner loops are in Arithmetic Progression.
The initial value of index of Loop 2 is 0 and common difference is i. The initial value of
index of Loop 3 is j and common difference is 1. Indices < i, j, k > for loads and stores
are represented as < iR, jR, kR>and < iW, jW, kW>respectively (within comments
the annotations of loads and store for the butterfly operation are shown).
The butterfly computation requires two memory accesses. For the iteration < i =
n/2, j = 0, k = 0 >the array elements a[0] and a[n/4] are required. In the earlier
stage i.e. in the iteration < i =n, j = 0, k = 0 >, a[0] is produced n/4 iterations ear-
lier than a[n/4]. The temporal distance of the store-load pair (a[kW= 0], a[kR= 0]) is
n/4and the spatial distance between them is 1 (To keep description simple we assume a
spatial distance of one hop between two consecutive stages). Thus a[0] is not a candidate
for streaming. By similar analysis it can be shown that a[n/4] is a candidate for streaming.
Figure 20 displays the signal flow diagram of an 16-point FFT. It can be clearly seen that
stage 2 requires data x[0] and x[4] for computation. In stage 1 x[4] is produced 4 iterations
after x[0] is produced.
To compute the memory requirement for this algorithm we need to compute the number
ACM Journal Name, Vol. V, No. N, Month 20YY.
32 ·Polymorphic ASIC
function identify store load pairs()
for each loop do
set of loads = load instructions in the loop choosen for pipeline
set of stores = store instructions in the loop choosen for pipeline
for each load in the set of loads do
for each store in the set of stores do
/*Check if the progressions are overlapping */
if starting index of load >ending index of store then
continue;
endif
/*record store-load pair for streaming using the analysis summarized in
tables XI, X */
endfor
endfor
endfor
for each recorded pair do
if number of streams for a load >1then
if there is any overlap then
/*choose the latest store for the overlapped interval */
choose store();
endif
endif
endfor
Algorithm 5: An algorithm to identify store-load pairs for streaming
function fft ()
for i = n; i ≥2; i = i/2 do
/*Loop 1 */
for j = 0; j <n; j+= i do
/*Loop 2 */
twiddle factor = fetch twiddle factor;
for k = j; k <j+i/2; k++ do
/*Loop 3 */
(a[k], a[k+i/2]) = butterfly(a[k], a[k+i/2], twiddle factor);
/* Load a[k] and a[k+i/2] are represented as a[kR] and a[kR+i/2].
Store a[k] and a[k+i/2] as a[kW] and a[kW+i/2] */
endfor
endfor
endfor
Algorithm 6: Code Sequence for the FFT Decimation in frequency kernel.
of data tokens that can be streamed. There are four potential store-load pairs: (a[kW],
a[kR]),(a[kW],a[kR+i/2]),(a[kW+i/2],a[kR]) and (a[kW+i/2],a[kR+i/2]). The
index kWtakes the values in the interval [jW, jW+i)and the index kRtakes the values
in the interval [jR, jR+i/2p), where p is the difference between the iterations of store
and load. For streaming data between a load and a store, the load index (kR) and the store
index (kW) should be the same. To determine if there is an overlap between load and store,
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·33
w0
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(8)
x(9)
x(10)
x(11)
x(12)
x(13)
x(14)
x(15)
X(0)
X(8)
X(4)
X(12)
X(2)
X(10)
X(6)
X(14)
X(1)
X(9)
X(5)
X(13)
X(3)
X(11)
X(7)
X(15)
w1
w2
w3
w4
w5
w6
w7
w8
w9
w10
w11
w12
w13
w14
w15
w17
w16
w17
w18
9
w20
w21
w22
w23
Fig. 20. The Signal Flow graph of 16-point FFT Decimation in Frequency.
Table IX. Table showing the overlaps for p= 1 for different store-load pairs
Store-Load Pair Interval of Overlap
(a[kW], a[kR]) [j1, j1+i/4)
(a[kW+i/2], a[kR]) [j1+i/2, j1+ 3i/4)
(a[kW+i/2], a[kR+i/2) [j1, j1+i/4)
(a[kW+i/2], a[kR+i/2]) [j1+i/2, j1+ 3i/4)
the starting index of load should be lesser than the ending index of store.
max(kR)−min(kW)>0⇒jW−jR< i/2p+1, p ∈[0, log2(n)] (1)
The first overlap occurs when p= 1. Therefore
jW−jR< i/4(2)
jWis Arithmetic Progression (AP) with a common difference of iand jRis also AP with
a common difference of i/2. The nth term of jWis given by n∗iand the nth term
of jRis given by n∗i/2. Hence the above mentioned condition is valid only for every
alternate value of jRwhen p= 1. Hence, the overlap between kWand kRoccur in the
interval [jW, jW+i/2) i.e. (a[kW],a[kR]) is a candidate for streaming in this interval.
The overlap intervals for the other cases and p= 1 is shown in table IX.
While all the four store-load pairs are candidates for streaming, only two of them can
be streamed in. Only in two pairs (a[kW+i/2],a[kR+i/2]) and (a[kW+i/2],a[kR])
the temporal distance is equal to the spatial distance. The stores are distributed between
the sequences a[kW]and a[kW+i/2]. Of these sequences a[kW+i/2] is the only set of
ACM Journal Name, Vol. V, No. N, Month 20YY.
34 ·Polymorphic ASIC
stores which are streamed. Only half of these stores are streamed, causing a 25% reduction
in the number of stores sent to the memory subsystem. The number of stores which are
eliminated causes a corresponding reduction in the number of loads. The speedup achieved
by the proposed technique depends on the number of pipeline stages. After the initial
latency, all stages of the pipeline run in parallel. The speedup achieved is proportional to
the number of stages. In the FFT kernel since the outermost loop is unrolled log2(n)times
the speedup achieved is log2(n).
5.1.4 Analysis of sequences. In this section we present a generic technique for ana-
lyzing the memory access sequence to identify store-load pairs for streaming. We present
the analysis for a two loop nesting hierarchy. The memory access sequences depend on
the iteration indices. Depending on the pattern of the iteration index, the memory access
sequence also changes.
Let mW∗i+cWdenote a store array index and mR∗i+cRdenote a load array index,
where mW,cW,mRand cRare constants and iis the index variable. The nth element of
the index variable is given by I+d∗n. We consider the following situations:
—Index variable is independent of outer loops indices The access pattern of the mem-
ory is constant for any iteration of the outer loops. The index variable could be either an
AP or a GP. We present the analysis for both the cases.
—Index variable is an AP: The access patterns of both load and store are in AP. The
common difference of these two sequences are mW∗dand mR∗d, where dis the
common difference of index variable. We determine the first point of overlap, within
loop bounds. By equating xth store and yth load:
x∗mW= (((cR−cW) + I(mR−mW))/d) + y∗mR(3)
where I,dare initial value and common difference of the sequence respectively. To
obtain an integer solution, ((cR−cW) + I(mR−mW)) should be divisible by d.
If it is not divisible, the two sequences do not overlap. Since both the indices are in
AP, the overlaps is also in AP with the initial value determined by equation 3 with a
common difference given by LCM (mW∗d, mR∗d).
—Index variable is in GP: As we did in the earlier case, equating the xth store and yth
load:
mW∗a∗rx−mR∗a∗ry=cR−cW(4)
where ais the initial value of GP and ris the common ratio. In the special case where
cR=cW, the equation can be reduced to
x=logr(mR/mW) + y(5)
From equation 5, we can see that if mR/mWis a power of rthere will be overlap.
Since the common ratio of both the sequences is the same, there will be reuse till the
end of the iteration. In case mR/mWis not a power of rthen there are no common
points between the two sequences. When cRis not equal to cW, no analysis can be
performed easily.
—The common difference or common ratio (dor r) depends on outer loop. If the loop
index is in AP, the difference is determined by the index of the outer loop. If the loop
index is in GP, the ratio is determined by the index of the outer loop. The results of the
computation is summarized in table X.
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·35
Table X. Condition for Overlap for various cases when the initial is independent of the outer loop.
Index variable of inner loop in AP
Common difference
is dependent on outer
loop index which
varies in AP.
mR∗y−mW∗x∗(1 + p∗d2/d1) =
(I∗(mW−mR) + (CW−CR)/d1
d1,d2are common differences of
outer and inner loops respectively.
Common difference
is dependent on outer
loop index that varies
in GP.
I∗rp
1∗(mW∗x−mR∗y∗rp′
1) = ((mR∗
I−mW∗I) + (cR−cW))
Where r1is the common ratio and
Iis the initial value of inner loop.
Needs to be evaluated for all of p.
Index variable of inner loop in GP
Common ratio de-
pends on outer loop
index that varies in
AP.
Analysis is not possible
Common ratio de-
pends on inner loop
index that varies in
GP.
x−p∗y=logr(mR/mW)where cW=cRand r1=rp
2
Where r1and r2are the common
ratios of the index variable and the
difference.
Table XI. Condition for Overlap for various cases when the initial value is dependent on the outer loop.
Index Variable of inner loop is in AP
Initial Value are de-
pendent on outer loop
and outer loop index
varies in AP.
d2∗(mW∗x−mR∗y) + d1∗(mW∗p−
MR∗p′) = mR∗I1−mW∗I1+ cR−cW
I1is the initial value of the outer
loop, d1is the difference of the
outer loop and dis the difference
of the loop under consideration.
Initial Value are de-
pendent on outer loop.
outer loop index varies
in GP.
d2∗(mW∗x−mR∗y) + mW∗I1∗
r1p1−mR∗I1∗r1p=cW−cR
I1is the initial value of the outer
loop, ris the ratio of the outer
loop and dis the difference of the
loop under consideration.
Index Variable of inner loop is in GP
Initial Value are de-
pendent on outer loop
and vary in AP.
x−y=logr2((mR∗I1)−logr2(mW∗
I1+mW∗p∗d1)
I1is the initial value of the outer
loop, d1is the difference of the
outer loop and ris the ratio of the
loop under consideration.
Initial Value are de-
pendent on outer loop
and outer loop index
vary in GP.
x−y=logr2(mR/mW)−p∗logr2(r1)I1is the initial value of the outer
loop, r1is the ratio of the outer
loop and r2is the ratio of the loop
under consideration.
—The initial and final values depend on outer loops. The analysis for loops in which
initial and final values are dependent on the outer loops but the common difference of
the index variable is a known constant, is presented here. The analysis for an index
variable which follows a GP or an AP is different and both cases are shown in tableXI.
The equations given in table X and table XI need to be evaluated for various values of p
ACM Journal Name, Vol. V, No. N, Month 20YY.
36 ·Polymorphic ASIC
in the range shown in equations6 and 7 given below.
p > M ∗(d/d1)(6)
where Mis the number of iterations of the outer loop, dis the common difference of the
inner loop and d1is the common difference of the outer loop.
M∗logr(r1)< p (7)
where Mis the number of iterations of the outer loop, ris the common ratio of the inner
loop and r1is the common ratio of the outer loop.
5.2 Results of FFT
The Custom Instruction pipeline for a 1024 point FFT is realized using ten pipeline stages.
Each stage implements the butterfly operation together with other computations necessary
for pipeline control as a Custom Instruction. Each Custom Instruction is realized using
two CEs, each with a LWMU (refer section 4.1) of 16 entries. Each pipeline stage is given
one dedicated path to access memory. A semi-automatic mapping of the 1024 point FFT
pipeline on a 5 ×5 fabric is shown in figure 21. This is an example of the actual realization
of the generic custom instruction shown in figure 12. The mapping assumes the NoC is
exclusively available for the FFT application. In our implementation one input is processed
TTTTT
TTTTT
TTTTT
TTTTT
TTTTT
Fig. 21. A mapping for Custom Instruction pipeline for 1024 point FFT. Each ellipse represents a Custom In-
struction. The dark lines represent the paths through which the various Custom Instructions communicate.
every sixteen cycles. The ALU used in the CE is a general purpose ALU, hence butterfly
operation takes three cycles to complete. The memory model used is a pull model, i.e.
memory waits for a request and then transfers the data. This adds additional latency for
processing the inputs. However, the memory modules can be optimized by adding a push
model or a burst mode for streaming applications. This will have a high impact on the
throughput supported.
To study the performance of the proposed technique we compare the area and power
with a custom ASIC implementation of the same FFT unit. The custom ASIC implemen-
tation uses hardware butterfly compute units and the memory requests are overlapped with
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·37
Table XII. Results for 1024 point FFT implementation for a dedicated hardware implementation and using RE-
DEFINE
library RED EF IN E CustomAS IC RED EF IN E
CustomAS IC
Area Frequency Power Area Frequency Power Area Power
(in mm2) (in MHz) (in mW) (in mm2) (in MHz) (in mW) (normalized
to freq.)
130nm SP 4.76 250 112.6 0.59 167 88.6 8.1 0.85
130nm LL 4.87 167 118.2 0.64 71 45.1 7.6 1.1
90nm SP 4.56 333 129.1 0.44 250 54.8 10.3 1.7
90nm LL 4.87 167 64.0 0.75 66 24.5 6.5 1.0
computation. The custom ASIC implementation is capable of processing one input every
cycle.
Both custom ASIC and REDEFINE implementations of the FFT unit were synthesized
using Synopsys Design Vision. Since the leakage power is sensitive to the technology
node, the design was synthesized using 130nm and 90nm Faraday libraries [Faraday 2008].
Since speed and leakage optimizations on standard cell libraries can impact the power
consumption of the synthesized chip, separate standard cell libraries for Low Leakage
(LL) and Standard Performance (SP) were used. The Low Leakage library was used with
high threshold voltage to minimize leakage power. The High Speed library was used with
low threshold voltage to maximize speed. For all implementations, the operating voltage
of the chip was set to 1 Volt.
Synopsys Prime Power was used for power estimation. The signal toggle rates for power
estimation were derived from a VCD file generated using Modelsim simulator [Mentor
Graphics 2008] obtained from a post synthesis netlist. The same set of test vectors were
used for both custom ASIC and REDEFINE implementations. The area, power and fre-
quency numbers for the two implementations are tabulated in the table XII. On an average,
across all the libraries, REDEFINE occupies 8×more area and dissipates 1.16×more
power compared to that of an ASIC implementation. However since the throughput of
ASIC solution is 16×more than REDEFINE solution, the average energy dissipation of
REDEFINE is 18.6×more than the ASIC solution. The area reported for REDEFINE is
that of a 5 ×5 fabric (comprising 25 CEs) though only 20 CEs are used. The leakage
power of REDEFINE at 90nm technology is 31.7mW with the high speed, low threshold
library and it is 5.7µW with the low leakage, high threshold library. For the ASIC solu-
tion with the same set of libraries the leakage power is 4.4mW and 2.04µW respectively.
REDEFINE dissipates 4.9×more leakage power than the ASIC solution.
The 1024 point FFT was also mapped on a Xilinx Virtex 5 (xc5vlx50t) FPGA. The
FPGA implementation occupies 1179 slice LUTs and dissipates 334mW (does not include
I/O pad power), running at 121MHz. The Xilinx Virtex 5 is in 65nm technology . After
normalizing the power dissipation for frequency and technology, it is observed that FPGA
consumes 17.8×more power than the comparable ASIC (90nm SP). This result is in line
with the result reported in [Kuon and Rose 2006]. Our implementation in FPGA does
not use hard blocks (for ex: multipliers present on FPGA). The FPGA consumes 10×
more power than the equivalent REDEFINE implementation (90nm SP). The Compute
ACM Journal Name, Vol. V, No. N, Month 20YY.
38 ·Polymorphic ASIC
and Transport Metadata in REDEFINE is responsible for runtime reconfiguration. For the
1024 point FFT this amounts to 1.65KB. The configuration information for the 1024 point
FFT FPGA implementation on Virtex 5 amounts to 1716KB which is about 1000×more
than that of REDEFINE.
5.3 NoC Support for Custom Instruction Pipeline
An in-order delivery is mandatory for data streaming among the HyperOps. In addition, it
is necessary to maintain a guaranteed bandwidth in order to sustain the Custom Instruction
pipeline without stalls. The design of the routers of the interconnect supporting Virtual
Circuits (ViCi) is described in this section.
Router vckt_set_up
Guaranteed Bandwidth
(GB)
GB Control
Best Effort (BE)
BE Control
[140:75]
66
66
75
75
[74:0]
north_in [140:0]
south_in [140:0]
east_in [140:0]
west_in [140:0]
north_out[140:0]
south_out[140:0]
east_out[140:0]
west_out[140:0]
Fig. 22. Block Diagram of the BE-GB router
The implemented router is divided into two different logical sections:
—The first section provides best-effort (BE) routing for the packets
—The next section maintains and terminates virtual circuits, and ensures a guaranteed
bandwidth (GB) and bounded latency.
We call this type of router a BE-GB router (figure 22 and 24). To achieve hardware recon-
figuration, the routers must be able to create a ViCi at any time of execution. The buffers of
the GB section are not accessible for BE routed traffic. If the buffers for the GB path were
usable for the BE section, they must be cleared before a ViCi can be established. This will
add an enormous effort to the hardware, since no packet can be dropped. Using GB buffers
for the BE section is similar to adding an additional virtual channel for the BE section. As
shown in [Joseph et al. 2008] the effect of additional Virtual Channels added to the input
port of the router is negligible compared to the power the necessary logic would consume.
Hence, in our implementation GB is not reused for BE traffic even when GB is free.
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·39
Guaranteed Bandwidth part
Tail
Indicator Bits(GB) (BE) Control
75
139 138 67
140 74 73 71 63 0
Best Effort part
Address
X Relative Y Relative
Address
NPI NPI
GB Payload (64 bits) BE Payload (64 bits)
Fig. 23. Packet format of a BE-GB router
5.3.1 Packet format. Figure 23 displays the packet format. At the input port of the
router, the packet is divided and forwarded to the appropriate sections where they are pro-
cessed independently. At the output port two packets are merged (if data is available for
both sections) and sent to subsequent routers. Due to the arbitration algorithm of the BE
section, a packet from the GB section might be merged with a completely different BE
packet compared to the one on arrival.
The first part of the packet sent over the network belongs to the GB services. It has a
total width of 66 bits including 2 Control Bits as overhead. The NPI indicates that new
data is available, the Tail Indicator bit is used to identify a tail packet to terminate a ViCi.
64 bits are assigned to carry the GB payload. The second part providing BE services is 75
bits wide. It also comes with a New Packet Indicator, the 2 Control Bits indicate a type of
the BE payload (header, error, or acknowledgement [ACK]). Based on these control bits
a ViCi is established, maintained, or terminated on the GB path. The next 8 bits are for
the X-Y relative addressing scheme to the destination. Remaining 64 bits contain the BE
payload.
5.3.2 Best Effort Service. To emulate a non-blocking network, the BE portion of the
router provides four Virtual Channels as discussed in [Dally 1992]. It supports a simple
X-Y relative addressing scheme to limit the control overhead to 11 bits per packet. The
router design is optimized to avoid FIFO buffers at the input port and loop back at output
crossbar. Arbitration among packets which are ready to be routed, is done using Round
Robin algorithm. The implementation details and performance analysis of the BE router
can be found in [Joseph et al. 2008].
5.3.3 Guaranteed Bandwidth Service. Guaranteed Bandwidth (GB) path provides GB
service using ViCi. It also ensures bounded latency after the ViCi is established. For estab-
lishing a ViCi, a header packet is sent through BE path. This packet is processed by Header
Parsing Logic which generates control signals for GB Control Logic. GB Control Logic
reserves ViCi buffers at each node from source to destination between which the ViCi is
established. When the header packet successfully reaches the destination, an ACK is sent
back to the source. Once the ACK is received, GB packets are routed on the established
ViCis path. When the ViCi needs to be ended, the source sends a tail packet through the
GB path which resets the reserved ViCi buffers. Since our router provides one ViCi buffer
there can be only one ViCi between two adjacent routers in each direction. When the path
can not be setup, the router at which the failure occurs, sends an error packet to the source.
When an error packet is received, the source sends a tail packet through GB path to tear
down the partial paths.
5.3.4 Related Work. Several research groups have proposed NoC router architectures
with wide variety of features. Basic design tradeoffs are silicon area, power and frequency
of operation, and performance tradeoffs are latency and throughput. Æthereal Network
ACM Journal Name, Vol. V, No. N, Month 20YY.
40 ·Polymorphic ASIC
VCktn
VC1
VC2
VC3
VC4
VC1
VC2
VC3
VC4
Req DecoderReq DecoderReq Decoder Req Decoder
Output ArbiterOutput ArbiterOutput ArbiterOutput Arbiter
VC1
VC2
VC3
VC4
VC1
VC2
VC3
VC4
VC Arbiter
VC_sel[1:0]
R1
R2
R3
R4
P
Port Update Logic
VC_status[3:0]
demux_sel[1:0]
port_status_ou_east
P − port_status_in[3:0]
R1, R2, R3, R4 − output_req_from_VC
req_from_east[1:0]
Address Update
Logic
VC Arbiter
VC_sel[1:0]
R1
R2
R3
R4
P
Port Update Logic
VC_status[3:0]
demux_sel[1:0]
VC Arbiter
VC_sel[1:0]
R1
R2
R3
R4
P
Port Update Logic
VC_status[3:0]
demux_sel[1:0]
port_status_out_southport_status_out_west
req_from_north[1:0]
req_from_south[1:0]
req_from_west[1:0]
mux_sel[1:0]
mux_sel[1:0]
mux_sel[1:0]
mux_sel[1:0]
Data
Control signal
VC Arbiter
VC_sel[1:0]
R1
R2
R3
R4
P
Port Update Logic
port_status_out_north
VC_status[3:0]
demux_sel[1:0]
Address Updated
Packet
VCkts
VCktw
VCkte [65:0]
CLK
north_in[140:0]
south_in[140:0]
west_in[140:0]
east_in[140:0]
north_out[140:0]
south_out[140:0]
west_out[140:0]
east_out[140:0]
[74:0]
75
66
[74:0]
Guaranteed
Header Parsing
Logic
GB Control
Logic
Bandwidth
Crossbar
Fig. 24. Data and control path scheme of the BE-GB router
on Chip [Goossens et al. 2005] offers guaranteed services - throughput and latency - in
addition to best effort services. It also offers alternative programming models and router
architectures to facilitate design space exploration. Æthereal NoC uses contention-free
routing, or pipelined time-division-multiplexed circuit switching, to implement its guaran-
teed performance services. However it does not incorporate virtual channels. We provide
virtual channels to emulate a non-blocking network and improve throughput. Our NoC can
provide both GB and BE traffic paths simultaneously.
Nostrum [Millberg et al. 2004] Network on Chip implements a service of GB along
with the already existing service of BE packet delivery. The guaranteed bandwidth is pro-
vided by ViCis. They are implemented using combination of two concepts called Looped
Containers and Temporally Disjoint Networks (TDNs). Looped Containers are used to
guarantee access to the network independently of the current network load without drop-
ping packets. The TDNs are used in order to achieve several ViCis, along with BE traffic.
The virtual circuits are setup semi-statically, i.e. the route is decided at design time, but
the bandwidth is variable at runtime. In our implementation bandwidth is fixed at design
time and not modifiable afterwards.
ViChaR [Nicopoulos et al. 2006] is a dynamic virtual channel regulator for Network on
Chip routers. ViChaR maximizes throughput by dispensing a variable number of virtual
channels on demand. Our implementation has a fixed number of virtual channels of 4.
ViChaR is only useful, if the packet size differs resulting in unutilized and expensive buffer
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·41
space in case of many small packets. However the packet size of the proposed architecture
is always equal and known in advance. Therefore a ViChaR implementation will result in
area and power overhead.
The supporting hardware and the compiler framework will ensure a close distance be-
tween the traffic source and its destinations. Hence the workload of the NoC is mostly
related to nearest neighbor patterns exploiting the locality to achieve maximum through-
put and minimum latency. The communication patterns are matched to the static topology
of the network. In [Kim et al. 2008] a NoC is introduced that is able to be reconfigured
to represent a particular topology according to the needs of the loaded application. Since
the compiler framework of our approach takes the topology into account, a reconfigurable
NoC will also result in a noticeable overhead in area and power dissipation.
5.4 Results
In REDEFINE 64 routers are connected in a honeycomb topology as shown in figure 4
to form the NoC with a bisection bandwidth of 91.5Gbps. The 8×8 honeycomb NoC is
implemented using RTL code in Verilog HDL and simulated using Mentor Graphics Mod-
elsim [Mentor Graphics 2008] by driving it through a test-bench written in Verilog HDL.
Synopsys Design Compiler is used for synthesis of the NoC. We evaluated the performance
of the NoC for different traffic patterns that are representative of real applications.
Based on the results given in [Joseph et al. 2008], we also tested the GB services with
the same traffic patterns to allow a comparison between BE and GB. The traffic patterns
are random, self-similar and two application specific that model the execution of multime-
dia and DSP kernels on REDEFINE. In random traffic, each tile generates packets with
random destinations. Packets are injected into the network from all ports every clock cycle
whenever the ports are available. Self-similar traffic has been observed in the bursty traffic
between on-chip modules in typical MPEG-2 video applications [Varatkar and Marculescu
2002]. Two application specific traffic patterns refer to the traffic generated from the traffic
trace given by a topology (honeycomb) aware simulator of REDEFINE. The application
specific traffic has a lot of near neighbor communication. The DSP kernel occupies an area
of 2 ×2 on the fabric and the multimedia kernel occupies an area of 4 ×4 tiles.
Latency and Throughput We compare the throughput and latency for the traffic pat-
terns mentioned earlier on the NoC. Figure 25 shows average network latency vs injected
load for different traffic patterns. The latency approaches infinity when injected traffic is
beyond the saturation throughput. Higher saturation throughput is achieved for application
specific traffic as compared to random or self-similar traffic patterns, due to higher near
neighborhood communication across tiles. The average length of the paths from source to
destination is higher in the case of random traffic when compared to all others. Hence the
saturation occurs much earlier than the other traffic patterns. As mentioned earlier these
kernels do not occupy the entire 8 ×8 fabric and hence the resource congestion is lesser
than that of random traffic.
Throughput is the maximum traffic accepted by the network and it relates to the peak
data rates sustainable by the system. The accepted traffic depends on the rate at which data
is injected into the network. Ideally accepted traffic should increase linearly with injection
load. However, due to the limitation of routing resources, accepted traffic will saturate at a
certain injection load.
For evaluating GB path, non-overlapping ViCis are established in the network along
with the BE traffic using the same traffic patterns. The bandwidth for GB path and BE path
ACM Journal Name, Vol. V, No. N, Month 20YY.
42 ·Polymorphic ASIC
are independent, and once the ViCi is established, the latency of the GB path is bounded.
Figures 26, 27, 28 and 29 show latency variation with offered traffic for both BE and GB
paths for random, self-similar, multimedia kernel, and DSP kernel traffic patterns respec-
tively. As it can be observed from the plots average latency remains constant for GB path.
For BE path as the traffic load is increased on the network, latency increases slowly ini-
tially and latency becomes infinite when the offered traffic reaches saturation throughput
of the network. This due to the limitation of routing resources.
Synthesis Results and Comparison To get an estimate of the size and speed of the
routers, we synthesized our NoC design for 90nm Faraday CMOS standard cell library
using Synopsys Design Compiler. Area, power and frequency numbers are tabulated for
BE-GB router with 4 virtual channels and 1 virtual circuit. Power is measured using syn-
opsys design vision with an activity factor of 50%.
Table XIII. Synthesis results of a BE-GB router using 90nm Faraday CMOS library
Router Area Power Frequency
(mm2)(mW) (MHz)
BE-GB 0.082572 31.2 909
6. CONCLUSION
In this paper we presented REDEFINE - the Polymorphic ASIC to obtain programmable
solutions that are comparable with the performance expected from an ASIC. We described
both the microarchitecture and compiler support required for the same. The microarchi-
tecture comprises a reconfigurable fabric and necessary support logic to execute the ap-
plications. The reconfigurable fabric is an interconnection of tiles in honeycomb topology
where each tile consists of a data driven Compute Element and a router. We obviate the
overheads of central register file by providing local storage at each Compute Element and
by delivering the data to the destination directly. We synthesized the reconfigurable fabric
using Synopsys design vision using 90nm and 130nm Faraday libraries.
We presented a compiler for REDEFINE to realize the application described in a High
Level Language (for ex: C) onto the reconfigurable fabric. The compiler aggregates basic
blocks to form larger acyclic code blocks called HyperOps. Execution of HyperOps in RE-
DEFINE follows a dynamic dataflow schedule incurring management overheads. In order
to accelerate loops in a nested loop hierarchy, the compiler aggregates multiple HyperOps
to form a Custom Instruction. All HyperOps in a Custom Instruction are persistent on the
reconfigurable fabric for all iterations of the loop and hence do not need to be scheduled for
every iteration. The reduced management overheads leads to a performance gain. The Cus-
tom Instructions can be further staged in a pipeline across various iterations of the loop to
accelerate application kernels. This allows us to stream the data through the pipeline stages
determined by analyzing the memory access sequences. This helps in reducing memory
traffic.
We presented 1024 point FFT implementation as a case study for the Custom Instruction
pipeline. In the FFT implementation we eliminated 25% of all loads and stores by stream-
ing directly to the consumer stage. We further realized the Custom Instruction pipeline
onto a 5 ×5 fabric. For a 1024 point FFT implementation, REDEFINE occupies 8×
more area and dissipates 1.16×more power when compared to that of an ASIC. Further
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·43
0
5
10
15
20
25
30
35
40
45
50
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Average Latency for BE traffic(cycles)
Injected Load (fraction of network capacity)
random
self-similar
multimedia kernel
DSP kernel
Fig. 25. Latency variation with injected traffic for var-
ious traffic patterns
0
5
10
15
20
25
30
35
40
45
50
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Average Latency(cycles) for Random Traffic
Injected Load (fraction of network capacity)
Best Effort
Guarantedd Bandwidth
Fig. 26. Latency variation with injected random traffic
0
5
10
15
20
25
30
35
40
45
50
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Average Latency(cycles) for Self-Similar Traffic
Injected Load (fraction of network capacity)
Best Effort
Guaranteed Bandwidth
Fig. 27. Latency variation with injected self-similar
traffic
0
5
10
15
20
25
30
35
40
45
50
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Average Latency(cycles) for Multimedia Traffic
Injected Load (fraction of network capacity)
Best Effort
Guaranteed Bandwidth
Fig. 28. Latency variation with injected multimedia
kernel traffic
0
5
10
15
20
25
30
35
40
45
50
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Average Latency(cycles) for DSP kernel Traffic
Injected Load (fraction of network capacity)
Best Effort
Guaranteed Bandwidth
Fig. 29. Latency variation with injected DSP kernel
traffic
the energy consumed is 18.6×higher than that of an ASIC implementation. REDEFINE
implementation of the FFT consumes 0.1×the power of its corresponding FPGA imple-
mentation. Since the FPGA is configurable at a very fine grained level namely at the bit
level, the configuration overhead of the FPGA implementation is 1000×more than that of
REDEFINE.
ACM Journal Name, Vol. V, No. N, Month 20YY.
44 ·Polymorphic ASIC
ACKNOWLEDGMENTS
The authors acknowledge the support for this research obtained from the Ministry of Com-
munication and Information Technology, Govt. of India. The authors acknowledge the
many fruitful discussions on this paper with Prof. Ed Deprettere, Prof. Patrick Dewilde,
Dr. Rishiyur Nikhil, Prof. Arvind and Dr. Vinod Kathail, Nitin Chawla and Amar Nath
Satrawala.
REFERENCES
ALLE, M., VARADARAJAN, K., , FELL, A., NANDY, S. K., AND NARAYAN, R. 2008. Compiling Techniques
for Coarse Grained Runtime Reconfigurable Architectures. In ARC’09: Proceedings of the 5th IEEE Interna-
tional Workshop on Applied Reconfigurable Computing.
AMBRIC INC. 2007. Am2000 massively parallel processor array.
ASANOVIC, K., BOD IK, R., CATANZARO, B. C., GEBIS, J. J., HUSBANDS, P., KEUTZER, K., PATTERSON,
D. A., PL ISHKER, W. L., SHALF, J., WILLIAMS, S. W., AND YELICK, K. A. 2006. The Landscape of
Parallel Computing Research: A View from Berkeley. Tech. Rep. UCB/EECS-2006-183, EECS Department,
University of California, Berkeley. Dec.
BAUER, L ., SHAFIQUE, M., KRAMER, S ., AND HENKEL, J. 2007. RISPP: Rotating Instruction Set Processing
Platform. In DAC ’07: Proceedings of the 44th annual conference on Design automation. IEEE, 791–796.
CHRIS LATT NER AND VIKRAM ADVE. 2004. LLVM: A Compilation Framework for Lifelong Program Anal-
ysis & Transformation. In CGO ’04: Proceedings of the international symposium on Code generation and
optimization. Washington, DC, USA.
CHU, M., FAN, K., AND MAHLKE, S. 2003. Region-based hierarchical operation partitioning for multicluster
processors. In PLDI ’03: Proceedings of the ACM SIGPLAN 2003 conference on Programming language
design and implementation. Vol. 38. ACM Press, New York, NY, USA, 300–311.
COOLEY, J. W. AND TUKEY, J. W. 1965. An algorithm for the machine calculation of complex fourier series.
Mathematics of Computation 19, 90, 297–301.
COWARE INC. 2007. Processor designer.
CYTRON, R., F ERRANTE, J. , ROSEN, B . K., W EGMAN, M. N ., AND ZADECK, F. K . 1991. Efficiently Comput-
ing Static Single Assignment Form and the Control Dependence Graph. ACM Transactions on Programming
Languages and Systems 13, 4 (Oct.), 451–490.
DALLY, W. J. 1992. Virtual-Channel Flow Control. IEEE Transactions on Parallel and Distributed Systems 3, 2,
194–205.
ELEMENT CXI. 2008. ECA 64 http://www.elementcxi.com/.
FARADAY. 2008. UMC Free Library http://freelibrary.faraday-tech.com/.
GOOSS ENS, K. , DIELISSEN, J., AND RADULESCU, A. 2005. Æthereal Network on Chip: Concepts, Architec-
tures and Implementations. IEEE Design & Test of Computers 22, 5, 414–421.
HICKS, J. E., CHI OU, D., ANG, B. S., AND ARVIND. 1993. Performance Studies of Id on the Monsoon
Dataflow System. Journal of Parallel and Distributed Computing [JPDC] 18, 3 (July), 273–300.
INAGAMI, Y. AND FOLEY, J. F. 1989. The specification of a new Manchester Dataflow Machine. In ICS ’89:
Proceedings of the 3rd International Conference on Supercomputing. New York, NY, USA.
JOSEPH, N., REDDY, C. R., VARADA RAJAN, K., ALLE, M., FELL, A., NANDY, S. K., AND NARAYAN, R.
2008. RECONNECT: A NoC for polymorphic ASICs using a Low Overhead Single Cycle Router. In ASAP
’08: Proceedings of the 19th IEEE International Conference on Application specific Systems, Architectures
and Processors.
KIM, M. M., DAVI S, J. D., OSKIN , M., AND AUSTIN, T. 2008. Polymorphic On-Chip Networks. In Proc. 35th
International Symposium on Computer Architecture (35th ISCA’08). ACM SIGARCH, Beijing.
KUON, I. AND ROSE, J. 2006. Measuring the gap between fpgas and asics. In FPGA ’06: Proceedings of the
2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays. ACM, New York, NY,
USA, 21–30.
MAHLKE, S. A., LIN, D. C., CH EN, W. Y., HANK, R. E ., AND BRINGMANN, R. A. 1992. Effective Compiler
Support for Predicated Execution Using the Hyperblock. In MICRO 25: Proceedings of the 25th Annual
International Symposium on Microarchitecture. Portland, Oregon.
ACM Journal Name, Vol. V, No. N, Month 20YY.
REDEFINE ·45
MATHSTAR. 2008. FPOA http://www.mathstar.com/.
MENTOR GRAPHICS. 2008. Modelsim SE http://www.m entor.com/products/fv/digital verification/modelsim se/.
MILLBERG, M., NILSSON, E., THID, R., AND JANTSCH, A. 2004. Guaranteed Bandwidth using Looped
Containers in Temporally Disjoint Networks within the NostrunNetwork on Chip. In DATE. IEEE Computer
Society, 890–895.
NICOPOULOS, C., PARK, D., KIM, J., VIJAYKRISHNAN, N ., YOUSIF, M. S ., AND DAS, C. R. 2006. VichaR:
A Dynamic Virtual Channel Regulator for Network-on-Chip Routers. In MICRO. IEEE Computer Society,
333–346.
PETERSEN, A., PUTNAM, A., MERCALD I, M., SCHWERIN, A., EGGERS, S. J., SWANS ON, S., AND OSKI N,
M. 2006. Reducing control overhead in dataflow architectures. In 15th PACT’06: Proceedings of the 15th
International Conference on Parallel Architecture and Compilation Techniques, E. Altman, K. Skadron, and
B. G. Zorn, Eds. ACM, Seattle, Washington, USA, 182–191.
POSEI DON DESIGN SYSTEM S INC. 2007. Triton builder.
QUICKSILVER. 2008. Adapt2400 http://qstech.com/.
RONG, H., TANG, Z., GOVINDARAJAN, R., DOUI LLET, A. , AND GAO, G. R. 2007. Single-dimension software
pipelining for multidimensional loops. ACM Transactions on Architecture and Code Optimization 4, 1 (Mar.).
S. B. AKERS. 1978. Binary Decision Diagrams. IEEE Transactions on Computers C-27, 6 (June), 509–516.
SANKARALINGAM, K., NAGARA JAN, R., LIU, H., KIM, C., HUH, J., BURGER, D., KEC KLER, S. W., AND
MOORE, C. R. 2003. Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. In Proc.
30th Ann. Intl Symp. on Computer Architecture (30th ISCA 2003), FCRC’03, ACM Computer Architecture
News (CAN). ACM SIGARCH/IEEE CS, San Diego, CA, 422–432. Published as Proc. 30th Ann. Intl Symp.
on Computer Architecture (30th ISCA 2003), FCRC’03, ACM Computer Architecture News (CAN), volume
31, number 2, UT Austin.
SATRAWALA, A. N., VARADARAJAN, K., ALLE, M., NANDY, S. K., AND NARAYAN, R. 2007. REDEFINE:
Architecture of a SOC Fabric for Runtime Composition of Computation Structures. In FPL ’07: Proceedings
of the International Conference on Field Programmable Logic and Applications.
SWANSON, S., S CHWERIN, A., MERCALDI, M., PETERSEN, A., PUTNAM, A., MICHELS ON, K ., OSKIN, M .,
AND EGGERS, S. J. 2007. The WaveScalar architecture. ACM Transactions on Computer Systems 25, 2
(May), 1–54.
SYNFORA INC. 2007. Aspen.
TAYLOR, M. B., KI M, J., MILLER, J., WENTZLAFF, D., GHODRAT, F., GREENWALD, B., HOFFMAN, H.,
JOHNSON, P., LEE, J.-W., LEE, W., MA, A., SARAF, A., SENESKI, M., SHNIDMA N, N., STRUMPEN, V.,
FRANK, M., AMARASINGHE, S. , AND AGARWAL, A. 2002. The raw microprocessor: A computational fabric
for software circuits and general-purpose programs. IEEE Micro 22, 2, 25–35.
TENSI LICA INC. 2007. Xtensa configurable processors.
VANGAL, S., HOWARD, J., RUHL, G., DIGHE, S., WILS ON, H., TSCHANZ, J., FINAN, D., IYER, P., SINGH,
A., JACOB, T., JAIN, S., VENKATARAMAN, S., HOSKOTE, Y., AND BORKAR, N. 2007. An 80-Tile
1.28TFLOPS Network-on-Chip in 65nm CMOS. Solid-State Circuits Conference, 2007. ISSCC 2007. Di-
gest of Technical Papers. IEEE International, 98–589.
VARATKAR, G. AND MARCU LESCU, R. 2002. Traffic Analysis for On-Chip Networks Design of Multimedia
Applications. In DAC 2002: Proceedings of the 39th Design Automation Conference. ACM Press, New York,
795–800.
VINOD, K., ARVIND,AND PINGALI, K. 1980. A Dataflow Architecure with tagged Tokens. Tech. Rep.
MIT/LCS/TM-174, Massachusetts Institute of Technology, Laboratory for Computer Science. Sept.
Received Month Year; revised Month Year; accepted Month Year
ACM Journal Name, Vol. V, No. N, Month 20YY.