Conference PaperPDF Available

Communication-Aware Module Placement for Power Reduction in Large FPGA Designs

Authors:
Communication-aware Module Placement for Power Reduction in
Large FPGA Designs
Kalindu Herath, Alok Prakash, Udaree Kanewala, Thambipillai Srikanthan
Nanyang Technological University, Singapore
{kalindub001@e, alok@, n1702552l@e, astsrikan@}ntu.edu.sg
Abstract—Modern multi-million logic FPGAs allow hard-
ware designers to map increasingly large designs into FPGAs.
However, traditional FPGA CAD flows scale poorly for large
designs, often producing low quality solutions in terms of
performance and power in such cases. To improve design pro-
ductivity, modular design methodology partitions a large design
into subsystems, compiles them individually and finally collates
the individual solutions to complete the mapping process. Exist-
ing work has attempted to partition large designs into smaller
subsystems, based on the intra-subsystem communication, to
reduce routing power dissipation. However, inter-subsystem
communication has not been considered, especially, during
the placement stage. In this work, we first show the adverse
effect of ignoring the inter-subsystem communication during
the placement stage. Next, we propose an inter-subsystem
communication-aware placement technique using a Simulated
Annealing based approach to achieve significant power savings.
Experimental results show over 7% reduction in routing power
when compared to the existing state-of-the-art partitioning flow
that ignores inter-subsystem communication, while the routing
power reduction is over 11% when compared to a commercial
CAD tool such as Altera Quartus.
Keywords-FPGA, CAD, modular design, placement
I. INTRODUCTION
Rapid scaling of transistors has enabled commercial Field
Programmable Gate Arrays (FPGA) manufacturers like Al-
tera [1] and Xilinx [2] to produce FPGAs with millions
of logic gates alongside diverse hard Intellectual Property
(IP) cores like Digital Signal Processors (DSPs) and Block-
RAMs (BRAMs) with varying capacities. This resource rich-
ness encourages designers to port immensely large designs
into modern FPGAs. Apart from the resource abundance,
low non-recurring engineering (NRE) cost and the time-
to-market (TTM) with respect to Application Specific Inte-
grated Circuits (ASICs) [3] have also made FPGAs popular.
Commercial FPGA Computer-Aided Design (CAD) tools
are well known to produce high quality mappings for small
to medium-scale designs. However, compared to the rapid
development in FPGAs, the available FPGA CAD tools do
not show sufficient maturity yet to efficiently map large
designs [4]. This typically results in inferior quality-of-result
(QoR) in terms of performance and power consumption for
such designs. Longer compilation time [5] is also commonly
observed with existing CAD flows, which eventually lowers
design productivity and adversely affects the TTM. Among
the major steps in CAD flow, the placement step is known
to consume a significant amount of time, often taking close
to 50% of the total CAD runtime [6]. Hence, commercial
FPGA vendors as well as the research community have
focused on improving the tools used for the placement step.
Existing placement techniques can be categorized into
three types, namely, Simulated Annealing (SA)-based,
partition-based and analytical. Altera uses parallel SA tech-
niques [7] in their Q2P [6] placer to improve their Quartus
CAD tool, while Xilinx incorporates analytical placement
techniques [8] to improve theirs. Despite the current endeav-
ors, significant effort is required to achieve better runtime
and performance while mapping large FPGA designs.
Recently, Modular Design Methodology (MDM) has been
gaining traction in both the research as well as commercial
communities, especially in order to reduce the runtime
of the existing CAD flows for large designs. The MDM
technique divides a large design into a number of smaller
modules. This technique is inspired by the fact that some
sections in a design, for instance, board support package
or external memory controllers, are not frequently changed
during design time. Such sections can be compiled and
mapped into an FPGA once, while continuing to iterate the
rest of the sections as required. As a result, MDM enables
code reuse and is effective in reducing the compile time as
well as in improving design productivity. Many approaches
which consider the reuse of pre-compiled modules (macros)
and IP cores along with library-based design can be seen in
literature. Dividing a large design into smaller modules in the
MDM technique also allows designers to leverage the exist-
ing CAD tools that already produce high quality mappings
for small to medium scale designs. However, MDM still
suffers from performance degradation and does not explicitly
focus on developing power-efficient solutions [9].
A significant contributor to the dynamic power consump-
tion in FPGAs is the charging and discharging of capacitive
loads that occur during the flow or communication of data
via the interconnect fabric [10] [11], especially at long
distances. This issue is further exacerbated by frequent com-
munication over long distance links [9]. Hence, to reduce
the overall power consumption in FPGAs, it is necessary
to reduce the long-distance links, especially for high fre-
quency communications. Authors in [9] proposed to divide
a large FPGA design into modules such that frequently
communicating entities fall within a single module, thereby
maximizing intra-module communication frequency. In the
subsequent compilation step, all the design entities from
these modules were placed in closed proximity by the CAD
flow, resulting in reduced interconnect power dissipation.
However, the authors failed to consider the inter-module
communication during the placement step that can poten-
tially result in further power savings.
In this paper, we propose an inter-module communication-
aware module placement strategy for placing modules that
builds on the work in [9] to further lower interconnect power
dissipation of large designs on FPGAs.
The rest of the paper is organized as follows. In Section
II we briefly explain related workflows followed by a mo-
tivational example in Section III. In Section IV, we explain
the proposed methodology. Next, we show the effectiveness
of the technique in Section V and we conclude the paper in
Section VI.
II. RE LATE D WORK
MDM includes two phases; module creation and module
assembling. During the first phase, a large design is divided
into modules. Each module is treated as an isolated design
hence it can go through CAD flow independently. During
the next phase, the pre-compiled modules are assembled
and inter-module links are routed to get the programmable
bitcode of the entire design. Placement techniques used in
the module assembly phase are discussed in this section.
Frontier [12] is a placement technique based on pre-placed
IP cores, also known as macro-blocks. These macros are
combined to form fixed sized and shaped clusters and then
assigned to FPGA regions while minimizing the timing and
wiring costs. Pre-compiled macro-block based design flow
has been discussed in BPR [5] and HMFlow [13]. BPR
introduces a new FPGA CAD tool for fast compilation of
FPGA circuits where placement for each macro-block is
selected from a database such that distance and expected
routing congestion between macro-blocks are minimized. A
modified version of VPR’s SA placement algorithm [14]
is used here to resolve the overlaps of placements that
occur due to non-atomic type and size of blocks. HMFlow
focuses on accelerating the compile time of designs having
a 50% or less FPGA resource utilization. Three module
placement algorithms are discussed in their work; recursive
bi-partitioning placer, random placer and heuristic placer.The
heuristic placer which is the fastest among them, considers
the amount of connectivity between hard macros and prior-
itizes the large macros and macros with BRAMs and DSPs.
Qflow [15] separates the design into invariant and evolv-
ing sets. Placement of evolving modules is handled by an SA
placer specialized for bigger (coarse-grained) modules. In
[16], cluster placement methodology provides an estimation
of the cluster’s wire-length and criticality to the annealer. A
floor planner based on Mixed Integer Linear Programming
(MILP) [17] allows the designer to customize the objective
function such that metrics like total wire length, area oc-
cupancy and aspect ratio can be weighted and incorporated
as a linear combination. On the other hand, an analytical
placer which optimize wire-length, is used in [18] instead
of a SA based placer. Power-aware CAD techniques, for
instance, [19] and [20], have not discussed their suitability
for large designs. On the other hand, scalable techniques
mainly focus on improving performance rather than power
consumption. Although MDM is a scalable proposition, its
module placement phase aims on improving compile time
while optimizing wire-length or performance. Previous work
lacks a module placement technique which considers data
communication and power consumption.
III. MOTI VATI ON
There are two major modes of power dissipation in
FPGAs; (i) Static power and (ii) Dynamic power. The former
is due to leakage current and subthreshold current whereas
the latter is due to the charging and discharging of capacitive
loads. Typical island-style FPGA consists of columns of
Configurable Logic Blocks (CLBs) interleaved with BRAM
and DSP columns, and input/output (IO) pins surrounding
the chip area. The interconnect fabric, which connects CLBs,
BRAMs, DSPs, and input/output(IO) pins to each other,
consumes 80% of its total area. As a result, the dynamic
power consumption of the interconnect fabric dominates the
total dynamic power dissipation. Dynamic power dissipation
(P) of an FPGA is modeled as follows [19];
P=1
2X
inets
Cii.v2
dd (1)
where nets are all the connections and respective com-
ponents of the FPGA, Cis the capacitance of each net, αis
the signal toggle rate (analogous to communication in this
paper) and vdd is the operating voltage.
In [9], the authors have explained that in a typical
application, some code sections execute more frequently
than the rest. Hence, the data communication of these code
sections is higher as well. Consequently, on the FPGA,
nets corresponding to such code segments show a higher
toggle rate, αi, due to data communication. In [9], authors
have proposed to break down a large design into a set of
modules based on the communication between nodes. We
refer to these modules as subsystems throughout this paper.
Nodes with high communication activity are included into
the same subsystem and the CAD is instructed to place
them in a close proximity. Having shorter routing wires
for such connections reduces both Ciand αi, reducing
overall dynamic power dissipation considerably. Ideally,
the subsystem creation should not introduce large inter-
subsystem communication. While the approach manages to
have shorter routing wires for the connections within the
subsystems, it does not guarantee that the communication
between the subsystems will also be minimized at the
same time. Hence, wrong placement of such subsystems
on an FPGA can allocate longer interconnect wires to
the links with high inter-subsystem communication, thereby
increasing dynamic power dissipation. [9] maps subsystems
into FPGAs using existing CAD placement, which does not
consider communication during the placement stage.
We elaborate this aspect using the following example.
Consider a design with 7 design units illustrated as a
graph shown in the figure 1(a). Each node shows its area
requirement on FPGA space and intra-communication as
[l, m, d][α]where l,mand drepresent required number of
CLBs, BRAMs and DSPs and αrepresents communication
within the node/subsystem respectively. An edge is a connec-
tion between two design units where data flow occurs. It is
annotated with communication between nodes/subsystems.
A
[0,20,0]
[0]
B
[60,0,0]
[0]
G
[0,20,0]
[0]
F
[120,0,0]
[0]
C
[60,0,0]
[0]
D
[0,20,0]
[0]
E
[120,0,0]
[0]
200
80
200
60
200
90
50
400
70
S1
[120,20,0]
[670]
S3
[120,20,0]
[200]
S2
[120,20,0]
[200]
60
140
80
(a) (b)
Figure 1: Subsystem creation based on communication
20
CLBs
BRAMs
DSPs
7
S2 S1 S3 S2
S1 S3
(a) (b)
Figure 2: Mapping subsystems to FPGA without placement
information
Performing a partitioning technique could result a network of
subsystems as given in the figure 1(b). At this point, previous
work has guided the CAD flow to group the relevant design
units as subsystems. But the location of these subsystems
on FPGA space is decided by the CAD flow. However,
we observe that CAD flow could map subsystems with
higher communication between them far apart. For instance,
subsystem in figure 1(b) could be mapped on FPGA as
shown in figure 2. However, the mapping in figure 2(a)
would cause lower wire length for higher inter-subsystem
communication links than the mapping in figure 2(b). This
leads to higher power consumption in option a than in option
b. In this paper, we strive to explicitly minimize the inter-
subsystem communication to achieve further power savings.
IV. METHODOLOGY
In this section, we describe communication-aware sub-
system placement methodology in detail. Similar to some
workflows on subsystem placement in MDM, we use an
SA based algorithm for our methodology. The inputs to our
approach are (i) a network of subsystems and (ii) target
FPGA architecture model, and the output is the placement
details for each subsystem on the target FPGA space.
A. Problem Formulation
A network of subsystems can be represented as an undi-
rected graph G(V, E ), where Vis the set of vertices that
represents each subsystem and Eis the set of edges. An
edge between two vertices indicates that there is one or
more inter-subsystems connection between the components
of both subsystems. Each edge is associated with a cost
value, αto represent inter-subsystem communication fre-
quency between associated subsystems. Each subsystem is
annotated with area requirement A(l, m, d)where l,m, and
drepresents the required number of CLBs, BRAMs and
DSPs respectively.
Given a network of subsystem G(V, E), area requirement
of each subsystem vVas Av, communication frequency
between subsystems of each inter-subsystem connection e
Eas αe, the placement problem is to find a set of non-
overlapping rectangular region Lvon FPGA space for each
subsystem v, where Lvsatisfies the area requirement of Av.
B. Algorithm
1) Simulated Annealing Algorithm: We explain our SA
based placement methodology in algorithm 1. A typical SA
begins with a random initial placement solution s(line 2).
Placement in CAD flow considers allocating atomic compo-
nents such as CLBs and BRAMs on FPGA space. But in
subsystem placement, a number of such atomic components
need to be placed. Instead of deciding the location of each
atomic component, subsystem placement decides a region
where all the components that belong to a subsystem can be
placed on FPGA. CO ST (s)(line 3) refers to the placement
quality of the solution s. Temperature (T) is analogous to
the current iteration of annealing. There can be niterations
(line 5) before it converges to a solution. In each iteration
of the algorithm, the solution of the previous iteration is
altered by moving one or more subsystems within FPGA
space (line 8), which is called a neighbor solution sN. If
sNis better than reported best sbest,sNis assigned to be
the best solution so far (line 11-12). However, in case if
the new solution sNis not the reported best solution, it is
still considered as a valid solution if it satisfies a probability
equation P(line 13-14).
2) Neighbor generation: In each iteration, a subsystem
is selected randomly to move withing FPGA space to cre-
ate neighbor solution. Maximum allowed movement along
horizontal and vertical direction on FPGA space (δxmax ,
δymax ) of the selected subsystem is a function of T, where
δxmax and δymax are gradually decreased when Treduces.
Once the maximum movement for current Tis obtained, the
selected subsystem is moved in the range [ δxmax,δxmax
] horizontally and [δymax ,δymax ] vertically.
3) Shape of the subsystems: Subsystems with require-
ment for multiple resource types (CLBs and BRAMs) are
split into sub-modules in a way such that sub-modules
contains only one resource type. For instance, as shown
in figure 3(a), subsystem S3with area requirement A3=
(120,20,0) is divided into two whose area requirements are
A3= (120,0,0) and A3.1= (0,20,0) respectively. We set
the communication between subsystem S3and S3.1to a
very large number (N). However, the original edges in the
graph Gare preserved. This subdivision of subsystems is
done to cater for the area requirement Avduring neighbor
generation. For instance, moving a subsystem having both
CLBs and BRAMs from location L1to L2might not find
Algorithm 1 Communication-aware subsystem placement
1: procedure SIM UL ATED A NN EA LI NG
2: sINIT PLACEMENT()
3: φCO ST(s)
4: sbest s, φbest φ
5: T0n
6: for TT0...0do
7: do
8: sNNEIGHBOR(s, T )
9: while MA X OVE RL AP(T)<OVERLAP(sN)
10: φNCO ST(sN)
11: if φN< φbest then
12: sbest sN, φbest φN
13: if P(e, eN, T )>RAND()then
14: ssN, φ φN
15: return sbest
16: procedure NEIGHBOR(s, T )
17: cSELECT SUBSYSTEM(s)
18: δxmax =F(T), δymax =F(T)
19: δx RAND(δxmax , δxmax )
20: δy RAND(δymax , δymax )
21: c(x, y)c(x+δx, y +δy)
22: return s
BRAMs at L2. However, in this work, we do not consider
the importance of the shape of a subsystem on FPGA space.
Therefore, in each iteration, we set the shape of subsystems
with CLBs to have a height to width ratio closer to 1 as
shown in figure 3(b). It has been shown in [21] that shapes
with aspect ratio closer to 1, help in reducing the wire length
within a subsystem.
4) Overlap of subsystems: Typical SA placement algo-
rithms do not allow overlaps of atomic components during
each iteration. However, since subsystems are non-atomic
components, it is difficult to produce a non-overlapping
solution in each iteration. The search space is also limited if
overlapping of subsystems is not allowed, which might cause
the algorithm to converge to a local minimum. To avoid such
situations, our approach allows overlapping of subsystems.
We use an exponential function of Tto decide maximum
allowable overlaps of subsystems, which reaches zero as T
decreases. Number of overlaps are counted as the area of
the overlapped region. Since we divide the subsystems into
sub-modules with single resource type, intersection of S2
and S3.1in figure 3 is not considered as an overlap.
5) Congestion model: Routability evaluation is an impor-
tant step in a placement algorithm. For instance, placing a
large number of components in a smaller area could lead to
excessive requirement of routing resources. Routing process
avoids highly congested areas and therefore may use longer
wires to connect two components. Hence, quality of result
(performance and power) is degraded. In our approach, we
incorporate a bounding box based congestion analysis to
estimate the routing congestion similar to [22]. In this model,
we consider the FPGA space as a 2D array, which we call as
CLBs
BRAMs
DSPs
S3.1
S3
(a) (b)
S1
[120,0,0]
[670]
S3
[120,0,0]
[200]
S2
[120,0,0]
[200]
60
140
80
NN
N
S2.1
[0,20,0]
S1.1
[0,20,0]
S3.1
[0,20,0]
11
12
Figure 3: a) Split subsystems b) Shape of subsystems
13
2
No. of bounding box overlaps:
(a) (b)
S2
S1
S3
S2
S1
S3
Bounding
box for S1-S2
Bounding box
for S2-S3
Figure 4: Congestion Model
congestion map. All congestion map elements are initialized
to zero at start. For each inter-subsystem connection in figure
1(b), a bounding box can be formed as figure 4(a). Note
that the bounding box for the edge between S1and S3is
not shown. Next, congestion map elements relevant to each
bounding box is incremented. Overlapping bounding boxes,
therefore, reflect high congestion regions. Congestion map
relevant to the subsystem placement shown in figure 4(a)
is shown in 4(b). We update the congestion map in each
iteration of the algorithm.
6) Cost Function and constraints: The main purpose of
our work is to avoid longer routing wires, especially for
high inter-subsystem communication links. Therefore, our
objective cost function is to minimize
CO ST (s) = X
eE
lee(2)
where leis half perimeter wire length for the bounding
box enclosing the subsystems relevant to edge e, whereas
αeis inter-subsystem communication. While minimizing
the cost function, we also consider a routing congestion
parameter C. For a valid placement, its maximum congestion
derived above must be less than empirically evaluated C.
C. Implementation on FPGA
It is important to note that our approach of
communication-aware subsystem placement does not
depend on a specific CAD tool. For implementation and
evaluation of our methodology, we use Altera’s Quartus
CAD flow [23]. But, an equivalent approach could be
followed with Xilinx CAD flow as well [24]. The first
phase of our subsystem placement approach partitions a
RTL
Synthesis
RTL
Simulation
Area Info.
󰇛 󰇜
RTL
Design
Communication-aware
partitioning algorithm
LogicLock
configuration
Quartus
Compilation
Map on FPGA
Subsystem generation
Extract parameters
Extract
connection
information
Compilation
Connection
info. of
modules
C

Subsystem placement
Communication-aware
placement algorithm
Subsystem Information
 󰇝 󰇞
 󰇝 󰇞
 󰇝 󰇞
Subsystem Information
 󰇝 󰇞
 󰇝 󰇞
 󰇝 󰇞
Target FPGA architecture
Figure 5: Communication-aware mapping methodology
large design into subsystems as done in [9]. This partitioning
technique works at entity/module level of an RTL code.
Entities which have high communication between them are
grouped into subsystems. For that, a given RTL code is
processed to extract the following details. This step is faster
than the full CAD compilation.
1) Area of the RTL entities: RTL synthesis step in Quartus
CAD flow estimates the resource requirement for each
entity. This step is faster than the full CAD compilation.
2) Communication frequency between entities: An RTL
simulation can dump all the signal toggles of desired
connections between entities.
3) Connectivity of entities: Entity-level RTL connection
information is extracted by doing text processing on code
Once these parameters are extracted, the communication-
aware partitioning algorithm forms subsystems such that
intra-subsystem communication is much higher than inter-
subsystems communication. The output of the partitioning
framework is a list of subsystems and a list of design
entities that belongs to each subsystem. We modified the
output slightly in order to obtain (i) a graph representation
of subsystems (ii) the inter-subsystem communication fre-
quency αeof each connection ebetween subsystems (iii)
area requirement Avof each subsystem v. Additionally, for
our placement algorithm, target FPGA architecture floorplan
information is needed. It should show the position of CLB,
BRAM and DSP columns and the number of resource
columns and rows in the floorplan. The placement algorithm
produces the placement for each subsystem vwith (i) left
bottom coordinate (xv, yv)of the rectangular region in
FPGA space and (ii) width wvand the height hvof the
rectangular region. We can use these parameters to invoke
Quartus LogicLock feature during the compilation to define
subsystems on FPGA space. Note that, in [9], placement
information to the LogicLock feature is not given allowing
CAD flow to decide the placement of each subsystem,
whereas this work overrides the tool’s placement decision
with our placement information.
V. RE SU LTS AND DISCUSSION
Next, we discuss the performance of the proposed
communication-aware placement strategy. We select applica-
tion set coded in RTL for the evaluation. Compilation reports
produced by Quartus CAD flow are used for the comparison.
A. Comparison strategy
We compare our methodology with two existing compi-
lation flows: (i) Original Quartus compilation without sub-
systems (ii) Quartus compilation with subsystems without
placement information as proposed in [9]. Full Quartus
compilation is done for these two cases and for proposed
solution followed by gate-level simulation to get the most
accurate power estimations from Quartus PowerPlay. Since,
we mainly target to reduce the power dissipation in the
FPGA interconnect, we compare the routing power dis-
sipation using the three approaches. Since, the operating
frequency affects the power dissipation, the routing power
is measured at a common operating frequency for all tests.
B. Benchmark Applications and Target FPGAs
Selected benchmark applications are similar to the ap-
plications in [9]. We use three handwritten applications
as presented in [9]. In addition, two applications have
been developed by modifying applications from the popular
polybench [25] benchmark suite. However, we extend the
benchmark application set that are inspired by polybench
kernels by introducing three more applications by following
the same concept for Gesummv, Cholesky and Symm kernels
to develop gesummv0, cholesky0and symm0respectively. In
addition, for a comprehensive evaluation, the benchmark set
is mapped into three newer devices in Altera Cyclone IV and
Stratix IV series. Although the methodology is compatible
with any device series including the latest Cyclone V and
Cyclone 10 series (or latest Xilinx devices), the gate level
simulation is not supported in the latest series which is
required to get a precise power measurement. It should be
noted that this is not the limitation of the proposed work.
Therefore, Cyclone IV and Stratix IV were the latest series
we could use to show the effectiveness of the methodology.
C. Results and Discussion
Three hand-coded applications are mapped to Cyclone
IV EP4CGX50 device, gesummv0is mapped to slightly
bigger Stratix IV EPSGX70 device and the rest of the
applications are mapped to Cyclone IV EP4CGX110 device.
Each application is compiled in three different test versions
as stated above. Table I reports percentage routing power
reduction of the new methodology over (i) default Quartus
compilation and (ii) over [9]. As observed from these results,
the proposed communication-aware placement technique for
subsystem placement during CAD flow helps to further
reduce routing power. In particular, the proposed approach
helps to reduce routing power of the benchmark applications
ranging from 3.55% to 10.75% (average of 7.08%) when
compared to [9]. It is also shown that the average routing
power reduction, when compared to the default Quartus
compilation, ranges from 3.69% to 18.21% for the applica-
tion set with an average routing power reduction of 11.51%.
Table I: Routing Power Saving
Application Target Prouting reduction Prouting reduction
Device Over default Over [9]
Synth1 EP4CGX50 18.21% 8.10%
Synth2 EP4CGX50 16.40% 10.49%
Synth3 EP4CGX50 15.91% 6.19%
atax0EP4CGX110 12.69% 10.75%
bicg0EP4CGX110 3.69% 3.55%
gesummv0EP4SGX70 6.68% 5.48%
cholesky0EP4CGX110 8.19% 6.05%
symm0EP4CGX110 10.31% 6.05%
Communication-aware subsystem placement actually de-
pends on how the application is partitioned into subsystems.
Some application partitioning might not create connections
with high inter-subsystems communication. For instance,
inter-subsystems links in bicg0are not as significant as that
of applications like Synth1 or atax0. As a result, the effec-
tiveness of this methodology depends on the inter-subsystem
communication. One can argue that placement technique
would not be necessary if all the high communication nodes
are included into subsystems during partition generation.
However, it should be remembered that making larger sub-
systems reduces the effectiveness of creating subsystems in
the first place, since it leads to the original problem of
mapping large designs on the FPGA. Instead, the current
subsystems are generated to strike a balance between the
size and intra-susbsystem communication as described in
[9]. In future, we will explore the subsystem generation step
with both intra- as well as inter- subsystem communication
frequencies in order to achieve even better results.
VI. CONCLUSION
In this paper, we presented a inter-subsystem communica-
tion aware placement technique that produces high quality
solutions with lower power consumption, especially for
large FPGA designs. The proposed Simulated Annealing
based approach reduces the distance between subsystems
with higher inter-subsystem communication. This approach
results in reducing routing power consumption by over 7%
when compared to an existing methodology that does not
consider the inter-subsystem communication frequency and
by over 11% when compared to a commercial CAD tool
such as Quartus. In future, we proposed to also evaluate the
footprints of subsytems for further power savings.
ACKNOWLEDGMENT
This project was partially funded by the National Research
Foundation Singapore under its Campus for Research Excel-
lence and Technological Enterprise (CREATE) programme
with the Technical University of Munich at TUMCREATE.
REFERENCES
[1] Altera. [Online]. Available: https://www.altera.com/
[2] Xilinx. [Online]. Available: https://www.xilinx.com/
[3] S. M. Trimberger, “Three Ages of FPGAs: A Retrospective
on the First Thirty Years of FPGA Technology, Proceedings
of the IEEE, 2015.
[4] H. Bian et al., “Towards scalable placement for FPGAs, in
Proceedings of the 18th annual ACM/SIGDA international
symposium on Field programmable gate arrays, 2010.
[5] J. Coole et al., “BPR: fast FPGA placement and routing using
macroblocks,” in Proceedings of the eighth IEEE/ACM/IFIP
international conference on Hardware/software codesign and
system synthesis, 2012.
[6] M. An et al., “Speeding up FPGA placement: Parallel
algorithms and methods,” in Field-Programmable Custom
Computing Machines (FCCM), 2014 IEEE 22nd Annual
International Symposium on, 2014.
[7] A. Ludwin et al., “Efficient and deterministic parallel place-
ment for FPGAs,” ACM Transactions on Design Automation
of Electronic Systems (TODAES), 2011.
[8] T.-H. Lin et al., “An efficient and effective analytical placer
for FPGAs,” in Proceedings of the 50th Annual Design
Automation Conference, 2013.
[9] K. Herath et al., “Communication-aware Partitioning for En-
ergy Optimization of Large FPGA Designs, in Proceedings
of the on Great Lakes Symposium on VLSI 2017, 2017.
[10] L. Shang et al., “Dynamic power consumption in Virtex-II
FPGA family,” in Proceedings of the 2002 ACM/SIGDA tenth
international symposium on Field-programmable gate arrays.
[11] T. Tuan et al., “A 90nm low-power FPGA for battery-powered
applications,” in Proceedings of the 2006 ACM/SIGDA 14th
international symposium on Field programmable gate arrays.
[12] R. Tessier, “Fast placement approaches for FPGAs, ACM
Transactions on Design Automation of Electronic Systems
(TODAES), 2002.
[13] C. Lavin et al., “HMFlow: Accelerating FPGA compila-
tion with hard macros for rapid prototyping,” in Field-
Programmable Custom Computing Machines (FCCM), 2011
IEEE 19th Annual International Symposium on, 2011.
[14] V. Betz et al., “VPR: A new packing, placement and routing
tool for FPGA research,” in International Workshop on Field
Programmable Logic and Applications, 1997.
[15] T. Frangieh et al., “A design assembly framework for FPGA
back-end acceleration,” in Reconfigurable Computing and
FPGAs (ReConFig), 2012 International Conference on, 2012.
[16] F. Gharibian et al., “Identifying and placing heterogeneously-
sized cluster groupings based on FPGA placement data,” in
Field Programmable Logic and Applications (FPL), 2014
24th International Conference on, 2014.
[17] M. Rabozzi et al., “Floorplanning for partially-reconfigurable
fpga systems via mixed-integer linear programming, in Field-
Programmable Custom Computing Machines (FCCM), 2014
IEEE 22nd Annual International Symposium on, 2014.
[18] M. Gort et al., “Design re-use for compile time reduction
in FPGA high-level synthesis flows, in Field-Programmable
Technology (FPT), 2014 International Conference on, 2014.
[19] S. Gupta et al., “CAD techniques for power optimization in
Virtex-5 FPGAs, in Custom Integrated Circuits Conference,
2007. CICC’07. IEEE, 2007.
[20] J. Lamoureux et al., “On the interaction between power-aware
FPGA CAD algorithms,” in 2003 IEEE/ACM international
conference on Computer-aided design.
[21] M. Wang et al., “Multi-million gate FPGA physical design
challenges,” in Proceedings of the 2003 IEEE/ACM interna-
tional conference on Computer-aided design, 2003.
[22] Y. Zhuo et al., “A congestion driven placement algorithm
for FPGA synthesis,” in Field Programmable Logic and
Applications, 2006. FPL’06. International Conference on.
[23] “Increasing Productivity with Quartus II Incremental Compi-
lation,” https://goo.gl/uy225f.
[24] “Vivado Design Suite User Guide-Hierarchical Design,”
https://goo.gl/6bUqqD.
[25] L.-N. Pouchet, “Polybench: The Polyhedral Benchmark
Suite,” http://web.cs.ucla.edu/˜pouchet/software/polybench/.
Article
The increasing size of modern FPGAs allows for ever more complex applications to be mapped onto them. However, long design implementation times for large designs can severely affect design productivity. A modular design methodology can improve design productivity in a divide and conqueror fashion but at the expense of degraded performance and power consumption of the resulting implementation. To reduce the dominant power dissipation component in FPGAs, the routing power, methodologies have been proposed that consider data communication between modules during module formation and placement on the FPGA. Selecting proper mapping region on target FPGAs, on the other hand, is becoming a critical process because of the heterogeneous resources and column arrangements in modern FPGAs. Selecting inappropriate FPGA regions for mapping could lead to degraded performance. Hence, we propose a methodology that uses communication-aware module placement, such that modules are mapped by selecting the best shape and region on the FPGA factoring the columnar resource arrangements. Additionally, techniques for module locking and splitting have been proposed for deterministic convergence of the algorithm and for improved module placement. This methodology exhibits nearly 19% routing power reduction with respect to commercial CAD flows without any degradation in achievable performance.
Article
Full-text available
Since their introduction, field programmable gate arrays (FPGAs) have grown in capacity by more than a factor of 10 000 and in performance by a factor of 100. Cost and energy per operation have both decreased by more than a factor of 1000. These advances have been fueled by process technology scaling, but the FPGA story is much more complex than simple technology scaling. Quantitative effects of Moore's Law have driven qualitative changes in FPGA architecture, applications and tools. As a consequence, FPGAs have passed through several distinct phases of development. These phases, termed “Ages” in this paper, are The Age of Invention, The Age of Expansion and The Age of Accumulation. This paper summarizes each and discusses their driving pressures and fundamental characteristics. The paper concludes with a vision of the upcoming Age of FPGAs.
Conference Paper
Modern FPGAs integrate multi-million logic resources that allow the realization of increasingly large designs. However, state-of-the-art simulated annealing based CAD tools for FPGA suffer from long runtime, poor performance and sub-optimal routing and placement decisions, especially for large applications, leading to less energy efficient designs. In this paper, we present a partitioning methodology that divides large application into smaller subsystems based on the communication frequency between these subsystems. We leverage the existing CAD tools to compile the large design, which is now annotated with their subsystems, to obtain the final bitstream. Experiments show that the proposed strategy can lead to a performance gain of over 60% while still achieving more than 20% reduction in energy consumption.
Conference Paper
Field Programmable Gate Arrays (FPGAs) CAD flow run-time has increased due to the rapid growth in size of designs and FPGAs. Researchers are trying to find new ways to improve compilation time without degrading design performance. In this paper, we present a novel approach that identifies tightly grouped FPGA logic blocks and then uses this information during circuit placement. Our approach is an orthogonal optimization applicable in incremental design and physical optimization, and reduces placement run-time. Specifically, we present a new algorithm that analyzes designs post-placement to extract medium-grained super-clusters that consist of two to seventeen clusters, which we call 'gems'. We modified VPR's simulated annealing placement algorithm to place our mixture of gems and clusters. Our new 'Singularity Annealing' algorithm first crushes each cluster grouping into a 'singularity' (treated as a single cluster). Then, the Singularity Annealer is run over this condensed circuit to obtain an initial placement, followed by an expansion of the singularities. Finally, we run a second low-temperature annealing phase on the entire expanded circuit. Our results show that our system reduces placement run-time on average by 17% while maintains the designs critical path delay, and increases designs channel width, and wirelength by 2% and 6.3%, respectively. We have also presented a test case to show the re-usability of gems in an incremental design example.
Article
High-level synthesis (HLS) raises the level of abstraction for hardware design through the use of software methodologies. An impediment to productivity in HLS flows, however, is the run-time of the back-end toolflow - synthesis, packing, placement and routing - which can take hours or days for the largest designs. We propose a new back-end flow for HLS that makes use of pre-synthesized and placed 'macros' for portions of the design, thereby reducing the amount of work to be done by the back-end tools, lowering run-time. A key aspect of our work is an analytical placement algorithm capable of handling large macros whose internal blocks have fixed relative placements, in conjunction with placing the surrounding individual logic blocks. In an experimental study, we consider the impact on run-time and quality-of-results of using macros: 1) in synthesis alone, and 2) in synthesis, packing and placement. Results show that the proposed approach reduces run-time by ∼3x, on average, with a negative performance impact of ∼5%.
Conference Paper
Placement of a large FPGA design now commonly requires several hours, significantly hindering designer productivity. Furthermore, FPGA capacity is growing faster than CPU speed, which will further increase placement time unless new approaches are found. Multi-core processors are now ubiquitous, however, and some recent processors also have hardware support for transactional memory (TM), making parallelism an increasingly attractive approach for speeding up placement. We investigate methods to parallelize the simulated annealing placement algorithm in VPR, which is widely used in FPGA research. We explore both algorithmic changes and the use of different parallel programming paradigms and hardware, including TM, thread-level speculation (TLS) and lock-free techniques. We find that hardware TM enables large speedups (8.1x on average), but compromises "move fairness" and leads to an unacceptable quality loss. TLS scales poorly, with a maximum 2.2x speedup, but preserves quality. A new dependency checking parallel strategy achieves the best balance: the deterministic version achieves 5.9x speedup and no quality loss, while the non-deterministic, lock-free version can scale to a 34x speedup.
Conference Paper
The aim of this paper is to show a novel floor planner based on Mixed-Integer Linear Programming (MILP), providing a suitable formulation that makes the problem tractable using state-of-the-art solvers. The proposed method takes into account an accurate description of heterogeneous resources and partially reconfigurable constraints of recent FPGAs. A global optimum can be found for small instances in a small amount of time. For large instances, with a time limited search, a 20% average improvement can be achieved over floor planners based on simulated annealing. Our approach allows the designer to customize the objective function to be minimized, so that different weights can be assigned to a linear combination of metrics such as total wire length, aspect ratio and area occupancy.
Conference Paper
Numerous studies have shown the advantages of hardware and software co-design using FPGAs. However, increasingly lengthy place-and-route times represent a barrier to the broader adoption of this technology by significantly reducing designer productivity and turns-per-day, especially compared to more traditional design environments offered by competitive technologies such as GPUs. In this paper, we address this challenge by introducing a new approach to FPGA application design that significantly reduces compile times by exploiting the functional reuse common throughout modern FPGA applications, e.g. as shared code libraries and unchanged modules between compiles. To evaluate this approach, we introduce Block Place and Route (BPR), an FPGA CAD approach that modifies traditional placement and routing to operate at a higher-level of abstraction by pre-computing the internal placement and routing of reused cores. By extending traditional place-and-route algorithms such as simulated-annealing placement and negotiated-congestion routing to abstract away the detailed implementation of reused cores, we show that BPR is capable of orders-of-magnitude speedup in place-and-route over commercial tools with acceptably low overhead for a variety of applications.
Conference Paper
The increasing design complexity of modern circuits has made traditional FPGA placement techniques not efficient anymore. To improve the scalability, commercial FPGA placement tools have started migrating to analytical placement. In this paper, we propose the first academic multilevel timing-and-wirelength-driven analytical placement algorithm for FPGAs. Our proposed algorithm consists of (1) multilevel timing-and-wirelength-driven analytical global placement with the novel block alignment consideration, (2) partitioning-based legalization, (3) wirelength-driven block matching-based detailed placement, and (4) timing-driven simulated-annealing-based detailed placement. Experimental results show that our proposed approach can achieve 6.91x speedup on average with 7% smaller critical path delay and 1% shorter routed wirelength compared to VPR, the well-known, state-of-the-art academic simulated-annealing-based FPGA placer.
Article
There are well known cases where FPGAs provide high performance within a modest power budget, yet unlike conventional desktop solutions, they are oftentimes associated with long wait times before a device configuration is generated. Such long wait times constitute a bottleneck limiting the number of compilation runs performed in a day; thus limiting to FPGA adaptation in modern computing platforms. This work presents an FPGA development paradigm that exploits logic variance and hierarchy as a means to increase FPGA productivity. The practical tasks of logic partitioning, placement and routing are examined and a resulting assembly framework, Quick Flow (qFlow), is implemented. Fifteen International Workshop on Logic and Synthesis (IWLS) 2005 benchmark designs and five large designs are used to evaluate qFlow. Experiments show up to 10x speed-ups using the proposed paradigm compared to vendor tool flows.
Conference Paper
The FPGA compilation process (synthesis, map, place, and route) is a time consuming task that severely limits designer productivity. Compilation time can be reduced by saving implementation data in the form of hard macros. Hard macros consist of previously synthesized, placed and routed circuits that enable rapid design assembly because of the native FPGA circuitry (primitives and nets)which they encapsulate. This work presents results from creating a new FPGA design flow based on hard macros called HMF low. HMF low has shown speedups of 10-50X over the fastest configuration of the Xilinx tools. Designed for rapid prototyping, HMF low achieves these speedups by only utilizing up to 50 percent of the resources on an FPGA and produces implementations that run 2-4X slower than those produced by Xilinx. These speedups are obtained on a wide range of benchmark designs with some exceeding 18,000 slices on a Virtex 4 LX200.