Page 1
578IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
Improving FPGA Performance for
CarrySave Arithmetic
Hadi ParandehAfshar, Ajay Kumar Verma, Philip Brisk, and Paolo Ienne
Abstract—The selective use of carrysave arithmetic, where
appropriate, can accelerate a variety of arithmeticdominated
circuits. Carrysave arithmetic occurs naturally in a variety of
DSP applications, and further opportunities to exploit it can be
exposed through systematic data flow transformations that can
be applied by a hardware compiler. Fieldprogrammable gate
arrays (FPGAs), however, are not particularly well suited to
carrysave arithmetic. To address this concern, we introduce the
“field programmable counter array” (FPCA), an accelerator for
carrysave arithmetic intended for integration into an FPGA as
an alternative to DSP blocks. In addition to multiplication and
multiply accumulation, the FPCA can accelerate more general
carrysave operations, such as multiinput addition (e.g., add
? integers) and multipliers that have been fused with other
adders. Our experiments show that the FPCA accelerates a wider
variety of applications than DSP blocks and improves perfor
mance, area utilization, and energy consumption compared with
soft FPGA logic.
Index Terms—Carrysave arithmetic, fieldprogrammable gate
array (FPGA), generalized parallel counter (GPC).
I. INTRODUCTION
F
arithmetic circuits do not map well onto lookup tables (LUTs),
the primary building block for general logic in FPGAs. To
address this concern, FPGAs offer two solutions: First, LUTs
are now tightly integrated with fast carry chains that perform
efficient carrypropagate addition; second, FPGAs contain DSP
blocks that perform multiplication and multiply accumulation
(MAC). Although an improvement over LUTs alone, these
enhancements lack generality; specifically, they cannot effec
tively accelerate carrysave arithmetic.
Carrysavearithmeticisatechniquetoaddsetsof
bers that eliminates much of the carry propagation that would
otherwise occur. Carrysave arithmetic has been the method of
choice for partialproduct reduction in parallel multipliers for
more than 40 years [1], [2]. More recently, Verma et al. [3] de
veloped a set of arithmeticoriented data flow transformations
thatcanbeappliedtoacomputationinordertomaximizetheuse
of carrysave arithmetic. These transformations systematically
reorder the operations in a circuit in order to cluster disparate
IELDPROGRAMMABLEGATEARRAY
performance is lacking for arithmetic circuits. Generally,
(FPGA)
num
Manuscript received May 27, 2008; revised September 26, 2008. First pub
lished June 16, 2009; current version published March 24, 2010.
The authors are with the Processor Architecture Laboratory, School of Com
puter and Communications Sciences, École Polytechnique Fédérale de Lau
sanne, 1015 Lausanne, Switzerland (email: hparande@gmail.com; ajaykumar.
verma@epfl.ch; philip.brisk@epfl.ch; paolo.ienne@epfl.ch).
Digital Object Identifier 10.1109/TVLSI.2009.2014380
adders togetherand tomerge adders withthe partialproductre
duction trees of parallel multipliers. Each cluster of adders is
then replaced with a compressor tree, i.e., a circuit that reduces
integers, , down to two,
(carry), such that
(sum) and
(1)
A carrypropagate adder (CPA), i.e., a twoinput adder, then
performs the final addition,
from the transformations of Verma et al. [3], compressor trees
occur naturally in a variety of applications [4]–[8].
The arithmetic capabilities of FPGAs are not well attuned to
the needs of carrysave arithmetic. Programmable LUTs have
been augmented with fast carry chains that are good building
blocks for CPAs but cannot be used for carrysave arithmetic.
The fastest methods to synthesize compressor trees on FPGA
general logic [9], [10] do not use the carry chains except for the
final CPA.
FPGAs also integrate DSP blocks, which perform integer
multiplication and MAC. Although useful, DSP blocks cannot
accelerate multiinput addition; likewise, when the transfor
mations of Verma et al. merge multipliers with adders, the
resulting operation can no longer map onto a DSP block. That
being said, certain multiplication operations whose bitwidths
do not match up well with the bitwidths of the DSP blocks are
faster when performed on the general logic of an FPGA [11].
This paper advocates the use of a field programmable counter
array(FPCA)forcarrysavearithmeticonFPGAs.TheFPCAis
aprogrammableacceleratorthatcanbeintegratedintoanFPGA
as an alternative to DSP blocks. An early FPCA, introduced
by Brisk et al. [12], is a lattice of
counter is a circuit that takes
them that are set to 1, and outputs the result, a value in the range
of
, as anbit unsigned binary number. The number of
output bits is
, to compute the result. Aside
counters. An
input bits, counts the number of
(2)
The FPCA described in this paper, in contrast, is built using
generalized parallel counters (GPCs) [13]–[15], an extended
type of counter that can sum bits having different input ranks;
GPCs, which will be defined formally in Section III, are built
using
counters as building blocks.
Our experiments compare the FPCA with DSP blocks for
multiplicationdominated circuits and with the best methods to
synthesize compressor trees on general FPGA logic for other
circuits that feature carrysave arithmetic. As the DSP blocks
are fixedbitwidth multipliers/MACs, they perform better than
10638210/$26.00 © 2010 IEEE
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
Page 2
PARANDEHAFSHAR et al.: IMPROVING FPGA PERFORMANCE FOR CARRYSAVE ARITHMETIC 579
Fig. 1. Illustration of the methodology underlying the proposed reconfigurable lattice. Using arithmetic transformations [3], a circuit/flow graph is transformed
to expose one (or more) compressor trees. The compressor tree is mapped onto a new reconfigurable lattice (FPCA) that is integrated into an FPGA. The final
addition is mapped onto a dedicated adder (shown here, integrated into the new lattice but could easily be implemented using a carry chain instead). The remaining
nonarithmetic operations in the circuit/flow graph are mapped onto the FPGA logic blocks.
the FPCA for those operations and bitwidths; however, the
FPCA retains an advantage when bitwidth mismatches occur.
The FPCA also benefits from the transformations of Verma
et al. [3], whereas the DSP blocks do not. The FPCA offers
advantages overDSPblocks andFPGAlogicintermsofcritical
path delay, area utilization, and energy consumption.
In our experiments, we considered GPCs built from four par
allelcountersizes:8:4,12:4,16:5,and20:5;althoughnocounter
size was uniformly better than the others across all benchmarks,
our results suggest that increasing the counter size beyond 12:4
yields diminishing returns. We conclude that GPCs built from
12:4 counters are the ideal choice for our specific set of bench
marks.
A. Illustrative Example
Fig.1showsourapproach.Acircuittransformedasdescribed
previously is partitioned into a (set of) compressor tree(s) with
corresponding CPA(s) and a set of nonarithmetic operations.
The compressor tree is mapped onto an FPCA, which is em
bedded within a larger FPGA. Fig. 1 assumes that a dedicated
CPA is integrated into the FPCA; alternatively, the carry chains
in the logicblock structure of the FPGA could be used to per
formthefinalCPA.Thenonarithmeticportionsofthecircuitare
mapped onto the FPGA. Following the lead of Xilinx and Al
tera, the FPGA shown in Fig. 1 is organized into columns. Each
column contains a set of logic clusters [e.g., the Altera Logic
Array Block (LAB)], which contain several logic blocks [e.g.,
the Altera Adaptive Logic Module (ALM)] connected by local
routing. A global routing network connects the different logic
clusters. Due to the column structure, the horizontal and ver
tical routing channels are nonuniform.
B. Paper Organization
The paper is organized as follows. Section II summarizes re
lated work, Section III introduces GPCs, Section IV presents
the FPCA architecture, Sections V and VI present the experi
mental framework and results, and lastly, Section VII concludes
the paper.
II. RELATED WORK
A. Commercial FPGA Architectures and Mapping
This section summarizes the arithmetic features in the Altera
Stratix III [16] and Xilinx Virtex5 [17] FPGAs, both of which
are highend FPGAs realized in 65nm CMOS technology. The
logic architectures of both of these FPGAs feature sixinput
LUTs with carry chains that perform efficient carrypropagate
addition withoutusing therouting network. The StratixIIIcarry
chain is a ripplecarry adder; the Virtex5 carry chain includes
an XOR gate and a multiplexor (mux) which enable carrylooka
head addition.
Stratix II introduced a method to combine the LUTs with the
carry chain to perform ternary (threeinput) addition, which re
mainedinplacefortheStratixIII;theVirtex5similarlysupports
ternary addition.
Due to the peculiar nature of FPGA architectures, it has long
been thought that multiinput addition is best realized using
trees of adders rather than compressor trees. The use of ternary
adders rather than binary (twoinput) adders could reduce the
height of the trees, thereby reducing delay and/or pipeline
depth. ParandehAfshar et al. [9], [10], however, showed that
compressor trees could be synthesized on FPGAs using GPCs
(see Section III), significantly reducing the delay compared to
ternary adder trees. Experimentally, this paper finds that the
FPCA is faster than both of these alternatives.
B. FPGA Enhancements to Improve Arithmetic Performance
Numerous enhancements for FPGAs have been proposed in
thepast,particularlytoimprovearithmeticperformance.Forex
ample, several researchers have proposed hard IP cores: appli
cationspecific integrated circuit (ASIC) components that im
plement common operations that are embedded into the FPGA.
The most prevalent of these IP cores include block memories,
DSP/MACblocks[18],[19],standardI/Ointerfaces[19],cross
bars [20], shifters [21], and floatingpoint units [21]. Kastner et
al. [22] developed techniques to examine a set of applications
to find good domainspecific IP core candidates.
Although the FPCA is similar in principle to the IP cores de
scribed previously, it is not completely hard:It is programmable
and has its own routing network. Although it is intended to im
plement just one class of circuits—compressor trees, the FPCA
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
Page 3
580IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
is flexible and is not fixed to a specific bitwidth; this distin
guishestheFPCA,forexample,fromthehardmultiplierswhose
bitwidths are fixed. Kuon and Rose [11] have noted that fixed
bitwidth multiplication has some limitations, e.g., it is ineffi
cient to implement
bmultiplicationon a
contained in a DSP block.
Cevrero et al. [23] recently proposed an alternative FPCA
architecture. Theirs is radically different than the one described
here; the most important distinguishing feature is that it uses
direct programmable connections but does not employ a global
routing network; as such, it offers less flexibility than the
architecture proposed here, but with the potential of reduced
delay, area, and power consumption due to the absence of
global routing. Future work will compare and contrast these
two architectures to better understand the differences between
them.
bmultiplier
C. Carry Chains
Also notable but not directly related to this work are the fast
carry chains [24]–[29]: These are used to implement efficient
carrypropagate addition within FPGA logic cells. If an FPCA
ispresent,thesecarrychainscanbeusedtoperformthefinalad
dition if a hard IP core implementation of a CPA is not present.
ParandehAfshar et al. [29] developed a carry chain that al
lows a logic cell to be configured as a 6:2 compressor, a well
known building block for compressor trees. The FPCA, how
ever, is much more powerful, as its logic cells contain larger
GPCs (e.g., with up to 20 inputs). The use of larger and more
flexible components reduces the number of logic levels in the
compressortreeaswellaspressureontheroutingnetwork:This
is favorable from the perspective of delay, area utilization, and
power consumption.
D. Programmable Arrays of Arithmetic Primitives
The FPCA is a homogeneous array of arithmetic primitives
connected by a routing network. Many principally similar arith
metic arrays have been proposed in the past, and this similarity
is acknowledged. The main difference is that the FPCA is lim
ited in its scope of application (solely compressor trees) and is
intended for integration into a larger FPGA, whereas the arrays
discussed in this paper are standalone devices.
Parhami [30], for example, built an array of bitserial addi
tive multipliers and used a datadriven control scheme. The ad
vantage of bitserial arithmetic is that it reduces the wiring re
quirement for an FPGA: This is significant because wiring can
consume up to 70% of onchip area. Although somewhat be
yond the scope of this paper, bitserial routing networks are an
active area of research that is beginning to emerge [31]; the ap
plications for such a device, however, must be able to tolerate
high latencies, and it is not immediately clear which applica
tions easily fall into this category.
The reconfigurable arithmetic FPGA (RAFPGA) [32] is an
arithmetic array partitioned into three regions: 1) two’s comple
ment addition; 2) sign/magnitude conversion to two’s comple
ment, and vice versa; and 3) multiplication and division. Tra
ditional FPGAstyle logic is also included in order to imple
ment control and generalpurpose logic. In principle, such a de
vice could use an FPCA to perform multiplication; however, no
RAFPGA has been produced commercially to the best of our
knowledge.
TheCHESSreconfigurablearray[33],developedatHP Labs,
is an array of 4b arithmetic logic units (ALUs), connected by a
busbased FPGAstyle routing network. Each ALU supports 16
arithmetic and logical operations (e.g., ADD, SUB, XOR), along
with selection and comparison tests. Neighboring ALUs can be
chained together (e.g., to perform 8b addition), and spatial par
allelism is abundant. As the ALU does not support primitives
for multiplication or multioperand addition, the inclusion of an
FPCA into the array is certainly plausible.
Several configurable arrays of floatingpoint units have also
been proposed. In 1988, Fiske and Dally [34] introduced the
Reconfigurable Arithmetic Processor (RAP), which contains 64
floating units connected by a switching network. More recently,
Intel’s Teraflops processor [35] connected 80 floatingpoint
MAC units using a highspeed network on chip. Although
floatingpoint units contain integer multipliers (and, hence,
compressor trees), it does not appear that there would be any
room to incorporate an FPCA into such a chip, because the
floatingpoint units themselves have fixed bitwidths in accor
dance with IEEE standards.
III. GPCS
A. Overview
Let
be an bit binary number,
is a bit. Let
where each
and
this case, is called the
and contributes a value of
by the binary integer.
An
assumes that all bits have the same rank when it computes their
sum. If all input bits have rank , then the output of the
counter is a set ofbits having ranks
A GPC [13]–[15] is a type of counter that counts bits having
different ranks. In fact, an
if desired: A bit of rank
must be connected to precisely
inputs of the counter. Of course, other methods to build
GPCs also exist.
A GPCis defined
,be the least bit
be the most significant bit. The subscript of a bit, in
of the bit. Each bit
to the total quantity represented
has rank
counter, as described in the preceding section,
.
counter can implement a GPC,
asatuple
,where is the
number of input bits of rank
of output bits; the input bits of each rank are independent.
For example, a (5, 3; 4) GPC can count up to 5 b of rank1
and 3 b of rank0; the maximum output value is 3; therefore,
four output bits are required.
Here,wefixthenumberofinputandoutputbitstobepositive
constants
and . Givenand , there is a family of GPCs
that satisfy these I/O constraints. Clearly, the number of input
bits cannot exceed
to sum andis the number
(3)
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
Page 4
PARANDEHAFSHAR et al.: IMPROVING FPGA PERFORMANCE FOR CARRYSAVE ARITHMETIC581
Fig. 2. 15:4 counter can implement a (5, 5; 4) GPC.
Likewise,themaximumallowableoutputvalue,whichoccurs
when all input bits are “1,” cannot exceed the maximum integer
value that can be expressed with
output bits
(4)
For a given
these I/O constraints. As an example, take
One GPC in this family is a (5, 5; 4) GPC (see Fig. 2).
This GPC has five input bits of rank0 and five of rank1. The
maximumvaluethatcanbecountedis
output bits are required to represent a value in the range [0, 15];
clearly,any(
,;4)GPCsuffices,undertheassumptionthat
and as well.
At the same time, a (4, 6; 4) GPC is also a member of this
family, as it produces an output value in the range [0, 14], as is
a (6, 3; 4) GPC, etc. An
counter is also a degenerate case
of a GPC; in this case, a (0, 10; 4) GPC.
and , there is a family of GPCs that satisfy
and.
;four
B. Configurable GPCs
For the FPCA, one fixed GPC does not suffice, because we
desire more flexibility. Instead, we build a configurable GPC
using an
counter with a layer of muxes placed on its
input. This configuration layer, which is described in detail in
SectionIIID,allowstheconfigurableGPCtoimplementawide
variety of GPCs; the user selects the desired GPC to implement
and programs the configuration layer accordingly. For example,
a programmable GPC with
to implement the functionality of both a 15:4 counter and a (5,
5; 4) GPC, among others.
The
of a GPC is the minimum rank among all of its
input bits. Now, suppose that the minimum rank of an input bit
to a given GPC is
and that the GPC must add two (or more)
bits of ranks and , such that
connected to one input of the GPC, while each input bit of rank
is connected to inputs. Fig. 2, for example, satisfies this
property.
Fig. 3 shows the design of an
from an
input output configuration layer followed by an
counter. Each input of the
the configuration layer) is connected to two GPC inputs and is
controlled by two configuration bits. The configuration bit on
the left selects one of two GPC inputs that are connected to a
mux; the configuration bit on the right drives the counter input
to 0 if it is not set, which allows the
any
counter for
the
counter is called a subcounter. For example, a 7:3
andshould be able
. Each input bit of rank is
inputoutput GPC built
counter (output of
counter to implement
; in this case, and
Fig. 3. Architecture of an ?input programmable GPC.
counterhasfivesubcounters:6:3,5:3,4:3,3:2,and2:2counters.
The definition of subcounters easily extends to GPCs as well.
C. Primitive, Covering, and Reasonable GPCs
This section specifies precisely which
GPCs should be implemented by the programmable GPC with
inputs,outputs, and an
A primitive GPC is one that satisfies the I/O constraints.
A covering GPC is a primitive GPC that is not a subGPC of
another primitive GPC. Referring to Fig. 2, the (5, 5; 4) GPC
is a covering GPC. If the number of rank1 inputs is increased
to six, then, there are only three rank0 input ports remaining,
i.e.,
. Hence, a (6, 3; 4) GPC is also a
covering GPC. A (5, 4; 4) GPC, in contrast, is not a covering
GPC, because the (5, 5; 4) GPC can implement its functionality
by driving one of the rank0 inputs to zero.
To achieve universal coverage, a configurable GPC only
needs to implement the functionality of the covering GPCs that
have
inputs and outputs. Of course, there is no formal
mandate that a configurable GPC provide universal coverage;
those that do not simply have limited flexibility compared with
those that do.
We have identified two classes of unreasonable GPCs,
meaning that we can find no rational justification for using
them; this is not, however, a formal definition. When designing
a configurable GPC, there is no need to add support for unrea
sonable GPCs, even if they are covering GPCs.
A GPC that has no rank0 input bits, i.e.,
reasonable. For example, consider a (7, 0; 4) GPC. The rank0
output will always be 0. The rank1, 2, and 3 outputs can be
computed by a 7:3 counter. If the rank of this GPC is , it suf
fices to replace it with a 7:3 counter of rank
a configurable GPC need not support this type of GPC.
Similarly, a GPC that has one rank0 input bit, i.e.,
is unreasonable. This bit determines whether the output of the
GPC is even or odd. The rank0 input bit is connected directly
to the rank0 output and is not used within the GPC. There is no
need to connect this bit to the GPC input; instead, it should be
connected to a GPC at a lower level of the compressor tree.
As an example,consider a (7,1; 4)GPC. The rank0 outputis
always equal to the rank0 input. The rank1, 2, and 3 outputs
can be computed by a traditional 7:3 counter. Suppose that the
rank of this GPC is . Then, it suffices to eliminate the rank0
input bit and propagate it to the next level of the tree; then, the
GPC is replaced with a 7:3 counter of rank
inputoutput
counter at its core.
, is un
instead. Thus,
,
.
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
Page 5
582 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
Fig. 4. Example of primitive, covering, and reasonable covering GPCs.
Fig. 4 shows the preceding concepts for
There are 23 primitive GPCs, 6 covering GPCs, and 3 reason
able covering GPCs. (0, 3, 1; 3) is unreasonable because
. (1, 0, 3; 3) is unreasonable because the rank0 and 1 out
puts can be computed by a 3:2 counter, while the rank2 input
connects directly to the rank2 output; it suffices to replace this
GPC with a 3:2 counter of rank and propagate the rank2 input
bit to the next level of the compressor tree directly. (1, 1, 1; 3)
is also unreasonable: Not only is
connects directly to each output; it suffices to propagate these
bits directly to the next level of the compressor tree, eliminating
this GPC altogether.
and.
but also each input bit
D. Configuration Layer
The configuration layer allows the user to program an
counter as any inputoutput reasonable covering GPC.
The circuit shown on the righthand side of Fig. 3 is placed on
each configuration layer output. The configuration layer archi
tecture is defined by a set of connections between input ports
and muxes. When the right configuration bit is zero, the corre
sponding counter input is driven to zero; otherwise, it selects
one of the inputs connected to the mux.
We make the assumption that
of
that satisfies (2), i.e.,
there are 4:3, 5:3, 6:3, and 7:3 counters; based on this assump
tion, we default to the 7:3 counter.
Let
is maximal for a given value
. For example, if,
and
be the set of input
ports and muxes. A sensible configuration layer architecture
satisfies the property that each input port
to
muxes whereis a nonnegative integer; thus,
be connected to any input bit of rank at most
the
of the input port and is denoted
configuring a GPC, the rank of each input bit connected to
input port
cannot exceed
Fig. 5(a) shows an example of a configuration layer (only
muxes are shown) for a GPC built from a 15:4 counter. Input
ports
have rank0; input ports
rank1, and ports
,, and
The configuration layer can be represented as a configuration
graph,adirectedbipartitegraph
andrepresents the set of connections from input
ports to muxes, i.e., there is an edge
if there is a connection from
ample corresponding to the configuration layer in Fig. 5(a). In
Fig. 5(a),
and connect directly to the counter inputs;
dummy muxes
andare shown in Fig. 5(b) simply to
is connected
can
.is called
. When
.
,, , andhave
have rank2.
,where
if and only
to. Fig. 5(b) shows an ex
represent the possible connection; a oneinput mux in the con
figuration graph becomes a direct connection in the configura
tion layer.
E. Configuring the GPC
The configuration graph represents the set of different
inputtomux connections. A configuration determines which
input port is selected by each mux. At most, one input port can
connect to each mux; if no input port is connected, the circuit
shown on the righthand side of Fig. 3 drives the counter input
to zero instead. Specifically, a configuration is a subset of edges
such that each mux
in
and each input port . An example of a configuration is
the set
, which configures the GPC as a
15:4 counter. A set of edges including
is not a configuration because two input ports are connected to
.
An active input port is incident on at least one edge in a con
figuration. A configuration is sensible if each active input port
is incident onedges in
configurationincludingedges
is not sensible, because the number of ports to which
is connected is not an even power of two.
Let
be a configuration, and let
ports that are configured to connect to
within bandwidth limits, the sum of the ranks of the input ports,
after configuration, cannot exceed
inputs; in other words
is incident on at most one edge
and
, where. A
,and,,
be the set of input
counter inputs. To stay
, the number of counter
(5)
F. Configuration Layer Design
Recallfrom Section IIIAthat a GPCis representedas a tuple
where is the number of
inputbitsofrank tobe summed,and from SectionIIIC,recall
the definitions of reasonable and covering GPCs.
In this section, we outline a method to design a GPC config
uration layer systematically. We do not attempt to achieve uni
versal coverage; instead, we restrict the set of GPCs that can be
mapped onto our configurable GPC; doing this allows us to im
plement the configuration layer with one level of muxes, each
having at most two inputs, thereby bounding the delay and area
overhead of the configuration layer.
The rank variation of a GPC
of input bit ranks supported by a GPC, e.g.,
GPC aforementioned; then,
dance with (5).
A configuration layer can implement a GPC
a configuration
such that
. This is intuitive:has
whichmust connect to
muxes. The condition
ensures that a sufficient supply of input ports with the desired
connectivity exists.
Let
be the set of reasonable covering GPCs that satisfy
I/O constraints
. A complete configuration layer can im
plement every GPC in
. To simplify the design of the con
figuration layer, we have chosen to restrict the set of reasonable
is the number
for the
in accor
if there is
for
input bits of rank , each of
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
Page 6
PARANDEHAFSHAR et al.: IMPROVING FPGA PERFORMANCE FOR CARRYSAVE ARITHMETIC583
Fig. 5. (a) Configuration layer (muxes only) for a 15input 4output GPC. (b) Bipartite graph representation of the configuration layer in (a).
Fig. 6. Configuration graph for (a) ? , (b) ? and ? , and (c) ? , ? , and
? . In (b), half of the input ports ?? ?? ?? ?? ? are connected to two muxes.
In (c), two of the input ports ?? ?? ? are connected to four muxes. No mux is
connected to more than two input ports.
covering GPCs to those whose rank is at most 2; this configura
tion layer is incomplete, meaning that universal coverage is not
achieved.
Let
reasonable covering GPCs whose rank variation is
. The configuration layer described here must implement
,, and . Fig. 6 shows the construction method for an
eightinput counter through the incremental addition of edges
to a configuration layer graph.
containsoneGPC:Allinputbitshaverank0,i.e.,an
counter. Any mapping from input ports to muxes suffices, e.g.,
be the set of
for a given
. Fig. 6(a) shows the initial set
of edges.
Now, let us consider
to the same mux; otherwise, both of these input ports could not
be configured as rank1 at the same time. In Fig. 6(b), edges are
added to the configuration graph so that input ports
and
can be configured as either rank0 or rank1.
. No rank1 input ports can connect
,,,
contains GPCs having bits of rank0, 1, or 2. Like the
aforementioned reasoning, each rank2 input port connects to
four muxes;
of the input ports can already be configured
as rank1; thus, it suffices to take half of them and connect them
to two additional muxes. In Fig. 6(c), input ports
extended so that they can be configured as rank0, 1, or 2. At
this point, we stop. In general, there are
connect to four GPC inputs,
connect to one.
The basic pattern shown in Fig. 6 is systematic and general
izestoany
counter.Stoppingat
mux in the configuration layer has at most two inputs.
andare
input ports in total;
connect to two, and
ensuresthatthelargest
IV. FPCA ARCHITECTURE
A. FPCA Architecture
The FPCA architecture presented by Brisk et al. [12] is a 2D
lattice of hierarchical
counters connected through a pro
grammable routing network; it had the same basic structure as
an islandstyle FPGA, but with programmable logic cells re
placed by
counters. The architecture presented here is
similar, but programmable GPCs replace the
Theconnection boxes thatinterface eachprogrammable GPCto
the adjacent routing channels and switch boxes that connect in
tersecting horizontal and vertical routing channels are the same
as an FPGA.
The hierarchical design of an
ibility. For example, suppose that the counter size in an FPCA
is 20:5. This 20:5 counter is hierarchically built from smaller
counters, e.g., 4:3. If there are only four input bits to sum at
a given time, the smaller counter can be used. This reduces
the delay of the circuit at stages of a compressor tree where
there is a small number of bits to sum; on the other hand, the
largenumberofsmallercountersincreasesthenumberofoutput
ports,asseveral4:3counters,forexample,willbeavailable.The
use of a configurable GPC in lieu of a hierarchically designed
counter offers similar flexibility, but without increasing
the number of output ports. When there is a small number of
bits available at each rank, a GPC can sum bits having different
ranks.
counters.
counter increases flex
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
Page 7
584IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
Fig. 7. Flipflop and mux are placed on each GPC output to allow pipelining.
The programmable GPC described in the preceding section
is purely combinational. The user may wish to pipeline the
compressor tree in order to increase the clock frequency and
throughput. To facilitate this, a flipflop and mux are placed
on each GPC output, as shown in Fig. 7. The same circuit is
typically placed on the outputs of FPGA logic blocks (but not
on the carrychain outputs).
In a sense, our intention to embed an FPCA into a FPGA
is similar in principle to a cluster of logic cells in a traditional
FPGA (e.g., a LAB in an Altera Stratixseries FPGA). A LAB
(or group of adjacent LABs) could be replaced with an FPCA.
The role of a programmable GPC within an FPCA is analogous
to the role of a programmable logic cell (an ALM in an Altera
Stratixseries FPGA) within a LAB. The primary difference is
thatduetotheinterconnectstructureofcompressortrees,amore
flexible routing network, in the style of a global (interLAB),
rather than a local (intraLAB), FPGA routing network is re
quired for the FPCA.
B. FPCA Mapping Heuristic
The algorithm to map a compressor tree onto an FPCA is
based on a heuristic developed by ParandehAfshar et al. [9] to
map compressor trees onto the logic cells (six LUTs) of high
performance FPGAs. The FPCA is homogeneous, i.e.,
the same for all programmable GPCs. For a given
available GPCs are the set of reasonable covering GPCs whose
rankvariationdoesnotexceed2.Nofurthermodificationstothe
mapping heuristic are required.
is
, the
C. FPCA Alternatives
Here, we qualitatively explore several different alternative
methods to integrate
counters into an FPGA and explain
why we believe that the FPCA is superior.
One possibility would be to add a programmable GPC to a
LAB or to replace one of the ALMs with the programmable
GPC; however, one GPC is unlikely to be used in isolation: A
collection of them is required to construct a compressor tree.
Thus, a compressor tree synthesized on this architecture would
notbeabletotakeadvantageofthefastlocalconnectionswithin
the LAB. A better approach is to cluster the GPCs together, as
is done by the FPCA. Replacing all of the ALMs in a LAB with
programmable GPCs effectively yields an FPCA with a local
LAB style rather than a global routing network; we have opted
for a routing network in the global style due to the complex
interconnect structure required to construct a compressor tree
from GPCs.
A second alternative is to integrate a GPC into an ALM as
a programmable type of macroblock, similar in principle to the
work by Cong and Huang [36] and Hu et al. [37]; however, this
architecture significantly increases the input and output band
width of the ALM; it is unlikely that the local routing network
within a LAB could handle this increased I/O bandwidth as it
exists today. We believe that a better approach is to strictly sep
arate the FPCA/GPCs from the LAB/ALMs.
V. EXPERIMENTAL SETUP
A. VPR
Two different versions of the Versatile PlaceandRoute
(VPR) tool [38], [39] were used to evaluate the FPCA. The
most recent version of VPR, version 5.0, was used to compare
the performance advantages of an FPGA containing an FPCA
against an FPGA containing DSP blocks as a baseline. The
earlier version of VPR does not support DSP blocks or any type
of embedded IP core; therefore, the newer version was required
to perform this comparison.
An earlier version, version 4.30, was used for a comparison
of energy consumption. At present, no power model is cur
rently available for the newer version of VPR; as discussed in
Section VC, we extended a preexisting power model for the
earlier version to compute the energy consumption.
VPR 5.0 provides preconstructed architecture models for dif
ferent process technologies; VPR 4.30, in contrast, requires the
user to provide transistorlevel properties of the wires. Details
will be provided in the following section.
B. Delay and Area Extraction
The FPCA was modeled as a standalone device using VPR
4.30. Each compressor tree in each benchmark was extracted
and synthesized on the FPCA. The FPCA was then modeled as
an IP core in VPR 5.0; for each benchmark, the delay through
each path through the FPCA was taken from VPR 4.30. The
complete benchmark was then synthesized on VPR 5.0, with all
compressor trees mapped onto FPCAs. The total delay includes
both noncompressor tree logic mapped onto the general logic
of the FPGA along with the compressor tree delay through the
FPCA.TomodeltheFPCA,thetraditionalFPGAlogicblocksin
VPR 4.30 were replaced with programmable GPCs. After map
ping a compressor tree onto a network of GPCs, VPR was used
toplaceandroutethecircuit.VPRalsoreportedthecriticalpath
delay,whichincludesbothroutingandlogicdelays.Thenumber
of GPCs required to synthesize each compressor tree can be de
termined from the result of the mapping heuristic.
The programmable GPCs described in Section III were mod
eled in Very High Speed Integrated Circuit Hardware Descrip
tion Language (VHDL) and synthesized using Synopsys Design
Compiler with 90nm TSMC standard cells. Cadence Silicon
Encounter was then used to place and route the designs and ex
tract delay and area estimates. This was done for four different
programmable GPCs, with
20:5. Thus, four different FPCA architectures were studied, as
the GPC size is assumed to be homogeneous within an FPCA.
A separate VPR architecture description file (ADF) was instan
tiated for each FPCA. We limited the channel width to 40 seg
ments, VPR’s default value.
For the purpose of comparison, we modeled an islandstyle
FPGA whose logic blocks resemble Altera’s ALM and whose
logic clusters resemble Altera’s LAB, but with four blocks per
, 12:4, 16:5, and
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
Page 8
PARANDEHAFSHAR et al.: IMPROVING FPGA PERFORMANCE FOR CARRYSAVE ARITHMETIC 585
cluster; the limited number of ALMs per LAB was due to com
plicationsinvolvedinmodelingcarrychainsinVPR.EachLAB
has two carry chains to support ternary addition. The primary
difference between this baseline architecture and the Stratix II
and III is that the baseline is island style, while the Stratix II
and III organize LABs into columns and employ nonuniform
routing in the  and directions. The GPC mapping heuristic
ofParandehAfsharetal.[9] wasusedtosynthesizecompressor
trees onto this FPGA.
Tomodelroutingdelays,VPR4.30requiresinformationsuch
as the perunit resistance and capacitance of wires. Our experi
mentsusedTSMC90nmtechnology,andtheperunitresistance
and capacitance for metal6 were computed and inserted into
VPR’s ADF. These values were used to compute the routing
delays of the FPCA. The perunit resistance and capacitance of
metal6 were chosen, as this metal layer seemed to be a reason
able choice for the wires in the routing network; in practice, the
routing network is likely to be realized in several metal layers.
VPR 5.0, in contrast, provides preconstructed architecture
models for different transistor technologies. We used an appro
priate model, which eliminated the need to explicitly provide
perunit resistance and capacitances of the wires.
C. VPR Power Model
A power model for VPR was developed by Poon et al. [40]
to model traditional islandstyle FPGAs. Choy and Wilton [41]
modified this framework to support power estimation for em
beddedIPcores,suchasDSPblocks.Weextendedthesemodels
to estimate the power consumption of the FPCA. The Activity
Estimator [42] estimates the probability of transitions occurring
in the circuits mapped to an FPGA; static probability and den
sity probability [43] are used to extract the transition activities
ofeachnet.Thecomplexityofthiscomputationis
is the number of inputs. This time complexity is suitable for
LUTbasedlogicblocks,wherethenumberofinputsistypically
six or less (ignoring carry chains). The programmable GPCs
used in our study of FPCAs have up to 20 inputs; for circuits
of this size, the exponential runtime of the model becomes a
limiting factor.
,where
D. FPCA Power Model for VPR
Due to the complexity of the VPR power model described in
theprecedingsection,wedevelopedamoreefficientsimulation
based power model. Our power model is based on the Lookup
technique advocated by Choy and Wilton [41].
The power model consists of an offline power characteriza
tion of eachGPC underdifferent input switching activityproba
bilities.Theresultsarecollectedinatableandfedintoanonline
powerestimatorthatextractstheswitchingactivitiesviasimula
tion. The simulator dynamically accesses the LUT to determine
the power dissipated given the switching activity at each point
in the simulation.
The offline power characterization flow is described as fol
lows. First, the programmable GPC is modeled in VHDL and
synthesized using Synopsys Design Compiler. Object and node
namesareextracted;thesenamesarelaterusedinthesimulation
phase for assigning switching activities. The power of the pro
grammable GPC is estimated using Synopsys PrimePower; the
power characteristics of the GPCs are extracted with different
transition activity rates. These rates are then organized into ta
bles,indexedbythetransitionprobabilities,whicharetheninput
into the online flow.
The online power estimator begins with a mapped netlist
whose objects and nodes are extracted. The objects are the
GPC blocks used for mapping a compressor tree. The transition
activities of objects are extracted through the application of
stimulus vectors, which are generated randomly. As the accu
racy of the power calculations used by VPR depends on the
accuracy of the switching activity annotated to the design, it
is essential to achieve high covering during simulation. High
coverage is achieved via simulation feedback to the random
vector generator. After a set of random vectors with high
signal coverage is found, the simulator computes the activity
transitions of objects and nets listed in the object list.
Next,amodifiedversionofVPRisusedtoestimatethepower
dissipated by the FPCA. VPR’s power model is based on tran
sition density and the static probability of nets. The transition
density of a signal represents the average number of transitions
of that signal per unit time; the static probability of a net is the
probability that the signal is high at any given time. These two
parameters are computed for each net in the design using the
simulation output.
The power model is used to extract the power model for the
FPCA. The offline GPC power characteristics are placed as a
table in the ADF that describes the FPCA. The table contains
the estimated power dissipation for transition activities ranging
from 0 to 1 by increments of 0.1; separate tables are instantiated
depending on whether or not each output of the GPC is written
to its flipflop.
A second input to the power model is the transition activity
of objects extracted in the previous step. VPR reports three dif
ferent power estimates: the dynamic power dissipated by the
GPC and by the routing network and the leakage power. The
power consumption of the routing network is estimated using
switching activities and switch box and wire parameters speci
fied in the ADF based on the target technology. The GPC power
consumption is estimated using the average activity of its inputs
and the offline power table in the ADF.
VI. EXPERIMENTAL RESULTS
A. Benchmarks
We selected a set of benchmarks from arithmetic, DSP, and
video processing domains where we were able to identify com
pressor trees. These benchmarks were broadly categorized into
multiplierbased and multiinput addition benchmarks.
The multiplierbased benchmarks include g.721 [44], a poly
nomial function that has been optimized using Horner’s Rule
(hpoly),
 and
), and a video processing application (video mixer
[45]).Thevideomixerconvertstwo channelsofred–green–blue
videototelevisionstandardYIQsignalsandthenmixesthemin
an alpha blender to produce a composite output signal.
The multiinput addition benchmarks include the Media
Bench application adpcm [44], a 1D multiplierless discrete
cosine transform (dct [8]), three and sixtap finiteimpulse
b multipliers ( and
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
Page 9
586IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
Fig. 8. Critical path delay observed for each benchmark without the transformations of Verma and Ienne [11] and with multiplication operations synthesized on
DSP blocks (DSP); all other synthesis methods applied the transformations. Ternary and GPC synthesize each benchmark wholly on the general logic of the FPGA,
while 8:4, 12:4, 16:5, and 20:5 synthesize each compressor tree on an FPCA that is integrated into a larger FPGA.
response filters with randomly generated constants (fir3 and
fir6 [6]), and an internally developed variable block size motion
estimator for H.264/AVC video coding (H.264 ME).
Themultiplierbasedbenchmarkscontainmultipliersthatcan
be synthesized on an FPGA using DSP blocks; however, when
the transformations of Verma et al. [3] are applied, the com
pressor trees within the multipliers are merged with other ad
dition operations, rendering the DSP blocks useless. After ap
plying these transformations, the compressor trees within these
benchmarks can only be synthesized on the general logic of the
FPGA or on an FPCA.
The video mixer contains many disparate compressor trees,
even after the transformations of Verma et al. are applied; all
other benchmarks contain one compressor tree. H.264 ME con
tains a set of identical processing elements (PEs), where each
PE contains a compressor tree. The number of PEs can vary de
pending on the needs of the system. We chose to synthesize a
fourPE system, ignoring the memory and control logic.
Each benchmark was synthesized six or seven times.
1) DSP: The multiplierbased benchmarks and adpcm were
synthesizedwithoutapplyingthetransformationsofVerma
et al. All multipliers in the multiplierbased benchmarks
were synthesized on the DSP blocks. adpcm contains three
disparate addition operations, but cannot use DSP blocks.
In all subsequent experiments, the transformations of
Verma et al. were applied to the multiplierbased bench
marks and to adpcm; the remaining multiinput addition
benchmarks were written with compressor trees explicitly
exposed. DSP blocks cannot be used for multiplication
operations following the transformations.
2) Ternary: Compressor trees are synthesized on ternary
adder trees using FPGA logic cells configured as ternary
adders.
3) GPC: Compressor trees are synthesized on the general
logic of an FPGA using the GPC mapping heuristic of
ParandehAfshar et al. [9].
4) 8:4, 12:4, 16:5, and 20:5: The compressor tree is synthe
sized on an FPCA; four different FPCAs with different
counter sizes were considered.
The experiments synthesized purely combinational circuits.
In actuality, the frequency and throughput of a compressor tree
could be increased by registering the output bits of each level
of logic in the tree. The benchmarks that were implemented did
not naturally contain pipelined compressor trees; therefore, this
possibility is not evaluated here.
Lastly, we note that Brisk et al. [12] attempted to synthesize
adder trees using the DSP blocks; this approach yielded very
slow compressor trees; as such, these experiments are not re
peated here, as the approach is not competitive.
B. Results
Fig. 8 shows the critical path delay of each benchmark after
synthesis. In all cases, other than the two multipliers (
and), the FPCA yields the minimum critical path
delay. In particular, the FPCA’s success on the multiplierbased
benchmarks compared with that of a DSP is due to its ability to
accelerate compressor trees generated by the transformations of
Verma et al.
and
transformations, as their compressor trees are not merged with
any other operations.
Itshouldbe notedthat
examples for the Alterastyle FPGA. The reason is that each
halfDSP block contains a
b multipliers are required for
the DSP block and the FPCA would be exacerbated for
and
b multiplication.
do not benefit from these
andare worstcase
b multiplier; for example, four
. The gap between

Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
Page 10
PARANDEHAFSHAR et al.: IMPROVING FPGA PERFORMANCE FOR CARRYSAVE ARITHMETIC587
Fig. 9. Average delay of the set of benchmarks decomposed into the delay through DSP blocks (DSP only)/compressor tree logic (all others) and noncompressor
tree logic (all synthesis methods) for the (a) multiplierbased and (b) multiinput addition benchmarks.
The benefits of the FPCA are exacerbated for the video mixer
because Vermaet al.’s transformationsare particularly effective
for this benchmark: it has many multiplication and addition op
erations that are merged together by these transformations.
Fig. 8 also shows that the FPCA considerably reduces the
critical path delay over Ternary and GPC; this is particularly
important for the multiinput addition benchmarks, where the
DSP blocks cannot be used.
Among the different FPCA options, none performed uni
formly better than the others. In the case of adpcm, all four
FPCAs achieved comparable critical path delays. Among the
FPCAs, 12:4 had the minimum critical path delay for hpoly and
; 16:5 had the minimum critical path delay for g721,
fir6, and H.264 ME; and 20:5 had the minimum critical path
delay for the remaining benchmarks. These results indicate that
no FPCA will be ideal for all benchmarks, but the counter size
should probably be larger than 8:4.
Fig. 9 shows the delay achieved by synthesizing each bench
markintoDSPblock/compressortreelogicandnoncompressor
tree logic. Fig. 9(a) shows the multiplierbased benchmarks,
where DSP blocks can be used, and Fig. 9(b) shows the results
forthemultiinputadditionbenchmarks.Duetoitslimitedfunc
tionality, the FPCA only speed up the delay of the compressor
tree.
In Fig. 9(a), the FPCA reduces the average compressor tree
logic delay from 30% (8:4) to 47% (20:5); however, the trans
formations of Verma et al. and the need to synthesize partial
productgeneratorsontheFPGAgenerallogicwhenDSPblocks
are not used increase the average noncompressor tree logic
delay by 46%. GPC and Ternary increase the average critical
path delay compared to DSP; 8:4, 12:4, 16:5, and 20:5 reduce
theaveragecriticalpathdelaycomparedwithDSPby0.2%,8%,
10%, and 11%, respectively.
InFig.9(a),theincreaseinaveragenoncompressortreelogic
delay for all options other than DSP is due to the fact that par
tialproduct generators must be synthesized on general FPGA
logic, rather than the DSP blocks. On the other hand, the FPCA
noticeablyreducestheaveragedelayoftheresultingcompressor
trees considerably compared with Ternary and GPC.
DSP blocks cannot be used for the multiinput addition
benchmarks in Fig. 9(b); we take GPC as a baseline as its
critical path delay is less than ternary. Since Verma and Ienne’s
transformations are applied to adpcm and the other multiinput
addition benchmarks have compressor trees directly exposed,
the noncompressor tree logic delay is the same in all cases.
Compared with GPC, 8:4, 12:4, 16:5, and 20:5 reduce the
overall (compressor tree) delay by 35% (45%), 41% (53%),
43% (56%), and 43% (55%), respectively. As there are no
partialproduct generators and DSP blocks are not used, these
benefits are due solely to critical path reduction within the
compressor trees.
Fig. 10 shows the area of each benchmark converted to two
input NAND gate equivalents (GEs); the area includes the com
putational elements (LUTs, DSP blocks, and GPCs) and does
not include any estimates of the utilization of resources in the
programmable routing network.
Each DSP block contains eight
an area of 10714 GEs. An
partial products. Since each ALM produces two output bits,
ALMs are required. In theory, this gives DSP blocks an
advantage in terms of area utilization compared with the other
methods.
For hpoly,
, and
considerably more area than DSP. This is due, primarily, to the
fact that partialproduct generators must be synthesized on the
general logic of the FPGA. It should be noted that
quired just one DSP block but only used four of the eight mul
tipliers. In other cases, namely, g721 and the video mixer, the
FPCAs had similar area requirements to DSP; however, 12:4
for video mixer was significantly smaller. For this benchmark,
GPCs built from 12:4 counters, coincidentally, were the per
fectsized building blocks.
Compared to GPC and Ternary, the FPCAs reduced the area
requirement; in most cases, the area reduction was marginal;
however, it was quite pronounced for the video mixer and fir6.
Similar to the critical path delay results reported in Fig. 8, none
of the four FPCA options was uniformly better than the others
across all benchmarks.
b multipliers and has
bit multiplier generates
, the FPCA consumed
re
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
Page 11
588IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
Fig. 10. Areaofeachbenchmark aftersynthesis.TheareaoftheDSPblocks,LUTs,andGPCshavebeenconvertedto twoinputNANDGEs.Theseareaestimates
do not account for the programmable routing network.
Fig. 11. Energy consumption of each benchmark (normalized to Ternary).
Fig. 11 shows the normalized energy consumption of each
benchmark. As VPR 5.0 does not have a power model, the en
ergy consumption reported here are only for the compressor
trees and were measured using VPR 4.30. No energy consump
tion for DSP is reported since VPR 4.30 does not support em
bedded blocks.
Fig. 12 shows the average energy consumption across the set
of benchmarks, decomposed into energy consumed by the logic
elements (ALMs/GPCs) and the routing network.
GPC consumes more energy than the other options. GPC
builds a compressor tree using sixinput GPCs with three or
four outputs; two ALMs per GPC are required. Each ALM in
Ternary,incontrast,takessixinputbitsandproducestwooutput
bits (ignoring the carryout bit, which is propagated to the next
ALM in the chain). For this reason, GPC tends to require more
ALMs and dissipates more static power.
Fig. 12 shows that the primary advantage of the FPCA comes
from its ability to reduce logic delay. In both Ternary and GPC,
LUTbased ALMs are used to realize the arithmetic building
blocks for compressor trees. The FPCA, in contrast, uses ASIC
implementations of these components, which is considerably
more efficient. Although Ternary consumes less energy in the
routing network than any of the alternatives, the FPCA more
than makes up for this in terms of energy savings in the logic.
In conclusion, both Figs. 11 and 12 show that the FPCA signif
icantly reduces energy consumption compared to Ternary and
GPC.
We suspect that DSP blocks will consume less energy for
multiplication operations, because the other methods will need
to synthesize the partialproduct generators on the general logic
for the FPGA and the number of partial products per multipli
cation operation is quadratic.
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
Page 12
PARANDEHAFSHAR et al.: IMPROVING FPGA PERFORMANCE FOR CARRYSAVE ARITHMETIC 589
Fig. 12. Average energy consumption due to logic and routing.
VII. CONCLUSION AND FUTURE WORK
The FPCA is a programmable IP core that can accelerate
compressor trees on FPGAs. For parallel multiplication,
the FPCA retains many of the advantages of the embedded
multipliers in the DSP blocks; however, it suffers a disadvan
tage in terms of area utilization because the partial products
must be synthesized on the FPGA general logic. For parallel
multiplication, the FPCA retains most of the benefits of the
embedded multipliers in the DSP blocks, while providing a
variablebitwidth solution for multiplication operations that do
not match the fixed bitwidth of the DSP blocks. Moreover, the
FPCA can accelerate multiinput addition operations, while the
DSP blocks cannot, particularly when used in conjunction with
transformations by Verma et al. [3] to expose compressor trees
at the application level. Furthermore, the FPCA reduces the
critical path delay and energy consumption compared to the
best methods to synthesize compressor trees on the FPCA.
TheDSPblockwillgenerallyoutperformtheFPCAforappli
cations containing manymultiplications whose bitwidths match
precisely that of the ASIC multipliers in the embedded DSP
blocks and for which the transformations of Verma et al. are in
effective. For virtually all other applications that contain com
pressor trees—naturally or via transformation, the FPCA per
forms significantly better than current FPGAs.
REFERENCES
[1] C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Trans. Elec
tron. Comput., vol. EC13, no. 6, p. 754, Dec. 1964.
[2] L. Dadda, “Some schemes for parallel multipliers,” Alta Freq., vol. 34,
pp. 349–356, Mar. 1965.
[3] A. K. Verma, P. Brisk, and P. Ienne, “Dataflow transformations to
maximize the use of carrysave representation in arithmetic circuits,”
IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 27, no.
10, pp. 1761–1774, Oct. 2008.
[4] S. Mirzaei, A. Hosangadi, and R. Kastner, “FPGA implementation of
high speed FIR filters using add and shift method,” in Proc. Int. Conf.
Comput. Des., San Jose, CA, Oct. 2006, pp. 308–313.
[5] C.Y. Chen, S.Y. Chien, Y.W. Huang, T.C. Chen, T.C. Wang, and
L.G. Chen, “Analysis and architecture design of variable blocksize
motion estimation for H.264/AVC,” IEEE Trans. Circuits Syst. I, Reg.
Papers, vol. 53, no. 2, pp. 578–593, Feb. 2006.
[6] S. Sriram, K. Brown, R. Defosseux, F. Moerman, O. Paviot, V. Sun
dararajan, and A. Gatherer, “A 64 channel programmable receiver chip
for 3G wireless infrastructure,” in Proc. IEEE Custom Integr. Circuits
Conf., San Jose, CA, Sep. 2005, pp. 59–62.
[7] S. R.Vangal, Y. V. Hoskote,N.Y. Borkar, andA. Alvandpour,“A6.2
Gflops floatingpoint multiplyaccumulator with conditional normal
ization,” IEEE J. SolidState Circuits, vol. 41, no. 10, pp. 2314–2323,
Oct. 2006.
[8] A. Shams, W. Pan, A. Chandanandan, and M. Bayoumi, “A highper
formance 1DDCT architecture,” in Proc. IEEE Int. Symp. Circuits
Syst., Geneva, Switzerland, May 2000, vol. 5, pp. 521–524.
[9] H. ParandehAfshar, P. Brisk, and P. Ienne, “Efficient synthesis of
compressor trees on FPGAs,” in Proc. AsiaSouth Pacific Des. Autom.
Conf., Seoul, Korea, Jan. 2008, pp. 138–143.
[10] H. ParandehAfshar, P. Brisk, and P. Ienne, “Improving synthesis of
compressor trees on FPGAs via integer linear programming,” in Proc.
Int. Conf. Des. Autom. Test Eur., Munich, Germany, Mar. 2008, pp.
1256–1261.
[11] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,”
IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 26, no.
2, pp. 203–215, Feb. 2007.
[12] P. Brisk, A. K. Verma, P. Ienne, and H. ParandehAfshar, “Enhancing
FPGAperformanceforarithmeticcircuits,”inProc.Des.Autom.Conf.,
San Diego, CA, Jun. 2007, pp. 404–409.
[13] W. J. Stenzel, W. J. Kubitz, and G. H. Garcia, “A compact highspeed
parallel multiplication scheme,” IEEE Trans. Comput., vol. C26, no.
10, pp. 948–957, Oct. 1977.
[14] S. Dormido and M. A. Canto, “Synthesis of generalized parallel
counters,” IEEE Trans. Comput., vol. C30, no. 9, pp. 699–703,
Sep. 1981.
[15] S. Dormido and M. A. Canto, “An upper bound for the synthesis of
generalized parallel counters,” IEEE Trans. Comput., vol. C31, no. 8,
pp. 802–805, Aug. 1982.
[16] “Stratix III Device Handbook, Vol. 1 and 2” Altera Corporation, San
Jose, CA, Feb. 2009. [Online]. Available: http://www.altera.com/
[17] “Virtex5 User Guide” Xilinx Corporation, San Jose, CA, 2007. [On
line]. Available: http://www.xilinx.com/
[18] “Virtex5FPGA XtremeDSPDesignConsiderations”XilinxCorpora
tion, San Jose, CA, Jan. 2009. [Online]. Available: http://www.xilinx.
com/
[19] P. S. Zuchowski, C. B. Reynolds, R. J. Grupp, S. G. Davis, B. Cremen,
and B. Troxel, “A hybrid ASIC and FPGA architecture,” in Proc. Int.
Conf. Comput.Aided Des., San Jose, CA, Nov. 2002, pp. 187–194.
[20] P. Jamieson and J. Rose, “Architecting hard crossbars on FPGAs and
increasing their areaefficiency with shadow clusters,” in Proc. IEEE
Int. Conf. Field Programmable Technol., Kitakyushu, Japan, Dec.
2007, pp. 57–64.
[21] M. J. Beauchamp, S. Hauck, K. D. Underwood, and K. S. Hemmert,
“Architecturalmodificationstoenhancethefloatingpointperformance
of FPGAs,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16,
no. 2, pp. 177–187, Feb. 2008.
[22] R. Kastner, A. Kaplan, S. O. Memik, and E. Bozorgzadeh, “Instruc
tion generation for hybridreconfigurable systems,” ACM Trans. Des.
Autom. Electron. Syst., vol. 7, no. 4, pp. 602–627, Oct. 2002.
[23] A. Cevrero, P. Athanasopoulos, H. ParandehAfshar, A. K. Verma, P.
Brisk, F. Gurkaynak, Y. Leblebici, and P. Ienne, “Architecture im
provements for field programmable counter arrays: Enabling synthesis
of fast compressor trees on FPGAs,” in Proc. Int. Symp. FPGAs, Mon
terey, CA, Feb. 2008, pp. 181–190.
[24] D. Cherepacha and D. Lewis, “DPFPGA: An FPGA architecture op
timized for datapaths,” VLSI Des., vol. 4, no. 4, pp. 329–343, 1996.
[25] A. Kaviani, D. Vranseic, and S. Brown, “Computational field pro
grammable architecture,” in Proc. IEEE Custom Integr. Circuits Conf.,
Santa Clara, CA, May 1998, pp. 261–264.
[26] S. Hauck, M. M. Hosler, and T. W. Fry, “Highperformance carry
chains for FPGAs,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
vol. 8, no. 2, pp. 138–147, Apr. 2000.
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
Page 13
590IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
[27] K. LeijtenNowak and J. L. van Meerbergen, “An FPGA architecture
with enhanced datapath functionality,” in Proc. Int. Symp. Field Pro
grammable Gate Arrays, Monterey, CA, Feb. 2003, pp. 195–204.
[28] M. T. Frederick and A. K. Somani, “Multibit carry chains for
highperformance reconfigurable fabrics,” in Proc. Int. Conf. Field
Programmable Logic Appl., Madrid, Spain, Aug. 2006, pp. 1–6.
[29] H.ParandehAfshar,P.Brisk,andP.Ienne,“AnovelFPGAlogicblock
for improved arithmetic performance,” in Proc. Int. Symp. Field Pro
grammable Gate Arrays, Monterey, CA, Feb. 2008, pp. 171–180.
[30] B. Parhami, “Configurable arithmetic arrays with datadriven control,”
in Proc. Asilomar Conf. Signals, Syst., Comput., Pacific Grove, CA,
Oct./Nov. 2000, pp. 89–93.
[31] R.Francis,S.Moore,andR.Mullins,“Anetworkoftimedivisionmul
tiplexed wiring for FPGAs,” in Proc. 2nd IEEE Symp. Networkson
Chip, Apr. 2008, pp. 35–44, Newcastle University, U.K..
[32] N. L. Miller and S. F. Quigley, “A novel field programmable gate array
architecture for high speed arithmetic processing,” in Proc. 8th Int.
Workshop FieldProgrammable Logic Appl., Tallinn, Estonia, Aug./
Sep. 1998, pp. 386–390.
[33] A. Marshall, T. Stansfield, I. Kostarnov, J. Vuillemin, and B. Hutch
ings, “A reconfigurable arithmetic array for multimedia applications,”
in Proc. Int. Symp. Field Programmable Gate Arrays, Monterey, CA,
Feb. 1999, pp. 135–143.
[34] S. Fiske and W. J. Dally, “The reconfigurable arithmetic processor,” in
Proc. 15th Int. Symp. Comput. Archit., Honolulu, HI, May/Jun. 1988,
pp. 30–36.
[35] Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar, “A 5GHz
mesh interconnect for a teraflops processor,” IEEE Micro, vol. 27, no.
5, pp. 51–61, Sep./Oct. 2007.
[36] J. Cong and H. Huang, “Technology mapping and architecture evalua
tion for k/mmacrocellbased FPGAs,” ACM Trans. Des. Autom. Elec
tron. Syst., vol. 10, no. 1, pp. 3–23, Jan. 2005.
[37] Y. Hu, S. Das, S. Trimberger, and L. He, “Design, synthesis and eval
uation of heterogeneous FPGA with mixed LUTs and macrogates,”
in Proc. Int. Conf. Comput.Aided Des., San Jose, CA, Nov. 2007, pp.
188–193.
[38] V.BetzandJ.Rose,“VPR:Anewpacking,placement,androutingtool
for FPGA research,” in Proc. 7th Int. Workshop FieldProgrammable
Logic Appl., London, U.K., Sep. 1997, pp. 213–222.
[39] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep
Submicron FPGAs. Norwell, MA: Kluwer, Feb. 1999.
[40] K.K.W.Poon,S.J.E.Wilton,andA.Yan,“Adetailedpowermodelfor
fieldprogrammable gate arrays,” ACM Trans. Des. Autom. Electron.
Syst., vol. 10, no. 2, pp. 279–302, Apr. 2005.
[41] N. C. K. Choy and S. J. E. Wilton, “Activitybased power estimation
and characterization of DSP and multiplier blocks in FPGAs,” in Proc.
IEEE Int. Conf. Field Programmable Technol., Bangkok, Thailand,
Dec. 2006, pp. 253–256.
[42] J. Lamoureux and S. J. E. Wilton, “Activity estimation for field pro
grammable gate arrays,” in Proc. IEEE Int. Conf. Field Programmable
Logic Appl., Madrid, Spain, Aug. 2006, pp. 1–8.
[43] F. N. Najm, “A survey of power estimation techniques in VLSI cir
cuits,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 2, no. 4,
pp. 446–455, Dec. 1994.
[44] C. Lee, M. Potkonjak, and W. H. MangioneSmith, “MediaBench: A
tool for evaluating and synthesizing multimedia and communications
systems,” in Proc. 30th Int. Symp. Microarchitecture, Research Tri
angle Park, NC, Dec. 1997, pp. 330–335.
[45] “Creating HighSpeed Data Path Components—Application Note,”
Synopsys Corporation, Mountain View, CA, Oct. 1999, ver. 1999.10,
Chapter 1.
Hadi ParandehAfshar received the B.S. degree
in computer engineering and the M.S. degree in
computer architecture from the University of Tehran,
Tehran, Iran, in 2001 and 2003, respectively. He has
been working toward the Ph.D. degree in the Pro
cessor Architecture Laboratory, School of Computer
and Communications Sciences, École Polytechnique
Fédérale de Lausanne, Lausanne, Switzerland, since
2007.
His research interests include reconfigurable com
puting, computer architecture and arithmetic, and de
sign automation for embedded systems.
Ajay Kumar Verma received the B.S. degree in
computer science from the Indian Institute of Tech
nology Kanpur, Kanpur, India, in 2003. He has been
working toward the Ph.D. degree in the Processor
Architecture Laboratory, School of Computer and
Communications Sciences, École Polytechnique
Fédérale de Lausanne, Lausanne, Switzerland, since
2004.
His research interests include logic synthesis, op
timization of arithmetic circuits, and design automa
tion for applicationspecific processors.
Mr. Verma was a recipient of the Best Paper Award at the International Con
ference on Compilers, Architecture, and Synthesis for Embedded Systems in
2007.
Philip Brisk received the B.S., M.S., and Ph.D. de
greesfromthe UniversityofCalifornia,Los Angeles,
in2002,2003,and2006,respectively,allincomputer
science.
Since 2006, he has been a Postdoctoral Scholar
with the Processor Architecture Laboratory, School
of Computer and Communications Sciences, École
Polytechnique Fédérale de Lausanne, Lausanne,
Switzerland. His research interests include reconfig
urable computing, compilers, and design automation
and architecture for applicationspecific processors.
Dr.Briskisorhasbeenamember oftheprogramcommitteesofseveralinter
national conferences and workshops, including the IEEE Symposium on Appli
cationSpecific Processors, the International Workshop on Software and Com
pilers for Embedded Systems, and the Reconfigurable Architecture Workshop.
He was a recipient of the Best Paper Award at the International Conference on
Compilers, Architecture, and Synthesis for Embedded Systems in 2007.
Paolo Ienne (S’94–M’96) received the Dottore de
gree in electronic engineering from the Politecnico
di Milano, Milan, Italy, in 1991 and the Ph.D. degree
from the École Polytechnique Fédérale de Lausanne
(EPFL), Lausanne, Switzerland, in 1996.
In December 1996, he joined the Semiconductors
Group, Siemens AG, Munich, Germany (which later
became Infineon Technologies AG), where after
working on data path generation tools, he became
the Head of the Embedded Memory Unit, Design Li
brariesDivision.Since 2000,he hasbeen with EPFL,
where he is currently a Professor and heads the Processor Architecture Labora
tory, School of Computer and Communications Sciences. His research interests
include various aspects of computer and processor architecture, computer
arithmetic, reconfigurable computing, and multiprocessor systemsonchip.
Dr. Ienne is or has been a member of the program committees of several in
ternational conferences and workshops, including Design Automation and Test
in Europe, the International Conference on Computer Aided Design, the Inter
national Conference on Compilers, Architecture, and Synthesis for Embedded
Systems(CASES),theInternationalSymposiumonLowPowerElectronicsand
Design, the International Symposium on HighPerformance Computer Archi
tecture, the International Conference on Field Programmable Logic and Appli
cations, and the IEEE International Symposium on Asynchronous Circuits and
Systems. He was the General Cochair of the Sixth IEEE Symposium on Appli
cationSpecific Processors (SASP’08) and a Guest Editor for a special section
of the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION SYSTEMS on
Application Specific Processors. He was a recipient of the Design Automation
Conference 2003 and the CASES 2007 Best Paper Awards.
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.