Improving FPGA Performance for CarrySave Arithmetic
Abstract
The selective use of carrysave arithmetic, where appropriate, can accelerate a variety of arithmeticdominated circuits. Carrysave arithmetic occurs naturally in a variety of DSP applications, and further opportunities to exploit it can be exposed through systematic data flow transformations that can be applied by a hardware compiler. Fieldprogrammable gate arrays (FPGAs), however, are not particularly well suited to carrysave arithmetic. To address this concern, we introduce the ??field programmable counter array?? (FPCA), an accelerator for carrysave arithmetic intended for integration into an FPGA as an alternative to DSP blocks. In addition to multiplication and multiply accumulation, the FPCA can accelerate more general carrysave operations, such as multiinput addition (e.g., add k > 2 integers) and multipliers that have been fused with other adders. Our experiments show that the FPCA accelerates a wider variety of applications than DSP blocks and improves performance, area utilization, and energy consumption compared with soft FPGA logic.
578 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
Improving FPGA Performance for
CarrySave Arithmetic
Hadi ParandehAfshar, Ajay Kumar Verma, Philip Brisk, and Paolo Ienne
Abstract—The selective use of carrysave arithmetic, where
appropriate, can accelerate a variety of arithmeticdominated
circuits. Carrysave arithmetic occurs naturally in a variety of
DSP applications, and further opportunities to exploit it can be
exposed through systematic data ﬂow transformations that can
be applied by a hardware compiler. Fieldprogrammable gate
arrays (FPGAs), however, are not particularly well suited to
carrysave arithmetic. To address this concern, we introduce the
“ﬁeld programmable counter array” (FPCA), an accelerator for
carrysave arithmetic intended for integration into an FPGA as
an alternative to DSP blocks. In addition to multiplication and
multiply accumulation, the FPCA can accelerate more general
carrysave operations, such as multiinput addition (e.g., add
integers) and multipliers that have been fused with other
adders. Our experiments show that the FPCA accelerates a wider
variety of applications than DSP blocks and improves perfor
mance, area utilization, and energy consumption compared with
soft FPGA logic.
Index Terms—Carrysave arithmetic, ﬁeldprogrammable gate
array (FPGA), generalized parallel counter (GPC).
I. INTRODUCTION
F
IELDPROGRAMMABLEGATEARRAY (FPGA)
performance is lacking for arithmetic circuits. Generally,
arithmetic circuits do not map well onto lookup tables (LUTs),
the primary building block for general logic in FPGAs. To
address this concern, FPGAs offer two solutions: First, LUTs
are now tightly integrated with fast carry chains that perform
efﬁcient carrypropagate addition; second, FPGAs contain DSP
blocks that perform multiplication and multiply accumulation
(MAC). Although an improvement over LUTs alone, these
enhancements lack generality; speciﬁcally, they cannot effec
tively accelerate carrysave arithmetic.
Carrysave arithmetic is a technique to add sets of
num
bers that eliminates much of the carry propagation that would
otherwise occur. Carrysave arithmetic has been the method of
choice for partialproduct reduction in parallel multipliers for
more than 40 years [1], [2]. More recently, Verma et al. [3] de
veloped a set of arithmeticoriented data ﬂow transformations
that can be applied to a computation in order to maximize the use
of carrysave arithmetic. These transformations systematically
reorder the operations in a circuit in order to cluster disparate
Manuscript received May 27, 2008; revised September 26, 2008. First pub
lished June 16, 2009; current version published March 24, 2010.
The authors are with the Processor Architecture Laboratory, School of Com
puter and Communications Sciences, École Polytechnique Fédérale de Lau
sanne, 1015 Lausanne, Switzerland (email: hparande@gmail.com; ajaykumar.
verma@epﬂ.ch; philip.brisk@epﬂ.ch; paolo.ienne@epﬂ.ch).
Digital Object Identiﬁer 10.1109/TVLSI.2009.2014380
adders together and to merge adders with the partialproductre
duction trees of parallel multipliers. Each cluster of adders is
then replaced with a compressor tree, i.e., a circuit that reduces
integers, ,downtotwo, (sum) and
(carry), such that
(1)
A carrypropagate adder (CPA), i.e., a twoinput adder, then
performs the ﬁnal addition,
, to compute the result. Aside
from the transformations of Verma et al. [3], compressor trees
occur naturally in a variety of applications [4]–[8].
The arithmetic capabilities of FPGAs are not well attuned to
the needs of carrysave arithmetic. Programmable LUTs have
been augmented with fast carry chains that are good building
blocks for CPAs but cannot be used for carrysave arithmetic.
The fastest methods to synthesize compressor trees on FPGA
general logic [9], [10] do not use the carry chains except for the
ﬁnal CPA.
FPGAs also integrate DSP blocks, which perform integer
multiplication and MAC. Although useful, DSP blocks cannot
accelerate multiinput addition; likewise, when the transfor
mations of Verma et al. merge multipliers with adders, the
resulting operation can no longer map onto a DSP block. That
being said, certain multiplication operations whose bitwidths
do not match up well with the bitwidths of the DSP blocks are
faster when performed on the general logic of an FPGA [11].
This paper advocates the use of a ﬁeld programmable counter
array (FPCA) for carrysave arithmetic on FPGAs. The FPCA is
a programmable accelerator that can be integrated into an FPGA
as an alternative to DSP blocks. An early FPCA, introduced
by Brisk et al. [12], is a lattice of
counters. An
counter is a circuit that takes input bits, counts the number of
them that are set to 1, and outputs the result, a value in the range
of
,asan bit unsigned binary number. The number of
output bits is
(2)
The FPCA described in this paper, in contrast, is built using
generalized parallel counters (GPCs) [13]–[15], an extended
type of counter that can sum bits having different input ranks;
GPCs, which will be deﬁned formally in Section III, are built
using
counters as building blocks.
Our experiments compare the FPCA with DSP blocks for
multiplicationdominated circuits and with the best methods to
synthesize compressor trees on general FPGA logic for other
circuits that feature carrysave arithmetic. As the DSP blocks
are ﬁxedbitwidth multipliers/MACs, they perform better than
10638210/$26.00 © 2010 IEEE
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
PARANDEHAFSHAR et al.: IMPROVING FPGA PERFORMANCE FOR CARRYSAVE ARITHMETIC 579
Fig. 1. Illustration of the methodology underlying the proposed reconﬁgurable lattice. Using arithmetic transformations [3], a circuit/ﬂow graph is transformed
to expose one (or more) compressor trees. The compressor tree is mapped onto a new reconﬁgurable lattice (FPCA) that is integrated into an FPGA. The ﬁnal
addition is mapped onto a dedicated adder (shown here, integrated into the new lattice but could easily be implemented using a carry chain instead). Th
e remaining
nonarithmetic operations in the circuit/ﬂow graph are mapped onto the FPGA logic blocks.
the FPCA for those operations and bitwidths; however, the
FPCA retains an advantage when bitwidth mismatches occur.
The FPCA also beneﬁts from the transformations of Verma
et al. [3], whereas the DSP blocks do not. The FPCA offers
advantages over DSP blocks and FPGA logic in terms of critical
path delay, area utilization, and energy consumption.
In our experiments, we considered GPCs built from four par
allel counter sizes: 8:4, 12:4, 16:5, and 20:5; although no counter
size was uniformly better than the others across all benchmarks,
our results suggest that increasing the counter size beyond 12:4
yields diminishing returns. We conclude that GPCs built from
12:4 counters are the ideal choice for our speciﬁc set of bench
marks.
A. Illustrative Example
Fig. 1 shows our approach. A circuit transformed as described
previously is partitioned into a (set of) compressor tree(s) with
corresponding CPA(s) and a set of nonarithmetic operations.
The compressor tree is mapped onto an FPCA, which is em
bedded within a larger FPGA. Fig. 1 assumes that a dedicated
CPA is integrated into the FPCA; alternatively, the carry chains
in the logicblock structure of the FPGA could be used to per
form the ﬁnal CPA. The nonarithmetic portions of the circuit are
mapped onto the FPGA. Following the lead of Xilinx and Al
tera, the FPGA shown in Fig. 1 is organized into columns. Each
column contains a set of logic clusters [e.g., the Altera Logic
Array Block (LAB)], which contain several logic blocks [e.g.,
the Altera Adaptive Logic Module (ALM)] connected by local
routing. A global routing network connects the different logic
clusters. Due to the column structure, the horizontal and ver
tical routing channels are nonuniform.
B. Paper Organization
The paper is organized as follows. Section II summarizes re
lated work, Section III introduces GPCs, Section IV presents
the FPCA architecture, Sections V and VI present the experi
mental framework and results, and lastly, Section VII concludes
the paper.
II. R
ELATED
WORK
A. Commercial FPGA Architectures and Mapping
This section summarizes the arithmetic features in the Altera
Stratix III [16] and Xilinx Virtex5 [17] FPGAs, both of which
are highend FPGAs realized in 65nm CMOS technology. The
logic architectures of both of these FPGAs feature sixinput
LUTs with carry chains that perform efﬁcient carrypropagate
addition without using the routing network. The Stratix III carry
chain is a ripplecarry adder; the Virtex5 carry chain includes
an
XOR gate and a multiplexor (mux) which enable carrylooka
head addition.
Stratix II introduced a method to combine the LUTs with the
carry chain to perform ternary (threeinput) addition, which re
mained in place for the Stratix III; the Virtex5 similarly supports
ternary addition.
Due to the peculiar nature of FPGA architectures, it has long
been thought that multiinput addition is best realized using
trees of adders rather than compressor trees. The use of ternary
adders rather than binary (twoinput) adders could reduce the
height of the trees, thereby reducing delay and/or pipeline
depth. ParandehAfshar et al. [9], [10], however, showed that
compressor trees could be synthesized on FPGAs using GPCs
(see Section III), signiﬁcantly reducing the delay compared to
ternary adder trees. Experimentally, this paper ﬁnds that the
FPCA is faster than both of these alternatives.
B. FPGA Enhancements to Improve Arithmetic Performance
Numerous enhancements for FPGAs have been proposed in
the past, particularly to improve arithmetic performance. For ex
ample, several researchers have proposed hard IP cores: appli
cationspeciﬁc integrated circuit (ASIC) components that im
plement common operations that are embedded into the FPGA.
The most prevalent of these IP cores include block memories,
DSP/MAC blocks [18], [19], standard I/O interfaces [19], cross
bars [20], shifters [21], and ﬂoatingpoint units [21]. Kastner et
al. [22] developed techniques to examine a set of applications
to ﬁnd good domainspeciﬁc IP core candidates.
Although the FPCA is similar in principle to the IP cores de
scribed previously, it is not completely hard: It is programmable
and has its own routing network. Although it is intended to im
plement just one class of circuits—compressor trees, the FPCA
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
580 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
is ﬂexible and is not ﬁxed to a speciﬁc bitwidth; this distin
guishes the FPCA, for example, from the hard multipliers whose
bitwidths are ﬁxed. Kuon and Rose [11] have noted that ﬁxed
bitwidth multiplication has some limitations, e.g., it is inefﬁ
cient to implement
b multiplication on a b multiplier
contained in a DSP block.
Cevrero et al. [23] recently proposed an alternative FPCA
architecture. Theirs is radically different than the one described
here; the most important distinguishing feature is that it uses
direct programmable connections but does not employ a global
routing network; as such, it offers less ﬂexibility than the
architecture proposed here, but with the potential of reduced
delay, area, and power consumption due to the absence of
global routing. Future work will compare and contrast these
two architectures to better understand the differences between
them.
C. Carry Chains
Also notable but not directly related to this work are the fast
carry chains [24]–[29]: These are used to implement efﬁcient
carrypropagate addition within FPGA logic cells. If an FPCA
is present, these carry chains can be used to perform the ﬁnal ad
dition if a hard IP core implementation of a CPA is not present.
ParandehAfshar et al. [29] developed a carry chain that al
lows a logic cell to be conﬁgured as a 6:2 compressor, a well
known building block for compressor trees. The FPCA, how
ever, is much more powerful, as its logic cells contain larger
GPCs (e.g., with up to 20 inputs). The use of larger and more
ﬂexible components reduces the number of logic levels in the
compressor tree as well as pressure on the routing network: This
is favorable from the perspective of delay, area utilization, and
power consumption.
D. Programmable Arrays of Arithmetic Primitives
The FPCA is a homogeneous array of arithmetic primitives
connected by a routing network. Many principally similar arith
metic arrays have been proposed in the past, and this similarity
is acknowledged. The main difference is that the FPCA is lim
ited in its scope of application (solely compressor trees) and is
intended for integration into a larger FPGA, whereas the arrays
discussed in this paper are standalone devices.
Parhami [30], for example, built an array of bitserial addi
tive multipliers and used a datadriven control scheme. The ad
vantage of bitserial arithmetic is that it reduces the wiring re
quirement for an FPGA: This is signiﬁcant because wiring can
consume up to 70% of onchip area. Although somewhat be
yond the scope of this paper, bitserial routing networks are an
active area of research that is beginning to emerge [31]; the ap
plications for such a device, however, must be able to tolerate
high latencies, and it is not immediately clear which applica
tions easily fall into this category.
The reconﬁgurable arithmetic FPGA (RAFPGA) [32] is an
arithmetic array partitioned into three regions: 1) two’s comple
ment addition; 2) sign/magnitude conversion to two’s comple
ment, and vice versa; and 3) multiplication and division. Tra
ditional FPGAstyle logic is also included in order to imple
ment control and generalpurpose logic. In principle, such a de
vice could use an FPCA to perform multiplication; however, no
RAFPGA has been produced commercially to the best of our
knowledge.
The CHESS reconﬁgurable array [33], developed at HP Labs,
is an array of 4b arithmetic logic units (ALUs), connected by a
busbased FPGAstyle routing network. Each ALU supports 16
arithmetic and logical operations (e.g.,
ADD, SUB, XOR), along
with selection and comparison tests. Neighboring ALUs can be
chained together (e.g., to perform 8b addition), and spatial par
allelism is abundant. As the ALU does not support primitives
for multiplication or multioperand addition, the inclusion of an
FPCA into the array is certainly plausible.
Several conﬁgurable arrays of ﬂoatingpoint units have also
been proposed. In 1988, Fiske and Dally [34] introduced the
Reconﬁgurable Arithmetic Processor (RAP), which contains 64
ﬂoating units connected by a switching network. More recently,
Intel’s Teraﬂops processor [35] connected 80 ﬂoatingpoint
MAC units using a highspeed network on chip. Although
ﬂoatingpoint units contain integer multipliers (and, hence,
compressor trees), it does not appear that there would be any
room to incorporate an FPCA into such a chip, because the
ﬂoatingpoint units themselves have ﬁxed bitwidths in accor
dance with IEEE standards.
III. GPC
S
A. Overview
Let
be an bit binary number,
where each
, is a bit. Let be the least bit
and
be the most signiﬁcant bit. The subscript of a bit, in
this case, is called the
of the bit. Each bit has rank
and contributes a value of to the total quantity represented
by the binary integer.
An
counter, as described in the preceding section,
assumes that all bits have the same rank when it computes their
sum. If all input bits have rank
, then the output of the
counter is a set of bits having ranks .
A GPC [13]–[15] is a type of counter that counts bits having
different ranks. In fact, an
counter can implement a GPC,
if desired: A bit of rank
must be connected to precisely
inputs of the counter. Of course, other methods to build
GPCs also exist.
A GPC is deﬁned as a tuple
, where is the
number of input bits of rank
to sum and is the number
of output bits; the input bits of each rank are independent.
For example, a (5, 3; 4) GPC can count up to 5 b of rank1
and 3 b of rank0; the maximum output value is 3; therefore,
four output bits are required.
Here, we ﬁx the number of input and output bits to be positive
constants
and .Given and , there is a family of GPCs
that satisfy these I/O constraints. Clearly, the number of input
bits cannot exceed
(3)
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
PARANDEHAFSHAR et al.: IMPROVING FPGA PERFORMANCE FOR CARRYSAVE ARITHMETIC 581
Fig. 2. 15:4 counter can implement a (5, 5; 4) GPC.
Likewise, the maximum allowable output value, which occurs
when all input bits are “1,” cannot exceed the maximum integer
value that can be expressed with
output bits
(4)
For a given
and , there is a family of GPCs that satisfy
these I/O constraints. As an example, take
and .
One GPC in this family is a (5, 5; 4) GPC (see Fig. 2).
This GPC has ﬁve input bits of rank0 and ﬁve of rank1. The
maximum value that can be counted is
; four
output bits are required to represent a value in the range [0, 15];
clearly, any (
, ; 4) GPC sufﬁces, under the assumption that
and as well.
At the same time, a (4, 6; 4) GPC is also a member of this
family, as it produces an output value in the range [0, 14], as is
a (6, 3; 4) GPC, etc. An
counter is also a degenerate case
of a GPC; in this case, a (0, 10; 4) GPC.
B. Conﬁgurable GPCs
For the FPCA, one ﬁxed GPC does not sufﬁce, because we
desire more ﬂexibility. Instead, we build a conﬁgurable GPC
using an
counter with a layer of muxes placed on its
input. This conﬁguration layer, which is described in detail in
Section IIID, allows the conﬁgurable GPC to implement a wide
variety of GPCs; the user selects the desired GPC to implement
and programs the conﬁguration layer accordingly. For example,
a programmable GPC with
and should be able
to implement the functionality of both a 15:4 counter and a (5,
5; 4) GPC, among others.
The
of a GPC is the minimum rank among all of its
input bits. Now, suppose that the minimum rank of an input bit
to a given GPC is
and that the GPC must add two (or more)
bits of ranks
and , such that . Each input bit of rank is
connected to one input of the GPC, while each input bit of rank
is connected to inputs. Fig. 2, for example, satisﬁes this
property.
Fig. 3 shows the design of an
input output GPC built
from an
input output conﬁguration layer followed by an
counter. Each input of the counter (output of
the conﬁguration layer) is connected to two GPC inputs and is
controlled by two conﬁguration bits. The conﬁguration bit on
the left selects one of two GPC inputs that are connected to a
mux; the conﬁguration bit on the right drives the counter input
to 0 if it is not set, which allows the
counter to implement
any
counter for and ; in this case,
the
counter is called a subcounter. For example, a 7:3
Fig. 3. Architecture of an input programmable GPC.
counter has ﬁve subcounters: 6:3, 5:3, 4:3, 3:2, and 2:2 counters.
The deﬁnition of subcounters easily extends to GPCs as well.
C. Primitive, Covering, and Reasonable GPCs
This section speciﬁes precisely which
input output
GPCs should be implemented by the programmable GPC with
inputs, outputs, and an counter at its core.
A primitive GPC is one that satisﬁes the I/O constraints.
A covering GPC is a primitive GPC that is not a subGPC of
another primitive GPC. Referring to Fig. 2, the (5, 5; 4) GPC
is a covering GPC. If the number of rank1 inputs is increased
to six, then, there are only three rank0 input ports remaining,
i.e.,
. Hence, a (6, 3; 4) GPC is also a
covering GPC. A (5, 4; 4) GPC, in contrast, is not a covering
GPC, because the (5, 5; 4) GPC can implement its functionality
by driving one of the rank0 inputs to zero.
To achieve universal coverage, a conﬁgurable GPC only
needs to implement the functionality of the covering GPCs that
have
inputs and outputs. Of course, there is no formal
mandate that a conﬁgurable GPC provide universal coverage;
those that do not simply have limited ﬂexibility compared with
those that do.
We have identiﬁed two classes of unreasonable GPCs,
meaning that we can ﬁnd no rational justiﬁcation for using
them; this is not, however, a formal deﬁnition. When designing
a conﬁgurable GPC, there is no need to add support for unrea
sonable GPCs, even if they are covering GPCs.
A GPC that has no rank0 input bits, i.e.,
, is un
reasonable. For example, consider a (7, 0; 4) GPC. The rank0
output will always be 0. The rank1, 2, and 3 outputs can be
computed by a 7:3 counter. If the rank of this GPC is
, it suf
ﬁces to replace it with a 7:3 counter of rank
instead. Thus,
a conﬁgurable GPC need not support this type of GPC.
Similarly, a GPC that has one rank0 input bit, i.e.,
,
is unreasonable. This bit determines whether the output of the
GPC is even or odd. The rank0 input bit is connected directly
to the rank0 output and is not used within the GPC. There is no
need to connect this bit to the GPC input; instead, it should be
connected to a GPC at a lower level of the compressor tree.
As an example, consider a (7, 1; 4) GPC. The rank0 output is
always equal to the rank0 input. The rank1, 2, and 3 outputs
can be computed by a traditional 7:3 counter. Suppose that the
rank of this GPC is
. Then, it sufﬁces to eliminate the rank0
input bit and propagate it to the next level of the tree; then, the
GPC is replaced with a 7:3 counter of rank
.
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
582 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
Fig. 4. Example of primitive, covering, and reasonable covering GPCs.
Fig. 4 shows the preceding concepts for and .
There are 23 primitive GPCs, 6 covering GPCs, and 3 reason
able covering GPCs. (0, 3, 1; 3) is unreasonable because
. (1, 0, 3; 3) is unreasonable because the rank0 and 1 out
puts can be computed by a 3:2 counter, while the rank2 input
connects directly to the rank2 output; it sufﬁces to replace this
GPC with a 3:2 counter of rank
and propagate the rank2 input
bit to the next level of the compressor tree directly. (1, 1, 1; 3)
is also unreasonable: Not only is
but also each input bit
connects directly to each output; it sufﬁces to propagate these
bits directly to the next level of the compressor tree, eliminating
this GPC altogether.
D. Conﬁguration Layer
The conﬁguration layer allows the user to program an
counter as any input output reasonable covering GPC.
The circuit shown on the righthand side of Fig. 3 is placed on
each conﬁguration layer output. The conﬁguration layer archi
tecture is deﬁned by a set of connections between input ports
and muxes. When the right conﬁguration bit is zero, the corre
sponding counter input is driven to zero; otherwise, it selects
one of the inputs connected to the mux.
We make the assumption that
is maximal for a given value
of
that satisﬁes (2), i.e., . For example, if ,
there are 4:3, 5:3, 6:3, and 7:3 counters; based on this assump
tion, we default to the 7:3 counter.
Let
and
be the set of input
ports and muxes. A sensible conﬁguration layer architecture
satisﬁes the property that each input port
is connected
to
muxes where is a nonnegative integer; thus, can
be connected to any input bit of rank at most
. is called
the
of the input port and is denoted . When
conﬁguring a GPC, the rank of each input bit connected to
input port
cannot exceed .
Fig. 5(a) shows an example of a conﬁguration layer (only
muxes are shown) for a GPC built from a 15:4 counter. Input
ports
have rank0; input ports , , , and have
rank1, and ports
, , and have rank2.
The conﬁguration layer can be represented as a conﬁguration
graph, a directed bipartite graph
, where
and represents the set of connections from input
ports to muxes, i.e., there is an edge
if and only
if there is a connection from
to . Fig. 5(b) shows an ex
ample corresponding to the conﬁguration layer in Fig. 5(a). In
Fig. 5(a),
and connect directly to the counter inputs;
dummy muxes
and are shown in Fig. 5(b) simply to
represent the possible connection; a oneinput mux in the con
ﬁguration graph becomes a direct connection in the conﬁgura
tion layer.
E. Conﬁguring the GPC
The conﬁguration graph represents the set of different
inputtomux connections. A conﬁguration determines which
input port is selected by each mux. At most, one input port can
connect to each mux; if no input port is connected, the circuit
shown on the righthand side of Fig. 3 drives the counter input
to zero instead. Speciﬁcally, a conﬁguration is a subset of edges
such that each mux is incident on at most one edge
in
and each input port . An example of a conﬁguration is
the set
, which conﬁgures the GPC as a
15:4 counter. A set of edges including
and
is not a conﬁguration because two input ports are connected to
.
An active input port is incident on at least one edge in a con
ﬁguration. A conﬁguration is sensible if each active input port
is incident on edges in , where .A
conﬁguration including edges
, , , and
is not sensible, because the number of ports to which
is connected is not an even power of two.
Let
be a conﬁguration, and let be the set of input
ports that are conﬁgured to connect to
counter inputs. To stay
within bandwidth limits, the sum of the ranks of the input ports,
after conﬁguration, cannot exceed
, the number of counter
inputs; in other words
(5)
F. Conﬁguration Layer Design
Recall from Section IIIA that a GPC is represented as a tuple
where is the number of
input bits of rank
to be summed, and from Section IIIC, recall
the deﬁnitions of reasonable and covering GPCs.
In this section, we outline a method to design a GPC conﬁg
uration layer systematically. We do not attempt to achieve uni
versal coverage; instead, we restrict the set of GPCs that can be
mapped onto our conﬁgurable GPC; doing this allows us to im
plement the conﬁguration layer with one level of muxes, each
having at most two inputs, thereby bounding the delay and area
overhead of the conﬁguration layer.
The rank variation of a GPC
is the number
of input bit ranks supported by a GPC, e.g.,
for the
GPC aforementioned; then,
in accor
dance with (5).
A conﬁguration layer can implement a GPC
if there is
a conﬁguration
such that for
. This is intuitive: has input bits of rank , each of
which must connect to
muxes. The condition
ensures that a sufﬁcient supply of input ports with the desired
connectivity exists.
Let
be the set of reasonable covering GPCs that satisfy
I/O constraints
.Acomplete conﬁguration layer can im
plement every GPC in
. To simplify the design of the con
ﬁguration layer, we have chosen to restrict the set of reasonable
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
PARANDEHAFSHAR et al.: IMPROVING FPGA PERFORMANCE FOR CARRYSAVE ARITHMETIC 583
Fig. 5. (a) Conﬁguration layer (muxes only) for a 15input 4output GPC. (b) Bipartite graph representation of the conﬁguration layer in (a).
Fig. 6. Conﬁguration graph for (a) , (b) and , and (c) , , and
. In (b), half of the input ports are connected to two muxes.
In (c), two of the input ports
are connected to four muxes. No mux is
connected to more than two input ports.
covering GPCs to those whose rank is at most 2; this conﬁgura
tion layer is incomplete, meaning that universal coverage is not
achieved.
Let
be the set of
reasonable covering GPCs whose rank variation is
for a given
. The conﬁguration layer described here must implement
, , and . Fig. 6 shows the construction method for an
eightinput counter through the incremental addition of edges
to a conﬁguration layer graph.
contains one GPC: All input bits have rank0, i.e., an
counter. Any mapping from input ports to muxes sufﬁces, e.g.,
. Fig. 6(a) shows the initial set
of edges.
Now, let us consider
. No rank1 input ports can connect
to the same mux; otherwise, both of these input ports could not
be conﬁgured as rank1 at the same time. In Fig. 6(b), edges are
added to the conﬁguration graph so that input ports
, , ,
and
can be conﬁgured as either rank0 or rank1.
contains GPCs having bits of rank0, 1, or 2. Like the
aforementioned reasoning, each rank2 input port connects to
four muxes;
of the input ports can already be conﬁgured
as rank1; thus, it sufﬁces to take half of them and connect them
to two additional muxes. In Fig. 6(c), input ports
and are
extended so that they can be conﬁgured as rank0, 1, or 2. At
this point, we stop. In general, there are
input ports in total;
connect to four GPC inputs, connect to two, and
connect to one.
The basic pattern shown in Fig. 6 is systematic and general
izes to any
counter. Stopping at ensures that the largest
mux in the conﬁguration layer has at most two inputs.
IV. FPCA A
RCHITECTURE
A. FPCA Architecture
The FPCA architecture presented by Brisk et al. [12] is a 2D
lattice of hierarchical
counters connected through a pro
grammable routing network; it had the same basic structure as
an islandstyle FPGA, but with programmable logic cells re
placed by
counters. The architecture presented here is
similar, but programmable GPCs replace the
counters.
The connection boxes that interface each programmable GPC to
the adjacent routing channels and switch boxes that connect in
tersecting horizontal and vertical routing channels are the same
as an FPGA.
The hierarchical design of an
counter increases ﬂex
ibility. For example, suppose that the counter size in an FPCA
is 20:5. This 20:5 counter is hierarchically built from smaller
counters, e.g., 4:3. If there are only four input bits to sum at
a given time, the smaller counter can be used. This reduces
the delay of the circuit at stages of a compressor tree where
there is a small number of bits to sum; on the other hand, the
large number of smaller counters increases the number of output
ports, as several 4:3 counters, for example, will be available. The
use of a conﬁgurable GPC in lieu of a hierarchically designed
counter offers similar ﬂexibility, but without increasing
the number of output ports. When there is a small number of
bits available at each rank, a GPC can sum bits having different
ranks.
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
584 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
Fig. 7. Flipﬂop and mux are placed on each GPC output to allow pipelining.
The programmable GPC described in the preceding section
is purely combinational. The user may wish to pipeline the
compressor tree in order to increase the clock frequency and
throughput. To facilitate this, a ﬂipﬂop and mux are placed
on each GPC output, as shown in Fig. 7. The same circuit is
typically placed on the outputs of FPGA logic blocks (but not
on the carrychain outputs).
In a sense, our intention to embed an FPCA into a FPGA
is similar in principle to a cluster of logic cells in a traditional
FPGA (e.g., a LAB in an
Altera Stratixseries FPGA). A LAB
(or group of adjacent LABs) could be replaced with an FPCA.
The role of a programmable GPC within an FPCA is analogous
to the role of a programmable logic cell (an ALM in an Altera
Stratixseries FPGA) within a LAB. The primary difference is
that due to the interconnect structure of compressor trees, a more
ﬂexible routing network, in the style of a global (interLAB),
rather than a local (intraLAB), FPGA routing network is re
quired for the FPCA.
B. FPCA Mapping Heuristic
The algorithm to map a compressor tree onto an FPCA is
based on a heuristic developed by ParandehAfshar et al. [9] to
map compressor trees onto the logic cells (six LUTs) of high
performance FPGAs. The FPCA is homogeneous, i.e.,
is
the same for all programmable GPCs. For a given
, the
available GPCs are the set of reasonable covering GPCs whose
rank variation does not exceed 2. No further modiﬁcations to the
mapping heuristic are required.
C. FPCA Alternatives
Here, we qualitatively explore several different alternative
methods to integrate
counters into an FPGA and explain
why we believe that the FPCA is superior.
One possibility would be to add a programmable GPC to a
LAB or to replace one of the ALMs with the programmable
GPC; however, one GPC is unlikely to be used in isolation: A
collection of them is required to construct a compressor tree.
Thus, a compressor tree synthesized on this architecture would
not be able to take advantage of the fast local connections within
the LAB. A better approach is to cluster the GPCs together, as
is done by the FPCA. Replacing all of the ALMs in a LAB with
programmable GPCs effectively yields an FPCA with a local
LAB style rather than a global routing network; we have opted
for a routing network in the global style due to the complex
interconnect structure required to construct a compressor tree
from GPCs.
A second alternative is to integrate a GPC into an ALM as
a programmable type of macroblock, similar in principle to the
work by Cong and Huang [36] and Hu et al. [37]; however, this
architecture signiﬁcantly increases the input and output band
width of the ALM; it is unlikely that the local routing network
within a LAB could handle this increased I/O bandwidth as it
exists today. We believe that a better approach is to strictly sep
arate the FPCA/GPCs from the LAB/ALMs.
V. E
XPERIMENTAL SETUP
A. VPR
Two different versions of the Versatile PlaceandRoute
(VPR) tool [38], [39] were used to evaluate the FPCA. The
most recent version of VPR, version 5.0, was used to compare
the performance advantages of an FPGA containing an FPCA
against an FPGA containing DSP blocks as a baseline. The
earlier version of VPR does not support DSP blocks or any type
of embedded IP core; therefore, the newer version was required
to perform this comparison.
An earlier version, version 4.30, was used for a comparison
of energy consumption. At present, no power model is cur
rently available for the newer version of VPR; as discussed in
Section VC, we extended a preexisting power model for the
earlier version to compute the energy consumption.
VPR 5.0 provides preconstructed architecture models for dif
ferent process technologies; VPR 4.30, in contrast, requires the
user to provide transistorlevel properties of the wires. Details
will be provided in the following section.
B. Delay and Area Extraction
The FPCA was modeled as a standalone device using VPR
4.30. Each compressor tree in each benchmark was extracted
and synthesized on the FPCA. The FPCA was then modeled as
an IP core in VPR 5.0; for each benchmark, the delay through
each path through the FPCA was taken from VPR 4.30. The
complete benchmark was then synthesized on VPR 5.0, with all
compressor trees mapped onto FPCAs. The total delay includes
both noncompressor tree logic mapped onto the general logic
of the FPGA along with the compressor tree delay through the
FPCA. To model the FPCA, the traditional FPGA logic blocks in
VPR 4.30 were replaced with programmable GPCs. After map
ping a compressor tree onto a network of GPCs, VPR was used
to placeandroute the circuit. VPR also reported the critical path
delay, which includes both routing and logic delays. The number
of GPCs required to synthesize each compressor tree can be de
termined from the result of the mapping heuristic.
The programmable GPCs described in Section III were mod
eled in Very High Speed Integrated Circuit Hardware Descrip
tion Language (VHDL) and synthesized using Synopsys Design
Compiler with 90nm TSMC standard cells. Cadence Silicon
Encounter was then used to place and route the designs and ex
tract delay and area estimates. This was done for four different
programmable GPCs, with
, 12:4, 16:5, and
20:5. Thus, four different FPCA architectures were studied, as
the GPC size is assumed to be homogeneous within an FPCA.
A separate VPR architecture description ﬁle (ADF) was instan
tiated for each FPCA. We limited the channel width to 40 seg
ments, VPR’s default value.
For the purpose of comparison, we modeled an islandstyle
FPGA whose logic blocks resemble Altera’s ALM and whose
logic clusters resemble Altera’s LAB, but with four blocks per
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
PARANDEHAFSHAR et al.: IMPROVING FPGA PERFORMANCE FOR CARRYSAVE ARITHMETIC 585
cluster; the limited number of ALMs per LAB was due to com
plications involved in modeling carry chains in VPR. Each LAB
has two carry chains to support ternary addition. The primary
difference between this baseline architecture and the Stratix II
and III is that the baseline is island style, while the Stratix II
and III organize LABs into columns and employ nonuniform
routing in the
 and directions. The GPC mapping heuristic
of ParandehAfshar et al. [9] was used to synthesize compressor
trees onto this FPGA.
To model routing delays, VPR 4.30 requires information such
as the perunit resistance and capacitance of wires. Our experi
ments used TSMC 90nm technology, and the perunit resistance
and capacitance for metal6 were computed and inserted into
VPR’s ADF. These values were used to compute the routing
delays of the FPCA. The perunit resistance and capacitance of
metal6 were chosen, as this metal layer seemed to be a reason
able choice for the wires in the routing network; in practice, the
routing network is likely to be realized in several metal layers.
VPR 5.0, in contrast, provides preconstructed architecture
models for different transistor technologies. We used an appro
priate model, which eliminated the need to explicitly provide
perunit resistance and capacitances of the wires.
C. VPR Power Model
A power model for VPR was developed by Poon et al. [40]
to model traditional islandstyle FPGAs. Choy and Wilton [41]
modiﬁed this framework to support power estimation for em
bedded IP cores, such as DSP blocks. We extended these models
to estimate the power consumption of the FPCA. The Activity
Estimator [42] estimates the probability of transitions occurring
in the circuits mapped to an FPGA; static probability and den
sity probability [43] are used to extract the transition activities
of each net. The complexity of this computation is
, where
is the number of inputs. This time complexity is suitable for
LUTbased logic blocks, where the number of inputs is typically
six or less (ignoring carry chains). The programmable GPCs
used in our study of FPCAs have up to 20 inputs; for circuits
of this size, the exponential runtime of the model becomes a
limiting factor.
D. FPCA Power Model for VPR
Due to the complexity of the VPR power model described in
the preceding section, we developed a more efﬁcient simulation
based power model. Our power model is based on the Lookup
technique advocated by Choy and Wilton [41].
The power model consists of an ofﬂine power characteriza
tion of each GPC under different input switching activity proba
bilities. The results are collected in a table and fed into an online
power estimator that extracts the switching activities via simula
tion. The simulator dynamically accesses the LUT to determine
the power dissipated given the switching activity at each point
in the simulation.
The ofﬂine power characterization ﬂow is described as fol
lows. First, the programmable GPC is modeled in VHDL and
synthesized using Synopsys Design Compiler. Object and node
names are extracted; these names are later used in the simulation
phase for assigning switching activities. The power of the pro
grammable GPC is estimated using Synopsys PrimePower; the
power characteristics of the GPCs are extracted with different
transition activity rates. These rates are then organized into ta
bles, indexed by the transition probabilities, which are then input
into the online ﬂow.
The online power estimator begins with a mapped netlist
whose objects and nodes are extracted. The objects are the
GPC blocks used for mapping a compressor tree. The transition
activities of objects are extracted through the application of
stimulus vectors, which are generated randomly. As the accu
racy of the power calculations used by VPR depends on the
accuracy of the switching activity annotated to the design, it
is essential to achieve high covering during simulation. High
coverage is achieved via simulation feedback to the random
vector generator. After a set of random vectors with high
signal coverage is found, the simulator computes the activity
transitions of objects and nets listed in the object list.
Next, a modiﬁed version of VPR is used to estimate the power
dissipated by the FPCA. VPR’s power model is based on tran
sition density and the static probability of nets. The transition
density of a signal represents the average number of transitions
of that signal per unit time; the static probability of a net is the
probability that the signal is high at any given time. These two
parameters are computed for each net in the design using the
simulation output.
The power model is used to extract the power model for the
FPCA. The ofﬂine GPC power characteristics are placed as a
table in the ADF that describes the FPCA. The table contains
the estimated power dissipation for transition activities ranging
from 0 to 1 by increments of 0.1; separate tables are instantiated
depending on whether or not each output of the GPC is written
to its ﬂipﬂop.
A second input to the power model is the transition activity
of objects extracted in the previous step. VPR reports three dif
ferent power estimates: the dynamic power dissipated by the
GPC and by the routing network and the leakage power. The
power consumption of the routing network is estimated using
switching activities and switch box and wire parameters speci
ﬁed in the ADF based on the target technology. The GPC power
consumption is estimated using the average activity of its inputs
and the ofﬂine power table in the ADF.
VI. E
XPERIMENTAL RESULTS
A. Benchmarks
We selected a set of benchmarks from arithmetic, DSP, and
video processing domains where we were able to identify com
pressor trees. These benchmarks were broadly categorized into
multiplierbased and multiinput addition benchmarks.
The multiplierbased benchmarks include g.721 [44], a poly
nomial function that has been optimized using Horner’s Rule
(hpoly),
 and b multipliers ( and
), and a video processing application (video mixer
[45]). The video mixer converts two channels of red–green–blue
video to televisionstandard YIQ signals and then mixes them in
an alpha blender to produce a composite output signal.
The multiinput addition benchmarks include the Media
Bench application adpcm [44], a 1D multiplierless discrete
cosine transform (dct [8]), three and sixtap ﬁniteimpulse
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
586 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
Fig. 8. Critical path delay observed for each benchmark without the transformations of Verma and Ienne [11] and with multiplication operations synthesized on
DSP blocks (DSP); all other synthesis methods applied the transformations. Ternary and GPC synthesize each benchmark wholly on the general logic of the FPGA,
while 8:4, 12:4, 16:5, and 20:5 synthesize each compressor tree on an FPCA that is integrated into a larger FPGA.
response ﬁlters with randomly generated constants (ﬁr3 and
ﬁr6 [6]), and an internally developed variable block size motion
estimator for H.264/AVC video coding (H.264 ME).
The multiplierbased benchmarks contain multipliers that can
be synthesized on an FPGA using DSP blocks; however, when
the transformations of Verma et al. [3] are applied, the com
pressor trees within the multipliers are merged with other ad
dition operations, rendering the DSP blocks useless. After ap
plying these transformations, the compressor trees within these
benchmarks can only be synthesized on the general logic of the
FPGA or on an FPCA.
The video mixer contains many disparate compressor trees,
even after the transformations of Verma et al. are applied; all
other benchmarks contain one compressor tree. H.264 ME con
tains a set of identical processing elements (PEs), where each
PE contains a compressor tree. The number of PEs can vary de
pending on the needs of the system. We chose to synthesize a
fourPE system, ignoring the memory and control logic.
Each benchmark was synthesized six or seven times.
1) DSP: The multiplierbased benchmarks and adpcm were
synthesized without applying the transformations of Verma
et al. All multipliers in the multiplierbased benchmarks
were synthesized on the DSP blocks. adpcm contains three
disparate addition operations, but cannot use DSP blocks.
In all subsequent experiments, the transformations of
Verma et al. were applied to the multiplierbased bench
marks and to adpcm; the remaining multiinput addition
benchmarks were written with compressor trees explicitly
exposed. DSP blocks cannot be used for multiplication
operations following the transformations.
2) Ternary: Compressor trees are synthesized on ternary
adder trees using FPGA logic cells conﬁgured as ternary
adders.
3) GPC: Compressor trees are synthesized on the general
logic of an FPGA using the GPC mapping heuristic of
ParandehAfshar et al. [9].
4) 8:4, 12:4, 16:5, and 20:5: The compressor tree is synthe
sized on an FPCA; four different FPCAs with different
counter sizes were considered.
The experiments synthesized purely combinational circuits.
In actuality, the frequency and throughput of a compressor tree
could be increased by registering the output bits of each level
of logic in the tree. The benchmarks that were implemented did
not naturally contain pipelined compressor trees; therefore, this
possibility is not evaluated here.
Lastly, we note that Brisk et al. [12] attempted to synthesize
adder trees using the DSP blocks; this approach yielded very
slow compressor trees; as such, these experiments are not re
peated here, as the approach is not competitive.
B. Results
Fig. 8 shows the critical path delay of each benchmark after
synthesis. In all cases, other than the two multipliers (
and ), the FPCA yields the minimum critical path
delay. In particular, the FPCA’s success on the multiplierbased
benchmarks compared with that of a DSP is due to its ability to
accelerate compressor trees generated by the transformations of
Verma et al.
and do not beneﬁt from these
transformations, as their compressor trees are not merged with
any other operations.
It should be noted that
and are worst case
examples for the Alterastyle FPGA. The reason is that each
halfDSP block contains a
b multiplier; for example, four
b multipliers are required for . The gap between
the DSP block and the FPCA would be exacerbated for

and
b multiplication.
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
PARANDEHAFSHAR et al.: IMPROVING FPGA PERFORMANCE FOR CARRYSAVE ARITHMETIC 587
Fig. 9. Average delay of the set of benchmarks decomposed into the delay through DSP blocks (DSP only)/compressor tree logic (all others) and noncomp
ressor
tree logic (all synthesis methods) for the (a) multiplierbased and (b) multiinput addition benchmarks.
The beneﬁts of the FPCA are exacerbated for the video mixer
because Verma et al.’s transformations are particularly effective
for this benchmark: it has many multiplication and addition op
erations that are merged together by these transformations.
Fig. 8 also shows that the FPCA considerably reduces the
critical path delay over Ternary and GPC; this is particularly
important for the multiinput addition benchmarks, where the
DSP blocks cannot be used.
Among the different FPCA options, none performed uni
formly better than the others. In the case of adpcm, all four
FPCAs achieved comparable critical path delays. Among the
FPCAs, 12:4 had the minimum critical path delay for hpoly and
; 16:5 had the minimum critical path delay for g721,
ﬁr6, and H.264 ME; and 20:5 had the minimum critical path
delay for the remaining benchmarks. These results indicate that
no FPCA will be ideal for all benchmarks, but the counter size
should probably be larger than 8:4.
Fig. 9 shows the delay achieved by synthesizing each bench
mark into DSP block/compressor tree logic and noncompressor
tree logic. Fig. 9(a) shows the multiplierbased benchmarks,
where DSP blocks can be used, and Fig. 9(b) shows the results
for the multiinput addition benchmarks. Due to its limited func
tionality, the FPCA only speed up the delay of the compressor
tree.
In Fig. 9(a), the FPCA reduces the average compressor tree
logic delay from 30% (8:4) to 47% (20:5); however, the trans
formations of Verma et al. and the need to synthesize partial
product generators on the FPGA general logic when DSP blocks
are not used increase the average noncompressor tree logic
delay by 46%. GPC and Ternary increase the average critical
path delay compared to DSP; 8:4, 12:4, 16:5, and 20:5 reduce
the average critical path delay compared with DSP by 0.2%, 8%,
10%, and 11%, respectively.
In Fig. 9(a), the increase in average noncompressor tree logic
delay for all options other than DSP is due to the fact that par
tialproduct generators must be synthesized on general FPGA
logic, rather than the DSP blocks. On the other hand, the FPCA
noticeably reduces the average delay of the resulting compressor
trees considerably compared with Ternary and GPC.
DSP blocks cannot be used for the multiinput addition
benchmarks in Fig. 9(b); we take GPC as a baseline as its
critical path delay is less than ternary. Since Verma and Ienne’s
transformations are applied to adpcm and the other multiinput
addition benchmarks have compressor trees directly exposed,
the noncompressor tree logic delay is the same in all cases.
Compared with GPC, 8:4, 12:4, 16:5, and 20:5 reduce the
overall (compressor tree) delay by 35% (45%), 41% (53%),
43% (56%), and 43% (55%), respectively. As there are no
partialproduct generators and DSP blocks are not used, these
beneﬁts are due solely to critical path reduction within the
compressor trees.
Fig. 10 shows the area of each benchmark converted to two
input
NAND gate equivalents (GEs); the area includes the com
putational elements (LUTs, DSP blocks, and GPCs) and does
not include any estimates of the utilization of resources in the
programmable routing network.
Each DSP block contains eight
b multipliers and has
an area of 10 714 GEs. An
bit multiplier generates
partial products. Since each ALM produces two output bits,
ALMs are required. In theory, this gives DSP blocks an
advantage in terms of area utilization compared with the other
methods.
For hpoly,
, and , the FPCA consumed
considerably more area than DSP. This is due, primarily, to the
fact that partialproduct generators must be synthesized on the
general logic of the FPGA. It should be noted that
re
quired just one DSP block but only used four of the eight mul
tipliers. In other cases, namely, g721 and the video mixer, the
FPCAs had similar area requirements to DSP; however, 12:4
for video mixer was signiﬁcantly smaller. For this benchmark,
GPCs built from 12:4 counters, coincidentally, were the per
fectsized building blocks.
Compared to GPC and Ternary, the FPCAs reduced the area
requirement; in most cases, the area reduction was marginal;
however, it was quite pronounced for the video mixer and ﬁr6.
Similar to the critical path delay results reported in Fig. 8, none
of the four FPCA options was uniformly better than the others
across all benchmarks.
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
588 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
Fig. 10. Area of each benchmark after synthesis. The area of the DSP blocks, LUTs, and GPCs have been converted to twoinput NAND GEs. These area estimat
es
do not account for the programmable routing network.
Fig. 11. Energy consumption of each benchmark (normalized to Ternary).
Fig. 11 shows the normalized energy consumption of each
benchmark. As VPR 5.0 does not have a power model, the en
ergy consumption reported here are only for the compressor
trees and were measured using VPR 4.30. No energy consump
tion for DSP is reported since VPR 4.30 does not support em
bedded blocks.
Fig. 12 shows the average energy consumption across the set
of benchmarks, decomposed into energy consumed by the logic
elements (ALMs/GPCs) and the routing network.
GPC consumes more energy than the other options. GPC
builds a compressor tree using sixinput GPCs with three or
four outputs; two ALMs per GPC are required. Each ALM in
Ternary, in contrast, takes six input bits and produces two output
bits (ignoring the carryout bit, which is propagated to the next
ALM in the chain). For this reason, GPC tends to require more
ALMs and dissipates more static power.
Fig. 12 shows that the primary advantage of the FPCA comes
from its ability to reduce logic delay. In both Ternary and GPC,
LUTbased ALMs are used to realize the arithmetic building
blocks for compressor trees. The FPCA, in contrast, uses ASIC
implementations of these components, which is considerably
more efﬁcient. Although Ternary consumes less energy in the
routing network than any of the alternatives, the FPCA more
than makes up for this in terms of energy savings in the logic.
In conclusion, both Figs. 11 and 12 show that the FPCA signif
icantly reduces energy consumption compared to Ternary and
GPC.
We suspect that DSP blocks will consume less energy for
multiplication operations, because the other methods will need
to synthesize the partialproduct generators on the general logic
for the FPGA and the number of partial products per multipli
cation operation is quadratic.
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
PARANDEHAFSHAR et al.: IMPROVING FPGA PERFORMANCE FOR CARRYSAVE ARITHMETIC 589
Fig. 12. Average energy consumption due to logic and routing.
VII. CONCLUSION AND
FUTURE
WORK
The FPCA is a programmable IP core that can accelerate
compressor trees on FPGAs. For parallel multiplication,
the FPCA retains many of the advantages of the embedded
multipliers in the DSP blocks; however, it suffers a disadvan
tage in terms of area utilization because the partial products
must be synthesized on the FPGA general logic. For parallel
multiplication, the FPCA retains most of the beneﬁts of the
embedded multipliers in the DSP blocks, while providing a
variablebitwidth solution for multiplication operations that do
not match the ﬁxed bitwidth of the DSP blocks. Moreover, the
FPCA can accelerate multiinput addition operations, while the
DSP blocks cannot, particularly when used in conjunction with
transformations by Verma
et al. [3] to expose compressor trees
at the application level. Furthermore, the FPCA reduces the
critical path delay and energy consumption compared to the
best methods to synthesize compressor trees on the FPCA.
The DSP block will generally outperform the FPCA for appli
cations containing many multiplications whose bitwidths match
precisely that of the ASIC multipliers in the embedded DSP
blocks and for which the transformations of Verma et al. are in
effective. For virtually all other applications that contain com
pressor trees—naturally or via transformation, the FPCA per
forms signiﬁcantly better than current FPGAs.
R
EFERENCES
[1] C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Trans. Elec
tron. Comput., vol. EC13, no. 6, p. 754, Dec. 1964.
[2] L. Dadda, “Some schemes for parallel multipliers,” Alta Freq., vol. 34,
pp. 349–356, Mar. 1965.
[3] A. K. Verma, P. Brisk, and P. Ienne, “Dataﬂow transformations to
maximize the use of carrysave representation in arithmetic circuits,”
IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 27, no.
10, pp. 1761–1774, Oct. 2008.
[4] S. Mirzaei, A. Hosangadi, and R. Kastner, “FPGA implementation of
high speed FIR ﬁlters using add and shift method,” in Proc. Int. Conf.
Comput. Des., San Jose, CA, Oct. 2006, pp. 308–313.
[5] C.Y. Chen, S.Y. Chien, Y.W. Huang, T.C. Chen, T.C. Wang, and
L.G. Chen, “Analysis and architecture design of variable blocksize
motion estimation for H.264/AVC,” IEEE Trans. Circuits Syst. I, Reg.
Papers, vol. 53, no. 2, pp. 578–593, Feb. 2006.
[6] S. Sriram, K. Brown, R. Defosseux, F. Moerman, O. Paviot, V. Sun
dararajan, and A. Gatherer, “A 64 channel programmable receiver chip
for 3G wireless infrastructure,” in Proc. IEEE Custom Integr. Circuits
Conf., San Jose, CA, Sep. 2005, pp. 59–62.
[7] S. R. Vangal, Y. V. Hoskote, N. Y. Borkar, and A. Alvandpour, “A 6.2
Gﬂops ﬂoatingpoint multiplyaccumulator with conditional normal
ization,” IEEE J. SolidState Circuits, vol. 41, no. 10, pp. 2314–2323,
Oct. 2006.
[8] A. Shams, W. Pan, A. Chandanandan, and M. Bayoumi, “A highper
formance 1DDCT architecture,” in Proc. IEEE Int. Symp. Circuits
Syst., Geneva, Switzerland, May 2000, vol. 5, pp. 521–524.
[9] H. ParandehAfshar, P. Brisk, and P. Ienne, “Efﬁcient synthesis of
compressor trees on FPGAs,” in Proc. AsiaSouth Paciﬁc Des. Autom.
Conf., Seoul, Korea, Jan. 2008, pp. 138–143.
[10] H. ParandehAfshar, P. Brisk, and P. Ienne, “Improving synthesis of
compressor trees on FPGAs via integer linear programming,” in Proc.
Int. Conf. Des. Autom. Test Eur., Munich, Germany, Mar. 2008, pp.
1256–1261.
[11] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,”
IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 26, no.
2, pp. 203–215, Feb. 2007.
[12] P. Brisk, A. K. Verma, P. Ienne, and H. ParandehAfshar, “Enhancing
FPGA performance for arithmetic circuits,” in Proc. Des. Autom. Conf.,
San Diego, CA, Jun. 2007, pp. 404–409.
[13] W. J. Stenzel, W. J. Kubitz, and G. H. Garcia, “A compact highspeed
parallel multiplication scheme,” IEEE Trans. Comput., vol. C26, no.
10, pp. 948–957, Oct. 1977.
[14] S. Dormido and M. A. Canto, “Synthesis of generalized parallel
counters,” IEEE Trans. Comput., vol. C30, no. 9, pp. 699–703,
Sep. 1981.
[15] S. Dormido and M. A. Canto, “An upper bound for the synthesis of
generalized parallel counters,” IEEE Trans. Comput., vol. C31, no. 8,
pp. 802–805, Aug. 1982.
[16] “Stratix III Device Handbook, Vol. 1 and 2” Altera Corporation, San
Jose, CA, Feb. 2009. [Online]. Available: http://www.altera.com/
[17] “Virtex5 User Guide” Xilinx Corporation, San Jose, CA, 2007. [On
line]. Available: http://www.xilinx.com/
[18] “Virtex5 FPGA Xtreme DSP Design Considerations” Xilinx Corpora
tion, San Jose, CA, Jan. 2009. [Online]. Available: http://www.xilinx.
com/
[19] P. S. Zuchowski, C. B. Reynolds, R. J. Grupp, S. G. Davis, B. Cremen,
and B. Troxel, “A hybrid ASIC and FPGA architecture,” in Proc. Int.
Conf. Comput.Aided Des., San Jose, CA, Nov. 2002, pp. 187–194.
[20] P. Jamieson and J. Rose, “Architecting hard crossbars on FPGAs and
increasing their areaefﬁciency with shadow clusters,” in Proc. IEEE
Int. Conf. Field Programmable Technol., Kitakyushu, Japan, Dec.
2007, pp. 57–64.
[21] M. J. Beauchamp, S. Hauck, K. D. Underwood, and K. S. Hemmert,
“Architectural modiﬁcations to enhance the ﬂoatingpoint performance
of FPGAs,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16,
no. 2, pp. 177–187, Feb. 2008.
[22] R. Kastner, A. Kaplan, S. O. Memik, and E. Bozorgzadeh, “Instruc
tion generation for hybridreconﬁgurable systems,” ACM Trans. Des.
Autom. Electron. Syst., vol. 7, no. 4, pp. 602–627, Oct. 2002.
[23] A. Cevrero, P. Athanasopoulos, H. ParandehAfshar, A. K. Verma, P.
Brisk, F. Gurkaynak, Y. Leblebici, and P. Ienne, “Architecture im
provements for ﬁeld programmable counter arrays: Enabling synthesis
of fast compressor trees on FPGAs,” in Proc. Int. Symp. FPGAs, Mon
terey, CA, Feb. 2008, pp. 181–190.
[24] D. Cherepacha and D. Lewis, “DPFPGA: An FPGA architecture op
timized for datapaths,” VLSI Des., vol. 4, no. 4, pp. 329–343, 1996.
[25] A. Kaviani, D. Vranseic, and S. Brown, “Computational ﬁeld pro
grammable architecture,” in Proc. IEEE Custom Integr. Circuits Conf.,
Santa Clara, CA, May 1998, pp. 261–264.
[26] S. Hauck, M. M. Hosler, and T. W. Fry, “Highperformance carry
chains for FPGAs,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
vol. 8, no. 2, pp. 138–147, Apr. 2000.
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
590 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
[27] K. LeijtenNowak and J. L. van Meerbergen, “An FPGA architecture
with enhanced datapath functionality,” in Proc. Int. Symp. Field Pro
grammable Gate Arrays, Monterey, CA, Feb. 2003, pp. 195–204.
[28] M. T. Frederick and A. K. Somani, “Multibit carry chains for
highperformance reconﬁgurable fabrics,” in Proc. Int. Conf. Field
Programmable Logic Appl., Madrid, Spain, Aug. 2006, pp. 1–6.
[29] H. ParandehAfshar, P. Brisk, and P. Ienne, “A novel FPGA logic block
for improved arithmetic performance,” in Proc. Int. Symp. Field Pro
grammable Gate Arrays, Monterey, CA, Feb. 2008, pp. 171–180.
[30] B. Parhami, “Conﬁgurable arithmetic arrays with datadriven control,”
in Proc. Asilomar Conf. Signals, Syst., Comput., Paciﬁc Grove, CA,
Oct./Nov. 2000, pp. 89–93.
[31] R. Francis, S. Moore, and R. Mullins, “A network of timedivision mul
tiplexed wiring for FPGAs,” in Proc. 2nd IEEE Symp. Networkson
Chip, Apr. 2008, pp. 35–44, Newcastle University, U.K..
[32] N. L. Miller and S. F. Quigley, “A novel ﬁeld programmable gate array
architecture for high speed arithmetic processing,” in Proc. 8th Int.
Workshop FieldProgrammable Logic Appl., Tallinn, Estonia, Aug./
Sep. 1998, pp. 386–390.
[33] A. Marshall, T. Stansﬁeld, I. Kostarnov, J. Vuillemin, and B. Hutch
ings, “A reconﬁgurable arithmetic array for multimedia applications,”
in Proc. Int. Symp. Field Programmable Gate Arrays, Monterey, CA,
Feb. 1999, pp. 135–143.
[34] S. Fiske and W. J. Dally, “The reconﬁgurable arithmetic processor,” in
Proc. 15th Int. Symp. Comput. Archit., Honolulu, HI, May/Jun. 1988,
pp. 30–36.
[35] Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar, “A 5GHz
mesh interconnect for a teraﬂops processor,” IEEE Micro, vol. 27, no.
5, pp. 51–61, Sep./Oct. 2007.
[36] J. Cong and H. Huang, “Technology mapping and architecture evalua
tion for k/mmacrocellbased FPGAs,” ACM Trans. Des. Autom. Elec
tron. Syst., vol. 10, no. 1, pp. 3–23, Jan. 2005.
[37] Y. Hu, S. Das, S. Trimberger, and L. He, “Design, synthesis and eval
uation of heterogeneous FPGA with mixed LUTs and macrogates,”
in Proc. Int. Conf. Comput.Aided Des., San Jose, CA, Nov. 2007, pp.
188–193.
[38] V. Betz and J. Rose, “VPR: A new packing, placement, and routing tool
for FPGA research,” in Proc. 7th Int. Workshop FieldProgrammable
Logic Appl., London, U.K., Sep. 1997, pp. 213–222.
[39] V. Betz, J. Rose, and A. Marquardt , Architecture and CAD for Deep
Submicron FPGAs. Norwell, MA: Kluwer, Feb. 1999.
[40] K. K. W. Poon, S. J. E. Wilton, and A. Yan, “A detailed power model for
ﬁeldprogrammable gate arrays,” ACM Trans. Des. Autom. Electron.
Syst., vol. 10, no. 2, pp. 279–302, Apr. 2005.
[41] N. C. K. Choy and S. J. E. Wilton, “Activitybased power estimation
and characterization of DSP and multiplier blocks in FPGAs,” in Proc.
IEEE Int. Conf. Field Programmable Technol., Bangkok, Thailand,
Dec. 2006, pp. 253–256.
[42] J. Lamoureux and S. J. E. Wilton, “Activity estimation for ﬁeld pro
grammable gate arrays,” in Proc. IEEE Int. Conf. Field Programmable
Logic Appl., Madrid, Spain, Aug. 2006, pp. 1–8.
[43] F. N. Najm, “A survey of power estimation techniques in VLSI cir
cuits,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 2, no. 4,
pp. 446–455, Dec. 1994.
[44] C. Lee, M. Potkonjak, and W. H. MangioneSmith, “MediaBench: A
tool for evaluating and synthesizing multimedia and communications
systems,” in Proc. 30th Int. Symp. Microarchitecture, Research Tri
angle Park, NC, Dec. 1997, pp. 330–335.
[45] “Creating HighSpeed Data Path Components—Application Note,”
Synopsys Corporation, Mountain Vie