Improving FPGA Performance for Carry-Save Arithmetic
The selective use of carry-save arithmetic, where appropriate, can accelerate a variety of arithmetic-dominated circuits. Carry-save arithmetic occurs naturally in a variety of DSP applications, and further opportunities to exploit it can be exposed through systematic data flow transformations that can be applied by a hardware compiler. Field-programmable gate arrays (FPGAs), however, are not particularly well suited to carry-save arithmetic. To address this concern, we introduce the ??field programmable counter array?? (FPCA), an accelerator for carry-save arithmetic intended for integration into an FPGA as an alternative to DSP blocks. In addition to multiplication and multiply accumulation, the FPCA can accelerate more general carry-save operations, such as multi-input addition (e.g., add k > 2 integers) and multipliers that have been fused with other adders. Our experiments show that the FPCA accelerates a wider variety of applications than DSP blocks and improves performance, area utilization, and energy consumption compared with soft FPGA logic.
578 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
Improving FPGA Performance for
Hadi Parandeh-Afshar, Ajay Kumar Verma, Philip Brisk, and Paolo Ienne
Abstract—The selective use of carry-save arithmetic, where
appropriate, can accelerate a variety of arithmetic-dominated
circuits. Carry-save arithmetic occurs naturally in a variety of
DSP applications, and further opportunities to exploit it can be
exposed through systematic data ﬂow transformations that can
be applied by a hardware compiler. Field-programmable gate
arrays (FPGAs), however, are not particularly well suited to
carry-save arithmetic. To address this concern, we introduce the
“ﬁeld programmable counter array” (FPCA), an accelerator for
carry-save arithmetic intended for integration into an FPGA as
an alternative to DSP blocks. In addition to multiplication and
multiply accumulation, the FPCA can accelerate more general
carry-save operations, such as multi-input addition (e.g., add
integers) and multipliers that have been fused with other
adders. Our experiments show that the FPCA accelerates a wider
variety of applications than DSP blocks and improves perfor-
mance, area utilization, and energy consumption compared with
soft FPGA logic.
Index Terms—Carry-save arithmetic, ﬁeld-programmable gate
array (FPGA), generalized parallel counter (GPC).
performance is lacking for arithmetic circuits. Generally,
arithmetic circuits do not map well onto lookup tables (LUTs),
the primary building block for general logic in FPGAs. To
address this concern, FPGAs offer two solutions: First, LUTs
are now tightly integrated with fast carry chains that perform
efﬁcient carry-propagate addition; second, FPGAs contain DSP
blocks that perform multiplication and multiply accumulation
(MAC). Although an improvement over LUTs alone, these
enhancements lack generality; speciﬁcally, they cannot effec-
tively accelerate carry-save arithmetic.
Carry-save arithmetic is a technique to add sets of
bers that eliminates much of the carry propagation that would
otherwise occur. Carry-save arithmetic has been the method of
choice for partial-product reduction in parallel multipliers for
more than 40 years , . More recently, Verma et al.  de-
veloped a set of arithmetic-oriented data ﬂow transformations
that can be applied to a computation in order to maximize the use
of carry-save arithmetic. These transformations systematically
reorder the operations in a circuit in order to cluster disparate
Manuscript received May 27, 2008; revised September 26, 2008. First pub-
lished June 16, 2009; current version published March 24, 2010.
The authors are with the Processor Architecture Laboratory, School of Com-
puter and Communications Sciences, École Polytechnique Fédérale de Lau-
sanne, 1015 Lausanne, Switzerland (e-mail: email@example.com; ajaykumar.
verma@epﬂ.ch; philip.brisk@epﬂ.ch; paolo.ienne@epﬂ.ch).
Digital Object Identiﬁer 10.1109/TVLSI.2009.2014380
adders together and to merge adders with the partial-product-re-
duction trees of parallel multipliers. Each cluster of adders is
then replaced with a compressor tree, i.e., a circuit that reduces
integers, ,downtotwo, (sum) and
(carry), such that
A carry-propagate adder (CPA), i.e., a two-input adder, then
performs the ﬁnal addition,
, to compute the result. Aside
from the transformations of Verma et al. , compressor trees
occur naturally in a variety of applications –.
The arithmetic capabilities of FPGAs are not well attuned to
the needs of carry-save arithmetic. Programmable LUTs have
been augmented with fast carry chains that are good building
blocks for CPAs but cannot be used for carry-save arithmetic.
The fastest methods to synthesize compressor trees on FPGA
general logic ,  do not use the carry chains except for the
FPGAs also integrate DSP blocks, which perform integer
multiplication and MAC. Although useful, DSP blocks cannot
accelerate multi-input addition; likewise, when the transfor-
mations of Verma et al. merge multipliers with adders, the
resulting operation can no longer map onto a DSP block. That
being said, certain multiplication operations whose bitwidths
do not match up well with the bitwidths of the DSP blocks are
faster when performed on the general logic of an FPGA .
This paper advocates the use of a ﬁeld programmable counter
array (FPCA) for carry-save arithmetic on FPGAs. The FPCA is
a programmable accelerator that can be integrated into an FPGA
as an alternative to DSP blocks. An early FPCA, introduced
by Brisk et al. , is a lattice of
counter is a circuit that takes input bits, counts the number of
them that are set to 1, and outputs the result, a value in the range
,asan -bit unsigned binary number. The number of
output bits is
The FPCA described in this paper, in contrast, is built using
generalized parallel counters (GPCs) –, an extended
type of counter that can sum bits having different input ranks;
GPCs, which will be deﬁned formally in Section III, are built
counters as building blocks.
Our experiments compare the FPCA with DSP blocks for
multiplication-dominated circuits and with the best methods to
synthesize compressor trees on general FPGA logic for other
circuits that feature carry-save arithmetic. As the DSP blocks
are ﬁxed-bitwidth multipliers/MACs, they perform better than
1063-8210/$26.00 © 2010 IEEE
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
PARANDEH-AFSHAR et al.: IMPROVING FPGA PERFORMANCE FOR CARRY-SAVE ARITHMETIC 579
Fig. 1. Illustration of the methodology underlying the proposed reconﬁgurable lattice. Using arithmetic transformations , a circuit/ﬂow graph is transformed
to expose one (or more) compressor trees. The compressor tree is mapped onto a new reconﬁgurable lattice (FPCA) that is integrated into an FPGA. The ﬁnal
addition is mapped onto a dedicated adder (shown here, integrated into the new lattice but could easily be implemented using a carry chain instead). Th
nonarithmetic operations in the circuit/ﬂow graph are mapped onto the FPGA logic blocks.
the FPCA for those operations and bitwidths; however, the
FPCA retains an advantage when bitwidth mismatches occur.
The FPCA also beneﬁts from the transformations of Verma
et al. , whereas the DSP blocks do not. The FPCA offers
advantages over DSP blocks and FPGA logic in terms of critical
path delay, area utilization, and energy consumption.
In our experiments, we considered GPCs built from four par-
allel counter sizes: 8:4, 12:4, 16:5, and 20:5; although no counter
size was uniformly better than the others across all benchmarks,
our results suggest that increasing the counter size beyond 12:4
yields diminishing returns. We conclude that GPCs built from
12:4 counters are the ideal choice for our speciﬁc set of bench-
A. Illustrative Example
Fig. 1 shows our approach. A circuit transformed as described
previously is partitioned into a (set of) compressor tree(s) with
corresponding CPA(s) and a set of nonarithmetic operations.
The compressor tree is mapped onto an FPCA, which is em-
bedded within a larger FPGA. Fig. 1 assumes that a dedicated
CPA is integrated into the FPCA; alternatively, the carry chains
in the logic-block structure of the FPGA could be used to per-
form the ﬁnal CPA. The nonarithmetic portions of the circuit are
mapped onto the FPGA. Following the lead of Xilinx and Al-
tera, the FPGA shown in Fig. 1 is organized into columns. Each
column contains a set of logic clusters [e.g., the Altera Logic
Array Block (LAB)], which contain several logic blocks [e.g.,
the Altera Adaptive Logic Module (ALM)] connected by local
routing. A global routing network connects the different logic
clusters. Due to the column structure, the horizontal and ver-
tical routing channels are nonuniform.
B. Paper Organization
The paper is organized as follows. Section II summarizes re-
lated work, Section III introduces GPCs, Section IV presents
the FPCA architecture, Sections V and VI present the experi-
mental framework and results, and lastly, Section VII concludes
A. Commercial FPGA Architectures and Mapping
This section summarizes the arithmetic features in the Altera
Stratix III  and Xilinx Virtex-5  FPGAs, both of which
are high-end FPGAs realized in 65-nm CMOS technology. The
logic architectures of both of these FPGAs feature six-input
LUTs with carry chains that perform efﬁcient carry-propagate
addition without using the routing network. The Stratix III carry
chain is a ripple-carry adder; the Virtex-5 carry chain includes
XOR gate and a multiplexor (mux) which enable carry-looka-
Stratix II introduced a method to combine the LUTs with the
carry chain to perform ternary (three-input) addition, which re-
mained in place for the Stratix III; the Virtex-5 similarly supports
Due to the peculiar nature of FPGA architectures, it has long
been thought that multi-input addition is best realized using
trees of adders rather than compressor trees. The use of ternary
adders rather than binary (two-input) adders could reduce the
height of the trees, thereby reducing delay and/or pipeline
depth. Parandeh-Afshar et al. , , however, showed that
compressor trees could be synthesized on FPGAs using GPCs
(see Section III), signiﬁcantly reducing the delay compared to
ternary adder trees. Experimentally, this paper ﬁnds that the
FPCA is faster than both of these alternatives.
B. FPGA Enhancements to Improve Arithmetic Performance
Numerous enhancements for FPGAs have been proposed in
the past, particularly to improve arithmetic performance. For ex-
ample, several researchers have proposed hard IP cores: appli-
cation-speciﬁc integrated circuit (ASIC) components that im-
plement common operations that are embedded into the FPGA.
The most prevalent of these IP cores include block memories,
DSP/MAC blocks , , standard I/O interfaces , cross-
bars , shifters , and ﬂoating-point units . Kastner et
al.  developed techniques to examine a set of applications
to ﬁnd good domain-speciﬁc IP core candidates.
Although the FPCA is similar in principle to the IP cores de-
scribed previously, it is not completely hard: It is programmable
and has its own routing network. Although it is intended to im-
plement just one class of circuits—compressor trees, the FPCA
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
580 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
is ﬂexible and is not ﬁxed to a speciﬁc bitwidth; this distin-
guishes the FPCA, for example, from the hard multipliers whose
bitwidths are ﬁxed. Kuon and Rose  have noted that ﬁxed
bitwidth multiplication has some limitations, e.g., it is inefﬁ-
cient to implement
-b multiplication on a -b multiplier
contained in a DSP block.
Cevrero et al.  recently proposed an alternative FPCA
architecture. Theirs is radically different than the one described
here; the most important distinguishing feature is that it uses
direct programmable connections but does not employ a global
routing network; as such, it offers less ﬂexibility than the
architecture proposed here, but with the potential of reduced
delay, area, and power consumption due to the absence of
global routing. Future work will compare and contrast these
two architectures to better understand the differences between
C. Carry Chains
Also notable but not directly related to this work are the fast
carry chains –: These are used to implement efﬁcient
carry-propagate addition within FPGA logic cells. If an FPCA
is present, these carry chains can be used to perform the ﬁnal ad-
dition if a hard IP core implementation of a CPA is not present.
Parandeh-Afshar et al.  developed a carry chain that al-
lows a logic cell to be conﬁgured as a 6:2 compressor, a well-
known building block for compressor trees. The FPCA, how-
ever, is much more powerful, as its logic cells contain larger
GPCs (e.g., with up to 20 inputs). The use of larger and more
ﬂexible components reduces the number of logic levels in the
compressor tree as well as pressure on the routing network: This
is favorable from the perspective of delay, area utilization, and
D. Programmable Arrays of Arithmetic Primitives
The FPCA is a homogeneous array of arithmetic primitives
connected by a routing network. Many principally similar arith-
metic arrays have been proposed in the past, and this similarity
is acknowledged. The main difference is that the FPCA is lim-
ited in its scope of application (solely compressor trees) and is
intended for integration into a larger FPGA, whereas the arrays
discussed in this paper are stand-alone devices.
Parhami , for example, built an array of bit-serial addi-
tive multipliers and used a data-driven control scheme. The ad-
vantage of bit-serial arithmetic is that it reduces the wiring re-
quirement for an FPGA: This is signiﬁcant because wiring can
consume up to 70% of on-chip area. Although somewhat be-
yond the scope of this paper, bit-serial routing networks are an
active area of research that is beginning to emerge ; the ap-
plications for such a device, however, must be able to tolerate
high latencies, and it is not immediately clear which applica-
tions easily fall into this category.
The reconﬁgurable arithmetic FPGA (RA-FPGA)  is an
arithmetic array partitioned into three regions: 1) two’s comple-
ment addition; 2) sign/magnitude conversion to two’s comple-
ment, and vice versa; and 3) multiplication and division. Tra-
ditional FPGA-style logic is also included in order to imple-
ment control and general-purpose logic. In principle, such a de-
vice could use an FPCA to perform multiplication; however, no
RA-FPGA has been produced commercially to the best of our
The CHESS reconﬁgurable array , developed at HP Labs,
is an array of 4-b arithmetic logic units (ALUs), connected by a
bus-based FPGA-style routing network. Each ALU supports 16
arithmetic and logical operations (e.g.,
ADD, SUB, XOR), along
with selection and comparison tests. Neighboring ALUs can be
chained together (e.g., to perform 8-b addition), and spatial par-
allelism is abundant. As the ALU does not support primitives
for multiplication or multioperand addition, the inclusion of an
FPCA into the array is certainly plausible.
Several conﬁgurable arrays of ﬂoating-point units have also
been proposed. In 1988, Fiske and Dally  introduced the
Reconﬁgurable Arithmetic Processor (RAP), which contains 64
ﬂoating units connected by a switching network. More recently,
Intel’s Teraﬂops processor  connected 80 ﬂoating-point
MAC units using a high-speed network on chip. Although
ﬂoating-point units contain integer multipliers (and, hence,
compressor trees), it does not appear that there would be any
room to incorporate an FPCA into such a chip, because the
ﬂoating-point units themselves have ﬁxed bitwidths in accor-
dance with IEEE standards.
be an -bit binary number,
, is a bit. Let be the least bit
be the most signiﬁcant bit. The subscript of a bit, in
this case, is called the
of the bit. Each bit has rank
and contributes a value of to the total quantity represented
by the binary integer.
counter, as described in the preceding section,
assumes that all bits have the same rank when it computes their
sum. If all input bits have rank
, then the output of the
counter is a set of bits having ranks .
A GPC – is a type of counter that counts bits having
different ranks. In fact, an
counter can implement a GPC,
if desired: A bit of rank
must be connected to precisely
inputs of the counter. Of course, other methods to build
GPCs also exist.
A GPC is deﬁned as a tuple
, where is the
number of input bits of rank
to sum and is the number
of output bits; the input bits of each rank are independent.
For example, a (5, 3; 4) GPC can count up to 5 b of rank-1
and 3 b of rank-0; the maximum output value is 3; therefore,
four output bits are required.
Here, we ﬁx the number of input and output bits to be positive
and .Given and , there is a family of GPCs
that satisfy these I/O constraints. Clearly, the number of input
bits cannot exceed
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 08,2010 at 09:28:15 UTC from IEEE Xplore. Restrictions apply.
PARANDEH-AFSHAR et al.: IMPROVING FPGA PERFORMANCE FOR CARRY-SAVE ARITHMETIC 581
Fig. 2. 15:4 counter can implement a (5, 5; 4) GPC.
Likewise, the maximum allowable output value, which occurs
when all input bits are “1,” cannot exceed the maximum integer
value that can be expressed with
For a given
and , there is a family of GPCs that satisfy
these I/O constraints. As an example, take
One GPC in this family is a (5, 5; 4) GPC (see Fig. 2).
This GPC has ﬁve input bits of rank-0 and ﬁve of rank-1. The
maximum value that can be counted is
output bits are required to represent a value in the range [0, 15];
clearly, any (
, ; 4) GPC sufﬁces, under the assumption that
and as well.
At the same time, a (4, 6; 4) GPC is also a member of this
family, as it produces an output value in the range [0, 14], as is
a (6, 3; 4) GPC, etc. An
counter is also a degenerate case
of a GPC; in this case, a (0, 10; 4) GPC.
B. Conﬁgurable GPCs
For the FPCA, one ﬁxed GPC does not sufﬁce, because we
desire more ﬂexibility. Instead, we build a conﬁgurable GPC
counter with a layer of muxes placed on its
input. This conﬁguration layer, which is described in detail in
Section III-D, allows the conﬁgurable GPC to implement a wide
variety of GPCs; the user selects the desired GPC to implement
and programs the conﬁguration layer accordingly. For example,
a programmable GPC with
and should be able
to implement the functionality of both a 15:4 counter and a (5,
5; 4) GPC, among others.
of a GPC is the minimum rank among all of its
input bits. Now, suppose that the minimum rank of an input bit
to a given GPC is
and that the GPC must add two (or more)
bits of ranks
and , such that . Each input bit of rank is
connected to one input of the GPC, while each input bit of rank
is connected to inputs. Fig. 2, for example, satisﬁes this
Fig. 3 shows the design of an
-input -output GPC built
-input -output conﬁguration layer followed by an
counter. Each input of the counter (output of
the conﬁguration layer) is connected to two GPC inputs and is
controlled by two conﬁguration bits. The conﬁguration bit on
the left selects one of two GPC inputs that are connected to a
mux; the conﬁguration bit on the right drives the counter input
to 0 if it is not set, which allows the
counter to implement
counter for and ; in this case,
counter is called a subcounter. For example, a 7:3
Fig. 3. Architecture of an -input programmable GPC.
counter has ﬁve subcounters: 6:3, 5:3, 4:3, 3:2, and 2:2 counters.
The deﬁnition of subcounters easily extends to GPCs as well.
C. Primitive, Covering, and Reasonable GPCs
This section speciﬁes precisely which
GPCs should be implemented by the programmable GPC with
inputs, outputs, and an counter at its core.
A primitive GPC is one that satisﬁes the I/O constraints.
A covering GPC is a primitive GPC that is not a sub-GPC of
another primitive GPC. Referring to Fig. 2, the (5, 5; 4) GPC
is a covering GPC. If the number of rank-1 inputs is increased
to six, then, there are only three rank-0 input ports remaining,
. Hence, a (6, 3; 4) GPC is also a
covering GPC. A (5, 4; 4) GPC, in contrast, is not a covering
GPC, because the (5, 5; 4) GPC can implement its functionality
by driving one of the rank-0 inputs to zero.
To achieve universal coverage, a conﬁgurable GPC only
needs to implement the functionality of the covering GPCs that
inputs and outputs. Of course, there is no formal
mandate that a conﬁgurable GPC provide universal coverage;
those that do not simply have limited ﬂexibility compared with
those that do.
We have identiﬁed two classes of unreasonable GPCs,
meaning that we can ﬁnd no rational justiﬁcation for using
them; this is not, however, a formal deﬁnition. When designing
a conﬁgurable GPC, there is no need to add support for unrea-
sonable GPCs, even if they are covering GPCs.
A GPC that has no rank-0 input bits, i.e.,
, is un-
reasonable. For example, consider a (7, 0; 4) GPC. The rank-0
output will always be 0. The rank-1, -2, and -3 outputs can be
computed by a 7:3 counter. If the rank of this GPC is
, it suf-
ﬁces to replace it with a 7:3 counter of rank
a conﬁgurable GPC need not support this type of GPC.
Similarly, a GPC that has one rank-0 input bit, i.e.,
is unreasonable. This bit determines whether the output of the
GPC is even or odd. The rank-0 input bit is connected directly
to the rank-0 output and is not used within the GPC. There is no
need to connect this bit to the GPC input; instead, it should be
connected to a GPC at a lower level of the compressor tree.
As an example, consider a (7, 1; 4) GPC. The rank-0 output is
always equal to the rank-0 input. The rank-1, -2, and -3 outputs
can be computed by a traditional 7:3 counter. Suppose that the
rank of this GPC is
. Then, it sufﬁces to eliminate the rank-0
input bit and propagate it to the next level of the tree; then, the
GPC is replaced with a 7:3 counter of rank
582 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
Fig. 4. Example of primitive, covering, and reasonable covering GPCs.
Fig. 4 shows the preceding concepts for and .
There are 23 primitive GPCs, 6 covering GPCs, and 3 reason-
able covering GPCs. (0, 3, 1; 3) is unreasonable because
. (1, 0, 3; 3) is unreasonable because the rank-0 and -1 out-
puts can be computed by a 3:2 counter, while the rank-2 input
connects directly to the rank-2 output; it sufﬁces to replace this
GPC with a 3:2 counter of rank
and propagate the rank-2 input
bit to the next level of the compressor tree directly. (1, 1, 1; 3)
is also unreasonable: Not only is
but also each input bit
connects directly to each output; it sufﬁces to propagate these
bits directly to the next level of the compressor tree, eliminating
this GPC altogether.
D. Conﬁguration Layer
The conﬁguration layer allows the user to program an
counter as any -input -output reasonable covering GPC.
The circuit shown on the right-hand side of Fig. 3 is placed on
each conﬁguration layer output. The conﬁguration layer archi-
tecture is deﬁned by a set of connections between input ports
and muxes. When the right conﬁguration bit is zero, the corre-
sponding counter input is driven to zero; otherwise, it selects
one of the inputs connected to the mux.
We make the assumption that
is maximal for a given value
that satisﬁes (2), i.e., . For example, if ,
there are 4:3, 5:3, 6:3, and 7:3 counters; based on this assump-
tion, we default to the 7:3 counter.
be the set of input
ports and muxes. A sensible conﬁguration layer architecture
satisﬁes the property that each input port
muxes where is a nonnegative integer; thus, can
be connected to any input bit of rank at most
. is called
of the input port and is denoted . When
conﬁguring a GPC, the rank of each input bit connected to
cannot exceed .
Fig. 5(a) shows an example of a conﬁguration layer (only
muxes are shown) for a GPC built from a 15:4 counter. Input
have rank-0; input ports , , , and have
rank-1, and ports
, , and have rank-2.
The conﬁguration layer can be represented as a conﬁguration
graph, a directed bipartite graph
and represents the set of connections from input
ports to muxes, i.e., there is an edge
if and only
if there is a connection from
to . Fig. 5(b) shows an ex-
ample corresponding to the conﬁguration layer in Fig. 5(a). In
and connect directly to the counter inputs;
and are shown in Fig. 5(b) simply to
represent the possible connection; a one-input mux in the con-
ﬁguration graph becomes a direct connection in the conﬁgura-
E. Conﬁguring the GPC
The conﬁguration graph represents the set of different
input-to-mux connections. A conﬁguration determines which
input port is selected by each mux. At most, one input port can
connect to each mux; if no input port is connected, the circuit
shown on the right-hand side of Fig. 3 drives the counter input
to zero instead. Speciﬁcally, a conﬁguration is a subset of edges
such that each mux is incident on at most one edge
and each input port . An example of a conﬁguration is
, which conﬁgures the GPC as a
15:4 counter. A set of edges including
is not a conﬁguration because two input ports are connected to
An active input port is incident on at least one edge in a con-
ﬁguration. A conﬁguration is sensible if each active input port
is incident on edges in , where .A
conﬁguration including edges
, , , and
is not sensible, because the number of ports to which
is connected is not an even power of two.
be a conﬁguration, and let be the set of input
ports that are conﬁgured to connect to
counter inputs. To stay
within bandwidth limits, the sum of the ranks of the input ports,
after conﬁguration, cannot exceed
, the number of counter
inputs; in other words
F. Conﬁguration Layer Design
Recall from Section III-A that a GPC is represented as a tuple
where is the number of
input bits of rank
to be summed, and from Section III-C, recall
the deﬁnitions of reasonable and covering GPCs.
In this section, we outline a method to design a GPC conﬁg-
uration layer systematically. We do not attempt to achieve uni-
versal coverage; instead, we restrict the set of GPCs that can be
mapped onto our conﬁgurable GPC; doing this allows us to im-
plement the conﬁguration layer with one level of muxes, each
having at most two inputs, thereby bounding the delay and area
overhead of the conﬁguration layer.
The rank variation of a GPC
is the number
of input bit ranks supported by a GPC, e.g.,
GPC aforementioned; then,
dance with (5).
A conﬁguration layer can implement a GPC
if there is
such that for
. This is intuitive: has input bits of rank , each of
which must connect to
muxes. The condition
ensures that a sufﬁcient supply of input ports with the desired
be the set of reasonable covering GPCs that satisfy
.Acomplete conﬁguration layer can im-
plement every GPC in
. To simplify the design of the con-
ﬁguration layer, we have chosen to restrict the set of reasonable
PARANDEH-AFSHAR et al.: IMPROVING FPGA PERFORMANCE FOR CARRY-SAVE ARITHMETIC 583
Fig. 5. (a) Conﬁguration layer (muxes only) for a 15-input 4-output GPC. (b) Bipartite graph representation of the conﬁguration layer in (a).
Fig. 6. Conﬁguration graph for (a) , (b) and , and (c) , , and
. In (b), half of the input ports are connected to two muxes.
In (c), two of the input ports
are connected to four muxes. No mux is
connected to more than two input ports.
covering GPCs to those whose rank is at most 2; this conﬁgura-
tion layer is incomplete, meaning that universal coverage is not
be the set of
reasonable covering GPCs whose rank variation is
for a given
. The conﬁguration layer described here must implement
, , and . Fig. 6 shows the construction method for an
eight-input counter through the incremental addition of edges
to a conﬁguration layer graph.
contains one GPC: All input bits have rank-0, i.e., an
counter. Any mapping from input ports to muxes sufﬁces, e.g.,
. Fig. 6(a) shows the initial set
Now, let us consider
. No rank-1 input ports can connect
to the same mux; otherwise, both of these input ports could not
be conﬁgured as rank-1 at the same time. In Fig. 6(b), edges are
added to the conﬁguration graph so that input ports
, , ,
can be conﬁgured as either rank-0 or rank-1.
contains GPCs having bits of rank-0, -1, or -2. Like the
aforementioned reasoning, each rank-2 input port connects to
of the input ports can already be conﬁgured
as rank-1; thus, it sufﬁces to take half of them and connect them
to two additional muxes. In Fig. 6(c), input ports
extended so that they can be conﬁgured as rank-0, -1, or -2. At
this point, we stop. In general, there are
input ports in total;
connect to four GPC inputs, connect to two, and
connect to one.
The basic pattern shown in Fig. 6 is systematic and general-
izes to any
counter. Stopping at ensures that the largest
mux in the conﬁguration layer has at most two inputs.
IV. FPCA A
A. FPCA Architecture
The FPCA architecture presented by Brisk et al.  is a 2-D
lattice of hierarchical
counters connected through a pro-
grammable routing network; it had the same basic structure as
an island-style FPGA, but with programmable logic cells re-
counters. The architecture presented here is
similar, but programmable GPCs replace the
The connection boxes that interface each programmable GPC to
the adjacent routing channels and switch boxes that connect in-
tersecting horizontal and vertical routing channels are the same
as an FPGA.
The hierarchical design of an
counter increases ﬂex-
ibility. For example, suppose that the counter size in an FPCA
is 20:5. This 20:5 counter is hierarchically built from smaller
counters, e.g., 4:3. If there are only four input bits to sum at
a given time, the smaller counter can be used. This reduces
the delay of the circuit at stages of a compressor tree where
there is a small number of bits to sum; on the other hand, the
large number of smaller counters increases the number of output
ports, as several 4:3 counters, for example, will be available. The
use of a conﬁgurable GPC in lieu of a hierarchically designed
counter offers similar ﬂexibility, but without increasing
the number of output ports. When there is a small number of
bits available at each rank, a GPC can sum bits having different
584 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
Fig. 7. Flip-ﬂop and mux are placed on each GPC output to allow pipelining.
The programmable GPC described in the preceding section
is purely combinational. The user may wish to pipeline the
compressor tree in order to increase the clock frequency and
throughput. To facilitate this, a ﬂip-ﬂop and mux are placed
on each GPC output, as shown in Fig. 7. The same circuit is
typically placed on the outputs of FPGA logic blocks (but not
on the carry-chain outputs).
In a sense, our intention to embed an FPCA into a FPGA
is similar in principle to a cluster of logic cells in a traditional
FPGA (e.g., a LAB in an
Altera Stratix-series FPGA). A LAB
(or group of adjacent LABs) could be replaced with an FPCA.
The role of a programmable GPC within an FPCA is analogous
to the role of a programmable logic cell (an ALM in an Altera
Stratix-series FPGA) within a LAB. The primary difference is
that due to the interconnect structure of compressor trees, a more
ﬂexible routing network, in the style of a global (inter-LAB),
rather than a local (intra-LAB), FPGA routing network is re-
quired for the FPCA.
B. FPCA Mapping Heuristic
The algorithm to map a compressor tree onto an FPCA is
based on a heuristic developed by Parandeh-Afshar et al.  to
map compressor trees onto the logic cells (six LUTs) of high-
performance FPGAs. The FPCA is homogeneous, i.e.,
the same for all programmable GPCs. For a given
available GPCs are the set of reasonable covering GPCs whose
rank variation does not exceed 2. No further modiﬁcations to the
mapping heuristic are required.
C. FPCA Alternatives
Here, we qualitatively explore several different alternative
methods to integrate
counters into an FPGA and explain
why we believe that the FPCA is superior.
One possibility would be to add a programmable GPC to a
LAB or to replace one of the ALMs with the programmable
GPC; however, one GPC is unlikely to be used in isolation: A
collection of them is required to construct a compressor tree.
Thus, a compressor tree synthesized on this architecture would
not be able to take advantage of the fast local connections within
the LAB. A better approach is to cluster the GPCs together, as
is done by the FPCA. Replacing all of the ALMs in a LAB with
programmable GPCs effectively yields an FPCA with a local
LAB style rather than a global routing network; we have opted
for a routing network in the global style due to the complex
interconnect structure required to construct a compressor tree
A second alternative is to integrate a GPC into an ALM as
a programmable type of macroblock, similar in principle to the
work by Cong and Huang  and Hu et al. ; however, this
architecture signiﬁcantly increases the input and output band-
width of the ALM; it is unlikely that the local routing network
within a LAB could handle this increased I/O bandwidth as it
exists today. We believe that a better approach is to strictly sep-
arate the FPCA/GPCs from the LAB/ALMs.
Two different versions of the Versatile Place-and-Route
(VPR) tool ,  were used to evaluate the FPCA. The
most recent version of VPR, version 5.0, was used to compare
the performance advantages of an FPGA containing an FPCA
against an FPGA containing DSP blocks as a baseline. The
earlier version of VPR does not support DSP blocks or any type
of embedded IP core; therefore, the newer version was required
to perform this comparison.
An earlier version, version 4.30, was used for a comparison
of energy consumption. At present, no power model is cur-
rently available for the newer version of VPR; as discussed in
Section V-C, we extended a preexisting power model for the
earlier version to compute the energy consumption.
VPR 5.0 provides preconstructed architecture models for dif-
ferent process technologies; VPR 4.30, in contrast, requires the
user to provide transistor-level properties of the wires. Details
will be provided in the following section.
B. Delay and Area Extraction
The FPCA was modeled as a stand-alone device using VPR
4.30. Each compressor tree in each benchmark was extracted
and synthesized on the FPCA. The FPCA was then modeled as
an IP core in VPR 5.0; for each benchmark, the delay through
each path through the FPCA was taken from VPR 4.30. The
complete benchmark was then synthesized on VPR 5.0, with all
compressor trees mapped onto FPCAs. The total delay includes
both non-compressor tree logic mapped onto the general logic
of the FPGA along with the compressor tree delay through the
FPCA. To model the FPCA, the traditional FPGA logic blocks in
VPR 4.30 were replaced with programmable GPCs. After map-
ping a compressor tree onto a network of GPCs, VPR was used
to place-and-route the circuit. VPR also reported the critical path
delay, which includes both routing and logic delays. The number
of GPCs required to synthesize each compressor tree can be de-
termined from the result of the mapping heuristic.
The programmable GPCs described in Section III were mod-
eled in Very High Speed Integrated Circuit Hardware Descrip-
tion Language (VHDL) and synthesized using Synopsys Design
Compiler with 90-nm TSMC standard cells. Cadence Silicon
Encounter was then used to place and route the designs and ex-
tract delay and area estimates. This was done for four different
programmable GPCs, with
, 12:4, 16:5, and
20:5. Thus, four different FPCA architectures were studied, as
the GPC size is assumed to be homogeneous within an FPCA.
A separate VPR architecture description ﬁle (ADF) was instan-
tiated for each FPCA. We limited the channel width to 40 seg-
ments, VPR’s default value.
For the purpose of comparison, we modeled an island-style
FPGA whose logic blocks resemble Altera’s ALM and whose
logic clusters resemble Altera’s LAB, but with four blocks per
PARANDEH-AFSHAR et al.: IMPROVING FPGA PERFORMANCE FOR CARRY-SAVE ARITHMETIC 585
cluster; the limited number of ALMs per LAB was due to com-
plications involved in modeling carry chains in VPR. Each LAB
has two carry chains to support ternary addition. The primary
difference between this baseline architecture and the Stratix II
and III is that the baseline is island style, while the Stratix II
and III organize LABs into columns and employ nonuniform
routing in the
- and -directions. The GPC mapping heuristic
of Parandeh-Afshar et al.  was used to synthesize compressor
trees onto this FPGA.
To model routing delays, VPR 4.30 requires information such
as the per-unit resistance and capacitance of wires. Our experi-
ments used TSMC 90-nm technology, and the per-unit resistance
and capacitance for metal-6 were computed and inserted into
VPR’s ADF. These values were used to compute the routing
delays of the FPCA. The per-unit resistance and capacitance of
metal-6 were chosen, as this metal layer seemed to be a reason-
able choice for the wires in the routing network; in practice, the
routing network is likely to be realized in several metal layers.
VPR 5.0, in contrast, provides preconstructed architecture
models for different transistor technologies. We used an appro-
priate model, which eliminated the need to explicitly provide
per-unit resistance and capacitances of the wires.
C. VPR Power Model
A power model for VPR was developed by Poon et al. 
to model traditional island-style FPGAs. Choy and Wilton 
modiﬁed this framework to support power estimation for em-
bedded IP cores, such as DSP blocks. We extended these models
to estimate the power consumption of the FPCA. The Activity
Estimator  estimates the probability of transitions occurring
in the circuits mapped to an FPGA; static probability and den-
sity probability  are used to extract the transition activities
of each net. The complexity of this computation is
is the number of inputs. This time complexity is suitable for
LUT-based logic blocks, where the number of inputs is typically
six or less (ignoring carry chains). The programmable GPCs
used in our study of FPCAs have up to 20 inputs; for circuits
of this size, the exponential runtime of the model becomes a
D. FPCA Power Model for VPR
Due to the complexity of the VPR power model described in
the preceding section, we developed a more efﬁcient simulation-
based power model. Our power model is based on the Lookup
technique advocated by Choy and Wilton .
The power model consists of an ofﬂine power characteriza-
tion of each GPC under different input switching activity proba-
bilities. The results are collected in a table and fed into an online
power estimator that extracts the switching activities via simula-
tion. The simulator dynamically accesses the LUT to determine
the power dissipated given the switching activity at each point
in the simulation.
The ofﬂine power characterization ﬂow is described as fol-
lows. First, the programmable GPC is modeled in VHDL and
synthesized using Synopsys Design Compiler. Object and node
names are extracted; these names are later used in the simulation
phase for assigning switching activities. The power of the pro-
grammable GPC is estimated using Synopsys PrimePower; the
power characteristics of the GPCs are extracted with different
transition activity rates. These rates are then organized into ta-
bles, indexed by the transition probabilities, which are then input
into the online ﬂow.
The online power estimator begins with a mapped netlist
whose objects and nodes are extracted. The objects are the
GPC blocks used for mapping a compressor tree. The transition
activities of objects are extracted through the application of
stimulus vectors, which are generated randomly. As the accu-
racy of the power calculations used by VPR depends on the
accuracy of the switching activity annotated to the design, it
is essential to achieve high covering during simulation. High
coverage is achieved via simulation feedback to the random
vector generator. After a set of random vectors with high
signal coverage is found, the simulator computes the activity
transitions of objects and nets listed in the object list.
Next, a modiﬁed version of VPR is used to estimate the power
dissipated by the FPCA. VPR’s power model is based on tran-
sition density and the static probability of nets. The transition
density of a signal represents the average number of transitions
of that signal per unit time; the static probability of a net is the
probability that the signal is high at any given time. These two
parameters are computed for each net in the design using the
The power model is used to extract the power model for the
FPCA. The ofﬂine GPC power characteristics are placed as a
table in the ADF that describes the FPCA. The table contains
the estimated power dissipation for transition activities ranging
from 0 to 1 by increments of 0.1; separate tables are instantiated
depending on whether or not each output of the GPC is written
to its ﬂip-ﬂop.
A second input to the power model is the transition activity
of objects extracted in the previous step. VPR reports three dif-
ferent power estimates: the dynamic power dissipated by the
GPC and by the routing network and the leakage power. The
power consumption of the routing network is estimated using
switching activities and switch box and wire parameters speci-
ﬁed in the ADF based on the target technology. The GPC power
consumption is estimated using the average activity of its inputs
and the ofﬂine power table in the ADF.
We selected a set of benchmarks from arithmetic, DSP, and
video processing domains where we were able to identify com-
pressor trees. These benchmarks were broadly categorized into
multiplier-based and multi-input addition benchmarks.
The multiplier-based benchmarks include g.721 , a poly-
nomial function that has been optimized using Horner’s Rule
- and -b multipliers ( and
), and a video processing application (video mixer
). The video mixer converts two channels of red–green–blue
video to television-standard YIQ signals and then mixes them in
an alpha blender to produce a composite output signal.
The multi-input addition benchmarks include the Media-
Bench application adpcm , a 1-D multiplierless discrete
cosine transform (dct ), three- and six-tap ﬁnite-impulse
586 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
Fig. 8. Critical path delay observed for each benchmark without the transformations of Verma and Ienne  and with multiplication operations synthesized on
DSP blocks (DSP); all other synthesis methods applied the transformations. Ternary and GPC synthesize each benchmark wholly on the general logic of the FPGA,
while 8:4, 12:4, 16:5, and 20:5 synthesize each compressor tree on an FPCA that is integrated into a larger FPGA.
response ﬁlters with randomly generated constants (ﬁr3 and
ﬁr6 ), and an internally developed variable block size motion
estimator for H.264/AVC video coding (H.264 ME).
The multiplier-based benchmarks contain multipliers that can
be synthesized on an FPGA using DSP blocks; however, when
the transformations of Verma et al.  are applied, the com-
pressor trees within the multipliers are merged with other ad-
dition operations, rendering the DSP blocks useless. After ap-
plying these transformations, the compressor trees within these
benchmarks can only be synthesized on the general logic of the
FPGA or on an FPCA.
The video mixer contains many disparate compressor trees,
even after the transformations of Verma et al. are applied; all
other benchmarks contain one compressor tree. H.264 ME con-
tains a set of identical processing elements (PEs), where each
PE contains a compressor tree. The number of PEs can vary de-
pending on the needs of the system. We chose to synthesize a
four-PE system, ignoring the memory and control logic.
Each benchmark was synthesized six or seven times.
1) DSP: The multiplier-based benchmarks and adpcm were
synthesized without applying the transformations of Verma
et al. All multipliers in the multiplier-based benchmarks
were synthesized on the DSP blocks. adpcm contains three
disparate addition operations, but cannot use DSP blocks.
In all subsequent experiments, the transformations of
Verma et al. were applied to the multiplier-based bench-
marks and to adpcm; the remaining multi-input addition
benchmarks were written with compressor trees explicitly
exposed. DSP blocks cannot be used for multiplication
operations following the transformations.
2) Ternary: Compressor trees are synthesized on ternary
adder trees using FPGA logic cells conﬁgured as ternary
3) GPC: Compressor trees are synthesized on the general
logic of an FPGA using the GPC mapping heuristic of
Parandeh-Afshar et al. .
4) 8:4, 12:4, 16:5, and 20:5: The compressor tree is synthe-
sized on an FPCA; four different FPCAs with different
counter sizes were considered.
The experiments synthesized purely combinational circuits.
In actuality, the frequency and throughput of a compressor tree
could be increased by registering the output bits of each level
of logic in the tree. The benchmarks that were implemented did
not naturally contain pipelined compressor trees; therefore, this
possibility is not evaluated here.
Lastly, we note that Brisk et al.  attempted to synthesize
adder trees using the DSP blocks; this approach yielded very
slow compressor trees; as such, these experiments are not re-
peated here, as the approach is not competitive.
Fig. 8 shows the critical path delay of each benchmark after
synthesis. In all cases, other than the two multipliers (
and ), the FPCA yields the minimum critical path
delay. In particular, the FPCA’s success on the multiplier-based
benchmarks compared with that of a DSP is due to its ability to
accelerate compressor trees generated by the transformations of
Verma et al.
and do not beneﬁt from these
transformations, as their compressor trees are not merged with
any other operations.
It should be noted that
and are worst case
examples for the Altera-style FPGA. The reason is that each
half-DSP block contains a
-b multiplier; for example, four
-b multipliers are required for . The gap between
the DSP block and the FPCA would be exacerbated for
PARANDEH-AFSHAR et al.: IMPROVING FPGA PERFORMANCE FOR CARRY-SAVE ARITHMETIC 587
Fig. 9. Average delay of the set of benchmarks decomposed into the delay through DSP blocks (DSP only)/compressor tree logic (all others) and non-comp
tree logic (all synthesis methods) for the (a) multiplier-based and (b) multi-input addition benchmarks.
The beneﬁts of the FPCA are exacerbated for the video mixer
because Verma et al.’s transformations are particularly effective
for this benchmark: it has many multiplication and addition op-
erations that are merged together by these transformations.
Fig. 8 also shows that the FPCA considerably reduces the
critical path delay over Ternary and GPC; this is particularly
important for the multi-input addition benchmarks, where the
DSP blocks cannot be used.
Among the different FPCA options, none performed uni-
formly better than the others. In the case of adpcm, all four
FPCAs achieved comparable critical path delays. Among the
FPCAs, 12:4 had the minimum critical path delay for hpoly and
; 16:5 had the minimum critical path delay for g721,
ﬁr6, and H.264 ME; and 20:5 had the minimum critical path
delay for the remaining benchmarks. These results indicate that
no FPCA will be ideal for all benchmarks, but the counter size
should probably be larger than 8:4.
Fig. 9 shows the delay achieved by synthesizing each bench-
mark into DSP block/compressor tree logic and non-compressor
tree logic. Fig. 9(a) shows the multiplier-based benchmarks,
where DSP blocks can be used, and Fig. 9(b) shows the results
for the multi-input addition benchmarks. Due to its limited func-
tionality, the FPCA only speed up the delay of the compressor
In Fig. 9(a), the FPCA reduces the average compressor tree
logic delay from 30% (8:4) to 47% (20:5); however, the trans-
formations of Verma et al. and the need to synthesize partial-
product generators on the FPGA general logic when DSP blocks
are not used increase the average non-compressor tree logic
delay by 46%. GPC and Ternary increase the average critical
path delay compared to DSP; 8:4, 12:4, 16:5, and 20:5 reduce
the average critical path delay compared with DSP by 0.2%, 8%,
10%, and 11%, respectively.
In Fig. 9(a), the increase in average non-compressor tree logic
delay for all options other than DSP is due to the fact that par-
tial-product generators must be synthesized on general FPGA
logic, rather than the DSP blocks. On the other hand, the FPCA
noticeably reduces the average delay of the resulting compressor
trees considerably compared with Ternary and GPC.
DSP blocks cannot be used for the multi-input addition
benchmarks in Fig. 9(b); we take GPC as a baseline as its
critical path delay is less than ternary. Since Verma and Ienne’s
transformations are applied to adpcm and the other multi-input
addition benchmarks have compressor trees directly exposed,
the non-compressor tree logic delay is the same in all cases.
Compared with GPC, 8:4, 12:4, 16:5, and 20:5 reduce the
overall (compressor tree) delay by 35% (45%), 41% (53%),
43% (56%), and 43% (55%), respectively. As there are no
partial-product generators and DSP blocks are not used, these
beneﬁts are due solely to critical path reduction within the
Fig. 10 shows the area of each benchmark converted to two-
NAND gate equivalents (GEs); the area includes the com-
putational elements (LUTs, DSP blocks, and GPCs) and does
not include any estimates of the utilization of resources in the
programmable routing network.
Each DSP block contains eight
-b multipliers and has
an area of 10 714 GEs. An
-bit multiplier generates
partial products. Since each ALM produces two output bits,
ALMs are required. In theory, this gives DSP blocks an
advantage in terms of area utilization compared with the other
, and , the FPCA consumed
considerably more area than DSP. This is due, primarily, to the
fact that partial-product generators must be synthesized on the
general logic of the FPGA. It should be noted that
quired just one DSP block but only used four of the eight mul-
tipliers. In other cases, namely, g721 and the video mixer, the
FPCAs had similar area requirements to DSP; however, 12:4
for video mixer was signiﬁcantly smaller. For this benchmark,
GPCs built from 12:4 counters, coincidentally, were the per-
fect-sized building blocks.
Compared to GPC and Ternary, the FPCAs reduced the area
requirement; in most cases, the area reduction was marginal;
however, it was quite pronounced for the video mixer and ﬁr6.
Similar to the critical path delay results reported in Fig. 8, none
of the four FPCA options was uniformly better than the others
across all benchmarks.
588 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
Fig. 10. Area of each benchmark after synthesis. The area of the DSP blocks, LUTs, and GPCs have been converted to two-input NAND GEs. These area estimat
do not account for the programmable routing network.
Fig. 11. Energy consumption of each benchmark (normalized to Ternary).
Fig. 11 shows the normalized energy consumption of each
benchmark. As VPR 5.0 does not have a power model, the en-
ergy consumption reported here are only for the compressor
trees and were measured using VPR 4.30. No energy consump-
tion for DSP is reported since VPR 4.30 does not support em-
Fig. 12 shows the average energy consumption across the set
of benchmarks, decomposed into energy consumed by the logic
elements (ALMs/GPCs) and the routing network.
GPC consumes more energy than the other options. GPC
builds a compressor tree using six-input GPCs with three or
four outputs; two ALMs per GPC are required. Each ALM in
Ternary, in contrast, takes six input bits and produces two output
bits (ignoring the carry-out bit, which is propagated to the next
ALM in the chain). For this reason, GPC tends to require more
ALMs and dissipates more static power.
Fig. 12 shows that the primary advantage of the FPCA comes
from its ability to reduce logic delay. In both Ternary and GPC,
LUT-based ALMs are used to realize the arithmetic building
blocks for compressor trees. The FPCA, in contrast, uses ASIC
implementations of these components, which is considerably
more efﬁcient. Although Ternary consumes less energy in the
routing network than any of the alternatives, the FPCA more
than makes up for this in terms of energy savings in the logic.
In conclusion, both Figs. 11 and 12 show that the FPCA signif-
icantly reduces energy consumption compared to Ternary and
We suspect that DSP blocks will consume less energy for
multiplication operations, because the other methods will need
to synthesize the partial-product generators on the general logic
for the FPGA and the number of partial products per multipli-
cation operation is quadratic.
PARANDEH-AFSHAR et al.: IMPROVING FPGA PERFORMANCE FOR CARRY-SAVE ARITHMETIC 589
Fig. 12. Average energy consumption due to logic and routing.
VII. CONCLUSION AND
The FPCA is a programmable IP core that can accelerate
compressor trees on FPGAs. For parallel multiplication,
the FPCA retains many of the advantages of the embedded
multipliers in the DSP blocks; however, it suffers a disadvan-
tage in terms of area utilization because the partial products
must be synthesized on the FPGA general logic. For parallel
multiplication, the FPCA retains most of the beneﬁts of the
embedded multipliers in the DSP blocks, while providing a
variable-bitwidth solution for multiplication operations that do
not match the ﬁxed bitwidth of the DSP blocks. Moreover, the
FPCA can accelerate multi-input addition operations, while the
DSP blocks cannot, particularly when used in conjunction with
transformations by Verma
et al.  to expose compressor trees
at the application level. Furthermore, the FPCA reduces the
critical path delay and energy consumption compared to the
best methods to synthesize compressor trees on the FPCA.
The DSP block will generally outperform the FPCA for appli-
cations containing many multiplications whose bitwidths match
precisely that of the ASIC multipliers in the embedded DSP
blocks and for which the transformations of Verma et al. are in-
effective. For virtually all other applications that contain com-
pressor trees—naturally or via transformation, the FPCA per-
forms signiﬁcantly better than current FPGAs.
 C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Trans. Elec-
tron. Comput., vol. EC-13, no. 6, p. 754, Dec. 1964.
 L. Dadda, “Some schemes for parallel multipliers,” Alta Freq., vol. 34,
pp. 349–356, Mar. 1965.
 A. K. Verma, P. Brisk, and P. Ienne, “Data-ﬂow transformations to
maximize the use of carry-save representation in arithmetic circuits,”
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 27, no.
10, pp. 1761–1774, Oct. 2008.
 S. Mirzaei, A. Hosangadi, and R. Kastner, “FPGA implementation of
high speed FIR ﬁlters using add and shift method,” in Proc. Int. Conf.
Comput. Des., San Jose, CA, Oct. 2006, pp. 308–313.
 C.-Y. Chen, S.-Y. Chien, Y.-W. Huang, T.-C. Chen, T.-C. Wang, and
L.-G. Chen, “Analysis and architecture design of variable block-size
motion estimation for H.264/AVC,” IEEE Trans. Circuits Syst. I, Reg.
Papers, vol. 53, no. 2, pp. 578–593, Feb. 2006.
 S. Sriram, K. Brown, R. Defosseux, F. Moerman, O. Paviot, V. Sun-
dararajan, and A. Gatherer, “A 64 channel programmable receiver chip
for 3G wireless infrastructure,” in Proc. IEEE Custom Integr. Circuits
Conf., San Jose, CA, Sep. 2005, pp. 59–62.
 S. R. Vangal, Y. V. Hoskote, N. Y. Borkar, and A. Alvandpour, “A 6.2-
Gﬂops ﬂoating-point multiply-accumulator with conditional normal-
ization,” IEEE J. Solid-State Circuits, vol. 41, no. 10, pp. 2314–2323,
 A. Shams, W. Pan, A. Chandanandan, and M. Bayoumi, “A high-per-
formance 1D-DCT architecture,” in Proc. IEEE Int. Symp. Circuits
Syst., Geneva, Switzerland, May 2000, vol. 5, pp. 521–524.
 H. Parandeh-Afshar, P. Brisk, and P. Ienne, “Efﬁcient synthesis of
compressor trees on FPGAs,” in Proc. Asia-South Paciﬁc Des. Autom.
Conf., Seoul, Korea, Jan. 2008, pp. 138–143.
 H. Parandeh-Afshar, P. Brisk, and P. Ienne, “Improving synthesis of
compressor trees on FPGAs via integer linear programming,” in Proc.
Int. Conf. Des. Autom. Test Eur., Munich, Germany, Mar. 2008, pp.
 I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,”
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 26, no.
2, pp. 203–215, Feb. 2007.
 P. Brisk, A. K. Verma, P. Ienne, and H. Parandeh-Afshar, “Enhancing
FPGA performance for arithmetic circuits,” in Proc. Des. Autom. Conf.,
San Diego, CA, Jun. 2007, pp. 404–409.
 W. J. Stenzel, W. J. Kubitz, and G. H. Garcia, “A compact high-speed
parallel multiplication scheme,” IEEE Trans. Comput., vol. C-26, no.
10, pp. 948–957, Oct. 1977.
 S. Dormido and M. A. Canto, “Synthesis of generalized parallel
counters,” IEEE Trans. Comput., vol. C-30, no. 9, pp. 699–703,
 S. Dormido and M. A. Canto, “An upper bound for the synthesis of
generalized parallel counters,” IEEE Trans. Comput., vol. C-31, no. 8,
pp. 802–805, Aug. 1982.
 “Stratix III Device Handbook, Vol. 1 and 2” Altera Corporation, San
Jose, CA, Feb. 2009. [Online]. Available: http://www.altera.com/
 “Virtex-5 User Guide” Xilinx Corporation, San Jose, CA, 2007. [On-
line]. Available: http://www.xilinx.com/
 “Virtex-5 FPGA Xtreme DSP Design Considerations” Xilinx Corpora-
tion, San Jose, CA, Jan. 2009. [Online]. Available: http://www.xilinx.
 P. S. Zuchowski, C. B. Reynolds, R. J. Grupp, S. G. Davis, B. Cremen,
and B. Troxel, “A hybrid ASIC and FPGA architecture,” in Proc. Int.
Conf. Comput.-Aided Des., San Jose, CA, Nov. 2002, pp. 187–194.
 P. Jamieson and J. Rose, “Architecting hard crossbars on FPGAs and
increasing their area-efﬁciency with shadow clusters,” in Proc. IEEE
Int. Conf. Field Programmable Technol., Kitakyushu, Japan, Dec.
2007, pp. 57–64.
 M. J. Beauchamp, S. Hauck, K. D. Underwood, and K. S. Hemmert,
“Architectural modiﬁcations to enhance the ﬂoating-point performance
of FPGAs,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16,
no. 2, pp. 177–187, Feb. 2008.
 R. Kastner, A. Kaplan, S. O. Memik, and E. Bozorgzadeh, “Instruc-
tion generation for hybrid-reconﬁgurable systems,” ACM Trans. Des.
Autom. Electron. Syst., vol. 7, no. 4, pp. 602–627, Oct. 2002.
 A. Cevrero, P. Athanasopoulos, H. Parandeh-Afshar, A. K. Verma, P.
Brisk, F. Gurkaynak, Y. Leblebici, and P. Ienne, “Architecture im-
provements for ﬁeld programmable counter arrays: Enabling synthesis
of fast compressor trees on FPGAs,” in Proc. Int. Symp. FPGAs, Mon-
terey, CA, Feb. 2008, pp. 181–190.
 D. Cherepacha and D. Lewis, “DP-FPGA: An FPGA architecture op-
timized for datapaths,” VLSI Des., vol. 4, no. 4, pp. 329–343, 1996.
 A. Kaviani, D. Vranseic, and S. Brown, “Computational ﬁeld pro-
grammable architecture,” in Proc. IEEE Custom Integr. Circuits Conf.,
Santa Clara, CA, May 1998, pp. 261–264.
 S. Hauck, M. M. Hosler, and T. W. Fry, “High-performance carry
chains for FPGAs,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
vol. 8, no. 2, pp. 138–147, Apr. 2000.
590 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 4, APRIL 2010
 K. Leijten-Nowak and J. L. van Meerbergen, “An FPGA architecture
with enhanced datapath functionality,” in Proc. Int. Symp. Field Pro-
grammable Gate Arrays, Monterey, CA, Feb. 2003, pp. 195–204.
 M. T. Frederick and A. K. Somani, “Multi-bit carry chains for
high-performance reconﬁgurable fabrics,” in Proc. Int. Conf. Field
Programmable Logic Appl., Madrid, Spain, Aug. 2006, pp. 1–6.
 H. Parandeh-Afshar, P. Brisk, and P. Ienne, “A novel FPGA logic block
for improved arithmetic performance,” in Proc. Int. Symp. Field Pro-
grammable Gate Arrays, Monterey, CA, Feb. 2008, pp. 171–180.
 B. Parhami, “Conﬁgurable arithmetic arrays with data-driven control,”
in Proc. Asilomar Conf. Signals, Syst., Comput., Paciﬁc Grove, CA,
Oct./Nov. 2000, pp. 89–93.
 R. Francis, S. Moore, and R. Mullins, “A network of time-division mul-
tiplexed wiring for FPGAs,” in Proc. 2nd IEEE Symp. Networks-on-
Chip, Apr. 2008, pp. 35–44, Newcastle University, U.K..
 N. L. Miller and S. F. Quigley, “A novel ﬁeld programmable gate array
architecture for high speed arithmetic processing,” in Proc. 8th Int.
Workshop Field-Programmable Logic Appl., Tallinn, Estonia, Aug./
Sep. 1998, pp. 386–390.
 A. Marshall, T. Stansﬁeld, I. Kostarnov, J. Vuillemin, and B. Hutch-
ings, “A reconﬁgurable arithmetic array for multimedia applications,”
in Proc. Int. Symp. Field Programmable Gate Arrays, Monterey, CA,
Feb. 1999, pp. 135–143.
 S. Fiske and W. J. Dally, “The reconﬁgurable arithmetic processor,” in
Proc. 15th Int. Symp. Comput. Archit., Honolulu, HI, May/Jun. 1988,
 Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar, “A 5-GHz
mesh interconnect for a teraﬂops processor,” IEEE Micro, vol. 27, no.
5, pp. 51–61, Sep./Oct. 2007.
 J. Cong and H. Huang, “Technology mapping and architecture evalua-
tion for k/m-macrocell-based FPGAs,” ACM Trans. Des. Autom. Elec-
tron. Syst., vol. 10, no. 1, pp. 3–23, Jan. 2005.
 Y. Hu, S. Das, S. Trimberger, and L. He, “Design, synthesis and eval-
uation of heterogeneous FPGA with mixed LUTs and macro-gates,”
in Proc. Int. Conf. Comput.-Aided Des., San Jose, CA, Nov. 2007, pp.
 V. Betz and J. Rose, “VPR: A new packing, placement, and routing tool
for FPGA research,” in Proc. 7th Int. Workshop Field-Programmable
Logic Appl., London, U.K., Sep. 1997, pp. 213–222.
 V. Betz, J. Rose, and A. Marquardt , Architecture and CAD for Deep
Submicron FPGAs. Norwell, MA: Kluwer, Feb. 1999.
 K. K. W. Poon, S. J. E. Wilton, and A. Yan, “A detailed power model for
ﬁeld-programmable gate arrays,” ACM Trans. Des. Autom. Electron.
Syst., vol. 10, no. 2, pp. 279–302, Apr. 2005.
 N. C. K. Choy and S. J. E. Wilton, “Activity-based power estimation
and characterization of DSP and multiplier blocks in FPGAs,” in Proc.
IEEE Int. Conf. Field Programmable Technol., Bangkok, Thailand,
Dec. 2006, pp. 253–256.
 J. Lamoureux and S. J. E. Wilton, “Activity estimation for ﬁeld pro-
grammable gate arrays,” in Proc. IEEE Int. Conf. Field Programmable
Logic Appl., Madrid, Spain, Aug. 2006, pp. 1–8.
 F. N. Najm, “A survey of power estimation techniques in VLSI cir-
cuits,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 2, no. 4,
pp. 446–455, Dec. 1994.
 C. Lee, M. Potkonjak, and W. H. Mangione-Smith, “MediaBench: A
tool for evaluating and synthesizing multimedia and communications
systems,” in Proc. 30th Int. Symp. Microarchitecture, Research Tri-
angle Park, NC, Dec. 1997, pp. 330–335.
 “Creating High-Speed Data Path Components—Application Note,”
Synopsys Corporation, Mountain Vie