Conference PaperPDF Available

Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication

Authors:

Abstract and Figures

Customized pipeline designs that minimize the pipeline initiation interval (II) maximize the throughput of FPGA accelerators designed with high-level synthesis (HLS). What is the impact of minimizing II on energy efficiency? Using a matrix-multiply accelerator, we show that matrix multiplies with II>1 can sometimes reduce dynamic energy below II=1 due to interconnect savings, but II=1 always achieves energy close to the minimum. We also identify sources of inefficient mapping in the commercial tool flow.
Content may be subject to copyright.
Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication
Peipei Zhou,Hyunseok Park,Zhenman Fang,Jason Cong,Andr´
e DeHon
UCLA, Dept. of Computer Science, Los Angeles, CA 90095
University of Pennsylvania, Dept. of ESE, 200 S. 33rd Street, Philadelphia, PA 19104
Email: {memoryzpp,parkhyu,zhenman,cong}@cs.ucla.edu,andre@ieee.org
Abstract—Customized pipeline designs that minimize the
pipeline initiation interval (II) maximize the throughput of
FPGA accelerators designed with high-level synthesis (HLS).
What is the impact of minimizing II on energy efficiency? Using
a matrix-multiply accelerator, we show that matrix multiplies
with II>1 can sometimes reduce dynamic energy below II=1
due to interconnect savings, but II=1 always achieves energy
close to the minimum. We also identify sources of inefficient
mapping in the commercial tool flow.
1. Introduction
To meet the ever-increasing demand for high computation
performance and energy efficiency, numerous commodity
acceleration platforms have been proposed and developed,
including the well-known many integrated cores (MICs, or In-
tel Xeon Phi processors), graphical processing units (GPUs),
field-programmable gate arrays (FPGAs), and application-
specific integrated circuits (ASICs) [
1
,
2
,
3
]. The utilization
wall [
4
] has stimulated interest in FPGAs, since FPGAs
provide both low power and customization capability to
accelerate different applications (more flexible than ASICs).
Compared to the general-purpose MICs and GPUs,
FPGAs allow designers to look beyond parallelization and
customize accelerators. The customized pipeline design has
been one of the most successful and widely used optimiza-
tions to improve the performance of FPGA accelerators
[
5
,
6
,
7
]. At the same time, the recent success of commercial
HLS tools like Xilinx Vivado HLS [
8
] has made design
space exploration for a customized pipeline easier compared
to conventional register transfer level (RTL) designs.
Among various tunable parameters in such pipeline
customization, the pipeline initiation interval (II)—which
is defined as the number of cycles between two consecu-
tive pipeline iterations [
9
]—is one of the most important
customization parameters since it reflects the throughput
of the pipeline design and has been widely studied (e.g.
[
5
,
6
,
7
]). Most prior studies, except [
5
], have focused on
minimizing the pipeline II so as to maximize the throughput
of FPGA accelerators. Meanwhile, there are also examples
[
6
] indicating that smaller pipeline II can reduce the energy
consumption of FPGA logic gates. Motivated by these
studies, this paper begins to explore a key question: Does a
customized pipeline with the minimum II always minimize
energy? If not, how does the pipeline II affect the energy
consumption?
To get initial insight into this question, we focus on
the classical matrix-multiplication algorithm specified in
C for HLS. We build an analytical model of the energy
consumption for the kernel as a function of matrix dimension,
N
, and pipeline
II
, including the effects of computational,
interconnect, memory, and leakage energy. This allows us to
identify how the energy should scale with problem size and
II
. We synthesize the HLS kernel with Vivado HLS and fit
constants within the analytical model. Along the way, we
identify sources of inefficiency in the commercial tool flow
that can cause the HLS solution to diverge from the ideal
scaling for the matrix-multiply kernel, which include:
1. Missing opportunities for register sharing.
2. Missing opportunities for address generator sharing.
3. Lack of power-gating for unused memory banks.
We find that the logic component of energy remains flat,
while memory and leakage components increase with
II
, but
interconnect can decrease with increasing
II
. Interconnect
savings are large enough that we can identify cases where
II > 1
minimizes energy. Nonetheless, we see that the
increasing energy term is modest, such that the
II = 1
case
is always within a few percent for matrix-multiplies that can
fit on a single FPGA today. The energy framework identified
here should translate to other HLS kernels, but will have
different compute and interconnect scaling that should be
characterized and better understood in future work.
2. Related Work in Energy Modeling
FPGA energy models have been widely used to provide
guidance in design space explorations. Recent studies in [
10
]
developed analytical models to characterize energy consump-
tion of designs ranging from a sequential design (processor)
to a spatial design (FPGA), using Rent’s rule [
11
] as a
modeling tool. Earlier works [
12
,
13
,
14
] employed models
to provide in-depth analysis of FPGA power decomposition
and the impact of look-up table (LUT) size, cluster size,
and segment lengths on power consumption. Recent work
introduced FPGA memory models to analyze the effect of the
memory architecture (including block size, banking, physical
spacing) and parallelism on an application’s energy efficiency
[
15
]. While these works present detailed energy models, they
do not directly address the microarchitectural structure that
results from tuning the II in HLS designs. To the best of
our knowledge, this work is the first to model the impact of
the pipeline II on energy consumption of FPGA accelerators
from a high-level perspective.
3. Matrix-Multiplication Kernel
v o i d matrix multiply ( float a [N ] [ N ] ,
float b [ N ] [ N ] , float c [ N ] [ N ] ) {
i n t i , j , k , p ;
k l o o p : for ( k = 0 ; k <N; k ++ ) {
i l o o p : for ( i = 0 ; i <N ; i ++) {
/ / i l o o p PI PE LIN E I I = I I i
p l o o p : for ( p = 0 ; p <N; p += N / I I i ) {
# p ra mg a HLS P IP ELI NE I I = 1
j l o o p : for ( j = 0 ; j <N / I I i ; j ++ ) {
# p ra gm a HLS UNROLL
c [ i ] [ p + j ] += a [ i ] [ k ] b [ k ] [ p+ j ] ;
} } } } }
Listing 1: Pseudo code of square matrix multiplication
Figure 1: Architecture for N = 6, II = 1
Figure 2: Architecture for N = 6, II = N = 6
To make it easier to understand, we use square
N×N
matrix-multiplication as an example to demonstrate the en-
ergy model for mapping applications with perfectly shareable
processing elements (PEs) onto a commodity FPGA. As
shown in Listing 1, to pipeline the
i loop
with a specific
II (e.g., II i = 1), we change the increment value of the
p index in the
p loop
and apply the HLS PIPELINE
pragma. Consequently, the inner-most
j loop
will be unrolled
automatically (HLS UNROLL pragma is shown to illustrate
this unroll but not needed explicitly). In the
j loop
,
N
II i
elements within one row of the c matrix,
c[i][p]
,
c[i][p+ 1]
,
... ,
c[i][p+N
II i 1]
are updated using
a[i][k]
and
N
II i
elements within one row of
b
matrix,
b[k][p]
,
b[k][p+ 1]
, ... ,
b[i][p+N
II i 1]
. The PIPELINE II of
i loop
(i.e., II i, for
simplicity, we will directly use II for II i in the rest of the
paper) determines the throughput, resource utilization, and
dynamic energy to execute this matrix multiplication kernel.
Please note that the most straightforward matrix-multiply
algorithm would not have had a separate
p loop
and
j loop
.
However, when we initially used that description, Vivado
HLS did not share the local registers optimally when we
increase
II
. Fig. 1 and Fig. 2 show the architecture for N
= 6, II = 1 and II = N = 6 respectively when registers are
perfectly shared.
The resources and cycles to finish the matrix multiplica-
tion kernel can be generalized in terms of problem size
N
and PIPELINE II of the i loop.
1.
There are
N
II
multiplier(s) and
N
II
adder(s), and each
computes II elements within one row. The number of
B
or
C
input/output registers and temporary registers
between multipliers and adders is N
II .
2.
There are
N
II
independent memory bank(s) for the
b
matrix,
each with
II
column(s). It is the same for the
c
matrix.
Only one memory bank is needed for the
a
matrix since
a[i][k]is shared within one i loop iteration.
3. The number of cycles to finish the kernel is N2×II .
TABLE 1: HLS reported resource usage for multiplier and
adder under different IIs, N=24
II 1 2 3 4 6 8 12 24
DSP 120 60 40 30 20 15 10 5
FF 8520 4260 2840 2130 1420 1065 710 355
LUT 8376 4188 2792 2094 1396 1047 698 349
4. Energy Model
4.1. Computation Energy
The computation energy includes arithmetic energy
(multiply-add operations) and register energy for holding
the inputs, outputs, and temporaries. We can perfectly share
these PEs and registers by factor of II, making the total PE
and register energy consumption:
Ecompute N
II ×N2×II=N3(1)
For Xilinx 7 Series FPGAs, each multiply-add needs three
DSP48E [
16
] for the floating-point multiplier and two
DSP48E for the adder along with a fixed number of LUTs
and FFs. Table 1 shows the resource usage for multipliers
and adders decreasing as
1
II
since PEs are perfectly shared.
4.2. Memory Energy
In order to fully pipeline the matrix multiplication, each
PE needs to access each column of the
b
and
c
matrix simulta-
neously. HLS provides comprehensive partition pragmas [
8
]
to easily partition an array into individual memory banks. For
example, we use the complete partition pragma to partition
b
along the column direction. Each column,
b[..][N]
, becomes
an individual memory block.
As II increases, the number of simultaneous accesses
to the
b
matrix decreases, which means more columns can
be placed in the same memory bank with size of
N×II
.
In this design, cyclic partition pragma is applied to the
b
and
c
matricies to automatically split the memory along the
column direction in
N
II
equally sized blocks interleaving the
original array.
In general, there are
N
II b
or
c
banks; within each, II
columns of data are stored. We need to consider the total
memory energy when accessing these banks. In total, there
are
N3b
memory reads,
N3c
memory reads and writes, and
N2a
memory reads, which we could safely ignore when N
is large enough. Each memory access is from a logic memory
bank with size of
N×II
. On Xilinx 7 series FPGAs, all
the logic memory banks are constructed using the embedded
BRAM18K memory banks on the chip [17].
If we activate a single BRAM for each read within a
bank, the total energy reading from BRAMs is constant at:
Emem N
II ×N2×II=N3(2)
When the logical memory bank size is larger than
physical BRAM18K bank size, it needs to be constructed
using multiple BRAM18K banks, and the area of each logical
memory bank increases. This impacts the wiring as we see
in the next section.
4.3. Interconnect Wire Energy
The wire energy can be decomposed into wires within
PEs and wires connecting PEs and memory. Wiring within
the PE is fixed and will scale with the compute energy.
Ewire.in.pe N
II ×N2×II=N3(3)
Figure 3: Routing of broadcasting a[i][k] to all 24 multipliers,
N=24,II=1
Wire transferring broadcast data
. In this matrix-
multiply algorithm, broadcast wires must transfer
a[i][k]
from
the memory bank storing the amatrix to the multipliers as
shown in Fig. 3. The BRAM blocks storing the
a
matrix are
close to the input register, Ain, and close to one multiplier.
This broadcast should take energy proportional to the total
area of all the PEs it is feeding.
1
As II increases, the total
PE area scales as
N
II
, until we can no longer fit
N×II
elements of the
b
matrix into a single BRAM. Thus the total
energy for broadcasting a[i][k]is:
Ewire.share.A N
II ×N2=N3
II (4)
After
N×II > BRAM18K
, this scaling changes in
interesting ways. At this point, the total layout area for
all the PEs becomes dictated by BRAMs not DSPs, and the
area does not change with II. If we must broadcast to all the
BRAMs, this means the broadcast energy does not shrink
with II.
Ewire.share.A N2×N2=N4(5)
However, we really only need to broadcast to a few BRAMs
within a PE, allowing the broadcast energy to continue to
shrink with increasing II. Current synthesis tools do not
exploit this opportunity.
Wire transferring private data
. When the logical mem-
ory bank size
N×II
is smaller than size of the physical
BRAM18K bank (18432 bits, 576 floating-point numbers),
the wiring between the private
c
and
b
matrix memory banks
and the PE logic is constant, so the total energy for wiring
also scales proportionally, independent of II:
Ewire.priv .B,C N
II ×N2×II=N3(6)
When the logical memory bank size is larger than
physical BRAM18K bank size, the wiring between the private
b
and
c
memory banks and the PE logic also grows as the
square root of the memory capacity or
N×II
. Thus, the
total energy routing memory is
Ewire.priv .B,C N
II ×N2×II×N×II
N3.5II0.5(7)
1.
[
18
] shows the H-tree layout has linear layout area, which implies
linear wirelength in the area, which in turn implies linear energy.
4.4. Leakage
During the computation, the FPGA will also leak energy
proportional to the time for the computation and the resources
that leak during the computation. If we put nothing else on
the FPGA and use a fixed size FPGA that does not offer
any power gating for unused components, leakage increases
with runtime and hence II:
Eleak N2×II ×EF P GA leak (8)
However, if we use a design with perfect power gating
of unused components, the leakage should scale with the
utilized logic. For the case where N×II < BRAM 18K:
Eleak N
II ×N2×II =N3(9)
If we exploit the smaller resources of the
II > 1
designs
to use a smaller FPGA, we can get some of the effects of
Eq. 9. Similarly, if we exploit the smaller resource utilization
of the
II > 1
to put additional logic onto the FPGA that
fills the resources unused by the matrix-multiply, the leakage
attributable to the multiply should scale closer to Eq. 9.
4.5. Total Energy
Putting all the energy components together and assuming
perfect power gating (Eq. 9), we have total energy as the
follows:
Etotal =Ecompute +Ememory +Ewire +Eleak
=
N3c1 + c2
II ,
if N×II BRAM18K
N3c3 + c4×N+c5×II0.5,
if N×II > BRAM18K
(10)
5. Results
For small
N
, when the design is not memory dominated,
we can expect to see decreasing energy until
N×II =
BRAM 18K
driven by broadcast wiring energy. Beyond
that, we expect to see energy increase with
II
due to wiring
energy between BRAMs and computation within a PE.
We mapped the HLS designs to an Virtix-7 XC7VX485T
using Vivado 2015.1.5. We simulated each mapped design
in Vivado with random
a
and
b
matricies. We then used the
Switching Activity Interchange format (SAIF) file generated
from post-implementation simulation to estimate the energy
required by the mapped designs. From the mapped designs,
we used linear regression fit to determine the constants
c1
c5
in Eq. 10.
Fig. 4 shows how the energy components scale with
II
from the Vivado mapped designs along with the total energy
model from our fit model. We can see the mostly flat DSP and
logic energies that match the analytic description. We also see
that the interconnect energy and the overall energy drop with
increasing to
II = 16
where
N×II =BRAM 18K
. We see
the interconnect energy grow after that as expected. However,
we also see that BRAM energy, rather than remaining flat,
increases with
II
after
II = 16
. Here, Vivado mapped
designs are unnecessarily activating all of the BRAMs, not
just the BRAM that holds the data needed on each cycle. This
makes the total design energy unreasonably high for large
II
.
It should be possible to avoid activating the unused BRAMs
as illustrated in [
15
,
19
]. Making the perfect power gating
assumption, we see that energy is minimized at
II = 8
.
However, the effect is small and the benefit over
II = 1
is
Figure 4: Energy Scaling with II for
N= 64
Matrix Multiply
Figure 5: Scaling with Nfor Matrix Multiply
less than 3%. If we get less than perfect power gating, this
effect will easily be dominated by an increase in leakage
energy with II.
Fig. 5 shows how energy scales with
N
, including how
this effects the optimal
II
and our model fit. The
II
for
the minimum energy point decreases with
N
since larger
N
means the single BRAM capacity is reached at a lower
II
. Since all energy components scale as
N3
for the region
where
N×II BRAM 18K
, the energy proportions remain
the same as Ngrows.
Note that all the PEs are generating the same addresses
for their local
b
and
c
memories. Consequently, the design can
use a single address generator for all the BRAMs. However,
for some values of
N
and
II
, Vivado HLS will not share the
address generators, resulting in much larger logic energy at
those design points. Also, as we pointed out earlier, without
adding the
p
loop in Listing 1, the HLS tool fails to share
the registers properly.
6. Conclusion and Future Work
Interconnect energy within our matrix-multiply kernel is
minimized for an II that is typically greater than one. With
efficient power gating or alternate use of chip resources, this
can lead to minimum total energy at a point other than the
fully pipelined, II=1 point. Nonetheless, the effect is small
and the fully pipelined design often uses the least energy in
practice, both due to leakage and other discrete and non-ideal
scaling effects.
The energy modeling framework illustrated here should
be adaptable to other kernels. However, there is good reason
to believe that kernels will differ in how they scale in key
areas. We expect interconnect energy to scale differently
for other tasks or even implementations of the same task.
For example, using the systolic-array implementation of
matrix multiply [
20
], one may see different scaling. Our
matrix-multiply kernel had near perfect sharing of logic as
II increased, which will not be the case for less regular tasks.
Consequently, it will be useful to characterize how these
components scale for other tasks and develop a suitably
parameterized energy model that can be adapted to various
tasks characteristics. Ultimately, we hope model generation
can be automated and provide high-level guidance for
designers. As illustrated here, these models may also help to
identify inefficiencies in current mapping tools that should
be addressed to achieve energy efficient designs.
Acknowledgments
This work is partially supported by the Center for
Domain-Specific Computing under the Intel Award 20134321
and NSF Award CCF-1436827. It is also supported in part by
C-FAR, one of the six centers of STARnet, a Semiconductor
Research Corporation program sponsored by MARCO and
DARPA.
References
[1] A. Duran and M. Klemm, “The intel many integrated core architecture,” in
HPCS, July 2012, pp. 365–366.
[2] S. Che et al., “Accelerating compute-intensive applications with GPUs and
FPGAs,” in SASP, June 2008, pp. 101–107.
[3] H. K. Phoon et al., “A highly compatible architecture design for optimum FPGA
to structured-ASIC migration,” in ICSE, Oct 2006, pp. 506–510.
[4] M. Taylor, “Is dark silicon useful? harnessing the four horsemen of the coming
dark silicon apocalypse,” in DAC, June 2012, pp. 1131–1136.
[5] P. Li et al., “Resource-aware throughput optimization for high-level synthesis,
in FPGA, 2015, pp. 200–209.
[6] J. Cong et al., “A fully pipelined and dynamically composable architecture of
cgra,” in FCCM, May 2014, pp. 9–16.
[7] J. Cong et al., “Automatic memory partitioning and scheduling for throughput
and power optimization,” TODAES, vol. 16, no. 2, pp. 15:1–15:25, Apr 2011.
[8] Xilinx, “Vivado High-Level Synthesis. [Online]. Available:
http://www.xilinx.com/products/design-tools/vivado/integration/esl-
design/index.htm
[9] M. Lam, “Software pipelining: An effective scheduling technique for vliw
machines,” in PLDI, 1988, pp. 318–328.
[10] A. DeHon, “Fundamental underpinnings of reconfigurable computing architec-
tures,” Proc. IEEE, vol. 103, no. 3, pp. 355–378, March 2015.
[11] B. Landman and R. L. Russo, “On a pin versus block relationship for partitions
of logic graphs,” TC, vol. C-20, no. 12, pp. 1469–1479, Dec 1971.
[12] K. Poon et al., “A detailed power model for field-programmable gate arrays,”
TODAES, vol. 10, no. 2, pp. 279–302, Apr. 2005.
[13] F. Li et al., “Power modeling and characteristics of field programmable gate
arrays,” TCAD, vol. 24, no. 11, pp. 1712–1724, Nov 2005.
[14] S. Rajavel and A. Akoglu, An analytical energy model to accelerate FPGA
logic architecture investigation, in ICFPT, Dec 2011, pp. 1–8.
[15] E. Kadric et al., “Impact of memory architecture on FPGA energy consumption,”
in FPGA, 2015, pp. 146–155.
[16] Xilinx 7 Series DSP48E1 Slice User Guide.
[17] Xilinx 7 Series FPGAs Memory Resources.
[18] C. E. Leiserson, “Area efficient graph layouts (for VLSI), in FOCS, 1980, pp.
270–281.
[19] R. Tessier et al., “Power-efficient RAM mapping algorithms for FPGA embed-
ded memory blocks,” TCAD, vol. 26, no. 2, pp. 278–290, Feb 2007.
[20] J. wook Jang, S. Choi, and V. K. Prasanna, “Energy-efficient matrix multiplica-
tion on FPGAs,” TVLSI, vol. 13, no. 11, pp. 1305–1319, November 2005.
... For clearer comparison of our PMSDS method, we collect some typical competing methods, including the traditional method [26], additional memory method [27], add-tree method [28], and full unroll method [29], the details are specified as below. ...
... LUTs and REGs, may cost more.  Full unroll method [29] which employs the logic resources in FPGA to fully unroll the inner loop of the matrix multiplication, and improve the matrix multiplication throughput. However, the resource utilization and power consumption may be greatly increased. ...
Article
Full-text available
FPGAs (Field Programmable Gate Arrays) can efficiently implement custom applications via their embedded DSP (Digital Signal Processor) slices including binary multipliers. An increasing number of binary multipliers belonging to a DSP slice usually demonstrate that it has capacity to process as many multiplication operations as possible in one clock cycle. In order to fully utilize the DSP resource, in this paper, we propose a novel DSP slice optimization method to achieve parallel multiplication on single DSP slice, namely PMSDS. First, the PMSDS splits multiplicators into two separate parts, i.e. valid bits and vacant bits, using a customized polynomial algebra method. Then, the PMSDS pre-calculates the maximum number of overflow bits combining the above polynomial algebra method. Finally, it computes the total multiplicators bit numbers and parallel the final multiplicators. We also propose an optimization model to find the best parallel solution according to the performance and precision of a single DSP slice. Moreover, we implement a PMSDS-based matrix multiplication algorithm supporting the computing precision dynamically changing. The experiments based on a large-scale and real-world matrix multiplication show that the PMSDS has better performance in latency and resource utilization than the traditional, add-tree, and full-unroll methods, and is more outstanding in frequency and dynamic power consumption comparing with the state-of-the-art methods.
... Sextans and Serpens [23,24] are general-purpose monolithic accelerators for sparse matrices. [25,26] analyze layout and pipeline efficiency. Other works like AMD DPU [11], Mocha [27] explore task-level parallelism by allocating multiple duplicate accs on the device without specializing each acc. ...
... Sextans and Serpens [23,24] are general-purpose monolithic accelerators for sparse matrices. [25,26] analyze layout and pipeline efficiency. Other works like AMD DPU [11], Mocha [27] explore task-level parallelism by allocating multiple duplicate accs on the device without DNNExplorer [29] enhances DNNBuilder by combining dedicated accs for the first several layers and a monolithic acc for the rest of the layers. ...
Preprint
Full-text available
Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic with AI Engine processors optimized for AI/ML. With 400 AIEs, it provides up to 6.4 TFLOPs performance for 32-bit floating-point data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. We observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises: How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch of massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers in one application. We deploy the CHARM framework for four different applications, including BERT, ViT, NCF, MLP, on the AMD Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPs, 1.61 TFLOPs, 1.74 TFLOPs, and 2.94 TFLOPs inference throughput for BERT, ViT, NCF and MLP, which obtain 5.40x, 32.51x, 1.00x and 1.00x throughput gains compared to one monolithic accelerator.
... The RTL generation phase creates the control logic to orchestrate the datapath, controlling each stage to execute at its scheduled cycle. For fully-pipelined [3] datapath, the enable signals for activation or the stall signals for flow control will be broadcast to every element of the pipeline to operate the datapath as a whole. Meanwhile, the FSM proceeds to the next stage only when all concurrent modules at the current stage signal their completion to the controller. ...
... The RTL generation phase creates the control logic to orchestrate the datapath, controlling each stage to execute at its scheduled cycle. For fully-pipelined [3] datapath, the enable signals for activation or the stall signals for flow control will be broadcast to every element of the pipeline to operate the datapath as a whole. Meanwhile, the FSM proceeds to the next stage only when all concurrent modules at the current stage signal their completion to the controller. ...
... In this section we conduct a quantitative analysis on why the performance of FPGA is better than GPU using this algorithm. In many situations, although claimed to have higher energy efficiency [14][15] [16], the acceleration rates by FPGA are not as good as those by GPU, while designing an FPGA accelerator takes a much longer time than GPU. To make things worse, FPGA is usually more expensive than GPU in the same generation. ...
... Starting from this initial scheme, the purpose is to build architectures with different levels of parallelism according to constraints inspired by the literature [4][3][10] [11]. These constraints are the number of RAMs used for each variable (coefficients, centers, and spectrum), whether the architecture was pipelined or not, how many times the code was unrolled, and the data type used (see Table I). ...
... One-to-all broadcast. As distinct PEs span a large area, they incur long wires to broadcast data to computation logic directly (bc_in_compute), e.g., matrix A is broadcast to multiplier-accumulators in matrix-multiplier [13]) or to local copies of shared data within each PE (bc_by_copy), e.g., Advanced Encryption Standard (AES) [14] broadcasts a shared key to all processing elements that perform encryption tasks independently. ...
Conference Paper
Full-text available
Modern FPGA chips feature abundant reconfigurable resources such as LUTs, FFs, BRAMs and DSPs. High-level synthesis (HLS) further advances users productivity in designing accelerators and scaling out the design quickly via fine-grain and coarse-grain pipelining and duplication to utilize on-chip resources. However, current HLS tools fail to consider data locality in the scaled-out design; this leads to a long critical path which results in a low operating frequency. In this paper we summarize the timing degradation problems to four common collective communication and computation patterns in HLS-based accelerator design: scatter, gather, broadcast and reduce. These widely used patterns scale poorly in one-to-all or all-to-one data movements between off-chip communication interface and on-chip storage, or inside the computation logic. Therefore, we propose the Latte microarchitecture featuring pipelined transfer controllers (PTC) along data paths in these patterns. Furthermore, we implement an automated framework to apply our Latte implementation in HLS with minimal user efforts. Our experiments show that Latte-optimized designs greatly improve the timing of baseline HLS designs by 1.50x with only 3.2% LUT overhead on average, and 2.66x with 2.7% overhead at maximum.
Article
Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic with AI Engine processors optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can provide up to 6.4 TFLOPS performance for 32-bit floating-point (FP32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. We observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises: How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch between massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose multiple diverse MM accelerator architectures working concurrently on different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework on four different deep learning applications in FP32, INT16, and INT8 data types, including BERT, ViT, NCF, and MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPS, 1.61 TFLOPS, 1.74 TFLOPS, and 2.94 TFLOPS inference throughput for BERT, ViT, NCF, and MLP in FP32 data type, respectively, which obtain 5.29 ×\times , 32.51 ×\times , 1.00 ×\times , and 1.00 ×\times throughput gains compared to one monolithic accelerator. CHARM achieves the maximum throughput of 1.91 TOPS, 1.18 TOPS, 4.06 TOPS, and 5.81 TOPS in the INT16 data type for the four applications. The maximum throughput achieved by CHARM in the INT8 data type is 3.65 TOPS, 1.28 TOPS, 10.19 TOPS, and 21.58 TOPS, respectively. We have open-sourced our tools, including detailed step-by-step guides to reproduce all the results presented in this paper and to enable other users to learn and leverage CHARM framework and tools in their end-to-end systems: https://github.com/arc-research-lab/CHARM .
Article
Hardware acceleration is a promising trend for the energy and thermally constrained systems. The programmable nature of FPGAs allows it to deliver high performance and energy efficient solution. Unfortunately, the traditional RTL-based synthesis flow of FPGAs prevents its wide adoption. In response, recent adoption of OpenCL programming model has raised the possibility to program FPGAs in a software manner. To harness the power of FPGAs using OpenCL programming model, it is advantageous to design an analytical model for performance analysis, design space exploration and provide insights into the performance bottlenecks. To this end, this paper presents FlexCL, an analytical performance and power model for OpenCL workloads on FPGAs. FlexCL leverages static analysis to analyze the OpenCL kernels. As for the performance estimation, it first develops systematic computation models for processing elements, compute units and kernels by modeling the operation scheduling, work-item and work-group scheduling, and the resource constraints. Then, it models different global memory access patterns. Finally, FlexCL estimates the overall performance by tightly coupling the memory and computation models based on the communication mode. FlexCL can be also used to guide performance and power trade-off analysis. Experiments demonstrate that the average performance and power estimation errors of FlexCL are 9.5% and 12.6% for the Rodinia suite, respectively. The OpenCL model on FPGAs also exposes a rich optimization design space. With FlexCL, we can enable rapid exploration of the design space with respect to both performance and power within seconds instead of hours or days.
Conference Paper
Full-text available
FPGAs have the advantage that a single component can be configured post-fabrication to implement almost any computation. However, designing a one-size-fits-all memory architecture causes an inherent mismatch between the needs of the application and the memory sizes and placement on the architecture. Nonetheless, we show that an energy-balanced design for FPGA memory architecture (memory block size(s), memory banking, and spacing between memory banks) can guarantee that the energy is always within a factor of 2 of the optimally-matched architecture. On a combination of the VTR 7 benchmarks and a set of tunable benchmarks, we show that an architecture with internally-banked 8Kb and 256Kb memory blocks has a 31% worst-case energy overhead (8% geomean). In contrast, monolithic 16Kb memories (comparable to 18Kb and 20Kb memories used in commercial FPGAs) have a 147% worst-case energy overhead (24% geomean). Furthermore, on benchmarks where we can tune the parallelism in the implementation to improve energy (FFT, Matrix-Multiply, GMM, Sort, Window Filter), we show that we can reduce the energy overhead by another 13% (25% for the geomean).
Article
Full-text available
Reconfigurable architectures are a distinct point in the larger design space that includes programmable processors and nonprogrammable fixed-function devices. In this paper, we identify the major parameters that distinguish architectures in this design space and draw connections between these parameters and physical requirements (e.g., energy, delay, and area) and application characteristics (e.g., word width, locality). Building on these connections, we identify the fundamental advantages that reconfigurable architectures can offer.
Conference Paper
Full-text available
Future processor chips will not be limited by the transistor resources, but will be mainly constrained by energy efficiency. Reconfigurable fabrics bring higher energy efficiency than CPUs via customized hardware that adapts to user applications. Among different reconfigurable fabrics, coarse-grained reconfigurable arrays (CGRAs) can be even more efficient than fine-grained FPGAs when bit-level customization is not necessary in target applications. CGRAs were originally developed in the era when transistor resources were more critical than energy efficiency. Previous work shares hardware among different operations via modulo scheduling and time multiplexing of processing elements. In this work, we focus on an emerging scenario where transistor resources are rich. We develop a novel CGRA architecture that enables full pipelining and dynamic composition to improve energy efficiency by taking full advantage of abundant transistors. Several new design challenges are solved. We implement a prototype of the proposed architecture in a commodity FPGA chip for verification. Experiments show that our architecture can fully exploit the energy benefits of customization for user applications in the scenario of rich transistor resources.
Conference Paper
With the emergence of robust high-level synthesis tools to automatically transform codes written in high-level languages into RTL implementations, the programming productivity when synthesising accelerators improves significantly. However, although the state-of-the-art high-level synthesis tools can offer high-quality designs for simple nested loop kernels, there is still a significant performance gap between the synthesized and the optimal design for real world complex applications with multiple loops. In this work we first demonstrate that maximizing the throughput of each individual loop is not always the most efficient approach to achieving the maximum system-level throughput. More area efficient non-fully pipelined design variants may outperform the fully-pipelined version by enabling larger degrees of parallelism. We develop an algorithm to determine the optimal resource usage and initiation intervals for each loop in the applications to achieve maximum throughput within a given area budget. We report experimental results on eight applications, showing an average of 31% performance speedup over state-of-the-art HLS solutions.
Conference Paper
Due to the breakdown of Dennardian scaling, the percentage of a silicon chip that can switch at full frequency is dropping exponentially with each process generation. This utilization wall forces designers to ensure that, at any point in time, large fractions of their chips are effectively dark or dim silicon, i.e., either idle or significantly underclocked. As exponentially larger fractions of a chip's transistors become dark, silicon area becomes an exponentially cheaper resource relative to power and energy consumption. This shift is driving a new class of architectural techniques that "spend" area to "buy" energy efficiency. All of these techniques seek to introduce new forms of heterogeneity into the computational stack. We envision that ultimately we will see widespread use of specialized architectures that leverage these techniques in order to attain orders-of-magnitude improvements in energy efficiency. However, many of these approaches also suffer from massive increases in complexity. As a result, we will need to look towards developing pervasively specialized architectures that insulate the hardware designer and the programmer from the underlying complexity of such systems. In this paper, I discuss four key approaches--the four horsemen--that have emerged as top contenders for thriving in the dark silicon age. Each class carries with its virtues deep-seated restrictions that requires a careful understanding of the underlying tradeoffs and benefits.
Conference Paper
In recent years, an observable trend in High Performance Computing (HPC) architectures has been the inclusion of accelerators, such as GPUs and field programmable arrays (FPGAs), to improve the performance of scientific applications. To rise to this challenge Intel announced the Intel® Many Integrated Core Architecture (Intel® MIC Architecture). In contrast with other accelerated platforms, the Intel MIC Architecture is a general purpose, manycore coprocessor that improves the programmability of such devices by supporting the well-known shared-memory execution model that is the base of most nodes in HPC machines. In this presentation, we will introduce key properties of the Intel MIC Architecture and we will also cover programming models for parallelization and vectorization of applications targeting this architecture.
Article
This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete. The advantage of software pipelining is that optimal performance can be achieved with compact object code. This paper extends previous results of software pipelining in two ways: First, this paper shows that by using an improved algorithm, near-optimal performance can be obtained without specialized hardware. Second, we propose a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block. With this scheme, all innermost loops, including those containing conditional statements, can be software pipelined. It also diminishes the start-up cost of loops with small number of iterations. Hierarchical reduction complements the software pipelining technique, permitting a consistent performance improvement be obtained. The techniques proposed have been validated by an implementation of a compiler for Warp, a systolic array consisting of 10 VLIW processors. This compiler has been used for developing a large number of applications in the areas of image, signal and scientific processing.
Article
Minimizing the area of a circuit is an important problem in the domain of Very Large Scale Integration. We use a theoretical VLSI model to reduce this problem to one of laying out a graph, where the transistors and wires of the circuit are identified with the vertices and edges of the graph. We give an algorithm that produces VLSI layouts for classes of graphs that have good separator theorems. We show in particular that any planar graph of n vertices has an O(n lg-square(n)) area layout and that any tree of n vertices can be laid out in linear area. The algorithm maintains a sparse representation for layouts that is based on the well-known UNION-FIND data structure, and as a result, the running time devoted to management of this representation is nearly linear.
Conference Paper
We develop new algorithms and architectures for matrix multiplication on cofigurable devices. These designs significantly reduce the energy dissipation and latency compared with the state-of-the-art FPGA-based designs. We derive functions to represent the impact of algorithmic level design choices on the system-wide energy dissipation, latency, and area by capturing algorithm and architecture details including features of the target FPGA. The functions are used to optimize energy performance under latency and area constraints for a family of candidate algorithms and architectures. As a result, our designs improve the energy performance of the optimized design from the recent Xilinx library by 32% to 88% without any increase in area-latency product. In terms of comprehensive metrics such as EAT (Energy-Area-Time) and E/AT (Energy/Area-Time), our designs offer superior performance compared with the Xilinx design by 50%-79% and 13%-44%, respectively. We also address how to exploit further increases in density of future FPGA devices for asymptotic improvement in latency and energy dissipation for multiplication of larger size matrices.
Conference Paper
There is a pressing need for exploring innovative reconfigurable architectures with the steady growth in the range of FPGA based applications. However, traditional FPGA architecture design methods require time consuming CAD experimentations to identify the most suitable hardware configuration for the target application. Several analytical models have been recently proposed to predict the relative performance of a given set of architectures. Replacing CAD experiments with these analytical models poses as the solution for reducing the complexity of architecture evaluation process. However, among a large set of existing models, an analytical energy model is missing to supplement the architecture evaluation. We argue that energy can be defined as a function of routed wire length and critical path delay. Therefore, we inherit wire length and critical path delay models to derive an analytical energy model for homogeneous FPGA architectures. We evaluate the impact of variations in logic architecture parameters in terms of LUT size, cluster size and inputs per CLB on the energy performance, and show that our energy model accurately captures the trends observed through CAD experiments. An energy model is robust only if its predictions are in agreement with any CAD flow or benchmark suite. We study the robustness of our energy model by varying the seed selection process of placement, optimization goal of clustering and placement, and the nature of the benchmark suite. In all our experimental evaluations, we observe that the energy model accurately captures the performance trends with a high degree of fidelity.