Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based
Text Generation
Seongmin Hong
seongminhong@kaist.ac.kr
KAIST
Daejeon, South Korea
Seungjae Moon
sjaemoon@kaist.ac.kr
KAIST
Daejeon, South Korea
Junsoo Kim
junsoo999@kaist.ac.kr
KAIST
Daejeon, South Korea
Sungjae Lee
sung-jae.lee@navercorp.com
NAVER CLOVA
Seongnam, South Korea
Minsub Kim
minssub.kim@navercorp.com
NAVER CLOVA
Seongnam, South Korea
Dongsoo Lee
dongsoo.lee@navercorp.com
NAVER CLOVA
Seongnam, South Korea
Joo-Young Kim
jooyoung1203@kaist.ac.kr
KAIST
Daejeon, South Korea
Abstract—Transformer is a deep learning language model
widely used for natural language processing (NLP) services
in datacenters. Among transformer models, Generative Pre-
trained Transformer (GPT) has achieved remarkable perfor-
mance in text generation, or natural language generation
(NLG), which needs the processing of a large input context in
the summarization stage, followed by the generation stage that
produces a single word at a time. The conventional platforms
such as GPU are specialized for the parallel processing of
large inputs in the summarization stage, but their performance
significantly degrades in the generation stage due to its sequen-
tial characteristic. Therefore, an efficient hardware platform is
required to address the high latency caused by the sequential
characteristic of text generation.
In this paper, we present DFX, a multi-FPGA acceleration
appliance that executes GPT-2 model inference end-to-end with
low latency and high throughput in both summarization and
generation stages. DFX uses model parallelism and optimized
dataflow that is model-and-hardware-aware for fast simulta-
neous workload execution among devices. Its compute cores
operate on custom instructions and provide GPT-2 operations
end-to-end. We implement the proposed hardware architecture
on four Xilinx Alveo U280 FPGAs and utilize all of the channels
of the high bandwidth memory (HBM) and the maximum
number of compute resources for high hardware efficiency.
DFX achieves 5.58×speedup and 3.99×energy efficiency over
four NVIDIA V100 GPUs on the modern GPT-2 model. DFX
is also 8.21×more cost-effective than the GPU appliance,
suggesting that it is a promising solution for text generation
workloads in cloud datacenters.
Keywords-Natural Language Processing; GPT; Text Genera-
tion; Datacenter; Multi-FPGA Acceleration; Model Parallelism
I. INTRODUCTION
Transformer [1] is a deep learning language model that
uses the mechanism of attention, which gives a different
weight of significance to each part of the input data. By
solving the recursion and lack of global dependency problem
of recurrent neural network (RNN) [2] and long short-term
memory (LSTM) [3], the transformer is becoming the de facto
Language
Model
"is"
"Hello, my name"
Input Tokens
"James" "Smith" "and" Output Tokens
...
"."
Language
Model Language
Model Language
Model Language
Model
...
Generation StageSummarization Stage
...
Figure 1. Illustration of transformer-based text generation.
standard for natural language processing (NLP) applications
such as text generation [4], [5], text classification [6], [7], and
machine translation [8], [9]. Among them, text generation,
broadly referred to as natural language generation (NLG),
is related to the automatic generation of human-readable
text by a computer. It has become of great importance in
emerging applications such as dialogue system [10], [11],
[12] and topic-to-essay generation [13], [14], [15], with a
rapid growth rate of 20% [16] in the NLG market. Among
transformer models, the Generative Pre-trained Transformer
(GPT) is widely used in cloud services, achieving remarkable
performance particularly in text generation applications.
In the text generation process, consisting of the summariza-
tion and generation stages, the language model continuously
generates sequential output words (i.e., output tokens) using
the input context made of multiple input words (i.e., input
tokens), as shown in Figure 1. In the summarization stage,
the language model processes a batch of input tokens with a
single run and generates a new output token. The generation
stage iterates the language model processing to generate the
subsequent output tokens, in which each iteration takes the
single output token from the previous iteration as input to
generate a single output token. Meanwhile, the language
model accumulates the contextual features throughout the
iterations. In current server platforms, GPU [17] is used to
accelerate text generation. Its massively parallel compute
units yield high performance in the summarization stage as
arXiv:2209.10797v1 [eess.SY] 22 Sep 2022
the input tokens can be computed simultaneously. However, a
significant performance degradation occurs in the generation
stage because GPU is not suitable for sequential processing,
suffering from severe underutilization.
Several architectures [18], [19], [20], [21] have been pro-
posed to accelerate the transformer. The attention mechanism
[1], composed of matrix multiplication and softmax for
contextual understanding, has been their primary operation of
concern because it is the most computationally intensive task
in the transformer. However, a language service requires an
architecture that considers the entirety of the transformer
model. For datacenters to adopt the above accelerator
architectures, the server platforms would need CPU or extra
compute modules to cover the complete operations, which
would lead to large processing overhead. Therefore, a unified
and programmable architecture that can support the whole
GPT operations end-to-end is necessary.
In this paper, we propose DFX, a multi-FPGA acceleration
appliance that specializes in text generation workloads
covering end-to-end inference of variously sized GPT models.
To address the sequential characteristic of text generation,
DFX compute core is optimized for single token processing,
which is impracticable in GPU. It also uses an efficient tiling
scheme and dataflow based on the characteristics of GPT for
maximum high bandwidth memory (HBM) [22] bandwidth
usage. To address the increasing model size, DFX uses
model parallelism on the multi-device system to increase the
physical number of compute cores that work in parallel while
evenly assigning full workload to each device. Furthermore,
we exploit FPGAs because the transformer-based model
continues to undergo modifications and expansions for
different language services in the datacenter. The FPGA-
based accelerator provides fully reprogrammable hardware to
support new operations and larger dimensions of the evolving
transformer with minimum cost for redesign when compared
to an ASIC-based accelerator.
The main contributions of our work are as follows.
•
We identify that the generation stage of text generation
workload is the bottleneck on parallel hardware such as
GPU due to its sequential characteristic.
•
We design a custom programmable compute core optimized
for the end-to-end acceleration of GPT inference with a
high hardware utilization.
•
We utilize the full HBM bandwidth complemented with
an efficient tiling scheme and dataflow based on the
characteristics of GPT to achieve low latency and high
throughput.
•
We apply model parallelism and efficient network to
the multi-FPGA system by evenly distributing the model
parameters to each FPGA in a way that requires minimal
data synchronization among FPGAs and achieves maximal
parallel computation.
•
We build a multi-FPGA appliance with low upfront and
LM head
Token
Embedding
LM head
Token
Embedding
LM head
Token
Embedding
Positional
Encoding
Generation Stage
"is" "James" "."
"Hello, my name"
..
..
..
Summarization Stage
Decoder Layer 1 Decoder Layer 1 Decoder Layer 1
Fully-
Connected
GELU
Fully-
Connected
Fully-
Connected
GELU
Fully-
Connected
Fully-
Connected
LayerNorm
Softmax
Fully-
Connected
LayerNorm
Softmax
Concat
K, V
Input Tokens :
Output Tokens :
++ ++ ++
Feed-
Forward
Network
Feed-
Forward
Network
Feed-
Forward
Network
Self-
Attention Self-
Attention Self-
Attention
Residual
LayerNorm
LayerNorm
Residual
Residual
LayerNorm
LayerNorm
Residual
Residual
LayerNorm
LayerNorm
Residual
Decoder Layer 2
Decoder Layer N
Decoder Layer 2
Decoder Layer N
Decoder Layer 2
Decoder Layer N
VectorVectorVector
MatrixMatrixMatrix
ID Emb
Vector
vec0
0vec1
1...... vecn
n
WTE
Token
ID
Vector
ID Emb
Vector
vec0
0vec1
1...... vecn
n
WTE
Token
ID
Vector
VectorVectorVector VectorVectorVector
VectorVectorVector VectorVectorVector
n×emb 1×emb 1× emb
1×emb 1×emb 1×emb
Create Q, K, V
Concat K, V
Fully-
Connected
Softmax
MatMul
(S × V)
Masked MatMul
(Q × KT)
Multi-Head Attention
Create Q, K, V
Concat K, V
Fully-
Connected
Softmax
MatMul
(S × V)
Masked MatMul
(Q × KT)
Multi-Head Attention
Figure 2. GPT-2 structure and illustration of summarization and generation
stages in text generation.
operating cost that accelerates transformer-based language
services, achieving multiple times better performance and
efficiency than the conventional GPU platform. We believe
this new hardware platform is promising for handling ever-
increasing text generation workloads in datacenters.
II. BACKGROU ND
A. GPT Language Model
GPT is based on the transformer [1] structure that achieves
the best accuracy in NLP. The primitive transformer has two
parts, encoder and decoder, to process the input and output
sequence, respectively. However, GPT includes only the
decoder because it focuses on generating text (i.e., creating
the next word sequence) by looking at the given context. GPT
is able to remove the encoder by using an alternate method
called token embedding, a process that uses pre-trained
matrices in place of the encoder. Furthermore, the model
size of GPT and its decoder layer is constantly increasing
with more parameters and operations to gain better accuracy
and sophistication in its token generation. Recently, OpenAI
announced the latest GPT model, GPT-3, but the model itself
is not available in the public domain [23]. In this paper, we
are based on the publicly available GPT-2 model. Note that
our hardware acceleration strategies for GPT-2 are applicable
to GPT-3 because it has the same model structure but with
a larger size. Figure 2shows the GPT-2’s model structure
and GPT-2-based text generation workload.
GPT-2 Structure
The token embedding, located at the
beginning of the decoder, is responsible for converting an
input word(s) into an embedding vector. The input word is
converted to the numeric token ID based on a dictionary.
Then, the pre-trained matrices, word token embedding (WTE)
and word position embedding (WPE), are indexed with the
token ID to obtain the corresponding vectors. WTE contains
token-related encoding, and WPE contains position-related
encoding. The two vectors are added to get the embedding
vector. LM head, located at the end of the decoder, has the
opposite role to the token embedding. It converts the output
embedding vector into the token ID. This process requires a
matrix multiplication with the transpose of WTE and selects
the token ID with the highest probability value by applying
the softmax function. The selected token ID represents the
generated word.
GPT-2 has
N
number of decoder layers between token
embedding and LM head, in which
N
is determined by
the model size. As shown in Figure 2, one decoder layer
is largely divided into four operations: self-attention, feed-
forward network, layer normalization, and residual. Self-
attention, a type of attention for the decoder, is a key
part of the transformer [1]; it creates the
Query
,
Key
, and
Value
matrix to obtain the attention matrix.
Query
is related
to a current given word, while
Key
and
Value
represent
the flow of the entire context. GPT-2 uses the multi-head
attention [24], a method of dividing attention weights into
H
columns to execute
H
independent matrix operations in
parallel.
H
represents the number of attention heads, and this
hyperparameter is increasing with the model size. Another
significant operation is the feed-forward network, which is
commonly used in deep neural networks. It is made up
of two fully connected (FC) layers and Gaussian Error
Linear Unit (GELU) activation function [25]. Lastly, the
layer normalization and residual, well-known techniques in
previous works [26], [27], are placed around self-attention
and feed-forward network to fine-tune the large model.
To generate tokens with the given context, GPT-2 contains
a summarization and generation stage. The summarization
stage takes the entire context as input, so the decoder’s input
dimension after the token embedding is
n×emb
, in which
n
is the length of the context in tokens and
emb
is the length
of the embedding vector; for reference,
emb =1600
for the
1.5B model. The embedding vector is fed to the decoder,
which involves a series of matrix multiplications with weights
of size
emb ×emb
or larger, to produce an output matrix
with the same initial dimension of
n×emb
. Only the last
row of the output matrix is processed in LM head, and
the first subsequent token is generated. The
Key
and
Value
matrices that represent the context are also created in the
summarization stage. In the generation stage, the previously
generated token enters the decoder, so the input dimension is
1×emb
. Since the tokens generated are determined by the
previous context, the generation stage updates the
Key
and
Value
matrices by appending a row with each new input’s
context. For instance, if “Hello, my name” is the context with
an input token length of 4,
4×emb Key
and
Value
matrices
are formed, and the first output token that represents the
word, “is,” is generated in the summarization stage. If the
requested output token length is 3, the output token enters
two iterations of the generation stage, and the
Key
and
Value
matrices increase their row dimension by 1 after each iteration.
From the decoder, the next tokens that represent “James” and
“.,” are outputted sequentially. Finally, the generated tokens
are detokenized altogether to form the sentence “Hello, my
name is James.”
The length of the given context and the length of the output
words affect the amount of computation in the summarization
and generation stage, respectively, so the time spent on either
stage varies for different workloads.
GPT-2 Workload
Completing the processes above, GPT-
2 is able to generate words or sentences from the input
context. Some prominent text generation workloads include
dialogue system and topic-to-essay generation (e.g., chatbot
and article writing). Depending on the workload, the ratio
of context to generation varies. For instance, the chatbot
service has an average input token request of length 50, then
produces an output token of length 50, having a ratio of
1:1. In contrast, the article writing application from OpenAI
allows the user to input up to 50 tokens, then produces
up to 150 tokens, having a wide range of ratios from 50:1
to 1:150 [28]. In other less widely used applications for
GPT-2 like question-and-answer, the input context is much
longer to generate a few-word answer. In datacenters, GPT-2
is applied for language services that tend to require more
output tokens in the generation stages than the input tokens
in the summarization stage.
B. Parallelism in Deep Learning
To train or inference large NLP models, multiple com-
puting devices, usually GPUs, are used [29], [30]. Among
various ways to process a single model through multiple
workers, the following two parallelism schemes are mainly
used.
Data Parallelism
Data parallelism [31] is the method of
splitting the input batch across multiple workers. The workers
individually perform operations with their own batch data.
This method is suitable for training but not inference because
a single batch size, or non-batched input, is commonly used
for inference applications that involve dynamic user requests.
Model Parallelism
Model parallelism separates the model
parameters across multiple workers and processes them
simultaneously. It is beneficial for large models such as
GPT-2 [24] and BERT [32] because the size of the model
allocated to each worker is reduced. Two widely used model
parallelism schemes are pipelined parallelism [33] and intra-
layer parallelism [34]. In pipelined parallelism, only one
worker performs a group of operations of the model and
transfer its output to different workers that process other
operations of the model. The entire process is pipelined for
high throughput, but the latency cannot be reduced. In intra-
layer parallelism, the parallelizable operations (e.g., matrix
multiplication) can be divided into multiple devices, so the
execution time of operation is significantly reduced. However,
synchronization may be required before certain operations
0
100
200
300
[128:1] [96:1] [64:1] [32:1] [32:2] [32:3] [32:4]
Latency (ms)
[Input Size: Output Size]
Summarization Generation
Figure 3. Latency of text generation with increased number of input tokens
(leftward) and output tokens (rightward) on GPU for GPT-2 1.5B model.
that require the entire output, so the performance is dependent
on the number of synchronizations and physical devices.
III. MOTI VATI ON
A. Sequential Characteristic
As described in Section
II-A
, GPT-2’s summarization and
generation stage have different computational property. The
summarization stage needs multiple tokens simultaneously,
so processing multiple tokens in parallel is advantageous. On
the other hand, the generation stage is a rather sequential
process that processes a single token one by one. Thus, the
opportunity for parallel processing is low. In datacenters,
GPU is typically used to run the text generation applications.
GPU is ideal for processing large input tokens in the summa-
rization stage with its massively parallel compute units and
high memory bandwidth, but it shows significant performance
degradation in the generation stage. The sequential process of
the generation stage is not parallelizable, and the operations
are not compute-intensive enough to effectively utilize GPU’s
large compute units. As a result, significant underutilization
occurs in the GPU. Figure 3shows that each additional
output token leads to a large increase in latency, 75.45ms on
average, whereas each additional input token increases the
latency only by 0.02ms on average for the GPU with GPT-2
1.5B model. Since text generation workloads typically create
long output tokens, devising an architecture that speeds up
the generation stage while maintaining high throughput in
both stages is imperative.
We also examine the effect of batch size on the latency
and throughput of the GPT-2. On the application level, if the
datacenter chooses to batch the input of different users, the
latency increases with the batch size because of the time spent
gathering the input from different users [35]. Consequently,
current datacenters prefer to run the model without fully
gathering the input (i.e., non-batched input). In this case,
the utilization decreases linearly by the number of unfilled
inputs in the given batch, so an optimized datapath that can
run a single input token with low latency is required.
B. End-to-End Acceleration
Most previous works target the acceleration of only the
attention mechanism in transformers [18], [19], [36], and
few include the feed-forward network [20]. However, GPT-
2 contains additional processes: token embedding, layer
normalization, residual, and LM head. Previous works
0% 20% 40% 60% 80% 100%
Latency
Number of
Operations
LayerNorm Self-Attention Residual Feed-Forward Network
9.9%
56.5%
12.9%
20.7%
33.31%
66.59%
0.01%
0.1%
Figure 4. GPT-2 latency and number of operations breakdown on GPU.
disregard accelerating these additional processes because
the time spent on the attention mechanism outweighs other
processes. However, if these architectures are to run GPT-2
end-to-end on an existing application or service, the rest of
the processes would need to be completed on the host. As
the model repeats the decoder layer processing many times,
extensive data transactions between the host and accelerator
through the PCIe would easily become the bottleneck in the
system performance. We need an accelerator that supports
the entire GPT-2 process without missing any functions in
the middle for practical usage.
Moreover, GPT-2 contains operations that are suboptimal
for the GPU to complete. Figure 4shows that the time spent
for layer normalization and residual is at 22.8% for the
GPU even when the number of raw mathematical operations
that accounts for them is extremely low at 0.11%. This
breakdown demonstrates that low-level operations of the
layer normalization and residual are inefficient in the GPU.
Domain-specific accelerators allow for hardware design that
is specialized for these complex computations. Therefore, an
alternative accelerator optimized for all GPT-2 operations is
necessary.
C. Parallel Computing
GPT-2 requires massive computations with large model
parameters, which suggests the need for a parallelism scheme
that divides the large model into multiple nodes for parallel
computation. With the growing dimensions of the GPT-2
model, a single device is not sufficient because it lacks
both memory bandwidth and capacity to do the required
computations. The latency is also critical in text generation
workloads, so multiple devices must share the workload
to reduce the overall execution time for the given input
token. To address this shortcoming, a multi-device system that
adopts model parallelism and efficient network is necessary to
maximize the amount of parallel computation with minimal
additional latency.
IV. DF X ARCHITECTURE
A. Architecture Overview
DFX architecture is designed to efficiently accelerate large
transformer-based language models based on FPGAs. As
shown in Figure 5, DFX is a server appliance architecture
that consists of dual-socket CPUs and multiple FPGAs. One
CPU and a homogeneous cluster of four FPGAs form a
system to compute an independent workload. Each FPGA
contains one compute core for a total of four cores per cluster.
PCIe Switch Gen3 x16PCIe Switch Gen3 x16
CPUCPU
FPGA FPGA
FPGA FPGA
FPGA Cluster 0
QPI CPU
FPGA FPGA
FPGA FPGA
QSFP
100 Gb/s
PCIe Switch Gen 3 x16PCIe Switch Gen 3 x16
FPGA Cluster 1
Memory Memory
Interconnect
Router
Register File
Compute Core
PCIe
QSFP QSFP
Attention
Head
Parallel
Column
Parallel
Model Parallelism
GPT Model
=
=
=
=
Synchronization
=
=
=
=
Synchronization
LayerNorm
Residual
LayerNorm
Residual
Feed-Forward NetworkFeed-Forward Network
Weight
Matrix
Weight
Matrix
Feed-Forward Network
Weight
Matrix
Self-AttentionSelf-Attention
Attention HeadAttention Head
Self-Attention
Attention Head
LayerNorm
Residual
LayerNorm
Residual
Feed-Forward Network
Weight
Matrix
Self-Attention
Attention Head H
LayerNorm
Residual
LayerNorm
Residual
Feed-Forward Network
Weight
Matrix
Self-Attention
Attention Head H
M3 M2
M0 M1
Partitioned Models (M)
Matrix
Processing
Unit
Vector
Processing
Unit
Sync
Sync
Sync
Sync
Accelerator
Figure 5. Overall DFX appliance architecture. Left is the illustration of GPT-2 model parallelism. Center is the mapping of the partitioned models into the
DFX appliance. Right is the accelerator’s microarchitecture.
The FPGAs are connected to the host CPU via PCIe Gen
3 subsystem that transfers data at 16 GB/s. The FPGA-to-
FPGA communication is enabled by a QSFP transceiver
that transfers data at 100 Gb/s. As each FPGA is limited
to two QSFP ports, a ring network is chosen instead of
other network topologies that require more node-to-node
connections. Although we chose four FPGAs per cluster, it
is scalable within a server appliance with only consideration
of monetary cost.
B. Homogeneous Multi-FPGA Cluster
In order to efficiently process the large-scale language
model, we apply model parallelism to the model parameters.
In particular, we adopt intra-layer parallelism scheme [34]
instead of pipelined parallelism scheme [33]. A specific
intra-layer partitioning method is applied to reduce the
latency of matrix operations, proportional to the number of
workers, with minimal synchronization overhead. In contrast,
pipelining incurs high latency because all the workers execute
entirety of each operation for a single input. Moreover, if
the final output result of the model is the following input
(i.e., feedback loop), like in text generation, the difference
in latency between the two schemes would increase linearly
per decoder layer. Figure 6illustrates DFX’s intra-layer
parallelism strategy. The weight matrices for multi-head
attention are divided head-wise, and the weight matrices for
the fully-connected layer are divided column-wise into two
portions (i.e., the number of FPGAs) so that each FPGA can
individually work on the assigned partition. To this end, the
partitioned matrices are stored in the memory of the FPGA
where their assigned core is, and each core executes identical
operations in parallel with the partitioned model parameters.
Each core then obtains the final result of the matrix operations,
which is a subvector of the output vector. Each subvector is
then circulated to the other FPGAs through the ring network
for data synchronization. After synchronization, each core
has the complete vector and is ready to proceed to the next
vector operation, such as residual. Overall, we need this
synchronization once during and once after self-attention and
Multi-Head Attention
Head 8 …Head15
Sync FC
for Attention
FPGA 0 LayerNorm ResidualSync
WQ
BQ
WK
BK
WV
BV
Memory
Head
WQ
BQ
WK
BK
WV
BV
WQ
BQ
WK
BK
WV
BV
K0V0... K7V7
Sync FC
for Attention
FPGA 1 LayerNorm ResidualSync
K8V8... K15 V15
W
B
W0
B0
W1
B1
Memory
Memory
Memory
Divide head-wise Divide column-wise Load from memory Store to memory Synchronization
Multi-Head Attention
Head 0 …Head7
Multi-Head Attention
Head 0 …Head7
FC for Q, K, V
FC for Q, K, V
Model
Parameters Model
Parameters
0 0 0
0 0 0
8 8 8
8 8 8
Figure 6. Intra-layer model parallelism strategy on 2 FPGAs.
feed-forward network, which sums to a total of four times per
decoder layer. Hence, we minimize the data synchronization
and transfer needed among the FPGAs while taking advantage
of model parallelism on the predominant self-attention and
feed-forward network operations. Overall, each FPGA runs
identical operations on identical hardware to run GPT-2 end-
to-end, so the four FPGAs form a homogeneous cluster.
Memory Mapping
The FPGA harnesses 8 GB HBM
and 32 GB DDR, whose theoretical peak bandwidths are
460 GB/s and 38 GB/s, respectively. Since GPT-2 requires
the partitioned model parameters frequently and in large
portions, the memory bandwidth has a significant impact on
the overall performance. Therefore, the weight matrices are
stored in the HBM. On the other hand, input/output tokens,
bias vectors, and other model parameters are stored in the
DDR because these data are accessed once per few iterations
of matrix operations or once per entire decoder stage (e.g.,
WTE and WPE), thus having a negligible effect on the overall
performance. DFX also utilizes the standard half-precision
floating-point (FP16) model parameters to retain the inference
accuracy.
C. Instruction Set Architecture
DFX has a flexible and custom instruction set architecture
(ISA) at the assembly language level to support the end-to-
end processing for GPT-2 inference, unlike previous NLP
accelerators that primarily focus on attention [18], [19], [20].
Instruction Set
There are three types in the DFX ISA:
compute
,
dma
, and
router
instructions. The
compute
Algorithm 1 GPT-2 Decoder Layer
Input: in emb, input embedding vector
Output: out emb, output embedding vector
Parameter: H, number of attention heads
1: /* Layer Norm */
2: lnorm1=LayerNorm(in emb,γl1,βl1)
3: /* Self-Attention */
4: value =Conv1D(lnorm1,Wv,bv)
5: key =Conv1D(lnorm1,Wk,bk)
6: query =Conv1D(lnorm1,Wq,bq)
7: for h=0to Hdo
8: mat =MaskedM M(query[h],keyT[h])
9: redu max =ReduM ax(mat)
10: score =So f tmax(mat −redu max)
11: attn0[h] = M M(score,value[h])
12: end for
13: attn =Sync(attn0)
14: c attn0=Conv1D(attn,Wa,ba)
15: c attn =Sync(c attn0)
16: /* Residual */
17: c attn =c att n +in emb
18: /* Layer Norm */
19: lnorm2=LayerNorm(c att n,γl2,βl2)
20: /* Feed-Forward Network */
21: f f n10=GELU (Conv1D(lnorm2,Wf1,bf1))
22: f f n1=Sync(f f n10)
23: f f n20=Conv1D(f f n1,Wf2,bf2)
24: f f n2=Sync(f f n20)
25: /* Residual */
26: out emb =f f n2+c att n
instructions are for running the main processing units
and have the format
(type, src1, src2, dst)
with
additional bits to determine if the source or destination
location is off-chip memory or on-chip register file. The
dma
and
router
instructions are for controlling the DMA
and the network router to move data of the given transfer size
to and from the cores and have the format
(type, src,
dst, xfer_size).
Each instruction type is executed through the instruction
chaining [35], in which sequences of dependent instructions
operate with minimal stalling. Meanwhile, instructions with-
out dependencies work in parallel; for instance,
compute
processes data,
dma
fetches data, and
router
fill the
buffer with data from the peer device simultaneously for
synchornization. The combination of instruction chaining
and parallel execution enables continuous use of memory
and communication bandwidth. Algorithm 1shows the
pseudocode of GPT-2 decoder layer using the ISA.
Compute Instructions
The
compute
instructions ac-
count for the majority of the core instruction set and are
composed of two groups to control the main processing units:
matrix instructions and vector instructions.
Matrix instructions are for executing matrix-vector multi-
plication and additional functions such as GELU and reduce
max. Matrix is loaded in tiles, and vectors are also loaded
in portions. Any matrix-matrix multiplications are done by a
sequence of matrix-vector multiplications. The description
of the major matrix instructions is as follows.
1) Conv1D
: This essential matrix instruction, written as
the equation
Ax +b
, is used in
Query
,
Key
,
Value
matrix generation and the feed-forward network. In this
instruction, weight matrix
A
, input vector
x
, and bias
vector
b
are required for execution.
Conv1D
has a
convolutional aspect in that if its input is longer than the
maximum input size, the operation is performed through
a sliding window.
2) MaskedMM
: Masked matrix multiplication (MM) has the
equation,
Ax
.
MaskedMM
calculates
Query ×KeyT
, also
known as the
Score
matrix. Note that the
Query
matrix
is loaded as vectors. The masking operation puts a
−∞
mask on the upper diagonal elements of the
Score
matrix
to indicate that the current token is not impacted by
future contexts. In combination with
Softmax
, a vector
instruction,
MaskedMM
creates a lower triangular matrix
and outputs the maximum value of each row.
3) MM
:
MM
instruction is the same as
MaskedMM
without
masking. It is used in LM head for calculating logit, an
intermediate value produced while converting the output
embedding vector to its token ID, and in the attention
layer for calculating Score ×Value.
Vector instructions execute low-level vector-vector and
vector-scalar operations along with
load
and
store
. They
support various basic operations, including
add
,
sub
,
mul
,
accum
,
recip_sqrt
,
recip
, and
exp
. Thus, some high-
level operations such as
LayerNorm
and
Softmax
are
effectively implemented by several vector instructions.
1) LayerNorm
: Layer normalization has the equation
y(xi) = γixi−µ
σ+βi,
in which the
µ
and
σ
are mean
and standard deviation, and
γ
and
β
are weight and
bias vectors, respectively. Calculating the mean requires
accum
and
mul
instructions, and the standard deviation
requires
recip_sqrt
in addition. The formula is then
executed by the
sub
,
mul
, and
add
instructions. The
parameters are fetched to the register file through the
load instruction.
2) Softmax
: Softmax has the equation
y(xi) = exi
∑jexj,
in
which
j
is the number of elements in the row. This
operation can be performed with basic vector instructions,
such as
exp
,
add
, and
accum
. The summation is similar
to calculating the mean in the
LayerNorm
. The division
is substituted by the recip and mul instructions.
V. MICROARCHITECTURE
Figure 7shows the proposed compute core’s microarchi-
tecture, which mainly consists of matrix processing unit and
vector processing unit. The primary goal of microarchitecture
is to efficiently process text generation workloads that have
sequential processes with non-batched input. In the following
subsections, we explain the details of microarchitecture.
A. Control Unit
The control unit contains logic to control the overall flow of
data by keeping track of the state of each unit and arbitrating
PCIe
Controller DDR
Controller Network
HBM
Controller
Controller
Scheduler
Instruction
Buffer
Vector
Register
File
Matrix
Function
Unit
Matrix
Function
Unit
Matrix
Function Unit
Special
Function Unit
Scalar
Register
File
Vector
Function Unit
Special
Function Unit
Key, Value
Buffer TX
Buffer RX
Buffer
Load/
Store
Buffer
Bias
Buffer Weight
Buffer
Embed
Buffer
Score
board Vector
Operand Collector
Matrix
Operand Collector
2x64x16-bit 2x64x16-bit 2x16-bit
64x16-bit 16-bit
64x16-bit 64x16-bit16x16-bit16x64x16-bit 64x16-bit 16-bit
16x16-bit 64x16-bit
64x16-bit 16-bit 64x16-bit 16-bit
64x16-bit + 16-b it
512-bit 256-bit
64x16-bit 16x16-bit 16x64x16-bit 64x16-bit 64x16-bit 64x16-bit
Arbiter
Control
Register File Manager
Matrix Processing Unit Vector Processing Unit
Arbiter
Interconnect
256-bit
Transpose
32x512-bit
Instruction
512-bit 512-bit
64x16-bit
Control Unit
DMA Router
Figure 7. DFX compute core microarchitecture.
which modules to run. It is composed of the controller,
scheduler, and scoreboard.
Controller
The controller’s main job is to receive the start
signal and system configuration from the host. The system
configuration includes the core ID and the number of cores in
the system, and the number of decoder layers and tokens that
the system needs to run on. These parameters determine the
behavior of each core. The core ID and the number of cores
direct the corresponding core on which section of the model
weights to work on and which peer device to receive from
and transmit to. The number of decoder layers determines
when single token processing completes, and the number of
input and output tokens determines when the entire service
completes. Since a different portion of the HBM needs to
be accessed for each layer, the layer number designates the
address the DMA needs to access. The token number is used
specifically for knowing where to mask during
MaskedMM
.
Lastly, the controller returns the done signal back to the host
once the entire GPT-2 operation finishes.
Scheduler
The scheduler receives the decoded system
configuration from the controller and instructions from the
instruction buffer. The scheduler contains multiple finite state
machines for each instruction type that checks the status
of the DMA, processing units, register file, and the router
to decide whether to run or wait on each instruction type.
The chosen instruction is sent to the scoreboard for the last
dependency check with the running instruction.
Scoreboard
The register file needs to check for depen-
dencies to run instructions based on the chaining method.
Since a sequence of instructions may cause data hazards, the
scoreboard monitors source and destination addresses. The
scoreboard uses a RAM to represent the address space and
marks the current instruction’s address with a
stale
bit
when in execution and with a
valid
bit when in writeback.
If the source and destination addresses overlap, the next
instruction stalls until the current computation finishes.
B. Direct Memory Access
The DMA, which contains the read and write interface,
serves a vital role in distributing the data that is transferred
at high bandwidth. To maximize the bandwidth of the HBM,
the DMA’s read and write interface is connected to all 32
HBM channels and handles single-channel data bitwidth of
512 bits at 200 MHz for a total of
32 ×512
bits per cycle.
The DMA stores and loads tiled weights,
Key
, and
Value
to
and from the HBM optimized for the matrix operation. In
our dataflow, the output
Value
needs to be transposed when
being written, so a transpose unit is placed in the DMA.
Besides the HBM channels, the single DDR channel is also
accessible by the DMA. The input token, bias, WTE, and
WPE are read from DDR to their corresponding buffers in the
DMA. The final output token is also written from the DMA
to DDR. Moreover, weights and biases cannot be reused in
matrix multiplications due to no input batching, so they are
buffered in the DMA and streamed into the processing units
for the computation with the preloaded input.
Tiling Scheme
DFX uses an optimized tiling scheme
that maximizes the number of computations and throughput
in the memory-intensive generation stage while retaining
performance in the summarization stage. To process a single
token in the generation stage, a large amount of weights
needs to be read from the HBM for the matrix multiplication.
Therefore, the weights are tiled in the HBM, and the DMA
reads the tiled weights at the maximum read bandwidth of
32 ×512
bits per cycle. This dimension can be rearranged to
d×l×BWd ata
weight bits, in which
d
is the tile dimension,
l
is the number of lanes, and
BWd ata
is the data bitwidth.
The number of lanes is the number of columns in a tile that
can be computed in parallel by the matrix function units
(see Section
V-C
). Since DFX uses FP16 data as its data
type,
BWd ata
is set to 16. We propose a model-and-hardware-
aware tiling scheme that finds 1) the optimal
d
and
l
value
for loading the weights of size
emb ×emb
or larger to the
DMA and 2) the effective loading direction that translates
to the order of matrix-vector multiplication.
We conduct a design space exploration to determine
that the optimal
d
and
l
is 64 and 16, respectively.
We evaluate the performance with different
d
values,
and the corresponding
l
values are chosen to maxi-
mize the memory bandwidth with FP16 data:
(d,l) =
{(8,128),(16,64),(32,32),(64,16),(128,8)}
. Figure 8(a)
shows that
(d,l) = {(16,64),(32,32),(64,16)}
have the
best performance on multi-head attention with negligible
0
10
20
30
40
50
d=16
l=64 d=32
l=32
d=64
l=16
Resource
Utilization (%)
LUT FF BRAM DSP
0
50
100
150
200
d=8
l=128
d=16
l=64
d=32
l=32
d=64
l=16 d=128
l=8
GFLOPS
(a) (b)
Figure 8. Design choices in tile dimension and lane number and their
impact on (a) multi-head attention performance and (b) resource utilization
for the matrix processing unit.
d
d
Vector
from Reg File
Matrix
from HBM
Vector
=
Tile
emb
l
d
l
d
emb
Figure 9. Illustration of tiling scheme for matrix-vector multiplication.
difference. Since attention head dimension,
H
, for the state-
of-the-art models, GPT-2 and GPT-3, is around 64 [24], [37],
we examine that
d>64
leads to underutilized compute units
and thus lower performance when computing the multi-head
attention. Specifically, performance degradation occurs when
computing
Query ×KeyT
because
KeyT
has
H
rows which
is smaller than
d
when
d>64
. Similarly,
l>64
also leads
to performance degradation when calculating
Score ×Value
because
Value
has
H
columns. Then, we synthesize the
hardware resource utilization required for the three choices
to find that
d=64
requires the least amount of hardware
resource as shown in Figure 8(b). For
d=16
and
d=32
,
larger
l
is required to maintain the same number of operations.
With larger
l
, the number of MAC remains unchanged, but
the resources in the matrix processing unit (e.g., accumulator,
operators in the special function unit, and the control logic)
increase linearly. Therefore, we standardize the hardware
with d=64 and l=16.
Furthermore, the DMA loads
d×l
weights in the horizontal
direction to fill in a tile, and moves to the tile below, as shown
in Figure 9. Moving in the horizontal direction maximizes
input reuse, but it requires a significant number of buffers to
store the partial sums that are produced when the input of
length
d
iterates across many columns of the weight matrix.
As the core’s requirements of deep pipelining and other
buffers cause a shortage of on-chip memory, completing
the horizontal direction is infeasible. The vertical direction
decreases the number of buffers to one, but it removes
input reuse. The inability to reuse the input increases the
amount of register file access, which decreases the throughput.
Therefore, the zigzag direction with a tile size of
d×d
balances hardware resource and data reuse for maximum
performance.
Transpose Scheme
In a standard attention operation,
Key
......
…..
++
MAC
...
+++
++
× × × ×
++
× × × ×
Buffer
…..
++
MAC
...
+++
++
× × × ×
++
× × × ×
Buffer
++
MAC
... ++
× × × ×
++
× × × ×
GELU Mask Reduce
Max
64x16-bit 16-bit
64x16-bit 16x16-bit16x64x16-bit
+++
Buffer
SFU
MFU
64x16-bit16-bit
Mul
64x16-bit
Add Sub
VFU
…..
MAC
+++
++++…..
MAC
...
++++
Adder
Tree
... ++
++ ++
++
++ ++
+
...
Recip
Sqrt Recip
Bypass
Bypass
(a)
SFU
Bypass
64x16-bit 16-bit
(b)
...
16-way
Duplicate
Vectorizer
ex
Bypass
...
Figure 10. DFX processing units. (a) Matrix processing unit. (b) Vector
processing unit.
needs to be transposed, but in our dataflow, the multiplicand
Value
needs to be transposed because the read from the HBM
is column-wise and the write is row-wise; therefore, the
intermediate matrices are transposed by default when loaded
to the DMA.
Value
requires high memory capacity (e.g., 0.31
MB per token for 1.5B model), so the conventional transpose
scheme of transposing the entire matrix in the on-chip
memory is inefficient. To address this issue, DFX transposes
the
Value
matrix while its partial tiles are being written to
the off-chip memory, instead of doing it when they are read.
The long latency of transpose can be completely hidden by
changing the computation order. Based on GPT’s order of
operation,
ValueT
is needed after
Query
,
Key
, and
Value
are
generated. Therefore, we rearrange the DFX instructions so
that
Value
is calculated earlier than
Query
and
Key
. This
rearrangement guarantees a sufficient period of time for the
Value transpose, while Query and Key are being generated.
C. Processing Units
The DFX core has two processing units, matrix processing
unit (MPU) and vector processing unit (VPU), as shown in
Figure 10. These processing units are designed to execute
main mathematical operations required for the end-to-end
acceleration of GPT-2, and they fully exploit parallel comput-
ing and hardware resources. The two processing units consist
of four main functional units, matrix function unit and vector
function unit, each accompanied by a special function unit,
which all consist of FP16 operators. The functional units are
composed of deep and diversified pipelines for maximum
throughput and utilize bypasses at each sub-computation to
asynchronously execute instructions at low latency.
Matrix Function Unit
The matrix instructions operate
on the matrix function unit (MFU). Its primary workload
is matrix-vector multiplication. The MFU contains a tree-
based multiplier-accumulators (MACs) that take vectors of
d
dimensions as input. The unit is also composed of
l
lanes,
which means lnumber of tree-based MAC hardware are in
parallel. Input remains constant throughout the lanes, but
l
different multiplicands from different columns of the weight
matrix are passed to each lane, so
d×l
multiplications are
done in parallel. The products in each lane are then passed
to the parallel adder tree of depth
log2(d)
to calculate the
partial sum.
Each of the FP16 multiplier and adder is mapped to
one digital signal processing slice (DSP) and two DSPs,
respectively. The multiplier takes 6 cycles, and the adder
takes 11 cycles. The MFU uses a total of
3×(d×l)
DSPs:
d×l
DSPs for the multipliers,
2×(d−1)×l
DSPs for the
adder trees, and
2×l
DSPs for scalar additions. In our case,
d
and
l
are set to 64 and 16 as explained in Section
V-B
,
comprising of 3072 DSPs for the MFU.
Vector Function Unit
The vector instructions operate on
the vector function unit (VFU). VFU is a floating-point
arithmetic logic unit (ALU) that supports element-wise
vector operations. Specifically, the VFU supports addition,
subtraction, and multiplication of two vectors of
d
dimension.
Similar to the MFU, DSP is used for all VFU operations. Ad-
dition, subtraction, multiplication, and exponential operation
take 11 cycles, 11 cycles, 6 cycles, and 4 cycles, respectively.
Exponential operation uses two DSPs, and other operations
use one DSP each. No instruction requires more than one
ALU operation, so all instructions are completed in the
shortest possible cycles without synchronization. Additionally,
VFU supports bypass to reduce unnecessary computational
cycles. For instance, load and store instructions do not require
any computation, so the data can skip the execution stage. As
VFU has a bypass path that directly connects the input and
output ports, the load and store instructions take only one
cycle. The data hazards that occur with such asynchronous
dataflow are handled in the scoreboard.
Special Function Units
The special function units (SFUs)
handle the nonlinear functions in GPT-2. The output from the
MFU and VFU is passed into SFU
M
and SFU
V
, respectively.
The SFUs use the combination of DSP, combinational logic,
and the lookup table method for optimal hardware utilization.
SFU
M
is responsible for executing computations that
follow matrix-vector multiplication, such as masking, GELU,
vectorization, and reduce max. The masking unit creates
a lower triangular matrix based on the tile information, in
which elements above the diagonal of the output matrix are
masked with the closest representable value to
−∞
, which is
eventually zeroed out after softmax. For the division needed
for dividing the result with the number of attention heads, a
scalar constant, we use a multiplier instead to save hardware
resources. To support GELU activation function with the
equation
y(x) = 0.5x(1+tanh[p2/π(x+0.044715x3)]),
the
lookup table is used with linear approximation. We sam-
ple 2048 inputs that achieve a mean squared error of 0
in half-precision floating-point and choose
[−8,8]
as the
range because the slope converges on either side at this
range. Linear approximation is sufficient for GELU that has
piecewise linear characteristics, and it reduces the hardware
overhead of supporting complex mathematical operations.
The vectorizer uses an asymmetric buffer to concatenate
outputs to match the tiling, and it is placed after the above
modules to increase hardware reuse. Lastly, the reduce max
unit, which finds either max or argmax value of the given
vector, is designed using a parallel tree of comparators.
SFU
V
is responsible for computations that follow the vector
operations by VFU; they are accumulation followed by scalar
reciprocal, multiplication, addition, and reciprocal square root.
An adder tree is in the SFU instead of in the VFU because
VFU supports instructions that only require vector outputs.
The rest of the functions are provided by the floating-point
DSP. Similar to SFU
M
, SFU
V
uses a multiplier to divide a
constant value of embedding size.
Both SFUs also utilize bypasses so that operations that
do not require certain hardware in the dataflow can skip the
hardware without cycle penalties. The ordering of GPT-2
operations can be conveniently alternated with the use of
matrix and vector instructions, which is advantageous in
speeding up the sequential generation.
D. Register File Manager
The register file manager contains on-chip memory struc-
tures or register files for storing a multitude of FP16 data
prior to and post computation in the processing units. We
have two types of register files: vector and scalar register files.
These register files are responsible for communicating with
the memory interface via the DMA and network via the router.
The register file manager also contains operand collectors
that generate and collect the processing unit instructions and
determine which register file data to access based on the
instructions.
Matrix Operand Collector
The matrix operand collec-
tor generates matrix microcodes based on the instruction
mentioned in Section
IV-C
during runtime. The runtime
generation of microcodes decreases the amount of instruction
transfer from the host. The matrix operand collector passes
these microcodes and operands, such as the input vector,
weight matrix, and bias vector, to the MPU for execution. It
reads a single input vector from the vector register file while
taking the weight and bias from the DMA buffer. It counts
the tiling order and allocates the corresponding input and
weight to the MPU. The identical input vector is broadcasted
to the parallel hardware, while different weights and biases
are distributed to each vector lane. In addition, the double
buffer is used for all operands to reduce the latency and
obtain high throughput.
Vector Operand Collector
The vector operand collector
generates the microcodes the VPU to execute vector instruc-
tions, similar to the matrix operand collector. Since the VPU
needs various operand types, the vector operand collector can
Router 1 Router 2
Left
Interface
512-bit
Arbiter Right
Interface
512-bit
Local
Interface
Control
256-bit 256-bit
64x16-bit
Core 1 Core 2
1
1 2 3 4
2 4 3 12 4 3 1
Router
Control
Unit Reorder
Router 4 Router 3
Core 4 Core 3
64x16-bit
TX RX
Src Dst
(Core ID, Type, Src, Dst, Xfer Size)
Synchronized
Data
Partitioned
Data
1
1 22
33
4
Figure 11. Illustration of data synchronization with lightweight router.
read both vector and scalar register files. It can also access
the DMA and network router buffers to perform the DMA or
network instructions (i.e., load, store, and synchronization).
E. Router
The multi-FPGA network is enabled by the lightweight
router. Each core utilizes the router to synchronize the
data in the register files with every other core in peer
devices across the ring network. Figure 11 shows the router
structure and the data synchronization. The router seamlessly
transfers
64 ×16
bit data to fetch the output vector from
the processing units and pass it peer-to-peer (inter-device).
The router contains a control unit that indicates which
device’s core to communicate with, buffers to hold the
transmitted and received vectors, and a reorder module that
uses the core ID to organize the data order to be identical
in every core. Unlike general routers, this router does not
contain additional logic for packet encoding or decoding.
The synchronization is necessary after executing a
Conv1D
instruction in the self-attention and feed-forward network
because model parallelism leads to each core only computing
a portion of the output matrix’s row, and the next operation
like layer normalization and residual requires the entire row.
The network’s peer-to-peer communication is enabled
by Aurora 64b/66b IP [38]. The Aurora IP implements a
light link-layer protocol for high-speed serial communication
between two devices. The protocol uses 64b/66b encoding,
which requires a low resource cost with only 3% transmission
overhead. Consequently, the router provides a lightweight
communication interface between supported devices, resulting
in low latency data communication.
VI. APPLIANCE IMPLEMENTATION
We build a DFX appliance prototype that uses Intel Xeon
Gold 6226R CPU and four Xilinx Alveo U280 data center
acceleration cards [39] for evaluation. Although it uses a
single homogeneous multi-FPGA cluster with four FPGA
cards, the appliance itself is capable of harnessing two sets
of these configurations and/or increasing the number FPGA
cards per cluster in a 4U server chassis. The server appliance
can be easily extended as it is already based on a dual-socket
NVMe SSD 2 TB DDR4 256 GB
Intel Xeon
16-core 2.9 GHz PCIe Switch QSFP Cables
100 Gb/s
4x Xilinx Alveo U280
Quad-port 1 GbE NIC
Figure 12. Image of DFX server appliance.
Component LUT FF BRAM URAM DSP
Register File
6K
(0.53%)
110K
(4.22%)
88.5
(4.39%)
0
(0.0%)
0
(0.0%)
MPU
170K
(13.06%)
381K
(14.65%)
56
(2.78%)
0
(0.0%)
3136
(34.75%)
VPU
36K
(2.77%)
55K
(2.13%)
1.5
(0.07%)
0
(0.0%)
390
(4.32%)
DMA
38K
(2.97%)
97K
(3.74%)
134.5
(6.67%)
52
(5.42%)
0
(0.0%)
Router
3K
(0.28%)
13K
(0.55%)
24
(1.19%)
0
(0.0%)
0
(0.0%)
Interconnect
180K
(13.83%)
303K
(11.64%)
1237
(10.11%)
0
(0.0%)
4
(0.04%)
Total
520K
(39.93%)
1107
(42.52%)
1192
(59.13%)
104
(10.83%)
3533
(39.15%)
SLR0
SLR1
SLR2
HBM
HBM
Figure 13. FPGA layout and resource utilization on Xilinx Alveo U280.
motherboard with 20 PCIe Gen3 x16 slots. Installing more
FPGA cards and configuring their clusters at the system
would be sufficient. Figure 12 shows the setup of the DFX
server appliance hardware.
U280 FPGA chip uses a chiplet-based (i.e., multi-die)
design with three super logic regions (SLRs) and supports 8
GB of HBM with 32 channels. We integrate 4 DFX cores,
1 core per device, across the four FPGAs. Considering the
total bitwidth of data path is 32
×
512, the place-and-route
(PnR) over three different dies is challenging. Since the HBM
controller is physically located in the bottom SLR (i.e., SLR0)
and the DFX core is large enough over the dies, we discover
that the implementation is eventually constrained by the
number of super long line routes (SLLs) which connect the
logic between the dies. To handle this multi-die crossing issue
effectively, we first decide to split the DFX microarchitecture
into kernels, in which the kernel represents a top-level design
module to be implemented within a single die. Then, we
map the kernel with the DMA and MPU modules into the
SLR closest to the HBM, or SLR0, because these modules
frequently access the memory interconnect that requires the
most amount of routing; otherwise, these routings would
need to be SLR-crossing and exceed the number of available
SLLs. However, when we try to map all the MPU lanes
and memory channels in a single die, the kernel causes PnR
failure due to routing congestion. As a result, we map the
maximum possible lanes of MPU that can meet the routing
constraint of SLR0 and map the rest into the other SLRs. By
having separate kernels that minimize die-crossing signals,
the DFX core overcomes the routing congestion and achieves
the maximum bandwidth utilization out of the device.
We run all U280 FPGAs at 200 MHz kernel frequency
and 410 MHz memory interface frequency, utilizing 39.93%
of LUT, 42.52% of FF, 59.13% of BRAM, 10.83% of
URAM, and 39.15% of DSP. Figure 13 shows the final
38.1
150.1
592.4
2370.4
9506.4
39.7
151.1
593.9
2362.1
9554.8
40.1
152.0
595.0 2378.6
9449.7
2531.6
66.5
250.5 984.6
3915.8
15877.4
67.0
248.5
982.8
3903.6
15558.7
67.7
251.2
979.3 4150.8
17692.3
4333.1
86.7
310.3 1276.4
5232.2
19873.6
100.5
357.6
1187.5
4921.2
19072.1
89.1
311.7
1313.5
5193.2
22869.4
5479.7
177.2 193.4 257.8 515.6
1546.8
349.1 365.2 429.7
1031.2
1718.7
692.8 709.0 773.4
1031.2
2062.4
790.2
224.2 244.6 326.1 652.3
1956.8
441.6
462.0
543.6 869.7
2174.2
876.5 896.9 978.4
1304.5
2609.1
970.7
227.0 247.6 330.2 660.4
1981.1
447.1
467.8
550.3
880.5
2201.2
887.4 908.0 990.6
1320.7
2641.5
982.8
1
10
100
1000
10000
100000
[32:1]
[32:4]
[32:16]
[32:64]
[32:256]
[64:1]
[64:4]
[64:16]
[64:64]
[64:256]
[128:1]
[128:4]
[128:16]
[128:64]
[128:256]
Average
[32:1]
[32:4]
[32:16]
[32:64]
[32:256]
[64:1]
[64:4]
[64:16]
[64:64]
[64:256]
[128:1]
[128:4]
[128:16]
[128:64]
[128:256]
Average
[32:1]
[32:4]
[32:16]
[32:64]
[32:256]
[64:1]
[64:4]
[64:16]
[64:64]
[64:256]
[128:1]
[128:4]
[128:16]
[128:64]
[128:256]
Average
[Input Size:Output Size]
345M, 1 GPU vs 1 FPGA
[Input Size:Output Size]
774M, 2 GPUs vs 2 FPGAs
[Input Size:Output Size]
1.5B, 4 GPUs vs 4 FPGAs
Latency (ms)
GPU Appliance DFX
5.58 x
4.46 x
3.20 x
Figure 14. Inference latency of DFX compared to the GPU appliance on various GPT-2 models.
Table I
GPT-2 MOD EL CO NFIG UR ATION
Number of
Parameters
Embedding
Dimension
Number of
Attention Heads
Head
Dimension
Number of
Layers
345M 1024 16 64 24
774M 1280 20 64 36
1.5B 1536 24 64 48
FPGA layout and resource utilization of one of the U280s.
We use Xilinx Vivado [40] to synthesize the hardware written
in SystemVerilog and the Xilinx Vitis 2020.2 platform [41]
for the host-FPGA communication.
VII. EVALUATI ON
We use the DFX appliance prototype to evaluate the system
performance. We use a GPU appliance, a custom server of
four NVIDIA V100 GPUs [42], as the evaluation baseline
to compare the results with DFX. The V100 GPU has the
most comparable hardware specification to the U280 FPGA,
especially the memory capacity and bandwidth, to yield a
fair comparison. We run the GPT-2 models on this GPU
appliance using NVIDIA’s GPU-optimized Megatron-LM
source code [34] and CUDA Toolkit 11.1 with the provided
parallelism scheme that supports both multi-GPU training
and inference. In addition, we run the cloud TPU [43], [44]
to analyze the performance of different accelerator platforms.
Regarding each model, we use the open-source 345M model
from NVIDIA Megatron-LM [34], and 774M and 1.5B model
from OpenAI [24]. We slightly adjust OpenAI’s 1.5B model’s
number of attention heads from 25 to 24 because OpenAI’s
configuration is difficult to parallelize for both hardware
platforms to run. Table Ishows the GPT-2 configuration
for each model. We use Xilinx Board Utility (xbutil) and
NVIDIA system management interface (nvidia-smi) for power
measurements. We measure the inference accuracy, latency,
throughput, and energy efficiency. Furthermore, we discuss
the scalability of DFX and compare the cost of the two
systems.
A. Inference Accuracy
Methodology
To ensure that DFX does not incur any
accuracy loss on GPT-2, we compare the accuracy for widely
used open-source datasets with the baseline V100 GPU on
the 345M model. We compare Winograd Schema Challenge
(WSC) [45], Childrens’ Book Common Noun (CBT-CN),
and Children’s Book Named Entities (CBT-NE) [46], which
predict a word based on the given context.
Accuracy
DFX achieves no loss, 0.3% loss, and 0.15%
gain in accuracy for WSC, CBT-CN, and CBT-NE, respec-
tively, when compared to the baseline GPU. The GPU runs its
kernel with FP16 operators, a standard for NLP applications.
To minimize the error in text generation applications, DFX
also runs its cores with FP16 operators based on the Xilinx
Floating-Point Operator IP. Both FP16 operators are based on
IEEE 754 with 1-bit sign, 5-bit exponent, and 10-bit mantissa.
Since all the operations are identical in both systems except
for the GELU operation, the difference comes from the
subtle difference in approximation between the GPU and
DFX, which is negligible.
B. Performance Analysis
Methodology
We evaluate the end-to-end performance of
DFX on various GPT-2 models (345M, 774M, and 1.5B)
with different combinations of input and output token lengths
to represent the dialogue system, topic-to-essay generation,
and other text generation workloads. We use the same model
and the same number of accelerators in both appliances for
a fair comparison. We compare one V100 GPU and one
U280 FPGA for the 345M model, two V100 GPUs and two
U280 FPGAs for the 774M model, and four V100 GPUs
and four U280 FPGAs for the 1.5B model. Specifically, we
run each model with input lengths of 32, 64, and 128 tokens
and various output lengths between 1 and 256 tokens, which
are the typical ranges of user requests for transformer-based
language services in datacenters [28].
Latency
Figure 14 shows the text generation latency of
the GPU appliance and DFX on various GPT-2 models,
as well as the speedup that DFX achieves. Note that the
Y-axis for latency is in log scale. The result shows that
DFX achieves an average of 5.58
×
speedup compared to
the GPU appliance for the 1.5B model. For the workload
with significantly more output tokens compared to the input
token, i.e., 32:256, DFX shows substantially lower latency,
10.03
×
, than the GPU appliance. For the 345M and 774M
models, DFX achieves an average of 3.20
×
and 4.46
×
speedup, respectively, compared to the GPU appliance with
the equivalent number of accelerators. The result presents
43.0%
29.6%
17.3%
9.3%
0.8%
Self-Attention
Feed-Forward Netwo rk
Synchronization
LayerNorm
Residual
Figure 15. Latency breakdown of 4 FPGAs on the 1.5B model.
0
5
10
15
[32:1]
[32:4]
[32:16]
[32:64]
[32:256]
[64:1]
[64:4]
[64:16]
[64:64]
[64:256]
[128:1]
[128:4]
[128:16]
[128:64]
[128:256]
Average
Normalized
Energy Efficiency
[Input Size: Output Size]
1.5B, 4 GPUs vs 4 FPGAs
GPU Appliance DFX
3.99 x
10.03 x8.66 x8.66 x
0
50
100
150
200
[32:1]
[32:4]
[32:16]
[32:64]
[32:256]
[64:1]
[64:4]
[64:16]
[64:64]
[64:256]
[128:1]
[128:4]
[128:16]
[128:64]
[128:256]
Average
Tokens Per Second
GPU Appliance DFX Speedup
3.78 x
Figure 16. Throughput and energy efficiency of DFX compared to the
GPU appliance on the 1.5B model.
that the additional number of output tokens leads to a more
significant increase in latency on the GPU appliance than on
DFX. In fact, the speedup of DFX over the GPU appliance
can be greater for even smaller input and larger output sizes.
Although the future NLG applications may require longer
token generations, we only focus on the range of current use
cases. The overall speedup of DFX is attenuated with a larger
input size because the GPU is able to take advantage of its
massively parallel computation to execute the large input. As
long as the ratio between the input and output lengths is lower
than 4:1, which is the case for text generation workloads,
DFX performs better than the GPU appliance.
Figure 15 shows the latency breakdown of running the 1.5B
model on 4 FPGAs. The result shows that the majority of the
time is consumed by self-attention and feed-forward network
at 72.6%. At 17.3%, synchronization may seem critical in
DFX when compared to the GPU appliance because GPU
has high-speed communication such as NVLink [47] to lower
the synchronization latency. However, DFX has 5.58
×
lower
overall latency, so the high synchornization proportion is
attributed to the speedup in other operations.
Throughput and Energy Efficiency
Figure 16 shows the
throughput and energy efficiency of the two appliances on the
1.5B model. DFX achieves an average of 3.78
×
throughput
and 3.99
×
high energy efficiency compared to the GPU
appliance. The throughput was measured by dividing the
number of output tokens by the text generation latency. The
1632.1
40.6
80.4
674.5
8.2
16.1
185.6 181.8 184.1
1
10
100
1000
10000
Summarization Generation Total
GFLOPS
GPU TPU DFX
Figure 17. Performance comparison with GPU, TPU, and DFX (1 FPGA)
on the 345M model.
93.10
146.25
207.56
0
50
100
150
200
250
1 FPGA 2 FPGAs 4 FPGAs
Tokens Per Second
1.57 x
1.42 x
Figure 18. Scalability of DFX on the 345M model.
throughput result shows that GPU maintains a relatively
constant throughput even when the output tokens are scaled
up, which indicates that the performance is bottlenecked
by low hardware utilization during the generation stage.
Furthermore, we observe that 1) each V100 GPU consumes
only 47.5W, on average, based on the nvidia-smi tool, and 2)
the average power consumption decreases as the number of
output tokens increases. Since the GPU runs at a high base
clock frequency of 1.23GHz, the low power consumption can
only be explained by low hardware utilization. Meanwhile,
DFX runs at an even lower 45W, not because of low hardware
utilization but because the FPGA runs at a lower operating
frequency of 200MHz.
We also analyze the GFLOPS performance of different
accelerator platforms, DFX accelerator (1 FPGA), TPU, and
GPU, for the 345M model with 64:64 tokens, as shown in
Figure 17. The GPU and TPU show similar behavior with
high throughput in the summarization stage (1632.1 and
674.5 GFLOPS) and signficantly reduced throughput in the
generation stage (40.6 and 8.2 GFLOPS), which implies that
these devices are highly utilized in a batched process but
severely underutilized in a non-batchable process. Meanwhile,
DFX retains an average of 184.1 GFLOPS during both
summarization and generation stages because its dataflow
is specialized for the iterative matrix-vector multiplication
rather than the infrequent matrix-matrix multiplication.
Scalability
Figure 18 shows the scalability of U280
FPGA in DFX for the 345M model with 64:64 tokens. On
average, DFX achieves 93.10 tokens/sec for 1 FPGA, 146.25
tokens/sec for 2 FPGAs, and 207.56 tokens/sec for 4 FPGAs.
The performance of DFX increases linearly with the number
of FPGAs at the rate of 1.5, which means the processing
unit for DFX is designed to retain high utilization with more
devices, even on relatively small models. The performance
gain is not directly proportional to the number of devices
Table II
APPLIANCE COS T ANALYS IS
GPU Appliance DFX
CPUs 2×Intel Xeon Gold
14-Core @2.2 GHz
2×Intel Xeon Gold
16-Core @2.9 GHz
Memory 384 GB DDR4 512 GB DDR4
Storage 12 TB NVMe 4 TB NVMe
Accelerators 4×NVIDIA Tesla V100
32 GB HBM2 (900 GB/s)
4×XILINX Alveo U280
8 GB HBM2 (460 GB/s)
Performance 13.01 tokens/sec 72.68 tokens/sec
Cost $45,832 ($11,458 per GPU) $31,180 ($7,795 per FPGA)
Performance
/ Cost 283.86 tokens/sec/million$ 2330.98 tokens/sec/million$
because we do not parallelize layer normalization and residual
due to their even larger synchronization overhead. There is
also a marginal drop in throughput with each additional
FPGA due to more data synchronization.
C. Cost Analysis
Table II shows the cost analysis of DFX and the GPU
appliance. DFX has an upfront cost that is
$
14,652 lower than
that of the GPU appliance when 4 devices are installed on
both appliances [48], [49], [50]. For comparison, we exclude
the cost of components (e.g., CPU and storage) other than
the accelerators. To measure the overall cost-effectiveness,
we consider both performance and upfront cost (i.e., retail
price). We choose the 1.5B model with the input-to-output
token ratio of 64:64 to measure the performance per cost,
which is representative of the chatbot service as described in
Section
II-A
. Comparing the performance-to-cost ratio, DFX
is 8.21×more cost-effective than the GPU appliance.
We can draw similar conclusions between the DFX and the
NVIDIA DGX system because our custom GPU appliance is
comparable in cost and performance to DGX-1 [51]. DGX-1
is a purchasable GPU-based appliance that harnesses 8 Tesla
V100 GPUs for machine learning workloads in datacenters.
Our custom GPU appliance is an upgraded version of DGX-1
with the same number of GPUs, in which each GPU has a
better performance grade and a larger HBM2 than those of
DGX-1. As the GPU-based appliance would perform worse
when scaled up to 8 devices, the stated improvement in cost-
effectiveness for DFX would be even greater when compared
to real datacenter appliances.
VIII. REL ATED WORK
Accelerator for Transformer Model
Hardware accel-
erators that support transformer-based NLP models have
been recently proposed. A3 [18], GOBO [36], SpAtten [20],
EdgeBERT [52], and ELSA [19] discuss designs that only
speed up the attention mechanism in the transformer using
various pruning, quantization, or both, without taking into
consideration the end-to-end process. For instance, SpAtten
uses pruning to accelerate the attention mechanism and the
layer normalization process but does not consider token
embedding, residual, and LM head. Our work targets the
language services at the datacenter, especially focusing on
text generation workloads, so there is less emphasis on ag-
gressively speeding up a specific operation in the transformer
but more emphasis on the model-parallel dataflow that can
address both memory and compute-intensive problems in its
end-to-end inference.
Hardware Architecture for Datacenters
Many domain-
specific acceleration architectures have been proposed for
machine learning but few at the scale of datacenters. Mi-
crosoft’s Brainwave [35] is a relevant work that implements
an FPGA-based neural processing unit for datacenters with
high-performance processing units. However, Brainwave does
not utilize high-bandwidth off-chip memory, so it cannot
effectively run memory-intensive workloads, such as GPT
inference. Google [43], [53], [54] proposes the systolic-based
TPU architecture for various DNN applications, and Facebook
[55] designs accelerators specialized for recommendation
systems. A few of these previous architectures for datacenters
[35], [54] support the transformer, but none have design
optimizations to drastically improve the performance for text
generation workloads specifically.
IX. CONCLUSION
This paper presents DFX, a low-latency multi-FPGA
appliance for accelerating transformer-based text generation
workloads. Our work identifies the need for designing a
datacenter-level system that provides low-latency inference,
end-to-end acceleration, and parallel computing for text
generation. We combine multi-device hardware with model
parallelism, custom instructions and dataflow, and other
hardware optimizations to maximize the potential of the
specified hardware. Based on the implementation results
of our DFX appliance prototype with four Xilinx Alveo
U280 FPGAs, DFX achieves 5.58
×
, 3.99
×
, and 8.21
×
improvements in performance, energy-efficiency, and cost-
effectiveness, respectively, compared to conventional multi-
GPU appliances.
ACK NOW LE DG EM EN TS
This research was supported in part by NAVER CLOVA
and the MSIT (Ministry of Science and ICT), Korea,
under the ITRC (Information Technology Research Center)
support program (IITP-2020-0-01847) supervised by the IITP
(Institute for Information & Communications Technology
Planning & Evaluation).
REFERENCES
[1]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez,
Ł
. Kaiser, and I. Polosukhin, “Attention is
all you need,” in Advances in neural information processing
systems, 2017, pp. 5998–6008.
[2]
K. Cho, B. Van Merri
¨
enboer, C. Gulcehre, D. Bahdanau,
F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase
representations using rnn encoder-decoder for statistical ma-
chine translation,” arXiv preprint arXiv:1406.1078, 2014.
[3]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[4]
Y. Zhang, G. Wang, C. Li, Z. Gan, C. Brockett, and
B. Dolan, “Pointer: Constrained progressive text generation
via insertion-based generative pre-training,” arXiv preprint
arXiv:2005.00558, 2020.
[5]
D. Pascual, B. Egressy, C. Meister, R. Cotterell, and R. Watten-
hofer, “A plug-and-play method for controlled text generation,”
arXiv preprint arXiv:2109.09707, 2021.
[6]
S. Gonz
´
alez-Carvajal and E. C. Garrido-Merch
´
an, “Comparing
bert against traditional machine learning text classification,”
arXiv preprint arXiv:2005.13012, 2020.
[7]
S. Garg and G. Ramakrishnan, “Bae: Bert-based adver-
sarial examples for text classification,” arXiv preprint
arXiv:2004.01970, 2020.
[8]
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey et al.,
“Google’s neural machine translation system: Bridging the
gap between human and machine translation,” arXiv preprint
arXiv:1609.08144, 2016.
[9]
Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li, D. F. Wong, and
L. S. Chao, “Learning deep transformer models for machine
translation,” arXiv preprint arXiv:1906.01787, 2019.
[10]
Z. Lin, P. Xu, G. I. Winata, F. B. Siddique, Z. Liu, J. Shin,
and P. Fung, “Caire: An end-to-end empathetic chatbot,” in
Proceedings of the AAAI Conference on Artificial Intelligence,
vol. 34, no. 09, 2020, pp. 13 622–13 623.
[11]
P. Budzianowski and I. Vuli
´
c, “Hello, it’s gpt-2–how can i help
you? towards the use of pretrained language models for task-
oriented dialogue systems,” arXiv preprint arXiv:1907.05774,
2019.
[12]
T. Wolf, V. Sanh, J. Chaumond, and C. Delangue, “Trans-
fertransfo: A transfer learning approach for neural network
based conversational agents,” arXiv preprint arXiv:1901.08149,
2019.
[13]
X. Feng, M. Liu, J. Liu, B. Qin, Y. Sun, and T. Liu, “Topic-to-
essay generation with neural networks.” in Proceedings of the
27th International Joint Conference on Artificial Intelligence,
2018, pp. 4078–4084.
[14]
P. Yang, L. Li, F. Luo, T. Liu, and X. Sun, “Enhancing topic-
to-essay generation with external commonsense knowledge,”
pp. 2002–2012, 2019.
[15]
L. Qiao, J. Yan, F. Meng, Z. Yang, and J. Zhou, “A sentiment-
controllable topic-to-essay generator with topic knowledge
graph,” arXiv preprint arXiv:2010.05511, 2020.
[16]
“Natural language generation market,” Markets and Markets,
Tech. Rep., 2018.
[17]
“Nvidia dgx appliance.” [Online]. Available: https://www.
nvidia.com/en-us/data-center/dgx- station-a100/
[18]
T. J. Ham, S. J. Jung, S. Kim, Y. H. Oh, Y. Park, Y. Song, J.-H.
Park, S. Lee, K. Park, J. W. Lee et al., “Aˆ 3: Accelerating
attention mechanisms in neural networks with approximation,”
in 2020 IEEE International Symposium on High Performance
Computer Architecture (HPCA). IEEE, 2020, pp. 328–341.
[19]
T. J. Ham, Y. Lee, S. H. Seo, S. Kim, H. Choi, S. J. Jung, and
J. W. Lee, “Elsa: Hardware-software co-design for efficient,
lightweight self-attention mechanism in neural networks,” in
2021 ACM/IEEE 48th Annual International Symposium on
Computer Architecture (ISCA). IEEE, 2021, pp. 692–705.
[20]
H. Wang, Z. Zhang, and S. Han, “Spatten: Efficient sparse
attention architecture with cascade token and head pruning,”
in 2021 IEEE International Symposium on High-Performance
Computer Architecture (HPCA). IEEE, 2021, pp. 97–110.
[21]
L. Lu, Y. Jin, H. Bi, Z. Luo, P. Li, T. Wang, and Y. Liang,
“Sanger: A co-design framework for enabling sparse attention
using reconfigurable architecture,” in MICRO-54: 54th Annual
IEEE/ACM International Symposium on Microarchitecture,
2021, pp. 977–991.
[22]
D. U. Lee, K. W. Kim, K. W. Kim, H. Kim, J. Y. Kim, Y. J.
Park, J. H. Kim, D. S. Kim, H. B. Park, J. W. Shin et al., “25.2
a 1.2 v 8gb 8-channel 128gb/s high-bandwidth memory (hbm)
stacked dram with effective microbump i/o test methods using
29nm process and tsv,” in 2014 IEEE International Solid-
State Circuits Conference Digest of Technical Papers (ISSCC).
IEEE, 2014, pp. 432–433.
[23]
“Openai api.” [Online]. Available: https://openai.com/blog/
openai-api/
[24]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever
et al., “Language models are unsupervised multitask learners,”
OpenAI blog, vol. 1, no. 8, p. 9, 2019.
[25]
D. Hendrycks and K. Gimpel, “Gaussian error linear units
(gelus),” arXiv preprint arXiv:1606.08415, 2016.
[26]
J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”
arXiv preprint arXiv:1607.06450, 2016.
[27]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2016, pp. 770–
778.
[28]
“Input:output token ratio.” [Online]. Available: https:
//beta.openai.com/docs/usage-guidelines/use-case-guidelines
[29]
N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani,
P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young
et al., “Mesh-tensorflow: Deep learning for supercomputers,”
arXiv preprint arXiv:1811.02084, 2018.
[30]
Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen,
H. Lee, J. Ngiam, Q. V. Le, Y. Wu et al., “Gpipe: Efficient
training of giant neural networks using pipeline parallelism,”
Advances in neural information processing systems, vol. 32,
pp. 103–112, 2019.
[31]
T. H. Cormen and M. T. Goodrich, “A bridging model
for parallel computation, communication, and i/o,” ACM
Computing Surveys (CSUR), vol. 28, no. 4es, pp. 208–es,
1996.
[32]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert:
Pre-training of deep bidirectional transformers for language
understanding,” arXiv preprint arXiv:1810.04805, 2018.
[33]
J. M. Tarnawski, A. Phanishayee, N. Devanur, D. Mahajan, and
F. Nina Paravecino, “Efficient algorithms for device placement
of dnn graph operators,” Advances in Neural Information
Processing Systems, vol. 33, pp. 15451–15 463, 2020.
[34]
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and
B. Catanzaro, “Megatron-lm: Training multi-billion parameter
language models using model parallelism,” arXiv preprint
arXiv:1909.08053, 2019.
[35]
J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill,
M. Liu, D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi
et al., “A configurable cloud-scale dnn processor for real-time
ai,” in 2018 ACM/IEEE 45th Annual International Symposium
on Computer Architecture (ISCA). IEEE, 2018, pp. 1–14.
[36]
A. H. Zadeh, I. Edo, O. M. Awad, and A. Moshovos, “Gobo:
Quantizing attention-based nlp models for low latency and
energy efficient inference,” in 2020 53rd Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO).
IEEE, 2020, pp. 811–824.
[37]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan,
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell
et al., “Language models are few-shot learners,” arXiv preprint
arXiv:2005.14165, 2020.
[38]
“Xilinx aurora 64b/66b ip.” [Online]. Available: https://www.
xilinx.com/products/intellectual-property/aurora64b66b.html
[39]
“Xilinx alveo u280 data center accelerator card.” [Online].
Available: https://www.xilinx.com/products/boards-and-kits/
alveo/u280.html
[40]
“Xilinx vivado.” [Online]. Available: https://www.xilinx.com/
support/university/vivado.html
[41]
“Xilinx vitis unified software platform.” [Online].
Available: https://www.xilinx.com/products/design-tools/vitis/
vitis-platform.html
[42]
“Nvidia v100 gpu.” [Online]. Available: https://www.nvidia.
com/en-us/data-center/v100/
[43]
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal,
R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al.,
“In-datacenter performance analysis of a tensor processing unit,”
in Proceedings of the 44th annual international symposium
on computer architecture, 2017, pp. 1–12.
[44]
“Google cloud tpu.” [Online]. Available: https://cloud.google.
com/tpu
[45]
H. Levesque, E. Davis, and L. Morgenstern, “The winograd
schema challenge,” in Thirteenth International Conference on
the Principles of Knowledge Representation and Reasoning,
2012.
[46]
O. Bajgar, R. Kadlec, and J. Kleindienst, “Embracing data
abundance: Booktest dataset for reading comprehension,”
arXiv preprint arXiv:1610.00956, 2016.
[47]
A. Li, S. L. Song, J. Chen, J. Li, X. Liu, N. R. Tallent, and K. J.
Barker, “Evaluating modern gpu interconnect: Pcie, nvlink,
nv-sli, nvswitch and gpudirect,” IEEE Transactions on Parallel
and Distributed Systems, vol. 31, no. 1, pp. 94–110, 2019.
[48]
“Nvidia tesla v100 cost.” [Online]. Avail-
able: https://www.microway.com/hpc-tech-tips/nvidia-tesla-
v100-price-analysis/
[49]
“Nvidia tesla v100 cost - amazon.” [Online].
Available: https://www.amazon.com/NVIDIA-Tesla-Volta-
Accelerator-Graphics/dp/B07JVNHFFX
[50]
“Xilinx alveo u280 cost.” [Online]. Available: https:
//colfaxdirect.com/store/pc/viewPrd.asp?idproduct=3720
[51]
“Nvidia dgx-1 cost.” [Online]. Avail-
able: https://www.anandtech.com/show/12587/nvidias-dgx2-
sixteen-v100-gpus-30-tb-of-nvme-only-400k
[52]
T. Tambe, C. Hooper, L. Pentecost, T. Jia, E.-Y. Yang, M. Do-
nato, V. Sanh, P. Whatmough, A. M. Rush, D. Brooks et al.,
“Edgebert: Sentence-level energy optimizations for latency-
aware multi-task nlp inference,” in MICRO-54: 54th Annual
IEEE/ACM International Symposium on Microarchitecture,
2021, pp. 830–844.
[53]
N. P. Jouppi, D. H. Yoon, G. Kurian, S. Li, N. Patil, J. Laudon,
C. Young, and D. Patterson, “A domain-specific supercomputer
for training deep neural networks,” Communications of the
ACM, vol. 63, no. 7, pp. 67–78, 2020.
[54]
N. P. Jouppi, D. H. Yoon, M. Ashcraft, M. Gottscho, T. B.
Jablin, G. Kurian, J. Laudon, S. Li, P. Ma, X. Ma et al., “Ten
lessons from three generations shaped google’s tpuv4i: Indus-
trial product,” in 2021 ACM/IEEE 48th Annual International
Symposium on Computer Architecture (ISCA). IEEE, 2021,
pp. 1–14.
[55]
M. Anderson, B. Chen, S. Chen, S. Deng, J. Fix, M. Gschwind,
A. Kalaiah, C. Kim, J. Lee, J. Liang et al., “First-generation
inference accelerator deployment at facebook,” arXiv preprint
arXiv:2107.04140, 2021.