PreprintPDF Available

Accelerating Polynomial Multiplication for Homomorphic Encryption on GPUs

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Homomorphic Encryption (HE) enables users to securely outsource both the storage and computation of sensitive data to untrusted servers. Not only does HE offer an attractive solution for security in cloud systems, but lattice-based HE systems are also believed to be resistant to attacks by quantum computers. However, current HE implementations suffer from prohibitively high latency. For lattice-based HE to become viable for real-world systems, it is necessary for the key bottlenecks - particularly polynomial multiplication - to be highly efficient. In this paper, we present a characterization of GPU-based implementations of polynomial multiplication. We begin with a survey of modular reduction techniques and analyze several variants of the widely-used Barrett modular reduction algorithm. We then propose a modular reduction variant optimized for 64-bit integer words on the GPU, obtaining a 1.8x speedup over the existing comparable implementations. Next, we explore the following GPU-specific improvements for polynomial multiplication targeted at optimizing latency and throughput: 1) We present a 2D mixed-radix, multi-block implementation of NTT that results in a 1.85x average speedup over the previous state-of-the-art. 2) We explore shared memory optimizations aimed at reducing redundant memory accesses, further improving speedups by 1.2x. 3) Finally, we fuse the Hadamard product with neighboring stages of the NTT, reducing the twiddle factor memory footprint by 50%. By combining our NTT optimizations, we achieve an overall speedup of 123.13x and 2.37x over the previous state-of-the-art CPU and GPU implementations of NTT kernels, respectively.
Content may be subject to copyright.
Accepted Manuscript: 2022 IEEE International Symposium on Secure and Private Execution Environment Design
Accelerating Polynomial Multiplication for
Homomorphic Encryption on GPUs
Kaustubh Shivdikar, Gilbert Jonatan, Evelio Mora, Neal Livesay, Rashmi Agrawal§,
Ajay Joshi§, Jos´
e L. Abell´
an, John Kim, David Kaeli
Northeastern University, §Boston University, KAIST University, Universidad Cat´
olica de Murcia
{shivdikar.k, n.livesay}, {eamora, jlabellan}, {rashmi23, joshi},,,
Abstract—Homomorphic Encryption (HE) enables users to
securely outsource both the storage and computation of sensitive
data to untrusted servers. Not only does HE offer an attractive
solution for security in cloud systems, but lattice-based HE
systems are also believed to be resistant to attacks by quantum
computers. However, current HE implementations suffer from
prohibitively high latency. For lattice-based HE to become viable
for real-world systems, it is necessary for the key bottlenecks—
particularly polynomial multiplication—to be highly efficient.
In this paper, we present a characterization of GPU-based
implementations of polynomial multiplication. We begin with
a survey of modular reduction techniques and analyze several
variants of the widely-used Barrett modular reduction algorithm.
We then propose a modular reduction variant optimized for 64-
bit integer words on the GPU, obtaining a 1.8×speedup over the
existing comparable implementations. Next, we explore the fol-
lowing GPU-specific improvements for polynomial multiplication
targeted at optimizing latency and throughput: 1) We present a
2D mixed-radix, multi-block implementation of NTT that results
in a 1.85×average speedup over the previous state-of-the-art.
2) We explore shared memory optimizations aimed at reducing
redundant memory accesses, further improving speedups by
1.2×. 3) Finally, we fuse the Hadamard product with neighboring
stages of the NTT, reducing the twiddle factor memory footprint
by 50%. By combining our NTT optimizations, we achieve
an overall speedup of 123.13×and 2.37×over the previous
state-of-the-art CPU and GPU implementations of NTT kernels,
Index Terms—Lattice-based cryptography, Homomorphic En-
cryption, Number Theoretic Transform, Modular arithmetic,
Negacyclic convolution, GPU acceleration
Computation is increasingly outsourced to remote cloud-
computing services [1], [2]. Encryption provides security
as data is transmitted over the internet. However, classical
encryption schemes require that data be decrypted prior to
performing computation, exposing sensitive data to untrusted
cloud providers [3], [4]. Using Homomorphic Encryption (HE)
allows computations to be run directly on encrypted operands,
offering ideal security in the cloud-computing era (Figure 1).
Moreover, many of the breakthrough HE schemes are lattice-
based, and are believed to be resistant to attacks by quantum
computers [5].
One major challenge in deploying HE in real-world systems
is overcoming the high computational costs associated with
HE. For computation on data encrypted via state-of-the-art HE
Trust Barrier
Message Encrypt Ciphertext
Message Decrypt Ciphertext
Fig. 1. HE provides security from eavesdroppers on the web as well as
untrusted cloud services, as encrypted data can be computed on directly.
schemes—such as HE for Arithmetic of Approximate Num-
bers [6] (also known as HEAAN or CKKS) and TFHE [7]—
a slowdown of 4–6 orders of magnitude is reported, as
compared to running the same computation on unencrypted
data [8], [9]. We aim to accelerate HE by targeting the main
operation in these schemes (and, more generally, in lattice-
based cryptography): polynomial multiplication [10], [11],
[12]. The Number Theoretic Transform (NTT) and modular
reduction are two key bottlenecks in polynomial multiplication
(and, by extension, in HE), as evidenced by the performance
profiling of several lattice-based cryptographic algorithms by
Koteshwara et al. [13]. As lattice-based HE schemes have
continued to establish themselves as leading candidates for
privacy-preserving computing and other applications, there has
been an increased focus on optimization and acceleration of
these core operations [14], [15], [16].
For most real-world applications of lattice-based HE, the
number Nof polynomial coefficients and the modulus Q
need to be large to guarantee a strong level of security and
a higher degree of parallelism [9]. For example, N= 216
and dlog2(Q)e= 1240 are the default values in the HEAAN
library. The large values for Nand Qtranslate to heavy work-
load demands, requiring a significant amount of computational
power to evaluate modular arithmetic expressions, as well as
placing high demands on the memory bandwidth utilization.
HE workloads possess high levels of data parallelism [17].
Existing compute systems such as general-purpose CPUs do
not scale well since they are unable to fully exploit this
parallelism for such data-intensive workloads. However, the
SIMD-style GPU platforms, with their thousands of cores
and high bandwidth memory (HBM), are natural candidates
arXiv:2209.01290v1 [cs.CR] 2 Sep 2022
Improvement Input
NTT Kernel
(Section V-A)
(Section V-B)
Optimized NTT
(Section V-C)
Mixed Radix
(Section V-A.3)
Barrett's Reduction
(Section II-B)
Shared Memory
(Section V-A.1)
Hadamard Product
(Section III-B)
Scaled Out
FHE Acceleration
Fig. 2. Our contributions: 4major optimizations incorporated into 3kernels
for accelerating these highly parallelizable workloads. The
potential of the GPU platform to accelerate HE has motivated
a rapidly growing body of work over the past year [9], [18],
[19], [20], [21], [22], [23], [24], [25], [26].
To address performance bottlenecks in existing polynomial
multiplication algorithms, we begin by analyzing the Barrett
modular reduction algorithm [27], as well as the algorithm’s
variants [21], [26], [28] which have been utilized in prior
HE schemes. We then analyze various NTT implementations,
including mixed-radix and 2D implementations, which we
tune to improve memory efficiency. Finally, we apply a
number of GPU-specific optimizations to further accelerate
HE. By combining all our optimizations, we achieve an overall
speedup of 123.13×and 2.37×over the previous state-of-the-
art CPU [29] and GPU [21] implementations of NTT kernels,
respectively. Our key contributions are as follows (Figure 2):
1) We propose an instantiation of the Dhem–
Quisquater [28] class of Barrett reduction variants
which is optimized for HE, providing a 1.85×speedup
over prior studies [21], [23], [26], [30].
2) We present a mixed-radix, 2D NTT implementation to
effectively exploit temporal and spatial locality, resulting
in a 1.91×speedup over the radix-2 baseline.
3) We propose a fused polynomial multiplication algorithm,
which fuses the Hadamard product with its neighboring
butterfly operations using an application of Karatsuba’s
Algorithm [31]. This reduces the twiddle factor’s mem-
ory footprint size by 50%.
4) We incorporate the use of low latency, persistent, shared
memory in our single-block NTT GPU kernel implemen-
tation, reducing the number of redundant data fetches
from global memory, providing a further 1.25×speedup.
Modular reduction is a key operation and computational
bottleneck in lattice-based cryptography [32]. This section
is a self-contained survey of modular reduction algorithms,
particularly Barrett reduction [27], a widely-used algorithm
that we utilize in our work.
Following Shoup [33], we define the bit length len(a)of
a positive integer ato be the number of bits in the binary
representation of a; more precisely, len(a) = blog2ac+ 1.
A. Background: modular reduction and arithmetic
Let xmod qdenote the remainder of a nonnegative in-
teger xdivided by a positive integer q. The naive method
for performing modular reduction—i.e., the computation of
xmod q—is via an integer division operation:
xmod q=x bx/qcq.
However, there are a number of alternative methods for
performing modular reduction—especially in conjunction with
arithmetic operations such as addition and multiplication—that
avoid expensive integer division operations.
For example, Algorithm 1specifies a simple and efficient
computation of the modular reduction of a sum. Let βdenote
the word-size (e.g., β= 32 or 64). Observe that either a+b
lies in [0, q)and is reduced, or a+blies in [q, 2q)and requires
a single correctional subtraction to become reduced (see lines
2–3). The restriction len(q)β1prevents overflow of the
transient operations (i.e., a+b).
Algorithm 1 A baseline modular addition algorithm
Require: 0a, b < q,len(q)β1
Ensure: sum = (a+b) mod q
1: sum a+b
2: if sum qthen
3: sum sum q
4: return sum
There are multiple methods for reducing products. In lattice-
based cryptography, commonly used algorithms for imple-
mentations on hardware platforms such as CPUs and GPUs
include the algorithms of Barrett [27], Montgomery [34], and
Shoup [35], [36]. In this paper, we select Barrett’s algorithm
as our baseline, as Barrett’s algorithm enjoys the following
1) Low overhead: It requires a low-cost pre-computation
(and storage) of a single word-size integer µ.
2) Versatility: It may be used effectively in contexts where
multiple products are reduced modulo q.
3) Generality: It does not restrict to special classes of
moduli, such as Mersenne primes (see, e.g., [37], [38]).
4) Performant: It is significantly faster than integer di-
vision, and has comparable runtime performance with
Montgomery’s algorithm (see, e.g., [39]).
Barrett’s algorithm is used in many open-source libraries,
including cuHE [40], PALISADE [41], and HEANN [20], [6].
The Barrett reduction algorithm, and our proposed variant for
use in HE, are analyzed in Section II-B.
B. Barrett modular reduction: analysis and optimization
Next, we provide details of Barrett modular reduction and
then explore potential improvements. Algorithm 2specifies
the classical reduction algorithm of Barrett [27]. Note that if
Compute Throughput
Avg. IPC
Memory Throughput
DRAM Throughput
L1 Throughput
L2 Throughput
Issued Warps
Registers per thread
(a) NTT workload modular reduction comparison
Percent Change
Modular reduction
Classic Barrett
Proposed Barrett
Compute Throughput
Avg. IPC
Memory Throughput
DRAM Throughput
L1 Throughput
L2 Throughput
Issued Warps
Registers per thread
(b) inverse-NTT workload modular reduction comparison
Percent Change
Modular reduction
Classic Barrett
Proposed Barrett
Long Scoreboard
Math Pipe Throttle
Not Selected
LG Throttle
Short Scoreboard
MIO Throttle
Branch Resolving
Dispatch Stall
IMC Miss
No Instruction
(c) NTT stall histogram across various modular reduction algorithms
Avg. stalled cycles / instruction
Modular reduction
Classic Barrett
Proposed Barrett
Long Scoreboard
Math Pipe Throttle
Not Selected
LG Throttle
Short Scoreboard
MIO Throttle
Branch Resolving
Dispatch Stall
IMC Miss
No Instruction
(d) inverse-NTT stall histogram across various modular reduction algorithms
Avg. stalled cycles / instruction
Modular reduction
Classic Barrett
Proposed Barrett
Fig. 3. Modular reduction profile comparison of architectural parameters (a,b) and causes of warp stalls (c,d).
Algorithm 2 Classical Barrett reduction
Require: m= len(q)β2,0x < 22m,µ=b22m
Ensure: rem = xmod q
1: cx(m1)
2: quot (c×µ)(m+ 1)
3: rem xquot ×q
4: if rem qthen
5: rem rem q
6: if rem qthen
7: rem rem q
8: return rem
0a, b < q and m= len(q), then x=a×bsatisfies
the condition 0x < 22mspecified in Algorithm 2.
This algorithm is commonly used in HE acceleration studies
targeting a GPU [24], [30], [23], [42]. As noted by Sahu
et al. [23], the pre-computed constant µand the transient
operations (excluding the product c×µ) are preferably word-
sized. This condition imposes the restriction len(q)β2.
Note that the classical Barrett reduction may require zero,
one, or two correctional subtractions; see lines 4–7 in Al-
gorithm 2. As noted by Barrett [27], a second conditional
subtraction is required in approximately 1% of the cases. There
have been several attempts to modify Barrett’s algorithm to
eliminate the need for a second conditional subtraction. The
algorithms proposed by ¨
Ozerk et al. [21] and Lee et al. [26]
each require two correctional subtractions to fully reduce the
product of a= 994674970 and b= 994705408 modulo
q= 994705409, although we found experimentally that ¨
et al.’s proposed reduction algorithm only requires a second
conditional subtraction in 0.22% of cases.
Dhem–Quisquater [28] defines a class of Barrett modular
Algorithm 3 Dhem–Quisquater’s modified Barrett reduction
Require: m= len(q)β4,0x < 22m,µ=b22m+3
Ensure: rem = xmod q
1: cx(m2)
2: quot (c×µ)(m+ 5)
3: rem xquot ×q
4: if rem qthen
5: rem rem q
6: return rem
reduction variants (with parameters αand β) that require at
most one correctional subtraction. A commonly used (see,
e.g., Kong and Philips [43] and Wu et al. [44]) instantiation
of Dhem–Quisquater’s class of algorithms is specified in
Algorithm 3(setting parameters α=N+ 3 and β=2,
as defined in Dhem–Quisquater [28]). Notably, this instan-
tiation is used in the PALISADE HE Software Library [41].
Although Algorithm 3provides an improvement in algorithmic
complexity over Algorithm 2, it further restricts the modulus
to at most length (β4) to ensure µis word-sized.
As discussed by Kim et al. [20], restrictions on the modulus
size are significant in the context of optimizing HE, as the
modulus size is inversely related to the workload size. To elab-
orate, polynomial multiplication is typically performed with
respect to a large composite modulus Q. If each prime factor
of Qis m-bits, then the Chinese Remainder Theorem can be
used to partition the computation of polynomial multiplication
with respect to Qinto dlen(Q)/mesimpler computations of
polynomial multiplication with respect to the m-bit factors.
For example, if len(Q) = 1240, then the restriction from
30-bit to 28-bit moduli increases the workload size (i.e.,
dlen(Q)/me) by 7.14%.
Algorithm 4 Proposed Barrett reduction optimized for a GPU
Require: m= len(q)β2,0x < 22m,µ=b22m+1
Ensure: rem = xmod q
1: cx(m2)
2: quot (c×µ)(m+ 3)
3: rem xquot ×q
4: if rem qthen
5: rem rem q
6: return rem
Therefore, we propose Algorithm 4for use in HE imple-
mentations on a GPU. Similar to Algorithm 3, Algorithm 4is
an instantiation of Dhem–Quisquater [28] (for α=N+ 1 and
β=2) that requires at most one correctional subtraction.
However, Algorithm 4allows for moduli qof length up to
β2, and thus results in no increase in the workload size.
Figure 3provides a snapshot of the performance of various
modular reduction kernels on a V100 GPU. The detailed de-
scription of each parameter is further described in Table Iand
II in Section IV. The values in the Figure 3(a,b) are normalized
to the built-in implementation of modular reduction on GPUs
(which utilizes the modulo % operator). In Figure 3(a,b) we
see significant improvements in the proposed Barrett reduc-
tion, as marked by the speedups due to improved compute and
memory throughput. The performance improvements achieved
can be attributed to our implementation requiring at most
1correctional subtraction (as compared to 2for others).
Figure 3(c,d) enable us to see the primary causes of kernel
stalls for NTT and inverse-NTT workloads, respectively. Fig-
ure 3(c,d) highlights the reasons for the maximum number
of stalls while executing NTT and inverse-NTT kernels. In
Figure 3(c), the longest stall (measured in the average number
of cycles per instruction) for the NTT workload is due to a
“Math Pipe Throttle”, which results when the kernel begins to
saturate the ALU instruction pipeline (See Table I). Figure 3(d)
reports the cause of stalls in inverse-NTT, with the longest
stall caused by a “Wait”, which signifies the scheduler has an
abundance of “Ready” warps and is starting to saturate the
streaming multiprocessors (SMs) (See Table II).
In Figure 4, we present a comparison of the implementations
of the modular reduction algorithms described in this section.
We report the execution time of a single modular reduction
operation for 28, 29, and 30-bit prime numbers as run on a
V100 GPU. The operands and moduli are randomly sampled
from a uniform distribution. The classical Barrett reduction
algorithm is significantly faster than reduction by integer
division (i.e., the built-in reduction), as shown in Figure 4.
Algorithm 4has nearly identical performance to Algorithm 3
for 28-bit moduli (while permitting 29 and 30-bit moduli, as
well). Algorithm 4has a 1.22×speedup over the classical
Barrett reduction for 30-bit primes. To our knowledge, the
specific instantiation of Dhem–Quisquater modular reduction
specified in Algorithm 4does not appear in an open-source
library nor in the literature.
(x % q)
(Algo. 2)
(Algo. 3)
(Algo. 4)
Execution time (us)
N/A for 29 bits
N/A for 30 bits
Modulus size
28 bits
29 bits
30 bits
Fig. 4. Execution times of modular reduction implementations for 28,29, and
30-bit prime numbers (on the V100 GPU), averaged over 10,000 iterations.
The error bars represent ranges. The “builtin reduction” uses the CUDA %
construct for modular reduction.
For m > 0, define Zmto be the set {0,1,2, . . . , m 1}
together with the operations of modular addition (a, b)7→
(a+b) mod mand modular multiplication (a, b)7→ (a×
b) mod m. The naive algorithm for multiplying two polynomi-
als PN1
i=0 aixiand PN1
i=0 bixirequires order N2arithmetic
operations. It is well known [45] that the number of operations
can be reduced to the order of Nlog(N)using the Fast Fourier
Transform (FFT) algorithm.
It is convenient to represent a polynomial PN1
i=0 aixias an
N-dimensional coefficient vector a= (a0, a1, . . . , aN1).
A. Background: Number Theoretic Transform
In this section, we give a brief review of the Discrete Fourier
Transform (DFT) and the Fast Fourier Transform (FFT) for the
special case that the field of coefficients is Zq, for qa prime.
The DFT and FFT over Zqare both commonly—and often
confusingly—referred to as the Number Theoretic Transform
(NTT). In the classical setup for the NTT, the parameters N,
q, and ωsatisfy the following properties:
1) N > 1is a power of 2;
2) qis a prime number such that Ndivides q1; and
3) ωis a primitive Nth root of unity in Zq; i.e., ωi= 1 if
and only if iis a multiple of N.
The N-point NTT (DFT) with respect to ωis the func-
tion NTTω: (Zq)N(Zq)Ndefined by NTTω(a) =
i=0 a[i]ωij )N1
j=0 . The inverse transformation of NTTω
is 1
NNTTω1. Famously, the cyclic convolution [46] of
vectors aand bin (Zq)Ncan be computed in the or-
der of Nlog(N)arithmetic operations via the expression
NNTTω1(NTTω(a)NTTω(b)), where denotes the
Hadamard product (i.e., entry-wise multiplication) on (Zq)N.
A closely related operation to cyclic convolution is nega-
cyclic convolution, which is widely known as polynomial mul-
tiplication in the context of lattice-based cryptography [47].
The setup for polynomial multiplication has parameters N,q,
and ψsatisfying the following properties:
1) N > 1is a power of 2;
2) qis a prime such that 2Nis a divisor of q1; and
3) ψis a primitive 2Nth root of unity in Zq(which implies
that ω=ψ2is a primitive Nth root of unity).
Let Ψand Ψ1denote the vector of “twiddle factors” in
(Zq)N, defined by Ψ[i] = ψiand Ψ1[i] = ψifor all i.
(b) (c)
Fig. 5. (a) Negacyclic convolution block diagram. (b) Hadamard product and its neighboring butterflies. (c) Fusion of butterflies into Hadamard product.
Then the negacyclic convolution a~bof vectors aand bin
(Zq)Nsatisfies the following relation [48]:
The NTT algorithm (i.e., the FFT) used to compute the NTT
mathematical function (i.e., the DFT) consists of an iteration of
stages, in which computations are performed in the form of
butterfly operations. The computational graphs for the well-
studied (radix-2) Cooley–Tukey (CT) butterfly [49] and the
Gentleman–Sande (GS) butterfly [50] are shown in Figure 6.
Fig. 6. The Cooley–Tukey (left) and Gentleman–Sande butterflies (right).
oppelmann et al. [47] define an elegant algorithmic specifi-
cation for polynomial multiplication using NTTs based on the
CT and GS butterflies. Their design utilizes two specialized
variants of the FFT/NTT:
1) the merged CT NTT,NTTCT
nobo, defined by Roy et
al. [51] (see Algorithm 5); and
2) the merged GS NTT,NTTGS
bono, defined by
oppelmann et al. [47] (see Algorithm 6).
Algorithm 5 Merged CT NTT, NTTCT
Require: permuted twiddle factors Ψbr
1: m1
2: kN/2
3: while m<N do
4: for i= 0 to m1do
5: jF ir st 2×i×k
6: jLast j F irst +k1
7: ξΨbr[m+i]
8: for j=jF irst to jLast do
9: a[j]
a[j+k]a[j] + ξ×a[j+k] mod q
a[j]ξ×a[j+k] mod q
10: m2×m
11: kk/2
12: return a
Algorithm 6 Merged GS NTT, NTTGS
Require: permuted twiddle factors Ψbr
1: mN/2
2: k1
3: while m1do
4: for i= 0 to m1do
5: jF ir st 2×i×k
6: jLast j F irst +k1
7: ξΨbr[m+i]
8: for j=jF irst to jLast do
9: a[j]
a[j+k]a[j] + a[j+k] mod q
ξ×(a[j]a[j+k]) mod q
10: mm/2
11: k2×k
12: return a
In Algorithms 5and 6,br denotes the bit-reversal of a
log2(N)-bit binary sequence, and Ψbr denotes the twiddle
factors permuted with respect to br; i.e., Ψbr[i] = ψbr(i)for
all iin [0, N ). Polynomial multiplication can be computed via
the merged CT and GS NTTs as follows [47]:
bono (NTTCT
The advantages of this algorithmic specification for polyno-
mial multiplication include the following:
1) Hadamard products omitted: The multiplication by pow-
ers of ψ, i.e., the Hadamard products with Ψand Ψ1,
are “merged” into the NTT computations, saving a total
of 3Nmodular multiplications.
2) Bit-reversal permutations omitted: The merged CT NTT
takes the input in normal order and returns the output
in a permuted bit-reversed order (hence no bo), and
vice versa for the merged GS NTT. This removes the
need for intermediate permutations to correct the order.
3) Good spatial locality: In the merged CT NTT, the
twiddle factors Ψbr are read in sequential order. In the
merged GS NTT, the twiddle factors are read sequen-
tially during each stage.
Zhang et al. [52] propose a technique to merge the
N-scaling operation in Equation (1) into the GS NTT. Rather
than performing entry-wise modular multiplication by 1
Zhang et al. multiply the output of each butterfly operation
by 1
2modulo q. Observe that:
2mod q=(x
2if xis even
2if xis odd
The computation of x
2mod qcan be implemented without
divisions, products, or branching via the expression
x1+(x&1) ×((q+ 1) 1).
Ozerk et al. [21] use this technique to merge 1
N-scaling into
bono (also see their open-source code [53]). We write
bono, 1
to denote the merging of NTTGS1
bono with
N-scaling. Incorporating this NTT into (1) gives the following
algorithm specification for polynomial multiplication:
a~b= NTTGS1
bono, 1
nobo(b)) (2)
This algorithm specification is the basis for all of our imple-
mentations of polynomial multiplication.
B. Proposed optimization: fused polynomial multiplication
Alkim et al. [54] propose several techniques for integrating
the Hadamard product with its neighboring butterflies. They
specify polynomial multiplication algorithms involving one,
two, and three-stage integrations. These algorithms have sig-
nificantly reduced complexity for the multiplication of two
polynomials. However, the complexity of multiplying larger
numbers of polynomials may be significantly increased, espe-
cially when more stages are integrated.
We propose a single-stage fused polynomial multiplication,
which offers significant speedup for multiplying two polyno-
mials at minimized cost for multiplying larger numbers of
polynomials. Our proposal uses Karatsuba’s algorithm [31] to
reduce the number of modular products by N/2compared to
the single-stage algorithm of Alkim et al. [54].
Consider the computational subgraph of Equation (2) in-
duced by the final stages of NTTCT
nobo, the Hadamard product
, and the first stage of NTTGS1
bono, 1
. Each of the N/2
connected components in this graph are of the form
b1 (3)
for some twiddle factor αand inputs a0, a1, b0, and b1(see
Figure 5). Thus, the computation for each component consists
of 5 (modular) product operations, 2 scaling by 1
2operations, 6
sum/difference operations, and 2memory accesses. The output
of the computation in expression (3) is
a0×b0+α2×a1×b1mod q
a0×b1+a1×b0mod q(4)
Algorithm 7also computes expression (3), but requires 4
products, 0 scalings by 1
2, 5 sums/differences, and 1memory
access. This variation on Karatsuba’s Algorithm [31] relies on
the fact that a0×b1+a1×b0mod qis equivalent to
(a0+a1)×(b0+b1)a0×b0a1×b1mod q.
Algorithm 7 Butterflies fused into the Hadamard product
Require: [a0
b1(Zq)2, twiddle factor α2Zq
Ensure: [c0
c1]ha0×b0+α2×a1×b1mod q
a0×b1+a1×b0mod qi
1: prod1a0×b0mod q
2: prod2a1×b1mod q
3: sum1a0+a1mod q
4: sum2b0+b1mod q
5: prod3sum1×sum2 mod q
6: prod4α2×prod2 mod q
7: sum3prod1 + prod4 mod q
8: sum4prod3prod1 mod q
9: sum5sum4prod2 mod q
10: [c0
11: return [c0
We say that Algorithm 7fuses the CT and GS butterflies into
the Hadamard product.
To define the fused polynomial multiplication algorithm,
we first define truncated versions of the CT and GS NTTs.
Define the truncated CT NTT,[
nobo, to be the merged
CT NTT with the final stage omitted (i.e., line 3 in Algorithm 5
is replaced with while m < (N/2) do”). Likewise, define
the truncated GS NTT,[
bono, 1
2, to be the merged GS
NTT with the first stage omitted (i.e., line 3 in Algorithm 6
is replaced with while m > 1do”). Our proposed fused
polynomial multiplication is specified in Algorithm 8.
Algorithm 8 Proposed fused polynomial multiplication
Require: a,b(Zq)N, permuted twiddle factors Ψbr
Ensure: c=a~b
1: b
2: b
3: for i= 0 to N/21do
4: ub
b[2i] mod q
5: vb
a[2i+ 1] ×b
b[2i+ 1] mod q
6: w(b
a[2i] + b
a[2i+ 1]) ×(b
b[2i] + b
b[2i+ 1]) mod q
7: ywumod q
8: b
c[2i+ 1] yvmod q
9: zv×Ψbr[N
2c] mod q
10: if iis even then
11: b
c[2i]u+zmod q
12: else
13: b
c[2i]uzmod q
14: c[
bono, 1
15: return c
The benefits of our proposed fused polynomial multiplica-
tion algorithm include the following:
1) Fewer operations: N/2fewer modular product oper-
ations, Nfewer scaling by 1
2operations, N/2fewer
sum/difference operations, and N/2fewer memory ac-
cesses (but N/4additional negations).
2) Number of twiddle factors halved: The second half of the
entries in the twiddle factor arrays for each of the merged
NTTs are not used in fused polynomial multiplication
and can be omitted.
3) Re-use of recently-accessed twiddle factors: The twiddle
factors read in the last stage of the truncated CT NTT
are immediately re-used in the fused Hadamard product.
We implement our polynomial multiplication kernels tar-
geting NVIDIA’s 7th generation Volta GPU architecture, the
V100 PCIe GPU with 16 GB onboard memory. The V100 has
a multi-level memory, as shown in Figure 7. The V100 features
a highly tuned high-bandwidth memory (HBM2), which is
called global memory in the CUDA framework. The global
memory, being the largest in capacity, has the highest latency
to access data (1029 cycles) [55]. The V100 provides a 128
KB L1 data cache and a 128 KB L1 instruction cache per
SM, as well as a unified L2 cache for data and instructions
(6.1MB in size). Each SM on a V100 has a shared memory
(each configurable in size up to 96 KB). Data accesses to
shared memory are much more efficient (i.e., 19 cycles) as
compared to accesses to global memory (1029 cycles) [56].
Effective use of the memory hierarchy, and especially shared
memory, on a GPU is critical to obtaining the best perfor-
mance [57]. Our single-block implementation of NTT utilizes
shared memory for local data caching, thus reducing the
number of redundant fetches from global memory [58] by
a factor of log(N)times (where Nis the size of the input
coefficient array). Furthermore, we improve cache efficiency
by increasing the spatial locality of our data access patterns,
exploiting memory coalescing on the GPU [58], as described
in Section V-C.
We obtain performance metrics for our kernels using
hardware performance counters and binary instrumentation
tools. We explore performance bottlenecks using a variety
of tools including the NVIDIA Binary Instrumentation Tool
(NVBit) [59] for tracing memory transactions, the Nsight
Compute for fetching performance counters, and the Nsight
Systems [60] to obtain kernel scheduler performance, as
well as measuring synchronization overheads. We compare
64 KB
High Bandwidth Memory [ 16 GB ]
Unified L2 Cache [ 6144 KB ]
64 B Cache Line
16-way set-associative
Shared Memory
[upto 96 KB]
Data Cache
128 KB
Latency (in cycles)
L2 193
12 KB
Instruction Cache
128 KB
L1 Data 28
Shared 19
HBM 1029
Fig. 7. V100 GPU memory hierarchy and latency comparison.
kernel performance based on Architectural Profile” and “Stall
Profile” plots. The Arch Profile” compares the relative change
as compared to a baseline (see Table I), whereas the “Stall
Profile” provides information on the primary causes of a kernel
stall during execution (see Table II).
Parameter Description
SM Throughput % of cycles the SM was busy
Avg. IPC Average # of instructions per cycle
ALU ALU Pipeline utilization
DRAM B/W % of peak memory transactions
the DRAM processed per second
L1$ and L2$ B/W % of peak memory transactions the L1$
and L2$ processed per second respectively
L1$ and L2$ Hit-Rate % of memory transactions the L1$
and L2$ fulfilled successfully
Regs/Thread # of registers used by each thread of the warp
Issued Warps Avg. # of warps issued per second by scheduler
Type of stall Reason
Long Scoreboard Waiting for a scoreboard dependency
on a L1$ operation
Math Pipe Throttle Waiting for the ALU execution pipe
to be available
Wait Waiting on fixed latency execution dependency
Indicates highly optimized kernel
Not Selected Waiting for the scheduler to select the warp
Indicates warps oversubscribed to scheduler
Selected Warp was selected by the micro scheduler
Barrier Waiting for sibling warps at sync barrier
Indicates diverging code paths before a barrier
LG Throttle Waiting for the L1 instruction queue
Short Scoreboard Scoreboard dependency on shared memory
Indicates higher shared memory utilization
MIO Throttle Stalled on MIO (memory I/O) instruction queue
Branch Resolving Waiting for a branch target to be computed
Dispatch Stall Warp stalled because dispatcher holds back
issuing due to conflicts or events
IMC Miss Waiting for an immediate cache (IMC) miss
No Instruction Waiting after an instruction cache miss
We observe that the NTT kernel is a memory-bound work-
load, heavily bottlenecked by the GPU’s DRAM latency. The
butterfly operation is one of the key computations within
the NTT kernel. This operation is characterized by strided
accesses, with the stride varying with each stage. The changes
in the stride lead to non-sequential memory accesses, reducing
the spatial locality of the NTT kernel. To effectively leverage
memory coalescing, we can partition data carefully across
CUDA threads [61]. We propose three different implemen-
tations of NTT kernels, each optimized for different input
sizes and employing different data partitioning techniques.
We follow a similar approach here as described by ¨
SM Throughput
Avg. IPC
L1$ B/W
L1$ Hit-Rate
L2$ B/W
L2$ Hit-Rate
Issued Warps
Architectural Parameters of Shared mem
% change over Global mem
Long Scoreboard
Math Pipe Throttle
Not Selected
LG Throttle
Short Scoreboard
MIO Throttle
Branch Resolving
Dispatch Stall
IMC Miss
No Instruction
NTT Workload Stall Histogram
Avg. stalled CPI
Memory Type
Global Mem
Shared Mem
Long Scoreboard
Math Pipe Throttle
Not Selected
LG Throttle
Short Scoreboard
MIO Throttle
Branch Resolving
Dispatch Stall
IMC Miss
No Instruction
inverse-NTT Workload Stall Histogram
Avg. stalled CPI
Memory Type
Global Mem
Shared Mem
Fig. 8. (a) Architectural performance profile of shared memory NTT and iNTT workloads compared against respective global memory workloads. (b) Stall
profile of NTT workload comparing global and shared memory kernels. (c) Stall profile of inverse-NTT workload comparing global vs. shared memory kernels.
et al. [21], though we leverage a number of algorithmic
optimizations, combined with code optimizations, that are
unique to this work.
The three implementations of polynomial multiplications
proposed in this work are as listed below:
LOS-NTT (Latency-optimized Single-block NTT): For
single polynomial multiplication with N211.
LOM-NTT (Latency-optimized Multi-block NTT): For
single polynomial multiplication with N > 211.
TOM-NTT (Throughput-optimized Multi-block NTT):
For multiple polynomial multiplications with no con-
straints on N.
A. Latency optimized Single-block NTT
The LOS-NTT kernel performs all the NTT operations
within a single block of the CUDA kernel. Using a single block
for computing the entire NTT workload has the following
The overhead of a single block-level barrier
(syncthreads) is significantly lower than a kernel-
level (multi-block) barrier.
We can leverage shared memory, which can only be
addressed within the scope of a single block.
Since all threads of a block share the same L1 and L2
caches, and L1 is write-through, write updates by any
thread are reflected in L2 across all threads.
Our LOS-NTT implementation consists of two phases,
separated by a block-level barrier. The first transfers the
input coefficient vectors (of size N) from high latency global
memory to the faster, low latency shared memory. The second
phase performs the merged NTTs, as defined in Algorithms 5
and 6. This phase consists of two nested loops. The first
iterates over the log(N)stages of the CT algorithm. This
is followed by the second loop of N
2, iterating over the
elements of the input coefficient vector. These iterations are
free from any loop-carried dependencies, allowing them to
be run in parallel. We capitalize on this inherent parallelism
by computing each iteration of the second loop in parallel,
assigning each loop iteration to a separate CUDA thread. We
further improve the performance of our kernel with four GPU-
specific optimizations.
1) Shared Memory Optimization: Each stage of the CT im-
plementation is characterized by multiple butterfly operations
of varying strides. These butterfly operations result in strided
memory accesses, with the step size varying from 1to N
(where Ncan be as large as 216). We optimize for memory
access efficiency by storing the input coefficient vector, as
well as the outputs of butterfly operations, in persistent shared
memory, which is significantly faster than accessing global
memory. We utilize 8KB of shared memory per SM for storing
the input polynomial coefficients, as well as the output of
intermediate stages. Using shared memory incurs the overhead
of transferring input coefficients to shared memory and the
final results back to global memory. Despite these additional
overheads, incorporating the use of shared memory allows
us to obtain a 1.25×speedup over the use of only global
memory (Figure 8). Figure 8(a) denotes a large drop in L1$
and L2$ performance. The primary reason for this degraded
performance is memory transactions that access the shared
memory do not count towards L1and L2cache performance.
Since all the coalesced memory transactions to the global
memory (which counted towards the cache performance) are
now redirected towards the shared memory (which is excluded
from cache performance counters), the L1and L2cache
bandwidth and hit-rate take a performance hit. Figure 8(b,c)
identifies the primary causes of stalls for the NTT and inverse-
NTT kernels, respectively. The“Long Scoreboard” stall is
caused by dependencies in L1 cache operations. The large drop
in the stall values for the “Long Scoreboard” in Figure 8(b)
is an indicator of memory pressure being reduced in the L1
cache and indirectly in the L2 cache and DRAM. Similarly, in
Figure 8(c), the increase in the average “Math Pipe Throttle”
stall values is tied to the compute throughput of the inverse-
NTT kernel.
2) Barrett’s Modular Reduction Optimization: We further
accelerate our NTT kernel with the use of our modified Barrett
implementation, specifically designed for GPU execution, as
shown in Section II-B. The smaller number of correctional
subtractions present in our implementation allows us to obtain
a1.85×average speedup over previous work [21] and a
1.72×speedup over the builtin modulus operation. We also
obtain similar execution times to the 28-bit modified Barrett’s
reduction, as reported for PALISADE [41]. To our knowledge,
our proposed Barrett variant is the fastest Barrett modular
reduction for general 30-bit and 62-bit moduli.
3) Mixed Radix Optimization: The naive implementation
of NTT and inverse NTT used in this study is based on a
radix-2algorithm. In this implementation, each thread within a
block operates on 2elements of the input coefficient array. We
improve the performance of our kernel by experimenting with
radix-4,8, and 16 implementations that distribute 4,8, and 16
elements per thread, respectively. Higher radix implementa-
tions improve temporal locality, as the input coefficient vector
data is reused. Unfortunately, this improvement in the temporal
locality is associated with a significant loss in parallelism. We
also experiment with kernels that use radix 4or radix 8for
single-block kernels and radix 16 for multi-block.
Radix 2
Radix 4
Radix 4/16
Radix 8/16
2D Serial
2D Pipelined
NTT workload radix comparison
Duration [us]
Radix 2
Radix 4
Radix 4/16
Radix 8/16
2D Serial
2D Pipelined
inverse-NTT workload radix comparison
Duration [us]
% of Theoretical Peak
% of Theoretical Peak
Compute Throughput
Memory Throughput
L1$ Throughput
L2$ Throughput
DRAM Throughput
Fig. 9. Higher radix comparison for (a) NTT and (b) inverse-NTT kernels.
We also experiment with 2-dimensional NTT implemen-
tations. A 2D NTT maps the data into a matrix form, thus
treating our coefficient as a row-major square matrix. This
allows us to perform a column-wise NTT followed by a row-
wise NTT. An N1degree polynomial can be mapped into
aN×Nmatrix. This also divides the NTT kernel into
two stages (column-wise NTT and row-wise NTT). The first
stage computes Nnumber of N-point column-wise NTT
operations, followed by the second stage that computes N
number of N-point row-wise NTT. Each N-point NTT is
mapped into a block with N
2threads, where each thread is
responsible for computing a radix-2butterfly. The 2D NTT
approach allows us to map the data while preserving spatial
locality. We further accelerate our computation by pipelining
the two stages of row-wise and column-wise NTT operations,
thus presenting two variants of our 2Dimplementation (2D
Serial and 2DPipelined). This approach provides an average
of 2.91×speedup for NTT and inverse-NTT kernel over the
naive radix-2implementations (Figure 9). This improvement
in execution time can be largely attributed to the increased
memory throughput for NTT (Figure 9(a)), as well as inverse-
NTT (Figure 9(b)). The improved memory throughput also
contributes to the increased compute throughput, as continuous
streaming of data from DRAM no longer starves the SMs of
input operands.
4) Fused Polynomial Multiplication Optimization: Finally,
we propose an optimization that fuses together the last stage
of merged CT NTT, the Hadamard product, and the first stage
of merged GS NTT. Figure 5(c) shows the implementation of
our fused polynomial multiplication. Our implementation of
the fused kernel significantly reduces the number of multipli-
cation operations and re-uses recently-cached twiddle factors.
Experimental results show that we reduce the execution time,
resulting in a 6.1% and 2.4% improvement as compared to the
naive implementation for polynomial multiplication, for input
sizes of N= 211 and N= 216, respectively.
B. Latency optimized Multi-block NTT
Fig. 10. The Multi-block NTT task distribution.
The LOM-NTT kernel is designed to handle large input
arrays (N > 211). The LOM-NTT kernel distributes tasks
using a similar strategy as used in the LOS-NTT kernel, except
that it spreads them over multiple blocks. This allows us to
employ multiple SMs to execute the workload in parallel.
The LOM-NTT kernel splits a single N-point NTT between
multiple blocks. Because of the use of multiple blocks, this im-
plementation requires kernel-wide barriers for synchronization
between stages. We use the LOM-NTT kernel to decompose
a single N-point NTT into multiple 211-point NTTs. Then we
incorporate our LOS-NTT (Single-block) kernel to evaluate
all the 211-point NTTs to harness the optimizations of shared
memory and block-level barriers. We show the distribution for
our LOM-NTT for N= 216 in Figure 10.
C. Throughput-optimized Multi-block NTT
The throughput-optimized kernel is designed to compute
multiple NTTs simultaneously. Unlike the latency-optimized
kernel that computes just a single NTT operation, TOM-
NTT is optimized to compute up to 215 NTT operations
simultaneously, with each NTT computation being a 216-point
NTT (the size of each input coefficient vector is 216). The
TOM-NTT kernel is fed 2input matrices. The first matrix
holds the input coefficient vectors. These vectors, of size 216,
are stacked in the matrix in the row-major format. This matrix
is then transferred to GPU and stored in global memory in a
column-major format, coalescing reads across threads into a
single memory transaction. The second input matrix contains
the twiddle factors. We store twiddle factors in a similar way
as the coefficient matrix. Each input matrix is of dimension
216 ×215. Both matrices, when combined, completely fill the
DRAM storage of 16 GB on the V100 GPU. The TOM-
NTT kernel executes the 28-point NTT over 32,768 vectors in
628 ms. With an average execution time is 19.17 µsper NTT
operation, this kernel exhibits close to linear weak scaling.
A. Experimental Methodology
We present three different NTT kernels in this work, along
with four optimizations tailored for the GPU platform. We
evaluate the performance of our Single-block NTT kernel
for input coefficient vector sizes of N= 211 and of our
Multi-block kernel for vector sizes of N= 212 to 216.
We incrementally add each of the four optimizations to our
NTT kernels and report performance improvements. Twiddle
factors are pre-computed on the CPU and hence do not add
to the compute overhead on the GPU. We report on multiple
performance metrics for each approach, leveraging profiling
tools on the GPU platform. For each optimization, the speedup
achieved is reported using the respective non-optimized kernel
as the baseline for comparison. Finally, we evaluate weak
scaling for our throughput-optimized TOM-NTT kernel.
B. Performance Metrics
We incrementally add optimizations to our NTT kernels
and report performance improvements in Table III (for input
coefficient size N= 216). For each optimization, the speedup
achieved is reported, using the respective non-optimized kernel
as the baseline for comparison.
Optimization Relative
SM-only 1.2× 27.3% +20.0%
SM + Alg4 1.72×+10.86% +3.2%
SM + Alg4 + 2D 2.91×+5.85% +16.14%
SM + Alg4 + 2D + FHP 1.02×+0.3% 0.33%
dlog2(q)e= 62 US ED A S BAS ELI NE )
Our shared memory optimized kernel, when compared
against the global memory kernel, achieves a 20% improve-
ment in DRAM bandwidth utilization and a 1.2×speedup.
Data is transferred between DRAM and shared memory using
coalesced memory transactions, improving DRAM bandwidth
Next, we compare the execution time for our NTT kernel
implementation by incorporating various modular reduction
techniques, as shown in Figure 3. We compare our best
performing NTT kernel (highlighted in Table IV) to ¨
et al. [21] and find a 1.85×speedup for N= 216 and a
1.13×speedup for N= 214. The use of radix 4,8, and 16
and 2Dimplementations provide additional speedup due to the
increased temporal, as well as spatial, locality in 4,8, and 16-
point butterfly operations, as compared to the baseline radix
2implementation. The effects of increased data locality are
reflected in the 5.85% improvement in the L1 cache hit-rate.
Our best performing kernel, that of 2DNTT, achieves a 2.91×
speedup over a radix 2implementation (Figure 9). Our fused
polynomial multiplication kernel reduced the execution time
for the last stage of the merged CT NTT kernel, the Hadamard
11 12 13 14 15 16 17
single block timing (us)
Fig. 11. Timing for the Single-block NTT.
product, and the first stage of the merged GS NTT kernel,
from 8.5µsdown to 6.5µs, resulting in a 1.3×speedup
as compared to its non-fused counterpart. When incorporated
within a polynomial multiplication kernel, this translates to a
6.1% improvement for Single-block kernel (for size N= 211)
and a 2.4% improvement for Multi-block kernel (for size
N= 216).
We also measured the scalability of our fastest single-block
NTT implementation. As our multi-block kernel implemen-
tation leverages our Single-block code, we also analyzed the
performance of the Single-block kernel by varying the input
polynomial size and the hardware resources used. On each
iteration, we double the size of the input array, as well as the
number of potential SMs utilized (by doubling the number
of blocks in the kernel). We observe that our Single-block
kernel exhibits close to linear weak scaling, as execution times
remain near constant as we increase both the input size and
the hardware resources utilized (Figure 11).
We also evaluate our TOM-NTT kernel that is optimized for
operating on a large number of NTT operations simultaneously
(working with up to 215 input coefficient vectors, each of size
216 elements). With an average execution time of 19.17 µs
per NTT operation, this kernel exhibits close to linear weak
scaling. Including all optimizations, our NTT kernels achieve a
speedup of 123.13×and 2.37×over the previous state-of-the-
art CPU [29] and GPU [21] implementations of NTT kernels,
Table IV presents runtimes of various implementations of
NTT and iNTT, adding to Table 8 in the work by ¨
Ozerk et
al. [21] with our own runtimes. Prior studies have explored
accelerated NTT on FPGAs [62] and custom accelerators [8].
But these custom solutions are not typically found on general-
purpose systems. On the other hand, GPUs are ubiquitous and
easily programmed. In recent years, there has been growing
interest in using a GPU to exploit the parallelism present in
NTT [19], [20], [21]. In particular, ¨
Ozerk et al. [21] propose
an efficient hybrid kernel approach to accelerate NTT. Our
LOS-NTT and LOM-NTT kernels are inspired by their work,
however, we provide some further optimizations such as our
fused Hadamard product, an improved version of Barrett
reduction, and explored higher radix NTTs. Kim et al. [20]
also propose some optimizations on NTTs, such as batching
using shared memory. We explored how those optimizations
Work Platform Ndlog2(q)eNTT
cuHE [40]GTX 690 214 64c56 65.3
215 64c71.2 83.6
cuHE [40],a Tesla K80 214 64c12.9 12.5
215 64c19 21.6
cuHE [40],b GTX 1070 214 64c66.8
Faster NTT [63]Tesla K80 214 64c9.6 9.7
215 64c15.3 16.2
Accl NTT [24]GTX 1070 214 64c57.8
Bootstrap HE [20] Titan V214 60 44.1
215 60 84.2
Re-encrypt [23] GTX 1050 214 NA 255
215 NA 470
RTX 1080 214 NA 375
215 NA 425
Efficient NTT [21] GTX 980 214 55 51 41
215 55 73 52
GTX 1080 214 55 33 20
215 55 36 24
Tesla V100 214 55 29 21
215 55 39 23
Our Work Tesla A100 214 62 13.3 10.9
216 62 16.5 18.7
Tesla V100 214 30 8.7 10.0
216 30 13.1 13.4
214 62 11.5 11.9
216 62 16.4 17.3
uses constant prime q=0xFFFFFFFF00000001
aresults are from [63]bresults are from [24]
cactual qiis restricted by q2
in < 264 232 + 1
could address the limitations we faced when implementing a
kernel with a radix higher than 4.
Alkim et al. [54] define and analyze several algorithms very
similar to Algorithm 8. They not only consider truncating their
NTTs by one stage but by two and three stages. Although some
of Alkim et al.’s algorithms utilize Karatsuba’s Algorithm,
they do not consider using Karatsuba’s Algorithm to merge
a single innermost pair of NTT stages. In our tests, our
fused polynomial multiplication implementation provides an
additional speedup of 6.1% and 2.4% as compared to the naive
implementation for polynomial multiplication for input sizes
of N= 211 and N= 216, respectively using Alkim et al.’s
(k1)-level NTT multiplication algorithm.
There is a Barrett reduction variant proposed by Yu et
al. [64] that requires no correctional subtractions. We found
that this algorithm has severe trade-offs in terms of operational
complexity as a function of workload size, which makes it less
attractive for use with HE.
In this work, we presented an analysis and proposed im-
plementations of polynomial multiplication, the key compu-
tational bottleneck in lattice-based HE systems, while tar-
geting the V100 GPU platform. Specifically, we analyzed
Barrett’s modular reduction algorithm and several variants.
We studied the interplay between algorithmic improvements
(such as multi-radix NTTs) and low-level kernel optimizations
tailored towards the GPU (including memory coalescing). Our
NTT optimizations achieve an overall speedup of 123.13×
and 2.37×over the previous state-of-the-art CPU [29] and
GPU [21] implementations of NTT kernels, respectively.
This work was supported in part by the Institute
for Experiential AI, the Harold Alfond Foundation,
the NSF IUCRC Center for Hardware and Embedded
Systems Security and Trust (CHEST), the RedHat
Collaboratory, and project grant PID2020-112827GB-
I00 funded by MCIN/AEI/10.13039/501100011033.
[1] A. Ghosh and I. Arce, “Guest Editors’ Introduction: In Cloud
Computing We Trust - But Should We?” IEEE Secur. Priv., vol. 8,
no. 6, pp. 14–16, 2010. [Online]. Available:
stamp/stamp.jsp?arnumber=5655238 1
[2] L. Branch, W. Eller, T. Bias, M. McCawley, D. Myers, B. Gerber, and
J. Bassler, “Trends in malware attacks against United States healthcare
organizations, 2016–2017,” Global Biosecurity, vol. 1, no. 1, 2019. 1
[3] M. Jayaweera, K. Shivdikar, Y. Wang, and D. Kaeli, “JAXED: Reverse
Engineering DNN Architectures Leveraging JIT GEMM Libraries, in
2021 Int. Symp. on Secure and Private Execution Environ. Design
(SEED). IEEE, 2021, pp. 189–202. [Online]. Available: https:
// auth.php/JAXED Reverse Engineering
DNN Architectures Leveraging JIT GEMM Libraries.pdf 1
[4] S. Thakkar, K. Shivdikar, and C. Warty, “Video steganography
using encrypted payload for satellite communication,” in 2017
IEEE Aerospace Conf. IEEE, 2017, pp. 1–11. [Online]. Available: auth.php/Video Steganography.pdf 1
[5] E. L. Cominetti and M. A. Simplicio, “Fast additive partially homomor-
phic encryption from the approximate common divisor problem, IEEE
Trans. Inf. Forensics Secur., vol. 15, pp. 2988–2998, 2020. 1
[6] J. H. Cheon, A. Kim, M. Kim, and Y. Song, “Homomorphic encryption
for arithmetic of approximate numbers,” in Advances in Cryptology—
ASIACRYPT 2017, T. Takagi and T. Peyrin, Eds. Springer, 2017. 1,
[7] I. Chillotti, N. Gama, M. Georgieva, and M. Izabach`
ene, “TFHE: fast
fully homomorphic encryption over the torus, J. Cryptol., vol. 33, no. 1,
pp. 34–91, 2020. 1
[8] N. Samardzic, A. Feldmann, A. Krastev, S. Devadas, R. Dreslinski,
C. Peikert, and D. Sanchez, “F1: A Fast and Programmable Acceler-
ator for Fully Homomorphic Encryption,” in MICRO-54: 54th Annu.
IEEE/ACM Int. Symp. on Microarchitecture, ser. MICRO ’21. New
York, NY, USA: ACM, 2021, pp. 238–252. 1,10
[9] W. Jung, E. Lee, S. Kim, J. Kim, N. Kim, K. Lee, C. Min, J. H. Cheon,
and J. H. Anh, “Accelerating fully homomorphic encryption through
architecture-centric analysis and optimization,” IEEE Access, vol. 9, pp.
98 772–98 789, 2021. 1,2
[10] C. Gentry, “Fully homomorphic encryption using ideal lattices,” in Proc.
of the 41st Annu. ACM Symp. on Theory of Comput.—STOC 2009.
ACM, 2009, pp. 169–178. 1
[11] V. Lyubashevsky, C. Peikert, and O. Regev, “On Ideal Lattices
and Learning with Errors over Rings, in Advances in Cryptology—
EUROCRYPT 2010, H. Gilbert, Ed. Berlin, Heidelberg: Springer Berlin
Heidelberg, 2010, pp. 1–23. 1
[12] P. Longa and M. Naehrig, “Speeding up the number theoretic transform
for faster ideal lattice-based cryptography, in Int. Conf. on Cryptology
and Netw. Security. Springer, 2016. 1
[13] S. Koteshwara, M. Kumar, and P. Pattnaik, “Performance Optimization
of Lattice Post-Quantum Cryptographic Algorithms on Many-Core Pro-
cessors,” in 2020 IEEE Int. Symp. on Performance Anal. of Syst. and
Softw. (ISPASS), 2020, pp. 223–225. 1
[14] DARPA. (2021) DARPA Selects Researchers to Accelerate Use of Fully
Homomorphic Encryption. [Online]. Available:
news-events/2021-03- 08 1
[15] A. Kim, M. Deryabin, J. Eom, R. Choi, Y. Lee, W. Ghang, and D. Yoo,
“General bootstrapping approach for RLWE-based homomorphic en-
cryption,” Cryptology ePrint Archive, 2021. 1
[16] V. Kadykov and A. Levina, “Homomorphic properties within lattice-
based encryption systems,” in 2021 10th Mediterranean Conf. on Em-
bedded Comput. (MECO). IEEE, 2021, pp. 1–4. 1
[17] P. Martins and L. Sousa, “Enhancing data parallelism of fully homomor-
phic encryption,” in Int. Conf. on Inf. Security and Cryptology. Springer,
2016, pp. 194–207. 1
[18] W. Jung, S. Kim, J. H. Ahn, J. H. Cheon, and Y. Lee, “Over 100x
faster bootstrapping in fully homomorphic encryption through memory-
centric optimization with gpus,” IACR Transactions on Cryptographic
Hardware and Embedded Syst., Aug. 2021. 2
[19] Y. Zhai, M. Ibrahim, Y. Qiu, F. Boemer, Z. Chen, A. Titov, and
A. Lyashevsky, Accelerating Encrypted Computing on Intel GPUs,”
2022 IEEE Int. Parallel and Distrib. Process. Symp. (IPDPS), 2022. 2,
[20] S. Kim, W. Jung, J. Park, and J. H. Ahn, “Accelerating number theoretic
transformations for bootstrappable homomorphic encryption on GPUs,”
in 2020 IEEE Int. Symp. on Workload Characterization (IISWC), 2020.
[21] ¨
O. ¨
Ozerk, C. Elgezen, A. C. Mert, E. ¨
urk, and E. Savas¸, “Efficient
number theoretic transform implementation on GPU for homomorphic
encryption,” J. Supercomput., pp. 1–33, 2021. [Online]. Available: https:
// 03980-5.pdf 2,3,
[22] S. Durrani, M. S. Chughtai, M. Hidayetoglu, R. Tahir, A. Dakkak,
L. Rauchwerger, F. Zaffar, and W.-m. Hwu, “Accelerating fourier and
number theoretic transforms using tensor cores and warp shuffles, in
Int. Conf. on Parallel Arch. and Compilation Tech. (PACT), 2021. 2
[23] G. Sahu and K. Rohloff, Accelerating Lattice Based Proxy Re-
encryption Schemes on GPUs,” in Cryptology and Netw. Security,
S. Krenn, H. Shulman, and S. Vaudenay, Eds. Springer, 2020. 2,
[24] J. Goey, W. Lee, B. Goi, and W. Yap, “Accelerating number theoretic
transform in GPU platform for fully homomorphic encryption,” J.
Supercomput., vol. 77, no. 2, pp. 1455–1474, 2021. 2,3,11
[25] A. A. Badawi, B. Veeravalli, J. Lin, N. Xiao, M. Kazuaki, and A. K. M.
Mi, “Multi-GPU Design and Performance Evaluation of Homomorphic
Encryption on GPU Clusters,” IEEE Trans. Parallel Distrib. Syst.,
vol. 32, pp. 379–391, 2021. 2
[26] W.-K. Lee, S. Akleylek, D. C.-K. Wong, W.-S. Yap, B.-M. Goi, and
S.-O. Hwang, “Parallel implementation of Nussbaumer algorithm and
number theoretic transform on a GPU platform: application to qTESLA,”
J. Supercomput., vol. 77, no. 4, pp. 3289–3314, 2021. 2,3
[27] P. Barrett, “Implementing the Rivest Shamir and Adleman Public Key
Encryption Algorithm on a Standard Digital Signal Processor, in
Advances in Cryptology CRYPTO’ 86, A. M. Odlyzko, Ed. Berlin,
Heidelberg: Springer Berlin Heidelberg, 1987, pp. 311–323. 2,3
[28] J. F. Dhem and J. J. Quisquater, “Recent results on modular multi-
plications for smart cards,” in Smart Card Research and Applications.
Springer Berlin Heidelberg, 2000. 2,3,4
[29] Microsoft SEAL (release 4.0). Microsoft Research, Redmond, WA.
[Online]. Available: 2,10,11
[30] W. Wang, Y. Hu, L. Chen, X. Huang, and B. Sunar, “Accelerating fully
homomorphic encryption using GPU,” in 2012 IEEE Conf. on High
Performance Extreme Comput., 2012, pp. 1–5. 2,3
[31] A. A. Karatsuba and Y. P. Ofman, “Multiplication of many-digital
numbers by automatic computers,” in Doklady Akademii Nauk, vol. 145,
no. 2. Russian Academy of Sciences, 1962, pp. 293–294. 2,6
[32] J. H. Cheon, K. Han, A. Kim, M. Kim, and Y. Song, “Bootstrapping
for approximate homomorphic encryption,” in Annu. Int. Conf. on the
Theory and Appl. of Cryptographic Tech. Springer, 2018, pp. 360–384.
[33] V. Shoup, A Computational Introduction to Number Theory and Algebra,
2nd ed. USA: Cambridge University Press, 2009. 2
[34] P. L. Montgomery, “Modular multiplication without trial division, Math.
Comput., vol. 44, pp. 519–521, 1985. 2
[35] V. Shoup. NTL: A library for doing number theory. [Online]. Available: 2
[36] D. Harvey, “Faster arithmetic for number-theoretic transforms, J. Symb.
Comput., vol. 60, pp. 113–119, 2014. 2
[37] T. Acar and D. Shumow, “Modular reduction without pre-computation
for special moduli,” Microsoft Research, Redmond, WA, USA, 2010. 2
[38] M. Knezevic, F. Vercauteren, and I. M. R. Verbauwhede, “Speeding Up
Barrett and Montgomery Modular Multiplications,” in IEEE Transac-
tions on Comput., 2009. 2
[39] L. Hars, “Long modular multiplication for cryptographic applications,”
in Int. Workshop on Cryptographic Hardware and Embedded Syst.
Springer, 2004, pp. 45–61. 2
[40] W. Dai and B. Sunar, “cuHE: A homomorphic encryption accelerator
library,” in Int. Conf. on Cryptography and Inf. Security in the Balkans.
Springer, 2015, pp. 169–186. 2,11
[41] PALISADE Homomorphic Encryption Software Library (release
1.11.5). [Online]. Available: 2,3,8
[42] W. Dai, Y. Dor¨
oz, and B. Sunar, Accelerating NTRU based homomor-
phic encryption using GPUs,” in 2014 IEEE High Performance Extreme
Comput. Conf. (HPEC). IEEE, 2014, pp. 1–6. 3
[43] Y. Kong and B. Phillips, “Comparison of Montgomery and Barrett
modular multipliers on FPGAs,” in 2006 Fortieth Asilomar Conf. on
Signals, Syst. and Computers, 2006, pp. 1687–1691. 3
[44] T. Wu, S.-G. Li, and L.-T. Liu, “Modular multiplier by folding Barrett
modular reduction,” in 2012 IEEE 11th Int. Conf. on Solid-State and
Integrated Circuit Technol., 2012, pp. 1–3. 3
[45] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction
to algorithms. MIT press, 2009. 4
[46] J. Von Zur Gathen and J. Gerhard, Modern Computer Algebra. Cam-
bridge University Press, 2013. 4
[47] T. P¨
oppelmann, T. Oder, and T. G¨
uneysu, “High-Performance Ideal
Lattice-Based Cryptography on 8-Bit ATxmega Microcontrollers,” in
Progress in Cryptology—LATINCRYPT. Springer, 2015, pp. 346–365.
[48] R. Crandall and B. Fagin, “Discrete Weighted Transforms and Large-
Integer Arithmetic,” Math. Comput., vol. 62, pp. 305–324, 1994. 5
[49] J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation
of complex fourier series,” Math. Comput., vol. 19, pp. 297–301, 1965.
[50] W. M. Gentleman and G. Sande, “Fast fourier transforms: For fun and
profit,” in Proc. of the November 7–10, 1966, Fall Joint Comput. Conf.,
ser. AFIPS ’66 (Fall). New York, NY, USA: Association for Computing
Machinery, 1966, pp. 563–578. 5
[51] S. S. Roy, F. Vercauteren, N. Mentens, D. D. Chen, and I. Verbauwhede,
“Compact Ring-LWE Cryptoprocessor,” in Cryptographic Hardware and
Embedded Syst.—CHES 2014, L. Batina and M. Robshaw, Eds. Berlin,
Heidelberg: Springer Berlin Heidelberg, 2014, pp. 371–391. 5
[52] N. Zhang, B. Yang, C. Chen, S. Yin, S. Wei, and L. Liu, “Highly Effi-
cient Architecture of NewHope-NIST on FPGA using Low-Complexity
NTT/INTT,” IACR Transactions on Cryptographic Hardware and Em-
bedded Syst., vol. 2020, no. 2, pp. 49–72, Mar. 2020. 5
[53] ¨
O. ¨
Ozerk, C. Elgezen, and A. C. Mert. (retrieved Oct 2021) gpu-ntt.
[Online]. Available: ntt 6
[54] E. Alkım, Y. A. Bilgin, and M. Cenk, “Compact and Simple RLWE
Based Key Encapsulation Mechanism, in Progress in Cryptology—
LATINCRYPT 2019. Springer, 2019, pp. 237–256. 6,11
[55] Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting
the NVIDIA volta GPU architecture via microbenchmarking,” arXiv
preprint arXiv:1804.06826, 2018. 7
[56] T. Baruah, K. Shivdikar, S. Dong, Y. Sun, S. A. Mojumder, K. Jung,
J. L. Abell´
an, Y. Ukidave, A. Joshi, J. Kim, and D. Kaeli, “GNNMark:
A Benchmark Suite to Characterize Graph Neural Network Training
on GPUs,” in 2021 IEEE Int. Symp. on Performance Anal. of Syst.
and Softw. (ISPASS). IEEE, 2021, pp. 13–23. [Online]. Available: auth.php/GNNMark.pdf 7
[57] Y. Sun, T. Baruah, S. A. Mojumder, S. Dong, X. Gong, S. Treadway,
Y. Bao, S. Hance, C. McCardwell, V. Zhao, H. Barclay, A. K.
Ziabari, Z. Chen, R. Ubal, J. L. Abell´
an, J. Kim, A. Joshi, and
D. Kaeli, “MGPUSim: Enabling Multi-GPU Performance Modeling
and Optimization,” in Proc. of the 46th Int. Symp. on Comput.
Architecture, ser. ISCA ’19. New York, NY, USA: Association
for Computing Machinery, 2019, p. 197–209. [Online]. Available: 7
[58] K. Shivdikar, “SMASH: Sparse Matrix Atomic Scratchpad Hashing,”
Master’s thesis, Northeastern University, 2021. [Online]. Available: auth.php/SMASH Thesis.pdf 7
[59] O. Villa, M. Stephenson, D. Nellans, and S. W. Keckler, “Nvbit: A
dynamic binary instrumentation framework for nvidia gpus, in Proc. of
the 52nd Annu. IEEE/ACM Int. Symp. on Microarchitecture, 2019, pp.
372–383. 7
[60] (2022, Apr) Nvidia Nsight Systems. [Online]. Available: https:
// 7
[61] K. Shivdikar, K. Paneri, and D. Kaeli, “Speeding up DNNs using HPL
based Fine-grained Tiling for Distrib. Multi-GPU Training,” Boston Area
Architecture Workshop, 2018 (BAAW/BARC), 2018. [Online]. Available: auth.php/BARC speeding.pdf 7
[62] M. Riazi, K. Laine, B. Pelton, and W. Dai, “HEAX: An architecture for
computing on encrypted data,” Int. Conf. on Architectural Support for
Programming Languages and Operating Syst. - ASPLOS, 2020. 10
[63] A. Al Badawi, B. Veeravalli, and K. M. M. Aung, “Faster number
theoretic transform on graphics processors for ring learning with errors
based cryptography, in 2018 IEEE Int. Conf. on Service Operations and
Logistics, and Informatics (SOLI). IEEE, 2018, pp. 26–31. 11
[64] H. Yu, G. Bai, and H. Hao, “Efficient Modular Reduction Algorithm
Without Correction Phase,” in Frontiers in Algorithmics, J. Wang and
C. Yap, Eds. Springer, 2015. 11
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
General matrix multiplication (GEMM) libraries on x86 architectures have recently adopted Just-in-time (JIT) based optimizations to dramatically reduce the execution time of small and medium-sized matrix multiplication. The exploitation of the latest CPU architectural extensions, such as the AVX2 and AVX-512 extensions, are the target for these optimizations. Although JIT compilers can provide impressive speedups to GEMM libraries, they expose a new attack surface through the built-in JIT code caches. These software-based caches allow an adversary to extract sensitive information through carefully designed timing attacks. The attack surface of such libraries has become more prominent due to their widespread integration into popular Machine Learning (ML) frameworks such as PyTorch and Tensorflow. In our paper, we present a novel attack strategy for JIT-compiled GEMM libraries called JAXED. We demonstrate how an adversary can exploit the GEMM library's vulnerable state management to extract confidential CNN model hyperparame-ters. We show that using JAXED, one can successfully extract the hyperparameters of models with fully-connected layers with an average accuracy of 92%. Further, we demonstrate our attack against the final fully connected layer of 10 popular DNN models. Finally, we perform an end-to-end attack on MobileNetV2, on both the convolution and FC layers, successfully extracting model hyperparameters.
Full-text available
Homomorphic Encryption (HE) draws a significant attention as a privacy-preserving way for cloud computing because it allows computation on encrypted messages called ciphertexts. Among numerous HE schemes proposed, HE for Arithmetic of Approximate Numbers (HEAAN) is rapidly gaining popularity across a wide range of applications as it supports messages that can tolerate approximate computation with no limit on the number of arithmetic operations applicable to the ciphertexts. A critical shortcoming of HE is the high computation complexity of ciphertext arithmetic; especially, HE multiplication (HE Mul) is more than 10,000 times slower than the corresponding multiplication between unencrypted messages. This leads to a large body of HE acceleration studies including ones exploiting FPGAs; however, those did not conduct a rigorous analysis of computational complexity and data access patterns of HE Mul. Moreover, the proposals mostly focused on designs with small parameter sizes, making it difficult to accurately estimate the performance of the HE accelerators in conducting a series of complex arithmetic operations. In this paper, we first describe how HE Mul of HEAAN is performed in a manner friendly to non-crypto experts. Then we conduct a disciplined analysis on its computational and memory-access characteristics, through which we (1) extract parallelism in the key functions composing HE Mul and (2) demonstrate how to effectively map the parallelism to the popular parallel processing platforms, CPUs and GPUs, by applying a series of optimizations such as transposing matrices and pinning data to threads. This leads to the performance improvement of HE Mul on a CPU and a GPU by 2.06× and 4.05×, respectively, over the reference HEAAN running on a CPU with 24 threads.
Full-text available
Sparse matrices, more specifically Sparse Matrix-Matrix Multiply (SpGEMM) kernels, are commonly found in a wide range of applications, spanning graph-based path-finding to machine learning algorithms (e.g., neural networks). A particular challenge in implementing SpGEMM kernels has been the pressure placed on DRAM memory. One approach to tackle this problem is to use an inner product method for the SpGEMM kernel implementation. While the inner product produces fewer intermediate results, it can end up saturating the memory bandwidth, given the high number of redundant fetches of the input matrix elements. Using an outer product-based SpGEMM kernel can reduce redundant fetches, but at the cost of increased overhead due to extra computation and memory accesses for producing/managing partial products. In this thesis, we introduce a novel SpGEMM kernel implementation based on the row-wise product approach. We leverage atomic instructions to merge intermediate partial products as they are generated. The use of atomic instructions eliminates the need to create partial product matrices, thus eliminating redundant DRAM fetches. To evaluate our row-wise product approach, we map an optimized SpGEMM kernel to a custom accelerator designed to accelerate graph-based applications. The targeted accelerator is an experimental system named PIUMA, being developed by Intel. PIUMA provides several attractive features, including fast context switching, user-configurable caches, globally addressable memory, non-coherent caches, and asynchronous pipelines. We tailor our SpGEMM kernel to exploit many of the features of the PIUMA fabric. This thesis compares our SpGEMM implementation against prior solutions, all mapped to the PIUMA framework. We briefly describe some of the PIUMA architecture features and then delve into the details of our optimized SpGEMM kernel. Our SpGEMM kernel can achieve 9.4x speedup as compared to competing approaches.
Conference Paper
Full-text available
Graph Neural Networks (GNNs) have emerged as a promising class of Machine Learning algorithms to train on non-euclidean data. GNNs are widely used in recommender systems, drug discovery, text understanding, and traffic forecasting. Due to the energy efficiency and high-performance capabilities of GPUs, GPUs are a natural choice for accelerating the training of GNNs. Thus, we want to better understand the architectural and system-level implications of training GNNs on GPUs. Presently, there is no benchmark suite available designed to study GNN training workloads. In this work, we address this need by presenting GNNMark, a feature-rich benchmark suite that covers the diversity present in GNN training workloads, datasets, and GNN frameworks. Our benchmark suite consists of GNN workloads that utilize a variety of different graph-based data structures, including homogeneous graphs, dynamic graphs, and heterogeneous graphs commonly used in a number of application domains that we mentioned above. We use this benchmark suite to explore and characterize GNN training behavior on GPUs. We study a variety of aspects of GNN execution, including both, compute and memory behavior, highlighting major bottlenecks observed during GNN training. At the system level, we study various aspects, including the scalability of training GNNs across a multi-GPU system, as well as the sparsity of data, encountered during training. The insights derived from our work can be leveraged by both hardware and software developers to improve both the hardware and software performance of GNN training on GPUs