ArticlePDF Available

Speeding Up Multi-Scalar Multiplication over Fixed Points Towards Efficient zkSNARKs

Authors:

Abstract and Figures

The arithmetic of computing multiple scalar multiplications in an elliptic curve group then adding them together is called multi-scalar multiplication (MSM). MSM over fixed points dominates the time consumption in the pairing-based trusted setup zero-knowledge succinct non-interactive argument of knowledge (zkSNARK), thus for practical applications we would appreciate fast algorithms to compute it. This paper proposes a bucket set construction that can be utilized in the context of Pippenger’s bucket method to speed up MSM over fixed points with the help of precomputation. If instantiating the proposed construction over BLS12-381 curve, when computing n-scalar multiplications for n = 2e (10 ≤ e ≤ 21), theoretical analysis ndicates that the proposed construction saves more than 21% computational cost compared to Pippenger’s bucket method, and that it saves 2.6% to 9.6% computational cost compared to the most popular variant of Pippenger’s bucket method. Finally, our experimental result demonstrates the feasibility of accelerating the computation of MSM over fixed points using large precomputation tables as well as the effectiveness of our new construction.
Content may be subject to copyright.
IACR Transactions on Cryptographic Hardware and Embedded Systems
ISSN 2569-2925, Vol. 2023, No. 2, pp. 358–380. DOI:10.46586/tches.v2023.i2.358-380
Speeding Up Multi-Scalar Multiplication over
Fixed Points Towards Efficient zkSNARKs
Guiwen Luo, Shihui Fuand Guang Gong
Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON,
Canada, {guiwen.luo,shihui.fu,ggong}@uwaterloo.ca
Abstract. The arithmetic of computing multiple scalar multiplications in an elliptic
curve group then adding them together is called multi-scalar multiplication (MSM).
MSM over fixed points dominates the time consumption in the pairing-based trusted
setup zero-knowledge succinct non-interactive argument of knowledge (zkSNARK),
thus for practical applications we would appreciate fast algorithms to compute it.
This paper proposes a bucket set construction that can be utilized in the context
of Pippenger’s bucket method to speed up MSM over fixed points with the help of
precomputation. If instantiating the proposed construction over BLS12-381 curve,
when computing
n
-scalar multiplications for
n
= 2
e
(10
e
21), theoretical analysis
indicates that the proposed construction saves more than 21% computational cost
compared to Pippenger’s bucket method, and that it saves 2
.
6% to 9
.
6% computational
cost compared to the most popular variant of Pippenger’s bucket method. Finally, our
experimental result demonstrates the feasibility of accelerating the computation of
MSM over fixed points using large precomputation tables as well as the effectiveness
of our new construction.
Keywords:
Multi-scalar multiplication
·
Pippenger’s bucket method
·
zkSNARK
·
blockchain
1 Introduction
In recent years, zero-knowledge succinct non-interactive argument of knowledge (zk-
SNARK) has gained tremendous interest from its theoretical development to the prac-
tical implementation because it provides an elegant privacy protection solution. Popu-
lar examples include anonymous transactions in Zcash [BCG
+
14] and smart contract
verification over private inputs [Ebe] in Ethereum. Many zkSNARKs with trusted
setup rely on pairing-based cryptography, which are very efficient in general. Groth
et al. [GOS06,Gro06,Gro09,Gro10,GOS12,GS12] first introduced pairing-based zero-
knowledge proofs, leading to the extensive research work in this area [Lip12,GGPR13,
DFGK14,Gro16,MBKM19,GWC19,CHM+19,BFS20,BDFG21].
All pairing-based trusted setup zkSNARKs in the literature follow a common paradigm,
where the prover computes a proof consisting of several points in an elliptic curve group
by generic group operations and the verifier checks the proof by a number of pairings in
the verification equation. Basically, it requires the prover and verifier to conduct their
computation by only using linear operations to the points built in the common reference
string. This computation is indeed MSM over fixed points. MSM dominates the overall
time for generating and verifying the proof. Thus, fast algorithms for computing MSM
over fixed points are desirable and necessary.
Shihui Fu is currently with Delft University of Technology, Delft, Netherlands, shihui.fu@tudelft.nl.
Licensed under Creative Commons License CC-BY 4.0.
Received: 2022-10-15 Accepted: 2022-12-15 Published: 2023-03-06
Guiwen Luo, Shihui Fu and Guang Gong 359
MSM in those applications shows the characteristic of having a large amount of points.
For example, one of the most classical zkSNARK applications is to prove the knowledge of
preimage for a cryptographic hash function. When using the traditional SHA-256, which
is compiled to an arithmetic circuit with 22,272 AND gates when the preimage is 512
bit [CGGN17], it will lead to the computation of MSM with more than 22,272 points.
When utilizing zkSNARK-friendly hash function Poseidon [GKR
+
21], the MSM still has
hundreds of points.
1.1 Related work
The most popular method for scalar multiplication in the elliptic curve group is the binary
algorithm, known as doubling and addition method (also known as square and multiplication
method in the exponentiation setting) [Knu97, Section 4.6.3]. GLV method [GLV01] and
GLS method [GLS11] decompose the scalar into dimensions 2,4,6 and 8, then compute the
corresponding MSM. When the point for scalar multiplication is fixed, precomputation
can be used to reduce the computational cost. Knuth’s 5 window algorithm utilizes
the precomputation of 16 points to speed up scalar multiplication [Knu97,BC89]. If
bigger window and more storage for precomputed points are used, the windowing method
can be even faster. Pippenger’s bucket method and its variants decompose the scalar,
then sort out all points into buckets with respect to their scalars, and finally utilize
an accumulation algorithm to add them together [Pip76,BDLO12]. Another line of
research lies in constructing new number systems to represent the scalar, such as basic
digit sets [Mat82,BGMW95] and multi-base number systems [DKS09,SIM12,YWLT13].
Researchers also try to make the addition arithmetic more efficient by using different curve
representations, such as projective coordinates and Jacobian coordinates that eliminate
the inversion operations, and Montgomery form that only utilizes
x
-coordinate [Mon87].
Differential addition chains (DACs) are used accompanying with
x
-only-coordinate systems,
for example, PRAC chains [Mon92], DJB chains [Ber06] and other multi-dimensional
DACs [Bro15,Rao15]. Most of the aforementioned techniques can be applied to MSM
where the number of points is small.
When the number of points in MSM is big, which is the situation in pairing-based
trusted setup zkSNARK applications, Pippenger’s bucket method and its variants are the
state-of-the-art algorithms that outperform other competitors. Bernstein et al. [BDLO12]
investigated Bos-Coster method [DR94, Section 4], Straus method [Str64] and Pippenger’s
bucket method, then chose Pippenger’s bucket method to implement batching forgery iden-
tification for elliptic curve signatures, which marks the beginning of extensive deployment
of Pippenger’s bucket method for computing MSM with a big number of points.
In practice, all of the popular zkSNARK-oriented implementations, such as Zcash [Zca],
TurboPLONK [GJW20], Bellman [bel], gnark [gna], choose Pippenger’s bucket method or
its variants to compute MSM over fixed points.
1.2 Our contribution
This paper proposes a new bucket set construction that yields an efficient algorithm
to compute MSM over fixed points in the context of Pippenger’s bucket method. Our
construction targets on
n
-scalar multiplication with 2
10 n
2
21
, which is desirable for
many pairing-based trusted setup zkSNARK applications. Our main contributions are
summarized as follows.
A new subsum accumulation algorithm. After sorting out points into buckets with
respect to their scalars, Pippenger’s bucket method would compute intermediate
subsums and utilize an accumulation algorithm to add those subsums together. The
original subsum accumulation Algorithm 1presented in Section 2.3 is applicable for
360 Speeding Up MSM over Fixed Points Towards Efficient zkSNARKs
the situation where the scalars in the bucket set are consecutive. When the scalars
in the bucket set are inconsecutive, Algorithm 1would be less efficient. This paper
proposes a new subsum accumulation Algorithm 3that accumulates
m
intermediate
subsums using at most 2
m
+
d
3additions, where
d
is the maximum difference
between two neighbor elements in the bucket set.
A construction of bucket set that yields efficient algorithm to compute MSM over
fixed points. The proposed bucket set construction carefully selects integer elements
from [0
, q/
2] so that for all
t
(0
tq
), there exist an integer
b
in the bucket set
and an integer m {1,2,3}such that the following assertion holds,
t=mb or t=qmb.
When instantiating over the BLS12-381 curve [Bow17], this construction would yield
an algorithm that takes advantage of 3
nh
precomputed points to evaluate the
n
-scalar
multiplication over fixed points where all scalars are smaller than a 255-bit prime
r
,
using at most approximately
((nh + 0.21q)additions, if q= 2c(10 c31, c 6= 15,16,17),
(nh + 0.28q)additions, if q= 216,
where
h
=
dlogqre
. The theoretical analysis shows that for
n
= 2
e
(10
e
21), the
proposed algorithm saves more than 21% computational cost compared to Pippenger’s
bucket method, and that it saves 2
.
6% to 9
.
6% computational cost compared to the
most popular variant of Pippenger’s bucket method, which is reviewed in Section
2.3.2.
The feasibility of accelerating the computation of MSM by taking advantage of large
precomputation tables and the effectiveness of our new construction are demonstrated
by our implementation. We implemented the popular variant of Pippenger’s bucket
method and our construction based on the BLS12-381 library
blst
[bls]. When
computing n-scalar multiplication over fixed points in the BLS12-381 curve groups,
the experimental result shows that the proposed construction saves more than 17
.
7%
of the computing cost compared to the Pippenger’s bucket method implementation
built in
blst
for
n
= 2
e
(10
e
21), and that it saves 3
.
1% to 9
.
2% of the
computing cost compared to the variant of Pippenger’s bucket method for
n
=
2e(10 e21, e 6= 16,20).
The paper is organized as follows. In Section 2several popular MSM algorithms
including Pippenger’s bucket method and one of its popular variants are reviewed. Then
we propose a new subsum accumulation algorithm in Section 3. In Section 4, we present a
framework of computing MSM over fixed points taking the advantage of precomputation.
This framework is used to derive our new MSM algorithm. Section 5is dedicated to the
construction of our new bucket set and multiplier set. We instantiate our construction
over BLS12-381 curve in Section 6and do the theoretical time complexity analysis. In the
end, we present the implementation and experimental result in Section 7.
Let us first introduce the notations used throughout the paper before diving into the
content.
Notations.
Without special explanations hereinafter, let
E
be an elliptic curve group and
r
be its order. Let
bxc
be the largest integer that is equal to or smaller than
x
, and
dxe
be
the smallest integer that is equal to or greater than
x
. Let
||
be bit string concatenation.
Notation Sn,r represents the following MSM over fixed points,
Sn,r =a1P1+a2P2+· · · +anPn,(1)
Guiwen Luo, Shihui Fu and Guang Gong 361
where
ai
’s are scalars such that 0
ai< r
and
Pi
’s are fixed points in
E
. Radix
q
= 2
c
is
an integer used to express a scalar in its radix
q
representation. Integer
h
is the length
of a scalar in its radix
q
representation, i.e.,
h
=
dlogqre
. The term
addition
refers to
the point addition arithmetic in
E
. Let us assume for simplicity the computational cost
of doubling and that of addition in
E
are the same, denoted as
A
. This is the norm in
Pippenger-like algorithms, where the major operations are additions. The storage size of a
point is denoted as P.
2 Recap of multi-scalar multiplication methods
In this section we review several widely used methods that compute
Sn,r
with large
n
,
namely trivial method, Straus method, Pippenger’s bucket method and one of the variants
of Pippenger’s bucket method.
2.1 Trivial method
In trivial method, each
aiPi
in (1) is computed separately by the doubling and addition
method, then
n
intermediate results are added together to obtain the final result. In the
worst case each scalar multiplication cost about 2
·
(
dlog2re
1)
·A
, the total cost of
computing Sn,r is
[2 ·(dlog2re 1) ·n+ (n1)] ·A2nlog2r·A. (2)
If non-adjacent form is used to represent the scalar
ai
(
i
= 1
,
2
,· · · , n
), because every
non-zero digit has to be adjacent to two 0s, in the worst case there are half non-zero digits
in
ai
. The cost of each scalar multiplication would drop to about (3
/
2)
dlog2
(
r
)
e · A
. The
time complexity of computing Sn,r in the worst case is about
3
2dlog2re · n+ (n1)·A3
2·nlog2r·A. (3)
2.2 Straus method
In order to compute Sn,r, Straus method [Str64] precomputes 2nc points
{b1P1+b2P2+· · · +bnPn|for 0bi2c1, i = 1,2,· · · , n},
where
c
is a small integer. It then divides each
ai
(in its binary form with high order bit
to the left) into segments of length c, i.e.,
ai=ai,h1||ai,h2|| · · · ||ai1||ai0=
h1
X
j=0
aij 2jc , i = 1,2,· · · , n, (4)
where h=dlog2(r)/ce, and 0aij <2cfor 1jh1. It retrieves the point
Sn,2c=a1,h1P1+a2,h1P2+· · · +an,h1Pn(5)
from the precomputation table, doubles it ctimes, adds the precomputed point
a1,h2P1+a2,h2P2+· · · +an,h2Pn(6)
to obtain
Sn,22c= (a1,h1||a1,h2)P1+ (a2,h1||a2,h2)P2+· · · + (an,h1||an,h2)Pn.(7)
362 Speeding Up MSM over Fixed Points Towards Efficient zkSNARKs
By repeating such process for h1times, we obtain
Sn,2hc =(a1,h1||a1,h2|| · · · ||a10)P1+ (a2,h1||a2,h2|| · · · ||a20 )P2
+· · · + (an,h1||an,h2|| · · · ||an,0)Pn.
Sn,2hc is exactly what we aim to compute, i.e., Sn,r.
Straus method is only suitable for small
n
because when
n
goes big the precomputation
would be exponentially large. One variant that can be used for large number
n
is to only
store n·(2c1) precomputed values,
{biPi|1bi2c1, i = 1,2,· · · , n},
where
c
is a small integer. At
j
-th iteration of (5)(6)(7) (
j
= 0
,
1
,
2
,· · · , h
1) in Straus
method, separately add together precomputed points
a1jP1
,
a2jP2
,
· · ·
,
anj Pn
with
n
1
additions to obtain
a1jP1+a2jP2+· · · +anj Pn.
The storage size would drop from 2nc ·Pto
n(2c1) ·P.
This process would repeat
h
times, each time it conducts
n
additions and
c
doublings (the
last time does not require doubling), so the computational cost is approximately
(n+c)h·A. (8)
2.3 Pippenger’s bucket method
Here we introduce Pippenger’s bucket method presented in [BDLO12, Section 4], which is
an application of Pippenger’s algorithm [Pip76].
Pippenger’s bucket method proceeds the same as what Straus method does except for
computing
Sn,2c=a1jP1+a2jP2+· · · +anj Pn,(9)
where j= 0,1,2,· · · , h 1, h =dlog2(r)/ce.
Pippenger’s bucket method evaluates (9) by first sorting all the points into (2
c
1)
buckets with respect to their scalars. We denote the intermediate subsum of those points
corresponding to scalar
i
as
Si
. It computes all
Si0s
(
i
= 1
,
2
,· · · ,
2
c
1) using at most
(
n
(2
c
1)) additions. Finally it computes
Sn,2c
=
P2c1
i=1 i·Si
by Algorithm 1using at
most 2(2c2) additions.
Algorithm 1 Subsum accumulation algorithm I
Input: S1, S2,· · · , Sm.
Output: 1S1+ 2S2+· · · +mSm.
tmp =0
tmp1 =0
for i=mto 1do
tmp =tmp +Si
tmp1 =tmp1 +tmp
return tmp1
The correctness of Algorithm 1is ensured by the following equation,
m
X
i=1
iSi=
m
X
i=1
i
X
j=1
Si=
m
X
j=1
m
X
i=j
Si.
Guiwen Luo, Shihui Fu and Guang Gong 363
The computation of
Sn,2c
costs
n
(2
c
1) + 2(2
c
2)
(
n
+ 2
c
)additions. The
computational cost of Sn,r is thus approximately
(n+ 2c)h·A. (10)
Compared to (8), in the first glimpse it seems that Pippenger’s bucket method is less
efficient against Straus method, but this might not be right for large
n
. Because there is
no precomputation requirement in Pippenger’s bucket method, bigger
c
can be selected to
minimize the overall computational cost.
2.3.1 The variant
In the aforementioned Pippenger’s bucket method, one downside is that Algorithm 1
runs
h
times. If there is storage available for precomputation, this shortcoming can be
circumvented by the variant presented in [BGMW95].
Choose a radix q= 2c, partition ai(i= 1,2,· · · , n)into segments as follows,
ai=ai,h1||ai,h2|| · · · ||ai0=
h1
X
j=0
aij qj,(11)
where h=dlogqre,0aij < q (0 jh1). It follows that
Sn,r =a1P1+a2P2+· · · +anPn
=
n
X
i=1
(
h1
X
j=0
aij qj)Pi
=
n
X
i=1
h1
X
j=0
aij ·qjPi
=: Snh,q.
(12)
We precompute the following points
{qjPi|i= 1,2,· · · , n, j = 0,1,2,· · · , h 1},
which requires the storage size of
nh ·P,
then
Sn,r
=
Snh,q
can be computed by using Algorithm 1only once. The computational
cost is
[nh (q1) + 2(q2)] ·A(nh +q)·A.
2.3.2 Further optimization
Pippenger’s bucket method and the variant can be further optimized by halving the size
of the bucket set. Let radix
q
= 2
c
, using the observation that in an elliptic curve group
P
is obtained from
P
by taking the negative of its
y
coordinate with almost no cost, all
the buckets can be restricted to scalars that are no more than q/2if
qh1< r q/2·qh1,
where
h
=
dlogqre
. Algorithm 2can be used to convert scalar
a
(0
a < r
)from its
standard
q
-ary form to the representation where every digit is in the range of [
q/
2
, q/
2].
364 Speeding Up MSM over Fixed Points Towards Efficient zkSNARKs
Algorithm 2 Scalar conversion I
Input:{aj}0jh1,0aj< q such that a=Ph1
j=0 ajqj.
Output:{bj}0jh1,q/2bjq/2such that a=Ph1
j=0 bjqj.
1: for j=0to h2by 1do
2: if ajq/2then
3: bj=aj
4: else
5: bj=ajq
6: aj+1=aj+1+1
7: bh1=ah1
8: return {bj}1jh1
The correctness of Algorithm 2is straightforward. Notice that the assumption ensures
ah1q/21, so bh1q/2considering the possible carry bit from ah2.
The time complexity of Pippenger’s bucket method would thus drop to
h(n+q/2) ·A, (13)
and the complexity of the variant would be
(nh +q/2) ·A. (14)
Henceforward when mentioning Pippenger’s bucket method and Pippenger’s variant,
we refer to the algorithms whose time complexities are (13) and (14) respectively.
2.4 Comparison of multi-scalar multiplication algorithms
We summarize in Table 1the precomputation storage and the time complexity of computing
Sn,r
by the aforementioned methods together with our construction proposed in Section 5.
Here
q
= 2
c
,
h
=
dlogqre
. Radix
q
is selected to minimize the computational cost. The
time complexity of Pippenger’s bucket set and Pippenger’s variant hold if
rq/
2
·qh1
.
The time complexity of our construction holds when r/qhis small.
Table 1: Comparison of different methods that computes Sn,r
Method Storage Worst case complexity
Trivial method n·P3/2·(nlog2r)·A
Straus method [Str64]n2c·P h(n+c)·A
Pippenger [Pip76,BDLO12]n·P h(n+q/2) ·A
Pippenger variant [BGMW95]nh ·P(nh +q/2) ·A
Our construction 3nh ·P(nh + 0.21q)·A
3 A new subsum accumulation algorithm
During the computation of
Sn,r
by Pippenger’s bucket method, after sorting every point
into the bucket with respect to its scalar and computing the intermediate subsum
Si0s
,
the reminder is invoking a subsum accumulation algorithm to compute
S=b1S1+b2S2+· · · +bmSm,
where 1
b1b2 · · · bm
. When set
{bi}1im
is not a sequence of consecutive
integers, Algorithm 1shows the limitation of handling such case with less efficiency. One
Guiwen Luo, Shihui Fu and Guang Gong 365
may utilize Bos-Coster method [DR94, Section 4] to deal with this case but it is a recursive
algorithm and its complexity is not easy to analyze. Here we propose a straightforward
algorithm to tackle this case.
Define b0= 0, let
d= max
1im{bibi1},
then Scan be computed by Algorithm 3.
Algorithm 3 Subsum accumulation algorithm II
Input: b1, b2,· · · , bm, S1, S2,· · · , Sm.
Output: S=b1S1+b2S2+· · · +bmSm.
1: Define a length-(d+ 1) array tmp = [0]×(d+1)
2: for i=mto 1by 1do
3: tmp[0] = tmp[0] + Si
4: k=bibi1
5: if k>=1then
6: tmp[k] = tmp[k] + tmp[0]
7: return 1·tmp[1] + 2·tmp[2] + · · · +d·tmp[d]
Denote
δj
=
bjbj1
, then
bi
=
Pi
j=1 δj
. The correctness of Algorithm 3comes from
the following equation,
m
X
i=1
biSi=
m
X
i=1
(
i
X
j=1
δj)Si
=
m
X
j=1
δj(
m
X
i=j
Si)
=
d
X
k=1
k
m
X
j=1j=k
(
m
X
i=j
Si).
(15)
During the execution of Algorithm 3, temp variable
tmp[0]
stores
Pm
i=jSi
when loop index
i
equals
j
, and temp variable
tmp[k]
stores
Pm
j=1j=k
(
Pm
i=jSi
)for 1
kd
after the
for loop.
If
{bi}1im
is strictly increasing and
k
in line 4 goes through
{
1
,
2
,· · · , d}
, then in
the
for
loop (line 2 6), each iteration executes exactly 2 additions. Since all
d
+ 1 temp
variables in
tmp
are initialized as 0’s, there are
d
+ 1 additions with addend 0, which
have no computational cost, so the
for
loop executes 2
m
(
d
+ 1) additions. Line 7 is
computed by subsum accumulation Algorithm 1with 2(
d
1) additions. In total, the cost
of Algorithm 3is 2m+d3additions.
If
{bi}1im
are not strictly increasing, which means sometimes
k
in line 4 equals 0, in
the corresponding for iteration it will only execute one addition by skipping if part.
If
k
in line 4 does not go through all integers in
{
1
,
2
,· · · , d}
, there exists a
tmp[k]
(1
kd
)who would skip the
for
loop and stay at 0. In the
for
loop, the addition saved by
the fact that
tmp[k]
is initialized as 0 will no longer be saved. In the mean time when line
7 is executed, at least one addition will be saved because
tmp[k]
= 0, so the total cost will
not increase.
To sum up, the cost of Algorithm 3in the worst case is (2
m
+
d
3)
·A
. When
d
= 1,
Algorithm 3degenerates to Algorithm 1.
366 Speeding Up MSM over Fixed Points Towards Efficient zkSNARKs
4 A framework of computing multi-scalar multiplication
over fixed points
The following framework is inspired by Brickell et al. [BGMW95], who presented a similar
method to compute single scalar multiplication using the notion of basic digit sets. They
did not consider the possible overflow of the most significant digit of a scalar, which is
not a big issue in single scalar multiplication while it matters in MSM
Sn,r
, because the
overflow will increase the computational cost by at most
n
additions. Here we give a
straightforward illustration to the framework without the involvement of basic digit sets.
Suppose we are going to compute
Sn,r
. Let
M
be a set of integers,
B
be a set of
non-negative integers and 0
B
. Given scalar
ai
(0
ai< r
)in its radix
q
representation
ai=
h1
X
j=0
aij qj,
where
h
=
dlogqre
, if every
aij
(1
in,
0
jh
1) is the product of an element
from set Mand an element from set B, i.e.,
aij =mij bij , mij M, bij B,
Sn,r can be computed as follows,
Sn,r =
n
X
i=0
aiPi=
n
X
i=1
(
h1
X
j=0
aij qj)Pi
=
n
X
i=1
(
h1
X
j=0
mij bij qj)Pi=
n
X
i=1
h1
X
j=0
bij ·mij qjPi.
(16)
Denote Pij =mij qjPi, then
Sn,r =
n
X
i=1
h1
X
j=0
bij Pij
=
n
X
i=1
h1
X
j=0
(X
kB
k·X
i,j s.t. bij =k
Pij )
=X
kB
k·(
n
X
i=1
h1
X
j=0 X
i,j s.t. bij =k
Pij ).
(17)
Suppose those nh|M|points
{mqjPi|1in, 0jh1, m M}(18)
are precomputed, and define intermediate subsum Sk,
Sk=
n
X
i=1
h1
X
j=0 X
i,j s.t. bij =k
Pij , k B.
Equation (17) can be evaluated by first computing all
Sk0s
(
kB
)with at most
nh
(
|B|
1) additions, the reason is straightforward since there are
nh
points being sorted into
|B|
1subsums. The reminder is computed by Algorithm 3with at most 2(
|B|
1) +
d
3
additions, where dis the maximum difference between two neighbor elements in B.
Guiwen Luo, Shihui Fu and Guang Gong 367
To sum up, the worst case time complexity of computing Sn,r is
(nh +|B|+d4) ·A, (19)
where h=dlogqre, with the help of
nh|M|(20)
precomputed points.
Set
M
is called a multiplier set, because the set of precomputed points contains the
points multiplied by every element from
M
. Set
B
is called a bucket set, since all points are
sorted into subsum buckets with respect to the scalars in
B
. This framework is translated
into Algorithm 4.
Algorithm 4 Multi-scalar multiplication over fixed points
Input:
Scalars
a1, a2,· · · , an
, fixed points
P1, P2,· · · , Pn
, radix
q
, scalar length
h
, multi-
plier set M={m0, m1,· · · , m|M|−1}, bucket set B={b0, b1,· · · , b|B|−1}.
Output: Sn,r =Pn
i=1 aiPi.
1: Precompute a length-nh|M|point array precomputation, such that
precomputation [|M|((i1)h+j) + k] = mkqjPi.
Precompute a hash table
mindex
to record the index of every multiplier, such that
mindex[mk] = k
. Precompute a hash table
bindex
to record the index of every bucket,
such that bindex[bk] = k.
2: Convert every aito its standard q-ary form, then convert it to ai=Ph1
j=0mijbij qj.
3:
Create a length-
nh
scalar array
scalars
, such that
scalars[(i1)h+j] = bij
.
Create a length-
nh
array
points
recording the index of points, such that
points[(i1)h+j] = |M|((i1)h+j) + mindex[mij]
.
n
-scalar multiplication
Sn,r
is equivalent to the following nh-scalar multiplication
nh1
X
i=0
scalars[i]·precomputation [points[i]],
where every scalar in scalars is from bucket set B.
4:
Create a length-
|B|
point array
buckets
to record the intermediate subsums, and initial-
ize every point to infinity. For
0inh 1
, add point
precomputation [points[i]]
to bucket buckets [bindex[scalars[i]]].
5: Invoke Algorithm 3to compute P|B|−1
i=0bi·buckets[i], return the result.
If we denote the expected number of zero element in the length-
nh
array
scalars
as
f
, and assume all elements in the length-
|B|
array
buckets
, Step 5 are none-zero, then
the average cost can be estimated as
(nh +|B|+df)·A. (21)
From (19)(21) we can see, given
n
and
r
, in order to reduce the time complexity of
computing
Sn,r
, we can choose a larger radix
q
to make
h
smaller, or find a smaller bucket
set B. Those two alternatives are closely related.
Here are two examples of utilizing this framework.
Example 1. Under this framework, Pippenger’s variant presented in Section 2.3.2 has
M={−1,1}, B ={0,1,2,· · · ,2c1}.
Example 2. For radix q= 2csuch that
qh1< r 1/4·q·qh1,
368 Speeding Up MSM over Fixed Points Towards Efficient zkSNARKs
we denote λ=qmod 3,λ {1,2}. The multiplier set is picked as
M={1,1,3,3},(22)
the corresponding bucket set is
B={i|0iq
4}∪{3iλ|for all is.t. q
43iλq
2}.(23)
It can be shown that this is a valid construction and that
|B| q/
3 + 2, thus the pair
(M, B )yields an algorithm to compute Sn,r using at most
(nh +lq
3m)·A.
5 A construction of multiplier set and bucket set
In this section, we construct a pair of multiplier set and bucket set (
M, B
)that can be
utilized to speed up the computation of
Sn,r
under the framework presented in Section
4. The essential difficulty to reduce the size of
B
is to make sure that every scalar in
[0
, r
1] can be converted to its radix
q
representation where every digit is the product of
an element from Mand an element from B.
Given a scalar a(0 a<r)in its standard q-ary representation
a=
h1
X
j=0
ajqj,0aj< q, (24)
we will show that our construction enables the scalar conversion from its standard
q
-
ary form to the required radix
q
representation, thus yielding efficient
Sn,r
computation
algorithm.
5.1 Our construction
For radix q= 2c(10 c31), the multiplier set is picked as
M={1,2,3,1,2,3}.(25)
Bucket set Bis established by an algorithm.
In order to determine
B
, let us first define three auxiliary sets
B0, B1
and
B2
. Let
rh1
=
br/qh1c
be the maximum leading term of a scalar in its standard
q
-ary expression,
B0={0}∪{b|1bq/2, s.t. ω2(b) + ω3(b)0 mod 2},
B2={0}∪{b|1brh1+ 1, s.t. ω2(b) + ω3(b)0 mod 2},(26)
where
ω2
(
b
)represents the exponent of factor 2 in
b
, and
ω3
(
b
)represents the exponent of
factor 3 in
b
. For instance, if
i
= 2
ek,
2
-k
,
ω2
(
b
) =
e
. From the definitions,
B0
(or
B2
)
has such a property that for all 0
tq/
2(or 0
trh1
+ 1), there exist an element
bB0(or bB2) and an integer m {1,2,3}, such that
t=mb.
Set
B0
itself is a valid bucket set construction, which was also mentioned in [BGMW95].
Since we can utilize negative elements in
M
, there are redundant elements to be removed
from B0. Set B1is defined by Algorithm 5, and the following Property 1holds for B1.
Guiwen Luo, Shihui Fu and Guang Gong 369
Algorithm 5 Construction of auxiliary set B1
Input:B0, q.
Output:B1.
1: B1=B0
2: for i=q/4to q/21by 1do
3: if iis in B0and q2·iis in B0then
4: B1.remove(q2·i)
5: for i=bq/6cto q/41by 1do
6: if iis in B0and q3·iis in B0then
7: B1.remove(q3·i)
8: return B1
Property 1.
Given
q
= 2
c
(10
c
31), for all
t
(0
tq
), there exist an element
bB1and an integer m {1,2,3}, such that
t=mb or t=qmb.
This property is verified by computation using Algorithm 7. It is also asserted by
computation that exchanging two for loops in Algorithm 5would construct the same
B1
.
Finally the bucket set is proposed to be
B=B1B2.(27)
Property 2.
For the multiplier set
M
and the bucket set
B
defined in (25) (27), scalar
a(0 a<r)can be expressed (not necessarily uniquely) as follows
a=
h1
X
j=0
mjbjqj, mjM, bjB. (28)
Proof.
By Property 1we know that for arbitrary integer
t
[0
, q
], it can be expressed as
t=mb +αq, m M, b B, α {0,1},
and by the definition of
B2
we know that for integer
t
[0
, rh1
+ 1], it can be expressed
as
t=mb, m {1,2,3}, b B.
Back to Property 2, Algorithm 6can be used to convert
a
from its standard
q
-ary
representation defined in (24) to its radix qrepresentation defined in (28).
Algorithm 6 Scalar conversion II
Input:{aj}0jh1,0aj< q such that a=Ph1
j=0 ajqj.
Output:{(mj, bj)}0jh1, mjM, bjBsuch that a=Ph1
j=0 mjbjqj.
1: for j=0to h2by 1do
2: Obtain mj,bj, αjsuch that aj=mjbj+αjq
3: aj+1=αj+aj+1
4: Obtain mh1,bh1such that ah1=mh1bh1
5: return {(mj,bj)}0jh1
The correctness of Algorithm 6comes from the fact that
i) a0[0, q 1],
ii) αj+aj+1 [0, q]for all 0jh3,
iii) αh2+ah1[0, rh1+ 1].
370 Speeding Up MSM over Fixed Points Towards Efficient zkSNARKs
For every
t
[0
, q
], a hash table
H
is precomputed to store its decomposition, i.e.,
H
(
t
) = (
m, b, α
)such that
t
=
mb
+
αq, m M , b B , α {
0
,
1
}
. Steps 2 and 4 in
Algorithm 6are executed by retrieving the decomposition from the hash table. For the
proposed (
M, B
), hash table
H
can be realized by a length-(
q
+ 1) array
decomposition
using Algorithm 7.
decomposition
is also utilized to verify Property 1by checking whether
in decomposition there is any entry whose last element is 1.
Algorithm 7 Construction of digit decomposition hash table
Input: M, B defined in (25) (27).
Output: Length-(q+ 1) array decomposition, a realization of hash table H.
1:
Define a length-(
q
+ 1) array
decomposition
and initiate every entry to be
[0,0,1]
.
2: for m {−1,2,3}do
3: for bBdo
4: if m·b+q0then
5: decomposition[m·b+q]=[m,b,1]
6: for m {1,2,3}do
7: for bBdo
8: if m·bqthen
9: decomposition[m·b]=[m,b,0]
10: return decomposition
When instantiating over BLS12-381 curve, it is calculated approximately
|B|=(0.21q, q = 2c(10 c31, c 6= 15,16,17),
0.28q, q = 216 .(29)
See Section 6and Appendix Afor the detailed parameters. It is checked that the maximum
difference between two neighbor elements in Bis d= 6.
For a point
P
= (
x, y
)on elliptic curve
E
with short Weierstrass form,
P
= (
x, y
)
can be obtained for almost no cost, hence the points associated with negative elements in
M
can be excluded from the precomputation table. Correspondingly, in Step 3, Algorithm
4, a length-
nh
boolean array is added to record the sign of multipliers. In Step 4, if a
multiplier is negative, the corresponding point should be deducted from the intermediate
subsum.
By the computational cost estimation formula presented in (19), we have the following
Proposition 1.
Proposition 1.
Given number of points
n
and group order
r
, suppose
q
= 2
c
(10
c
31)
and
h
=
dlogqre
, the multiplier set and bucket set defined in (25) (27) yield an algorithm
to compute MSM Sn,r over BLS12-381 curve using at most approximately
((nh + 0.21q)·A, q = 2c(10 c31, c 6= 15,16,17),
(nh + 0.28q)·A, q = 216 ,(30)
with the help of 3nh precomputed points
mqjPi|1in, 0jh1, m {1,2,3}.
6 Instantiation
In this section we instantiate our construction over BLS12-381 curve [BLS02,Bow17], and
present some theoretical analysis against Pippenger’s bucket method and Pippenger’s
variant.
Guiwen Luo, Shihui Fu and Guang Gong 371
BLS12-381 curve is a pairing-friendly elliptic curve initially designed by Sean Bowe
for the cryptocurrency system Zcash [Zca,Bow17]. It is widely deployed in blockchain
applications such as Zcash, Ethereum [eth], Chia [chi], DFINITY [dfi], Algorand [alg]. It
provides about 126-bit security [Pol78,BD19,GMT20].
BLS12-381 curve is defined by the equation
E:y2=x3+ 4
over the prime field Fp, where
p=0x1a0111ea397fe69a4b1ba7b6434bacd7
64774b84f38512bf6730d2a0f6b0f624
1eabfffeb153ffffb9feffffffffaaab
is the 381-bit field characteristic (in hexadecimal), and its embedding degree is 12. Two
subgroups
G1E
(
Fp
)and
G2E
(
Fp2
)over which bilinear pairings are defined have the
same 255-bit prime order
r=0x73eda753299d7d483339d80809a1d805
53bda402fffe5bfeffffffff00000001.
6.1 Theoretical analysis
Radix
q
is called optimal if the number of additions required to compute
Sn,r
in the
worst case is minimized. The optimal
q
and its corresponding scalar length
h
for different
methods is summarized in Table 2. The precomputation size presented in this table is in
terms of points in
G1
with affine coordinates. The precomputation size over
G2
would
double its counterpart over G1.
Pippenger’s bucket method and Pippenger’s variant are those two methods introduced
in Section 2.3.2. Our construction refers to the proposed construction presented in Section
5.
Table 2: Radix q, length hand precomputation size utilized to compute Sn,r
Pippenger Pippenger variant Our construction
n q h Storage q h Storage q h Storage
210 2832 96.0 KB 212 22 2.06 MB 213 20 5.62 MB
211 210 26 192 KB 213 20 3.75 MB 214 19 10.6 MB
212 210 26 384 KB 213 20 7.50 MB 214 19 21.3 MB
213 211 24 768 KB 214 19 14.2 MB 216 16 36.0 MB
214 212 22 1.50 MB 216 16 24.0 MB 216 16 72.0 MB
215 213 20 3.00 MB 216 16 48.0 MB 216 16 144 MB
216 213 20 6.00 MB 216 16 96.0 MB 219 14 252 MB
217 216 16 12.0 MB 218 15 180 MB 220 13 468 MB
218 216 16 24.0 MB 219 14 336 MB 220 13 936 MB
219 216 16 48.0 MB 220 13 624 MB 220 13 1.83 GB
220 216 16 96.0 MB 220 13 1.22 GB 222 12 3.38 GB
221 219 14 192 MB 222 12 2.25 GB 222 12 6.75 GB
The number of additions taken to compute
Sn,r
in the worst case and their comparison
are summarized in Table 3, where
372 Speeding Up MSM over Fixed Points Towards Efficient zkSNARKs
Improv1 = ((Our construction) - Pippenger)/Pippenger,
Improv2 = ((Our construction) - (Pippenger variant))/(Pippenger variant).
Table 3shows that theoretically when computing
Sn,r
over BLS12-381 curve for
n
=
2
e
(10
e
21), our construction saves 21% to 40% additions compared to Pippenger’s
bucket method, and it saves 2.6% to 9.6% additions compared to Pippenger’s variant.
Table 3: Comparison of number of additions taken to compute Sn,r in the worst case
nPippenger Pippenger variant Our construction Improv1 Improv2
210 3.69 ×1042.46 ×1042.22 ×10439.8% 9.6%
211 6.66 ×1044.51 ×1044.23 ×10436.4% 6.1%
212 1.20 ×1058.60 ×1048.12 ×10432.2% 5.6%
213 2.21 ×1051.64 ×1051.49 ×10532.4% 8.8%
214 4.06 ×1052.95 ×1052.80 ×10530.8% 4.9%
215 7.37 ×1055.57 ×1055.43 ×10526.4% 2.6%
216 1.39 ×1061.08 ×1061.03 ×10626.3% 5.0%
217 2.62 ×1062.10 ×1061.92 ×10626.6% 8.2%
218 4.72 ×1063.93 ×1063.63 ×10623.1% 7.7%
219 8.91 ×1067.34 ×1067.04 ×10621.1% 4.1%
220 1.73 ×1071.42 ×1071.35 ×10722.2% 4.9%
221 3.30 ×1072.73 ×1072.60 ×10721.2% 4.5%
It is noted that the proposed bucket sets listed in Appendix
A
are sufficient to compute
Sn,r
over BLS12-381 for
n
= 2
e
(22
e
28). Our method still shows 2
.
8%
5
.
8%
theoretical improvement against Pippenger’s variant in those cases, but its drawback is
that the precomputation would be too large.
6.2 Time complexity: worst case versus average case
We are going to show in this section that the difference between the worst case time
complexity and the average case is tiny, hence the worst case is used in this paper as the
representative. The result relies on the group order
r
, that is why we do the average case
analysis after instantiation. It is done by estimating the expected number of zero elements,
denoted as f, in the length-nh array scalars, Algorithm 4.
Suppose the group order in its standard
q
-ary form is
r
=
Ph1
j=0 rjqj
. For every
uniformly randomly picked scalar
a
(0
a < r
), when
a
is converted to the standard
q
-ary
form
a=
h1
X
j=0
ajqj,0aj< q,
for simplicity we assume that
Pr[aj= 0] 1
q,Pr[aj= (q1)] 1
q,0jh2,
and
Pr[ah1= 0] = qh1
r=qh1
Ph1
j=0 rjqj1
rh1+ 1.
Guiwen Luo, Shihui Fu and Guang Gong 373
Let us first do the analysis for our construction. When converting scalar
a
by Algorithm
6from its standard q-ary form to the radix qrepresentation
a=
h1
X
j=0
mjbjqj, mjM, bjB, (31)
we know
bj
= 0 (1
jh
2) if and only if
aj
= 0 and the carry bit from the previous
digit
αj1
= 0, or
aj
=
q
1and the carry bit
αj1
= 1. Assume the probability of carry
bit being 0 is
λ
, which is equal to the probability of
α
= 0 in array
decomposition
decided
by Algorithm 7, then
Pr[bj= 0] = λ1
q+ (1 λ)1
q=1
q,1jh2,
and
Pr[b0= 0] = 1
q,Pr[bh1= 0] = λ·1
rh1+ 1.
If a scalar is converted to the representation in (31), the expectation of number of
j
’s such
that bj= 0 is
h1
q+λ
rh1+ 1.
When running Algorithm 4, the expectation of number of zeros in scalars is
f=n(h1)
q+λ·n
rh1+ 1.(32)
Define
I=worst case time complexity average case time complexity
worst case time complexity
to measure the difference between the worst case time complexity and the average case
time complexity. Our method utilizes radix
q
comparable to
n
, so
n(h1)/q
is a small
number that can be ignored. It follows that
I=f
nh +|B|+ 2 λ·n/(rh1+ 1)
nh +|B|+ 2 <λ·n/(rh1+ 1)
nh =λ
(rh1+ 1)h.(33)
For radix
q
= 2
c
(10
c
22
, c 6
= 15
,
17) used in Table 2, the triad (
q, h, rh1
)can be
found in Appendix A, and it is checked that
λ <
0
.
7for those radixes. It follows that
I < 1%, which means the difference is small.
A similar analysis also applies to Pippenger’s bucket method and Pippenger’s variant,
and (33) still holds for them. Those two methods have
λ
0
.
5. For
q
= 2
c
(8
c
22),
I
is even smaller compared to our construction.
7 Implementation
In order to assess the cost of scalar conversion and the impact of memory locality issue
caused by large precomputation size, we conducted the experiment on the basis of
blst
, a
BLS12-381 signature library written in C and assembly [bls].
blst
library includes the
addition/doubling arithmetic and the implementation of Pippenger’s bucket method over
G1
and
G2
. We implemented Pippenger’s variant and our construction following Algorithm
4, we invoked the Pippenger’s bucket method implementation built in blst.
374 Speeding Up MSM over Fixed Points Towards Efficient zkSNARKs
7.1 Implementation analysis
In terms of the scalar conversion, which is Step 2 of Algorithm 4, a scalar is given as a
length-8
uint32_t
array. Both Pippenger’s bucket method and Pippenger’s variant need
to first convert the scalar to its standard
q
-ary form, then convert to the expression where
the absolute value of every digit is no more than
q/
2using Algorithm 2. Our construction
first converts the scalar to its standard
q
-ary form, then converts to the expression where
every digit is the product of an element from the multiplier set and an element from the
bucket set using Algorithm 6with the help of the decomposition hash table. We utilize a
length-(
q
+ 1) array
decomposition
to realize this hash table. So the concern boils down
to retrieving data from array decomposition.
In terms of Step 4 in Algorithm 4, where all points are sorted into different buckets, the
addition is done between a point fetched from array
precomputation
and another point
fetched from array
buckets
. We treat the
n
fixed points as the length-
nprecomputation
array for Pippenger’s bucket method. We have the following observations.
Pippenger’s bucket method compute htimes the equation (9), so in total it fetches
data
nh
times from its length-
n
array
precomputation
, it fetches data
nh
times
from its length-(0.5q+ 1) array buckets.
Pippenger’s variant fetches data
nh
times from its length-
nh
array
precomputation
,
it fetches data nh times from its length-(0.5q+ 1) array buckets.
Our construction fetches data
nh
times from its length-3
nh
array
precomputation
, it
fetches data
nh
times from its array
buckets
, whose length is roughly 0
.
21
q
(
q6
= 2
c
,
c {15,16,17}).
Pippenger’s variant and our construction show some advantages here regarding the num-
ber of fetch operations, since their
h
’s are usually smaller than that of Pippenger’s
bucket method. Their disadvantages are that the fetch operations are executed in larger
precomputation
and
buckets
arrays. Step 4 of Algorithm 4is a simple loop, so we utilize
prefetch to mitigate the impact of memory access of large arrays.
It is noted that in terms of fetching data from
buckets
array, our construction would
have some advantage against Pippenger’s variant when the same radix
q
is used because
our construction would use smaller
buckets
array in this case. Even if our radix
q
(
q6
= 2
16
)
is twice as big, our construction still keeps such advantage.
7.2 Experimental result
Our experiment is done on an Apple 14 inch MacBook Pro with 3.2 GHz M1 Pro chip,
and with 16 GB memory. M1 Pro has advantages like large cache size, high memory
bandwidth. Most importantly its cache line is 128 bytes, which is sufficient to accommodate
a BLS12-381
G1
point whose size is 96 bytes. Those characteristics are expected to provide
us some benefit when fetching data from large arrays.
The experimental result is presented in Tables 4and 5. Both Pippenger’s variant and
our construction use optimal radixes presented in Table 2, while the Pippenger’s bucket
method built in blst utilizes slightly different radixes, explicitly,
q=(2e2for n= 2e(10 e12),
2e3for n= 2e(13 e21).
We keep
blst
’s implementation intact, because on one hand our focus is on the comparison
between Pippenger’s variant and our construction, on the other hand
blst
’s implementation
can serve as a performance benchmark. In Table 4, s.c. v. represents the time spent
by Pippenger’s variant to do the scalar conversion for all
n
scalars in
Sn,r
, while s.c.
Guiwen Luo, Shihui Fu and Guang Gong 375
c. represents that of our construction. In Table 5, Improv1 is the comparison between
Pippenger’s bucket method and our construction, and Improv2 is the comparison between
Pippenger’s variant and our construction.
Both Pippenger’s variant and our construction show a huge improvement compared
to Pippenger’s bucket method, which demonstrates the feasibility of speeding up the
computation of Sn,r using large precomputation tables.
If we focus on the comparison between Pippenger’s variant and our construction, we
have the following observations when computing
Sn,r
in
G1
for
n
= 2
e
(10
e
21), and
in G2for n= 2e(10 e20)1,
In terms of the computation time of scalar conversion, in
G1
Pippenger’s variant
takes 0
.
9
1
.
5% out of its entire
Sn,r
computation time, while our construction
takes 1
.
1
2
.
8%. Because in
G2
the addition arithmetic takes relatively more time
compared to that in
G1
, the percentages are smaller. In
G2
Pippenger’s variant takes
0
.
4
0
.
6% out of its whole
Sn,r
computation time, while our construction takes
0.41.1%.
Our construction does not perform good for
n
= 2
16,
2
20
. For
n
= 2
16
, our optimal
radix
q
= 2
19
, which is 8 times larger than that of Pippenger’s variant. For
n
= 2
20
,
our optimal radix
q
= 2
22
, which is 4 times larger than that of Pippenger’s variant.
Since the radix value is even larger than
n
, it would have negative impact on fetching
data from array
buckets
as the analysis in the previous section indicates. When we
change to smaller radixes, specifically
q
= 2
18
for
n
= 2
16
, and
q
= 2
20
for
n
= 2
20
,
our construction outperforms Pippenger’s variant again as the results marked by
asterisk show, although theoretically those radixes are not optimal.
Our construction outperforms Pippenger’s variant for
n
= 2
e
(10
e
21
, e 6
=
16
,
20). In those cases, our construction demonstrates 3
.
1%
9
.
2% improvement
against Pippenger’s variant as Table 5shows.
Table 4: Experimental time taken to compute Sn,r by different methods
G1G2
ns.c. v. s.c. c. Pip. Pip. v. Constr. Pip. Pip. v. Constr.
210 123 us 134 us
15.2 ms 9.91 ms 9.03 ms 37.3 ms 24.4 ms 22.2 ms
211 248 us 265 us
27.1 ms 18.3 ms 17.1 ms 66.4 ms 44.9 ms 41.9 ms
212 497 us 548 us
48.5 ms 34.2 ms 32.2 ms
119 ms
83.3 ms 78.5 ms
213 920 us 657 us
89.4 ms 64.8 ms 62.0 ms
221 ms 160 ms 155 ms
214 1.07 ms 1.23 ms 165 ms 122 ms 114 ms 404 ms 300 ms 279 ms
215 2.14 ms 2.40 ms 303 ms 224 ms 217 ms 734 ms 541 ms 522 ms
216 4.29 ms 6.47 ms 551 ms 422 ms 430 ms 1.35 s 1.03 s 1.05 s
216 4.29 ms 7.63 ms 554 ms 424 ms 418 ms 1.34 s 1.03 s 1.01 s
217 12.1 ms 14.9 ms 1.06 s 864 ms 822 ms 2.54 s 2.05 s 1.99 s
218 17.8 ms 28.0 ms 1.93 s 1.60 s 1.49 s 4.69 s 3.88 s 3.61 s
219 31.3 ms 54.6 ms 3.55 s 2.98 s 2.83 s 8.63 s 7.28 s 6.83 s
220 62.7 ms 149 ms 6.84 s 5.62 s 5.63 s 16.7 s 13.7 s 13.5 s
220 62.7 ms 109 ms 6.84 s 5.61 s 5.51 s 16.6 s 13.7 s 13.3 s
221 120 ms 296 ms 13.2 s 11.2 s 10.7 s
1We did not do test in G2for n= 221 due to the memory size restriction of the test device.
376 Speeding Up MSM over Fixed Points Towards Efficient zkSNARKs
Table 5: Our method versus Pippenger’s bucket method and Pippenger’s variant
G1G2
nImprov1 Improv2 Improv1 Improv2
210 40.6% 8.86% 40.6% 9.26%
211 36.8% 6.54% 37.0% 6.78%
212 33.7% 5.78% 34.2% 5.74%
213 30.7% 4.40% 29.7% 3.13%
214 31.2% 6.54% 31.0% 7.29%
215 28.4% 3.19% 29.0% 3.61%
216 21.9% -1.88% 21.8% -2.05%
216 24.6% 1.48% 24.9% 2.10%
217 22.1% 4.91% 21.6% 3.08%
218 22.8% 6.75% 23.0% 7.03%
219 20.3% 5.12% 20.8% 6.06%
220 17.7% -0.13% 18.8% 1.45%
220 19.4% 1.69% 17.9% 2.66%
221 19.0% 4.31%
Acknowledgments
We would like to thank the reviewers for providing us detailed and valuable comments
to revise the manuscript. This work is supported by NSERC SPG and Ripple University
Research Grant.
References
[alg] Algorand: The blockchain for futurefi. https://www.algorand.com/.
[BC89]
Jurjen Bos and Matthijs Coster. Addition chain heuristics. In Conference on
the Theory and Application of Cryptology, pages 400–407. Springer, 1989.
[BCG+14]
Eli Ben-Sasson, Alessandro Chiesa, Christina Garman, Matthew Green, Ian
Miers, Eran Tromer, and Madars Virza. Zerocash: Decentralized anonymous
payments from bitcoin. In 2014 IEEE Symposium on Security and Privacy, SP
2014, Berkeley, CA, USA, May 18-21, 2014, pages 459–474. IEEE Computer
Society, 2014.
[BD19]
Razvan Barbulescu and Sylvain Duquesne. Updating key size estimations for
pairings. Journal of Cryptology, 32(4):1298–1336, 2019.
[BDFG21]
Dan Boneh, Justin Drake, Ben Fisch, and Ariel Gabizon. Halo infinite: Proof-
carrying data from additive polynomial commitments. In Tal Malkin and
Chris Peikert, editors, Advances in Cryptology - CRYPTO 2021 - 41st Annual
International Cryptology Conference, CRYPTO 2021, Virtual Event, August
16-20, 2021, Proceedings, Part I, volume 12825 of Lecture Notes in Computer
Science, pages 649–680. Springer, 2021.
[BDLO12]
Daniel J Bernstein, Jeroen Doumen, Tanja Lange, and Jan-Jaap Oosterwijk.
Faster batch forgery identification. In International Conference on Cryptology
in India, pages 454–473. Springer, 2012.
Guiwen Luo, Shihui Fu and Guang Gong 377
[bel]
bellman: A crate for building zk-snark circuits.
https://github.com/
zkcrypto/bellman.
[Ber06]
Daniel J Bernstein. Differential addition chains.
https://cr.yp.to/ecdh/
diffchain-20060219.pdf, 2006.
[BFS20]
Benedikt Bünz, Ben Fisch, and Alan Szepieniec. Transparent snarks from
DARK compilers. In Anne Canteaut and Yuval Ishai, editors, Advances in
Cryptology - EUROCRYPT 2020 - 39th Annual International Conference on
the Theory and Applications of Cryptographic Techniques, Zagreb, Croatia,
May 10-14, 2020, Proceedings, Part I, volume 12105 of Lecture Notes in
Computer Science, pages 677–706. Springer, 2020.
[BGMW95]
Ernest F Brickell, Daniel M Gordon, Kevin S McCurley, and David B Wilson.
Fast exponentiation with precomputation: Algorithms and lower bounds.
preprint, Mar, 27, 1995.
[bls]
blst: a BLS12-381 signature library focused on performance and security
written in c and assembly. https://github.com/supranational/blst.
[BLS02]
Paulo SLM Barreto, Ben Lynn, and Michael Scott. Constructing elliptic
curves with prescribed embedding degrees. In International Conference on
Security in Communication Networks, pages 257–267. Springer, 2002.
[Bow17] Sean Bowe. BLS12-381: New zk-snark elliptic curve construction, 2017.
[Bro15]
Daniel R Brown. Multi-dimensional Montgomery ladders for elliptic curves,
February 17 2015. US Patent 8,958,551.
[CGGN17]
Matteo Campanelli, Rosario Gennaro, Steven Goldfeder, and Luca Nizzardo.
Zero-knowledge contingent payments revisited: Attacks and payments for
services. In Proceedings of the 2017 ACM SIGSAC Conference on Computer
and Communications Security, pages 229–243, 2017.
[chi]
Chia network: a better blockchain and smart transaction platform.
https:
//www.chia.net/.
[CHM+19]
Alessandro Chiesa, Yuncong Hu, Mary Maller, Pratyush Mishra, Noah Vesely,
and Nicholas P. Ward. Marlin: Preprocessing zksnarks with universal and
updatable SRS. IACR Cryptology ePrint Archive, 2019:1047, 2019.
[DFGK14]
George Danezis, Cédric Fournet, Jens Groth, and Markulf Kohlweiss. Square
span programs with applications to succinct NIZK arguments. In Palash
Sarkar and Tetsu Iwata, editors, Advances in Cryptology - ASIACRYPT 2014
- 20th International Conference on the Theory and Application of Cryptology
and Information Security, Kaoshiung, Taiwan, R.O.C., December 7-11, 2014.
Proceedings, Part I, volume 8873 of Lecture Notes in Computer Science, pages
532–550. Springer, 2014.
[dfi] Dfinity foundation: Internet computer. https://dfinity.org/.
[DKS09]
Christophe Doche, David R Kohel, and Francesco Sica. Double-base number
system for multi-scalar multiplications. In Annual International Conference
on the Theory and Applications of Cryptographic Techniques, pages 502–517.
Springer, 2009.
378 Speeding Up MSM over Fixed Points Towards Efficient zkSNARKs
[DR94]
Peter De Rooij. Efficient exponentiation using precomputation and vector ad-
dition chains. In Workshop on the Theory and Application of of Cryptographic
Techniques, pages 389–399. Springer, 1994.
[Ebe] Jacob Eberhardt. Zokrates. https://zokrates.github.io/.
[eth]
Ethereum: a technology that’s home to digital money, global payments, and
applications. https://ethereum.org/en/.
[GGPR13]
Rosario Gennaro, Craig Gentry, Bryan Parno, and Mariana Raykova.
Quadratic span programs and succinct NIZKs without PCPs. In Thomas
Johansson and Phong Q. Nguyen, editors, Advances in Cryptology - EU-
ROCRYPT 2013, 32nd Annual International Conference on the Theory and
Applications of Cryptographic Techniques, Athens, Greece, May 26-30, 2013.
Proceedings, volume 7881 of Lecture Notes in Computer Science, pages 626–645.
Springer, 2013.
[GJW20]
Ariel Gabizon and Zachary J. Williamson. Proposal: The turbo-plonk program
syntax for specifying snark programs. 2020.
[GKR+21]
Lorenzo Grassi, Dmitry Khovratovich, Christian Rechberger, Arnab Roy, and
Markus Schofnegger. Poseidon: A new hash function for zero-knowledge proof
systems. In 30th USENIX Security Symposium (USENIX Security 21), pages
519–535, 2021.
[GLS11]
Steven D Galbraith, Xibin Lin, and Michael Scott. Endomorphisms for faster
elliptic curve cryptography on a large class of curves. Journal of cryptology,
24(3):446–469, 2011.
[GLV01]
Robert P Gallant, Robert J Lambert, and Scott A Vanstone. Faster point
multiplication on elliptic curves with efficient endomorphisms. In Annual
International Cryptology Conference, pages 190–200. Springer, 2001.
[GMT20]
Aurore Guillevic, Simon Masson, and Emmanuel Thomé. Cocks–pinch curves
of embedding degrees five to eight and optimal ate pairing computation.
Designs, Codes and Cryptography, 88(6):1047–1081, 2020.
[gna] gnark zk-SNARK library. https://github.com/ConsenSys/gnark.
[GOS06]
Jens Groth, Rafail Ostrovsky, and Amit Sahai. Non-interactive zaps and
new techniques for NIZK. In Cynthia Dwork, editor, Advances in Cryptology
- CRYPTO 2006, 26th Annual International Cryptology Conference, Santa
Barbara, California, USA, August 20-24, 2006, Proceedings, volume 4117 of
Lecture Notes in Computer Science, pages 97–111. Springer, 2006.
[GOS12]
Jens Groth, Rafail Ostrovsky, and Amit Sahai. New techniques for noninter-
active zero-knowledge. Journal of the ACM, 59(3):11:1–11:35, 2012.
[Gro06]
Jens Groth. Simulation-sound NIZK proofs for a practical language and
constant size group signatures. In Xuejia Lai and Kefei Chen, editors, Advances
in Cryptology - ASIACRYPT 2006, 12th International Conference on the
Theory and Application of Cryptology and Information Security, Shanghai,
China, December 3-7, 2006, Proceedings, volume 4284 of Lecture Notes in
Computer Science, pages 444–459. Springer, 2006.
Guiwen Luo, Shihui Fu and Guang Gong 379
[Gro09]
Jens Groth. Linear algebra with sub-linear zero-knowledge arguments. In
Shai Halevi, editor, Advances in Cryptology - CRYPTO 2009, 29th Annual
International Cryptology Conference, Santa Barbara, CA, USA, August 16-20,
2009. Proceedings, volume 5677 of Lecture Notes in Computer Science, pages
192–208. Springer, 2009.
[Gro10]
Jens Groth. Short pairing-based non-interactive zero-knowledge arguments.
In Masayuki Abe, editor, Advances in Cryptology - ASIACRYPT 2010 - 16th
International Conference on the Theory and Application of Cryptology and
Information Security, Singapore, December 5-9, 2010. Proceedings, volume
6477 of Lecture Notes in Computer Science, pages 321–340. Springer, 2010.
[Gro16]
Jens Groth. On the size of pairing-based non-interactive arguments. In
Marc Fischlin and Jean-Sébastien Coron, editors, Advances in Cryptology -
EUROCRYPT 2016 - 35th Annual International Conference on the Theory
and Applications of Cryptographic Techniques, Vienna, Austria, May 8-12,
2016, Proceedings, Part II, volume 9666 of Lecture Notes in Computer Science,
pages 305–326. Springer, 2016.
[GS12]
Jens Groth and Amit Sahai. Efficient noninteractive proof systems for bilinear
groups. SIAM Journal on Computing, 41(5):1193–1232, 2012.
[GWC19]
Ariel Gabizon, Zachary J. Williamson, and Oana Ciobotaru. PLONK: Per-
mutations over lagrange-bases for oecumenical noninteractive arguments of
knowledge. IACR Cryptol. ePrint Arch., page 953, 2019.
[Knu97]
Donald E Knuth. The Art of Programming, vol. 2 (3rd ed.), Seminumerical
algorithms. Addison Wesley Longman, 1997.
[Lip12]
Helger Lipmaa. Progression-free sets and sublinear pairing-based non-
interactive zero-knowledge arguments. In Ronald Cramer, editor, Theory of
Cryptography - 9th Theory of Cryptography Conference, TCC 2012, Taormina,
Sicily, Italy, March 19-21, 2012. Proceedings, volume 7194 of Lecture Notes
in Computer Science, pages 169–189. Springer, 2012.
[Mat82]
David W Matula. Basic digit sets for radix representation. Journal of the
ACM (JACM), 29(4):1131–1143, 1982.
[MBKM19]
Mary Maller, Sean Bowe, Markulf Kohlweiss, and Sarah Meiklejohn. Sonic:
Zero-knowledge snarks from linear-size universal and updatable structured
reference strings. In Lorenzo Cavallaro, Johannes Kinder, XiaoFeng Wang, and
Jonathan Katz, editors, Proceedings of the 2019 ACM SIGSAC Conference on
Computer and Communications Security, CCS 2019, London, UK, November
11-15, 2019, pages 2111–2128. ACM, 2019.
[Mon87]
Peter L Montgomery. Speeding the Pollard and elliptic curve methods of
factorization. Mathematics of computation, 48(177):243–264, 1987.
[Mon92]
Peter L Montgomery. Evaluating recurrences of form
xm+n
=
f
(
xm, xn, xmn
)
via Lucas chains, 1983.
https://cr.yp.to/bib/1992/montgomery-lucas.
pdf, 1992.
[Pip76]
Nicholas Pippenger. On the evaluation of powers and related problems. In
17th Annual Symposium on Foundations of Computer Science (sfcs 1976),
pages 258–263. IEEE Computer Society, 1976.
380 Speeding Up MSM over Fixed Points Towards Efficient zkSNARKs
[Pol78]
John M Pollard. Monte Carlo methods for index computation. Mathematics
of computation, 32(143):918–924, 1978.
[Rao15]
Srinivasa Rao Subramanya Rao. A note on Schoenmakers algorithm for multi
exponentiation. In 2015 12th International Joint Conference on e-Business
and Telecommunications (ICETE), volume 4, pages 384–391. IEEE, 2015.
[SIM12]
Vorapong Suppakitpaisarn, Hiroshi Imai, and Edahiro Masato. Fastest multi-
scalar multiplication based on optimal double-base chains. In World Congress
on Internet Security (WorldCIS-2012), pages 93–98. IEEE, 2012.
[Str64]
Ernst G Straus. Addition chains of vectors (problem 5125). American
Mathematical Monthly, 70(806-808):16, 1964.
[YWLT13]
Wei Yu, Kunpeng Wang, Bao Li, and Song Tian. Joint triple-base num-
ber system for multi-scalar multiplication. In International Conference on
Information Security Practice and Experience, pages 160–173. Springer, 2013.
[Zca] Zcash: Privacy-protecting digital currency. https://z.cash/.
Appendix
A Our bucket set constructions over BLS12-381 curve
Table 6lists our bucket set constructions for
q
= 2
c
(10
c
31
, c 6
= 15
,
17). Radixes 2
15
and 217 are abandoned because |B|/q is too large.
Table 6: Bucket sets over BLS12-381 curve
q h rh1|B|d|B|/q
210 26 28 218 6 0.213
211 24 3 427 6 0.208
212 22 7 857 6 0.209
213 20 231 1725 6 0.211
214 19 7 3417 6 0.209
215 17 29677 17312 4 0.528
216 16 29677 18343 6 0.280
217 15 118710 69249 4 0.528
218 15 7 54618 6 0.208
219 14 231 109244 6 0.208
220 13 29677 220931 6 0.211
221 13 7 436906 6 0.208
222 12 7419 874437 6 0.208
223 12 3 1747625 6 0.208
224 11 29677 3497731 6 0.208
225 11 28 6990507 6 0.208
226 10 1899369 14139299 6 0.211
227 10 3709 27962333 6 0.208
228 10 7 55924059 6 0.208
229 9 7597479 112481229 6 0.210
230 9 29677 223698691 6 0.208
231 9 115 447392434 6 0.208
... First, these two implementations require a pre-computation step, while our scheme does not. The pre-computation method is used to reduce the running time by precomputing a series of EC points and treating them as other base points in MSM, as shown in [LFG23]. Note that the number of precomputed points is multiple times greater than the original base points. ...
Article
Full-text available
Zero-knowledge proof is a critical cryptographic primitive. Its most practical type, called zero-knowledge Succinct Non-interactive ARgument of Knowledge (zkSNARK), has been deployed in various privacy-preserving applications such as cryptocurrencies and verifiable machine learning. Unfortunately, zkSNARK like Groth16 has a high overhead on its proof generation step, which consists of several time-consuming operations, including large-scale matrix-vector multiplication (MUL), number-theoretic transform (NTT), and multi-scalar multiplication (MSM). Therefore, this paper presents cuZK, an efficient GPU implementation of zkSNARK with the following three techniques to achieve high performance. First, we propose a new parallel MSM algorithm. This MSM algorithm achieves nearly perfect linear speedup over the Pippenger algorithm, a well-known serial MSM algorithm. Second, we parallelize the MUL operation. Along with our self-designed MSM scheme and well-studied NTT scheme, cuZK achieves the parallelization of all operations in the proof generation step. Third, cuZK reduces the latency overhead caused by CPU-GPU data transfer by 1) reducing redundant data transfer and 2) overlapping data transfer and device computation. The evaluation results show that our MSM module provides over 2.08x (up to 2.94x) speedup versus the state-of-the-art GPU implementation. cuZK achieves over 2.65x (up to 4.86x) speedup on standard benchmarks and 2.18× speedup on a GPU-accelerated cryptocurrency application, Filecoin.
Article
Full-text available
ECC is a popular public-key cryptographic algorithm, but it lacks an effective solution to multiple-point multiplication. This paper proposes a novel JSF-based fast implementation method for multiple-point multiplication. The proposed method requires a small storage space and has high performance, making it suitable for resource-constrained IoT application scenarios. This method stores and encodes the required coordinates in the pre-computation phase and uses table lookup operations to eliminate the conditional judgment operations in JSF-5, which improves the efficiency by about 70% compared to the conventional JSF-5 in generating the sparse form. This paper utilizes Co-Z combined with safegcd to achieve low computational complexity for curve coordinate pre-computation, which further reduces the complexity of multiple-point multiplication in the execution phase of the algorithm. The experiments were performed with two short Weierstrass elliptic curves, nistp256r1 and SM2. In comparison to the various CPU architectures used in the experiments, our proposed method showed an improvement of about 3% over 5-NAF.
Article
Full-text available
Recent algorithmic improvements of discrete logarithm computation in special extension fields threaten the security of pairing-friendly curves used in practice. A possible answer to this delicate situation is to propose alternative curves that are immune to these attacks, without compromising the efficiency of the pairing computation too much. We follow this direction, and focus on embedding degrees 5 to 8; we extend the Cocks–Pinch algorithm to obtain pairing-friendly curves with an efficient ate pairing. We carefully select our curve parameters so as to thwart possible attacks by “special” or “tower” Number Field Sieve algorithms. We target a 128-bit security level, and back this security claim by time estimates for the DLP computation. We also compare the efficiency of the optimal ate pairing computation on these curves to \(k=12\) curves (Barreto–Naehrig, Barreto–Lynn–Scott), \(k=16\) curves (Kachisa–Schaefer–Scott) and \(k=1\) curves (Chatterjee–Menezes–Rodríguez-Henríquez).
Conference Paper
Full-text available
The triple-base number system is used to speed up scalar multiplication. At present, the main methods to calculate a triple-base chain are greedy algorithms. We propose a new method, called the add/sub algorithm, to calculate scalar multiplication. The density of such chains gained by this algorithm with base {2,3,5} is 1 5·61426. It saves 22% additions compared with the binary/ternary method; 22.1% additions compared with the multibase non-adjacent form with base {2,3,5}; 13.7% additions compared with the greedy algorithm with base {2,3,5}; 20.9% compared with the tree approach with base {2,3}; and saves 4.1% additions compared with the add/sub algorithm with base {2,3,7}, which is the same algorithm with different parameters. To our knowledge, the add/sub algorithm with base {2,3,5} is the fastest among the existing algorithms. Also, recoding is very easy and efficient and together with the add/sub algorithm are very suitable for software implementation. In addition, we improve the greedy algorithm by plane search which searches for the best approximation with a time complexity of O(log 3 k) compared with that of the original of O(log 4 k).
Conference Paper
Ever since their introduction, zero-knowledge proofs have become an important tool for addressing privacy and scalability concerns in a variety of applications. In many systems each client downloads and verifies every new proof, and so proofs must be small and cheap to verify. The most practical schemes require either a trusted setup, as in (pre-processing) zk-SNARKs, or verification complexity that scales linearly with the complexity of the relation, as in Bulletproofs. The structured reference strings required by most zk-SNARK schemes can be constructed with multi-party computation protocols, but the resulting parameters are specific to an individual relation. Groth et al. discovered a zk-SNARK protocol with a universal structured reference string that is also updatable, but the string scales quadratically in the size of the supported relations. Here we describe a zero-knowledge SNARK, Sonic, which supports a universal and continually updatable structured reference string that scales linearly in size. We also describe a generally useful technique in which untrusted "helpers" can compute advice that allows batches of proofs to be verified more efficiently. Sonic proofs are constant size, and in the "helped" batch verification context the marginal cost of verification is comparable with the most efficient SNARKs in the literature.
Article
Recent progress on NFS imposed a new estimation of the security of pairings. In this work we study the best attacks against some of the most popular pairings and propose new key sizes using an analysis which is more precise than the analysis in a recent article of Menezes, Sarkar and Singh. We also select pairing-friendly curves for standard security levels.
Conference Paper
Zero Knowledge Contingent Payment (ZKCP) protocols allow fair exchange of sold goods and payments over the Bitcoin network. In this paper we point out two main shortcomings of current proposals for ZKCP, and propose ways to address them. First we show an attack that allows a buyer to learn partial information about the digital good being sold, without paying for it. This break in the zero-knowledge condition of ZKCP is due to the fact that in the protocols we attack, the buyer is allowed to choose common parameters that normally should be selected by a trusted third party. We implemented and tested this attack: we present code that learns, without paying, the value of a Sudoku cell in the "Pay-to-Sudoku" ZKCP implementation. We also present ways to fix this attack that do not require a trusted third party. Second, we show that ZKCP are not suited for the purchase of digital services} rather than goods. Current constructions of ZKCP do not allow a seller to receive payments after proving that a certain service has been rendered, but only for the sale of a specific digital good. We define the notion of Zero-Knowledge Contingent Service Payment (ZKCSP) protocols and construct two new protocols, for either public or private verification. We implemented our ZKCSP protocols for Proofs of Retrievability, where a client pays the server for providing a proof that the client's data is correctly stored by the server.We also implement a secure ZKCP protocol for "Pay-to-Sudoku" via our ZKCSP protocol, which does not require a trusted third party. A side product of our implementation effort is a new optimized circuit for SHA256 with less than a quarter than the number of AND gates of the best previously publicly available one. Our new SHA256 circuit may be of independent use for circuit-based MPC and FHE protocols that require SHA256 circuits.
Article
Differential addition chains (also known as strong addition chains, Lucas chains, and Chebyshev chains) are addition chains in which every sum is already accompanied by a difference. Low-cost differential addition chains are used to efficiently exponentiate in groups where the operation a, b, a/b ↦ → ab is fast: in particular, to perform x-coordinate scalar multiplication P ↦ → mP on an elliptic curve y 2 = x 3 + Ax 2 + x. Similarly, low-cost two-dimensional differential addition chains are used to efficiently compute the function P, Q, P −Q ↦ → mP +nQ on an elliptic curve. This paper presents two new constructive upper bounds on the costs of two-dimensional differential addition chains. The paper’s new “binary ” chain is very easy to compute and uses 3 additions (14 field multiplications in the elliptic-curve context) per exponent bit, with a uniform structure that helps protect against side-channel attacks. The paper’s new “extended-gcd ” chain takes more time to compute, does not have the uniform structure, and is not easy to analyze, but experiments show that it takes only about 1.77 additions (9.97 field multiplications) per exponent bit. 1 What is a differential addition chain? A differential addition chain is an addition chain in which each sum is already accompanied by a difference: i.e., whenever a new chain element P +Q is formed by adding P and Q, the difference P − Q was already in the chain. Here is an example of a one-dimensional differential addition chain starting from 0 and 1:
Article
We describe some novel methods to compute the index of any integer relative to a given primitive root of a prime $p$. Our first method avoids the use of stored tables and apparently requires $O(p^{1/2})$ operations. Our second algorithm, which may be regarded as a method of catching kangaroos, is applicable when the index is known to lie in a certain interval; it requires $O(w^{1/2})$ operations for an interval of width $w$, but does not have complete certainty of success. It has several possible areas of application, including the factorization of integers.
Conference Paper
Batch signature verification detects whether a batch of signatures contains any forgeries. Batch forgery identification pinpoints the location of each forgery. Existing forgery-identification schemes vary in their strategies for selecting subbatches to verify (individual checks, binary search, combinatorial designs, etc.) and in their strategies for verifying subbatches. This paper exploits synergies between these two levels of strategies, reducing the cost of batch forgery identification for elliptic-curve signatures.