ArticlePDF Available

Abstract and Figures

Parallel versions of collision search algorithms require a significant amount of memory to store a proportion of the points computed by the pseudo-random walks. Implementations available in the literature use a hash table to store these points and allow fast memory access. We provide theoretical evidence that memory is an important factor in determining the runtime of this method. We propose to replace the traditional hash table by a simple structure, inspired by radix trees, which saves space and provides fast look-up and insertion. In the case of many-collision search algorithms, our variant has a constant-factor improved runtime. We give benchmarks that show the linear parallel performance of the attack on elliptic curves discrete logarithms and improved running times for meet-in-the-middle applications.
Content may be subject to copyright.
IACR Transactions on Cryptographic Hardware and Embedded Systems
ISSN 2569-2925, Vol. 2021, No. 2, pp. 254–274. DOI:10.46586/tches.v2021.i2.254-274
Time-Memory Analysis of Parallel Collision
Search Algorithms
Monika Trimoska, Sorina Ionica and Gilles Dequen
Laboratoire MIS, Université de Picardie Jules Verne, Amiens, France
{monika.trimoska,sorina.ionica,gilles.dequen}@u-picardie.fr
Abstract.
Parallel versions of collision search algorithms require a significant amount
of memory to store a proportion of the points computed by the pseudo-random walks.
Implementations available in the literature use a hash table to store these points
and allow fast memory access. We provide theoretical evidence that memory is an
important factor in determining the runtime of this method. We propose to replace
the traditional hash table by a simple structure, inspired by radix trees, which saves
space and provides fast look-up and insertion. In the case of many-collision search
algorithms, our variant has a constant-factor improved runtime. We give benchmarks
that show the linear parallel performance of the attack on elliptic curves discrete
logarithms and improved running times for meet-in-the-middle applications.
Keywords:
discrete logarithm
·
parallelism
·
collision
·
elliptic curves
·
meet-in-
the-middle ·attack ·trade-off ·radix tree
1 Introduction
Given a function
f
:
SS
on a finite set
S
, we call collision any pair
a, b
of elements in
S
such that
f
(
a
) =
f
(
b
). Collision search has a broad range of applications in the cryptanalysis
of both symmetric and asymmetric ciphers: computing discrete logarithms, finding collisions
on hash functions and meet-in-the-middle attacks. Pollard’s rho method [
Pol78
], initially
proposed for solving factoring and discrete logs, can be adapted to find collisions for any
random mapping
f
. The parallel collision search algorithm, proposed by van Oorschot and
Wiener [
vW99
], builds on Pollard’s rho method, and is expected to have a linear speedup
compared to its sequential version. This algorithm computes several walks in parallel and
stores some of these points, called distinguished points.
In this paper, we revisit the memory complexity of the parallel collision search algorithm,
both for applications that need a small number of collisions (i.e. discrete logs) and those
needing a large number of collisions, such as meet-in-middle attacks. In the case of
discrete logarithms, collision search methods are the fastest known attacks in a generic
group. In elliptic curve cryptography, subexponential attacks are known for solving the
discrete log on curves defined over extension fields, but only generic attacks are known to
work in the prime field case. Evaluating the performance of collision search algorithms
is thus essential for understanding the security of curve-based cryptosystems. Several
record-breaking implementations of this algorithm are available in the literature. Over
a prime field we note the computation of a discrete log in a 112-bit group on a curve
of the form
y2
=
x3
3
x
+
b
[
BKK+12
,
BKM09
]. This computation was performed on
a Playstation 3. More recently, Bernstein, Lange and Schwabe [
BLS11
] reported on an
This work was partially funded by the European Union under the 2014/2020 European Regional
Development Fund (FEDER).
Licensed under Creative Commons License CC-BY 4.0.
Received: 2020-10-15 Accepted: 2020-12-15 Published: 2021-02-23
Monika Trimoska, Sorina Ionica and Gilles Dequen 255
implementation on the same platform and for the same curve, in which the use of the
negation map gives a speed-up by a factor
2
. Over binary fields, the current record is an
FPGA implementation breaking a discrete logarithm in a 117-bit group [
BEL+
]. As for
the meet-in-the-middle attack, this generic technique is widely used in cryptanalysis to
break block ciphers (double and triple DES, GOST [
Iso11
]), lattice-based cryptosystems
(NTRU [HGSW03,vV16]) and isogeny-based cryptosystems [ACC+18].
Two models of computation can be considered for this algorithm. The first one follows
the shared memory paradigm, in which each thread will compute distinguished points and
store it in the common memory. The second one is a message-passing model, where the
threads computing points, called the clients, send the distinguished points to a separate
process, running on a different machine called the server, who will handle the memory and
check for collisions.
First, our contribution is to extend the analysis of the parallel collision search algorithm
and present a formula for the expected runtime to find any given number of collisions,
with and without a memory constraint. We show how to compute optimal values of
θ
-
the proportion of distinguished points, allowing to minimize the running time of collision
search, both in the case of discrete logarithms and many-collision attacks. In the case where
the available memory is limited, we determine the optimal value of
θ
. Going further in the
analysis, our formulae show that the actual running time of many-collisions algorithm is
critically reduced if the number of words wthat can be stored in memory is larger.
Secondly, we focus on the data structure used for the algorithm. To the best of our
knowledge, all existing implementations of parallel collision search algorithms use hash
tables to organize memory and allow fast lookup operations. In this paper, we introduce a
new structure, called Packed Radix-Tree-List (PRTL), which is inspired by radix trees.
We show that the use of this structure leads to better use of memory in implementations
and thus yields improved running times for many-collision applications.
Using the PRTL structure, we have implemented the parallel collision search algorithm
for discrete logarithms on elliptic curves defined over prime fields and experimented using a
Shared-Memory Parallelism (SMP) system. Our benchmarks demonstrate the performance
and scalability of this method. While in the case of a single discrete log, the PRTL
variant implementation yields running times similar to those of a hash table approach,
our experiments demonstrate that the new data structure gives faster limited-memory
multi-collision attacks.
Organisation.
Section 2reviews algorithms for solving the discrete logarithm problem and
for meet-in-the-middle attacks. In Section 3, we revisit the proof for the time complexity
of the collision finding algorithm for a small and a large number of collisions. Section 4
describes our choice for the data structure, complexity estimates and comparison with
hash tables. Finally, Section 5presents our experimental results.
2 Parallel collision search
In this section, we briefly review Pollard’s rho method and the parallel algorithm for
searching collisions. Let
S
be a finite set of cardinality
n
. In order to look for collisions
for a function
f
:
SS
with Pollard’s rho method, the idea is to compute a sequence of
elements
xi
=
f
(
xi1
)starting at some random element
x0
. Since
S
is finite, eventually
this sequence begins to cycle and we therefore obtain the desired collision
f
(
xk
) =
f
(
xk+t
),
where
xk
is the point in the sequence before the cycle begins and
xk+t
is the last point on
the cycle before getting to
xk+1
(hence
f
(
xk
) =
f
(
xk+t
) =
xk+1
). One may show that the
expected number of steps taken until the collision is found is
pπn
2
, and therefore that the
memory complexity is also
O
(
pπn
2
). This algorithm can be further optimized to constant
memory complexity by using Floyd’s cycle-finding algorithm [
Jou09
,
Bre80
]. We do not
256 Time-Memory Analysis of Parallel Collision Search Algorithms
further detail memory optimizations here since they are inherently of sequential nature
and there is currently no known way to exploit these ideas in a parallel algorithm.
The parallel algorithm for collision search proposed by van Oorschot and Wiener [
vW99
]
assigns to each thread the computation of a trail given by points
xi
=
f
(
xi1
)starting at
some point
x0
. Only points that belong to a certain subset, called the set of distinguished
points, are stored. This set is defined by points having an easily testable property.
Whenever a thread computes a distinguished point
xd
, it stores it in a common list of
tuples (
x0, xd
). If two walks collide, this is identified when they both reached a common
distinguished point. We may then re-compute the paths and the points preceding the
common point are distinct points that map to the same value.
Solving discrete logarithms.
In this subsection,
S
denotes a cyclic group of order
n
. We
focus on the elliptic curve discrete logarithm problem (ECDLP) in a cyclic group
S
=
hPi
,
but the methods described in this paper apply to any finite cyclic group. We will assume
that the curve
E
is defined over a finite field
Fp
, where
p
is a prime number. Let
QS
and say we want to solve the discrete logarithm problem
Q
=
xP
, where
xZ
. To apply
the ideas explained above, we define a map
F
:
SS
which behaves randomly and such
that each time we compute
Ri+1
=
f
(
Ri
)we can easily keep track of integers
ai
and
bi
such that f(Ri) = aiP+biQ. Pollard’s initial proposal for such a function was
f(R) =
R+Pif RS1
2Rif RS2
R+Qif RS3,
(1)
where the sets
Si
,
i {
1
,
2
,
3
}
are pairwise disjoint and give a partition of the group
S
.
As a consequence, whenever a collision f(Rj) = f(Rk)occurs, we obtain an equality
ajP+bjQ=akP+bkQ. (2)
This allows us to recover
x
= (
ajak
)
/
(
bkbj
) (
mod n
), provided that
bkbj
is not a
multiple of
n
. Starting from
R0
, a multiple of
P
, Pollard’s rho [
Pol78
] method computes a
sequence of points
Ri
where
Ri+1
=
f
(
Ri
). Since the group
S
is finite, this sequence will
produce a collision after
pπn
2
iterations on average. In the parallel version, each thread
computes a walk, and only distinguished points on this walk are stored in memory. These
points are usually defined by an easily testable property, such as a certain number of
trailing bits of their
x
-coordinate being zero. Whenever a thread computes such a point,
this is stored in a common list, together with the corresponding
a
and
b
. When two walks
collide, this cannot be identified until the common distinguished point is computed. Then
the discrete logarithm can be recovered from an equation of type (2).
Many collision applications.
A first type of application of the van Oorschot and Wiener
algorithm computing many collisions is the multi-user setting of both public and secret key
schemes. In such a setting, it has been demonstrated that it is more efficient to recover
individual keys one by one, by using a growing common database of distinguished points,
instead of running the algorithm for each key separately (see [
KS01
,
FJM14
]). A second
type of applications concerns meet-in-the-middle attacks, which require finding a collision
of the type
f1
(
a
) =
f2
(
b
), where
f1
:
D1R
and
f2
:
D2R
are two functions with
the same co-domain. As explained in [
vW99
], solving this equation may be formulated
as a collision search problem on a single function
f
:
S× {
1
,
2
} S× {
1
,
2
}
, where the
solution we need is of the type:
f(a, 1) = f(b, 2),(3)
and
S
is a set bijective to
D1
. This collision is called the golden collision. The number
of unordered pairs in
S
are approximately
n2
2
and the probability that the two points in
Monika Trimoska, Sorina Ionica and Gilles Dequen 257
a pair map to the same value of
f
is
1
n
. There are
n
2
expected collisions for
f
and there
may be several solutions to Equation
(3)
. If one assumes that all collisions are equally
likely to occur, then in the worst case all possible
n
2
collisions for
f
are generated before
finding the golden one. Because so many collisions are generated, memory complexity can
be the bottleneck in meet-in-the-middle attacks and the memory constraint becomes an
important factor in determining the running time of the algorithm. We further explain
this idea in Section 3.
Computational model and data structure.
We consider a CPU implementation of the
shared memory variant of the algorithm, where each thread involved in the process performs
the same task of finding and storing distinguished points. In this case, the choice of a
data structure for allowing efficient lookup and insertion is significant. The most common
structure used in the literature is a hash table. In order to make parallel access to memory
possible, van Oorschot and Wiener [
vW99
] propose the use of the most significant bits of
distinguished points. Their idea is to divide the memory into segments, each corresponding
to a pattern on the first few bits. Threads read off these first bits and are directed towards
the right segment. Each segment is organized as a memory structure on its own.
In recent years, with the development of GPUs and programmable circuits, the client-
server model has been widely used for implementing parallel collision search. In this
setting, a large number of client chips are communicating with a central memory server
over the Internet. For computing discrete logarithms, [
BBB+09
] gives a comparison
between implementations on different architectures in this model. Current record-breaking
implementations of ECDLP also rely on this model [BKK+12,BKM09,BEL+].
Except for the need for a structure that allows efficient simultaneous access to memory,
all results in this paper apply to both the client-server and the SMP versions of the PCS
algorithm, even though our experimental results are obtained using a CPU implementation
following the SMP paradigm.
Notation.
In the remainder of this paper, we denote by
θ
the proportion of distinguished
points in a set
S
. We denote by
n
the number of elements of
S
. We denote by
E
an
elliptic curve defined over a prime finite field
Fp
and by
E
(
Fp
)the group of points on
E
defined over
Fp
. Whenever the set
S
is the group
E
(
Fp
),
n
is the cardinality of this
group. For simplicity, in this case, we assume that
n
is prime (which is the optimal case in
implementations).
3 Time complexity
Van Oorschot and Wiener [
vW99
] gave formulae for the expected running time of parallel
collision search algorithms. In this section, we revisit the steps of their proof and show
a careful analysis of the running time both for computing a single collision or multiple
collision applications. Our refined formulae indicate that the actual running time of the
algorithm depends on the proportion of distinguished points and allow us to determine
the optimal choice of θfor actual implementations.
3.1 Finding one collision: elliptic curve discrete logarithm
Van Oorschot and Wiener [vW99] proved that the runtime for finding one collision is
O1
Lrπn
2,
258 Time-Memory Analysis of Parallel Collision Search Algorithms
with
L
the number of threads we use. This is obtained by finding the expected number of
computed points before a collision occurs and then intuitively dividing the clock time by
L
when
L
processors are involved. The proof of the following theorem, given in Appendix A,
provides a more rigorous argument for the linear scalability of the algorithm.
Theorem 1.
Let
S
be a set with
n
elements and
f
:
SS
be a random map. In the
parallel collision search algorithm, denote by
θ
is the proportion of distinguished points
and tcand tsdenote the time for computing and storing a point respectively.
1. The expected running time to find one collision for fis
T(θ)=(1
Lrπn
2+1
θ)tc+ ( θ
Lrπn
2)ts,(4)
2. The worst case running time is
T(θ)=(1
Lr(2 π
2)n+1
Lrπn
2+1
θ)tc+θ
L(r(2 π
2)n+rπn
2)ts.(5)
Remark 1.In the client-server model, clients do not have access to memory, but they send
distinguished points to the server and thus
ts
stands for cost of communication on the
client-side. We suppose that all the client processors are dedicated to computing points.
On the server-side however, the analysis is different. Theorem 1and the means of finding
the optimal value for
θ
apply both to the shared memory implementation adopted in this
paper, and to the more common distributed client-server model.
As we can see in Equation
(4)
and
(5)
, the proportion of distinguished points we
choose will influence our time complexity. The optimal value for
θ
is the one that gives
the minimal run complexity. Most importantly, our analysis puts forward the idea that
the optimal choice for
θ
depends essentially on the choices made for the implementation
and memory management. From this formula, we easily deduce that if the proportion of
distinguished points is too small or too large, the running time of the algorithm increases
significantly.
By estimating the ratio
ts/tc
for a given implementation, one can extrapolate the
optimal value of
θ
by computing the zeros of the derivative of the function in Equation
(4)
:
T0(θ) = 1
Lrπn
2ts1
θ2tc.
Figure 1gives timings for our implementation of the attack, using a hash table to store
distinguished points. Timings shown in the figure are averaged over 100 runs on a 65-bit
curve and support our theoretical findings.
Note that most recent implementations available in the literature choose the number
of trailing bits giving the distinguished point property in a range between 0
.
178
log n
and
0
.
256
log n
(see [
BEL+
,
BLS11
,
BKM09
]). This value was determined by experimenting
on curves defined over small size fields. Our theoretical findings confirm that these values
were close to optimal, but we suggest that for future record-breaking implementations, the
value of θshould be determined as explained above.
3.2 Finding many collisions
Using a simplified complexity analysis, van Oorschot and Wiener [
vW99
] put forward the
following heuristic.
Monika Trimoska, Sorina Ionica and Gilles Dequen 259
4 6 8 10 12 14 16 18 20 22 24 26 28
1.5
2
2.5
3
·109
49m29s
29m1s
22m41s 22m1s
20m26s 21m12s 22m38s 21m55s 23m5s
30m8s
50m9s
Trailing zero bits
Runtime (µs)
Figure 1: Timings of solving ECDLP for different values of θ, 65-bits curve, 28 threads
Heuristic
([
vW99
])
.
Let
f
:
SS
a random map and assume that the memory can hold
w
distinguished points. Then in the meet-in-the-middle attack the (conjectured) optimum
proportion of distinguished points is
θ
2
.
25
pw
n
. Under this assumption, the expected
number of iterations required to complete the attack using these parameters is 2.5n
Lpn
w.
This heuristic suggests that in the case of many collisions attacks, a memory data structure
allowing to store more distinguished points will yield a better time complexity. We give
a more refined analysis for the running time of a parallel collision search for finding
m
collisions.
Theorem 2.
Let
S
be a set with
n
elements and
f
:
SS
a random map. We denote
by
θ
the proportion of distinguished points in
S
. The expected running time to find
m
collisions for fwith a memory constraint of wwords is:
1
Lw
θ+ (mw2
2θ2n)θn
w+2m
θ.(6)
Proof.
Let
X
be the expected number of distinguished points calculated per thread before
duplication. Let
T1
be the expected number of distinguished points computed until the
first collision was found, and
Ti
, for any
i >
1, the expected number of points stored in
the memory after the (i1)th collision was found and before the ith collision is found.
As shown in Theorem 1, the expected number of points stored before finding the first
collision is
T1
=
θpπn
2
. The probability of not having found the second collision after
each thread has found and stored Tdistinguished points is
P(X > T ) = (1 L+T1
)L
θ·(1 2L+T1
)L
θ·. . . ·(1 T L +T1
)L
θ.
As in the proof of Theorem 1, we approximate this expression by
P(X > T ) = eT2L22LT1T
22.
260 Time-Memory Analysis of Parallel Collision Search Algorithms
Hence the expected number of distinguished points computed by one thread before the
second collision is:
E(X) =
X
T=0
eT2L22LT1T
22Z
0
ex2L22xLT1
22dx =
=e
T2
1
22Z
0
e(xL+T1)2
22dx =θ2n
Le
T2
1
22Z
T1
θ2n
et2dt
=θ2n
Le
T2
1
22
θ2neT2
1
2θ2n
2T1Z
T1
θn
et2
2t2
,
where the last equality is obtained by integration by parts. We denote by
Uk=T1+T2+. . . Tk.
By applying repeatedly the formula above (and neglecting the last integral), we have that
Tk
=
θ2n
LUk1
. Therefore we have
Uk
=
Uk1
+
θ2n
LUk1
. By letting
Vk
=
LUk
θn
, we obtain a
sequence given by the recurrence formula
Vk=Vk1+1
Vk1
.
We will use the Cesaro-Stolz criterion to prove the convergence of this limit. First, we note
that this sequence is increasing and tends to
. Moreover we have that
V2
k
=
V2
k1
+2+
1
V2
k1
.
Hence V2
kV2
k1
k+1k2and as per Cesaro-Stolz we have Vk2k. We conclude that
Ukθ2kn
L.(7)
Since
Uk
is the number of distinguished points computed per thread, the total number
of stored points is
θ2kn
. Hence the memory will fill when
θ2kn
=
w
. This will occur
after computing the first
kw
=
w2
2θ2n
collisions and the expected total time for one thread
is
w
. When the memory is full, the time to find a collision is
θn
w
(see [
vW99
] for detailed
explanation). Finally, to actually locate the collision, we need to restart the two colliding
trails from their start, which requires 2 steps on average.
To sum up, the total time to find mcollisions is:
1
Lw
θ+ (mw2
θ22n)θn
w+2m
θ.
Remark 2.According to the formula obtained in Equation 7, we see that if the memory is
not filled when running the algorithm for finding
n
2
collisions, as in meet-in-the-middle
applications, then we store θn distinguished points, i.e. all distinguished points in S.
Note that the proof of Theorem 2relies strongly on our formula for the expected total
number of computed distinguished points for finding
m
collisions, when
m
is sufficiently
large and the memory is not limited:
Smθ2mn. (8)
We confirmed this asymptotic formula experimentally by running a multi-collision algorithm
for a curve over a 55-bit prime field. The comparison of our formula with the experimental
Monika Trimoska, Sorina Ionica and Gilles Dequen 261
Table 1: Comparing the asymptotic value of Smto an experimental average.
Collisions
Experimental
Avg.
SkCollisions
Experimental
Avg.
Sk
100 238289 231704 500 530493 518107
1000 750572 732714 2000 1062581 1036215
5000 1681831 1638399 7000 1990671 1938581
results is in Table 1. Each value in this Table is an average of 100 runs where we set
θ
= 1
/
2
13
. Furthermore, our formula coincides with the estimated workload for computing
k
discrete logarithms in [
KS01
], which is obtained using a different analysis valid when
k < n1/4.
By minimizing the time complexity function obtained in Theorem 2, we obtain an
estimate for the optimal value of θto take, in order to compute n
2collisions.
Corollary 1.
The optimum proportion of distinguished points minimizing the time com-
plexity bound in Theorem 2is
θ
=
w2+2nw
n
. Furthermore, by choosing this value for
θ
, the
running time of the parallel collision search algorithm for finding
n
2
collisions is bounded
by:
O n
Lr1 + 2n
w!.(9)
Proof. From Theorem 2, the runtime complexity is given by:
T(θ) = 1
Lw
θ+ (n
2w2
θ22n)θn
w+n
θ.
By computing the zeros of the derivative:
T0(θ) = n2θ2w22nw
2Lwθ2,
we obtain that by taking θ=w2+2nw
n, the time complexity is On
Lq1 + 2n
w.
This eliminates some of the heuristics in [
vW99
] and confirms the asymptotic runtime.
Most importantly, Corollary 1suggests that in the case of applications that fill the memory
available, the number of distinguished points we can store is an important factor in the
running time complexity. More storage space yields a faster algorithm by a constant factor.
We propose such an optimization in Section 4.
4 Our approach for the data structure
In this section, we evaluate the memory complexity of parallel collision search algorithms.
As explained in Section 2, van Oorschot and Wiener’s [
vW99
] proposed to divide the
memory into segments to allow simultaneous access by threads. We revisit this construction,
with the goal in mind to minimize the memory consumption as well. Since in Section 3
we showed that the time complexity of collision search depends strongly on the available
amount of memory, we propose an alternative structure called a Packed Radix-Tree-List,
which will be referred to as PRTL in this paper. We explain how to choose the densest
implementation of this structure for collision search data storing in Section 5.
Since the PRTL is inspired by radix trees, we first describe the classic radix tree
structure and then we give complexity analysis on why its straightforward implementation
is not memory efficient. The PRTL structure has the memory gain of radix tree common
prefixes but avoids the memory loss of manipulating pointers.
262 Time-Memory Analysis of Parallel Collision Search Algorithms
4.1 Radix tree structure
Each distinguished point from the collision search is represented as a number in a base
of our choice, denoted by
b
. For example, in the case of attacks on the discrete logs on
the elliptic curve, we may represent a point by its
x
-coordinate. The first numerical digit
of this number in base
b
gives the root node in the tree, the next digit is a child and
so on. This leads to the construction of an acyclic graph which consists of
b
connected
components (i.e. a forest).
In regard to memory consumption, we take advantage of common prefixes to have a
more compact structure. Let
c
be the length of numbers written in base
b
that we store
in the tree and
K
the number of distinguished points computed by our algorithm. To
estimate the memory complexity of this approach, we give upper and lower bounds for the
number of nodes that will be allocated in the radix tree before a collision is found.
Proposition 1.
The expected number of nodes in the radix tree verifies the following
inequalities:
b
b1KclogbK1N(K)(clogbK+b
b1)K. (10)
The proof of these inequalities is detailed in Appendix B.
Traditionally, nodes in a tree are implemented as arrays of pointers to child nodes.
This representation will lead to excessive memory consumption when the data to be stored
follows a uniform random distribution, leading to sparsely populated branches and to the
average distribution of nodes in the tree being closer to the worst case than to the best
case.
The difference between the worst-case value and the best-case value can be approximated
as
K
(
clogbK
). Depending on the application, this value may be large. Let us
consider the case where a single collision is required for solving the ECDLP. By a theorem
of Hasse [
Sil86
], we know that the number of points on the curve is given by
n
=
p
+ 1
t
,
with
|t|
2
p
. Since we assume that
n
is prime, we approximate
log nlog p
. Hence an
approximation of is:
θrπn
2(1
2logbnlogbrπ
2),
which implies that the tree is sparse. In the case of many collisions algorithms,
c
logbK
and this standard deviation becomes negligible, resulting into a space-reduced data
structure. We show how to handle sparse trees efficiently in Section 4.2.
4.2 Packed Radix-Tree-List
Starting from the analysis in Section 4.1, we look to construct a more efficient memory
structure by avoiding the properties of the classic radix tree that make it memory costly
for our purposes. Intuitively, we see that the radix tree is dense at the upper levels and
sparse at the lower ones. Hence it would be more efficient to construct a radix tree up
to a certain level and then add the points to linked lists, each list starting from a leaf on
the tree. We denote by
l
be the level up to which we build the radix tree. We call this a
Packed Radix-Tree-List
1
. Figure 2illustrates an example of an abstract Radix-Tree-List
in base 4.
This idea was considered by Knuth [
Knu98
, Chapter 6.3] for improving on a table
structure called trie, introduced by Fredkin [
Fre60
]. Knuth considers a forest of radix trees
that stop branching at a certain level, whose choice is a trade-off between space and fast
1The ’packed’ property is addressed in Section 5, where we give implementation details.
Monika Trimoska, Sorina Ionica and Gilles Dequen 263
Figure 2: Radix-Tree-List structure with b= 4 and l= 2
Table 2: Verifying experimentally the optimal level.
K l
Average nb. of empty lists per run
Level lLevel l+ 1
5·10618 0 37
7·10618 0 0.84
10719 0 75
access. Indeed, the more we branch, the faster the lookup is, but the more memory we
require. He suggests that the mixed strategy yields a faster lookup when we build a tree
up to a level where only a few keys are possible. Starting from this level a sequential
search through a list of the remaining keys is fast.
In our use case, we favor memory optimization to fast lookup, thus we use a different
technique to decide on the tree level. First, we look to estimate up to which level the tree
is complete for our use case. The number of leaves in a complete radix tree of depth
l
is
bl
. As per the coupon collector’s problem, all the linked lists associated with a leaf will
contain at least one point when the following inequality is verified:
Kbl(ln bl+ 0.577).(11)
We consider the highest value of
l
which satisfies this inequality to be the optimal level,
as it allows us to obtain the shortest linked lists while having 100% rate of use of the
memory structure. We verified this experimentally by inserting a given number of randomly
obtained points of length 65, with
b
= 2, in the PRTL structure. The results are in Table
2. We performed 100 runs for each value of
K
and counted the number of empty lists
at the end of each run. None of the 300 runs finished with an empty list in the PRTL
structure, which supports the claim that the obtained
l
is small enough to have at least
one point per list. Then, to confirm that
l
is the highest possible value that achieves this,
we reproduced the experiments by taking
l
+ 1, which is the lowest value that does not
satisfy Equation
(11)
. The results show that
l
+ 1 is not small enough to produce a 100%
rate of use of the memory, therefore lis in fact the optimal level to choose.
The attribution of a point to a leaf is determined by its prefix and we know in advance
that all the leaves will be allocated. Therefore, in practice we do not actually have to
construct the whole tree, but only the leaves. Hence, we allocate an array indexed by
prefixes beforehand and then we insert each point in the list for the corresponding prefix.
The operation used to map a point to an index is faster than a hash table function. More
precisely, we perform a bitwise AND operation between the
x
-coordinate of the point and
a precomputed mask to extract the prefix. Furthermore, the lists are sorted. Since we
are doing a search-and-add operation, sorting the lists does not take additional time and
proves to be more efficient than simply adding at the end of the list. Figure 3illustrates
the implementation of this structure.
Remark 3.When implementing the attack for curves defined over sparse primes, we advise
taking an
l
-bit suffix instead of an
l
-bit prefix. Prefixes of numbers in sparse prime fields
264 Time-Memory Analysis of Parallel Collision Search Algorithms
Figure 3: PRTL implementation. Same points stored as in Figure 2.
are not uniformly distributed and one might end up only with prefixes starting with the
0-bit, and therefore a half-empty array.
Remark 4.To experiment with this structure, we chose the example of ECDLP. In this case,
we store the starting point of the Pollard walk
kP
and the first distinguished point we find,
represented by the coefficient
k
and the
x
-coordinate correspondingly. Consequently, we
store a pair (
x
-
coordinate, k
). However, the analysis and choices we made for constructing
the PRTL are valid for every collision search application which needs to store pairs
(
key, data
)and requires pairs to be efficiently looked up by keys. For the ECDLP, Bailey
et al. [
BBB+09
] propose, for example, to store a 64-bit seed on the server instead of the
initial point, which makes the pair (x-coordinate, seed).
Remark 5.In our implementation, we always use
b
= 2, and thus, the parameter
b
will no
longer be specified.
PRTL vs. hash table.
We experimented with the ElfHash function, which is used in the
UNIX ELF format for object files. It is a very fast hash function, and thus comparable to
the mask operation in our implementation. Small differences in efficiency are negligible
since the insertion is the less significant part of the algorithm. Indeed, recall that insertion
is performed every 1
θiterations of the random map f.
As is the practice with the parallel collision search, we allocate
K
indexes for the hash
table, since we expect to have
K
stored points. Recall that this guarantees an average
search time of
O
(1), but it does not avoid multi-collisions. Indeed, according to [
Jou09
,
Section 6.3.2], in order to avoid 3-multi-collisions, one should choose a hash table with
K3
2
buckets. Consequently, we insert points in the linked lists corresponding to their hash
keys, as we did with the PRTL. Every element in the list holds a pair (
key, data
)and a
link to the next element. The PRTL is more efficient in this regard as we only need to
store the suffix of the key.
With this approach, we can not be sure that a 100% of the hash table indexes will
have at least one element. We test this by inserting a given number of random points on a
65-bit curve and counting the number of empty lists at the end of each run, like we did to
test the rate of use for the PRTL. We try out two different table sizes: the recommended
hash table size and for comparison, a size that matches the number of leaves in the PRTL.
All results are an average of 100 runs.
Table 3: Test the rate of memory use of a hash table structure.
Nb. of points
Average nb. of empty lists
for size =K
Average nb. of empty lists
for size = 2l
5·1062592960 (51.85%) 98308 (37.50%)
7·1063632679 (51.89%) 98304 (37.50%)
1075138792 (51.38%) 196615 (37.50%)
Results in Table 3show that when we choose a smaller table size, we have fewer empty
lists, but the hash table is still not 100% full. Due to these results, when implementing a
hash table we choose to allocate a array of pointers to slots, instead of allocating an array
Monika Trimoska, Sorina Ionica and Gilles Dequen 265
of actual slots that will not be filled. This is the optimal choice because we only waste 8
bytes for each empty slot, instead of 24 (the size of one slot).
Since results in Table 2show that the array in PRTL will be filled completely, when
using this structure, we allocate an array of slots directly. This makes PRTL save a
constant of 8Kbytes compared to a hash table.
To sum up, the PRTL structure is less space-consuming and has a memory rate of use
of 1. Note however that by Equation
(11)
, the average number of elements in a linked
list corresponding to a prefix is
K
blllog b
+ 0
.
577. This shows that the search time in
our structure is negligible, and our benchmarks shown in Section 5confirm that memory
access has no impact on the total running time for the algorithm.
It is clear that when one implements the PRTL, this structure takes the form of a
hash table where the hash function is in fact the modulo a specific value calculated using
Equation
(11)
. It might seem counter-intuitive that the optimal solution for a hash function
is the modulo function. However, collision search algorithms do not require a memory
structure that has hash table properties, such as each key to be assigned to a unique index.
Finally, a well-distributed hash function is useful when we look to avoid multi-collisions.
With collision search algorithms, the number of stored elements is so vast that we can not
possibly allocate a hash table of the appropriate size and thus we are sure to have longer
than usual linked lists. Fortunately, this is not a problem since the insertion time is, in
this case, not significant compared to the
1
θ
random walk computations needed before each
insertion. For example,
1
θ
would be of order 2
32
for a 129-bit curve. On the other hand, as
shown in Section 3, the available storage space is a significant factor in the time complexity,
which makes the use of this alternative structure more appropriate for collision search.
PRTL vs the data structure in [CLN+20].
It is worth noting here that the use of
the modulo function to compute the index of a distinguished point in the array has
already been proposed in the literature for a meet-in-the-middle attack for isogeny-based
cryptography [
CLN+20
]. The data structure in [
CLN+20
] is an array with
w
entries,
where
w
is the total number of distinguished points that can be stored in memory. Since
this particular attack uses several versions of the function
f
, the choice of the modulo
w
operation to compute the indexes in the array allows to easily detect whether two
distinguished points in a collision were obtained using the same function. Moreover, the
authors chose not to implement linked lists, hence once a distinguished point having an
index that was used before is found, the old point is thrown away. With respect to memory
use, the proposed structure is similar to a hash table : since the size of the array is not
computed in terms of the number of expected points, there will be empty slots in it when
no distinguished points are computed for the corresponding indexes.
5 Implementation and benchmarks
To support our findings, we implemented the parallel collision search using both PRTLs
and hash tables for discrete logarithms on elliptic curves defined over prime fields. Our
C implementation relies on the GNU Multiple Precision Arithmetic Library [
MM11
] for
large numbers arithmetic, and on the OpenMP (Open Multi-Processing) interface [
OPE
]
for shared memory multiprocessing programming. Our experiments were performed on a
28-core Intel Xeon E5-2640 processor using 128 GB of RAM and we experimented using
between 1 and 28 threads. In this section, first we explain in detail the implementation of
the PRTL structure and then we show experimental results.
Packed RTL. An entry in the lists in the PRTL stores one (key, data)pair.
266 Time-Memory Analysis of Parallel Collision Search Algorithms
Table 4:
Comparing the insertion runtime and memory occupation of a PRTL vs. a hash
table.
KMemory Runtime
PRTL Hash table PRTL Hash table
5·106106MB 324MB 5.05 s 5.20s
7·106148MB 454MB 6.74 s 7.01s
107213MB 649MB 9.84 s 10.2s
Table 5:
Runtime and the memory cost for attacking ECDLP using PRTLs and hash
tables.
Field Memory Memory per point Runtime Runtime per point
PRTL Hash
table
PRTL Hash
table
PRTL Hash
table
PRTL Hash
table
55-bit 402KB 1172KB 19B 59B 35.16s 36.42s 1.69ms 1.81ms
60-bit 618KB 1801KB 20B 59B
210.33s 212.83s
6.88ms 6.91ms
65-bit 1856KB 5212KB 21B 60B 1292s 1291s
14.90ms 14.95ms
In order to have the best packed structure, we look to avoid wasting space on addressing,
structure memory alignment and unintended padding. Hence we propose to store all relevant
data in one byte-vector. Our compact slot has the following structure:
struct {
b yt e v e c t o r [ v e c t o r s i z e ] ;
p o i n t e r t o n e xt ;
} l i n k ;
The
key
-suffix and
data
are bound in one single vector. In this way, we have at most 7
bits wasted due to alignment.
We designed functions that allow us to extract and set values in the vector. Our
implementation of a PRTL yields a better memory occupation, but most importantly,
manipulating this structure does not slow down the overall runtime of the attack. We
show experimental results that verify this in Table 4, where we insert a given number
of random points on a 65-bit curve, using both a hash table and the PRTL. To have a
measurement of the runtime that does not depend on point computation time, we take
θ
= 1, meaning every point is a distinguished one. The
key
length is thus
c
= 65. All
results are an average of 100 runs.
We show similar experiments in Table 5. This time, we performed actual attacks on
the discrete log over elliptic curves, instead of inserting random points. Since the number
of stored points is now random and can be different between two sets of runs, the runtime
per stored point and memory per stored point are more relevant results.
The results are an average of 100 runs and they show that by using a PRTL for the
storage of distinguished points we optimize the memory complexity by a factor of 3.
Calculating the exact memory occupation.
Let
f
be the size of the field in bits and
t
the number of trailing bits set to zero in a distinguished point. We keep the notation of
K
for the expected number of stored points and of
l
for the level of the PRTL structure.
To calculate the expected memory occupation of the entire PRTL structure, we first
calculate the size of a compact slot. Recall that a compact slot holds one byte-vector and
a pointer to the next slot, because we used linked lists. Thus, one compact slot takes
d
(
flt
+
f
)
/
8
e
+ 8 bytes (
flt
bits for the
key
-suffix and
f
bits for the
data
). The
size of the slot is multiplied by
K
, as
K
slots will be allocated. To make sure that the
access to the shared memory is asynchronous, we use locks on the shared data structure.
To minimize the time threads spend on locks, there is a lock for every entry in the array,
Monika Trimoska, Sorina Ionica and Gilles Dequen 267
which makes a total of 2
l
locks. This adds 8
·
2
l
bytes to the total memory occupation.
Let us compare the total cost to that of a hash table. As explained in Section 4, the
entries in our hash table are linked lists as well. To store a pair (
key, data
), every element
in the linked list holds a pointer to the
key
, a pointer to the
data
and a pointer to the
next element, which need 24 bytes in total. To this we add
d
(
ft
)
/
8
e
bytes for storing
the
key
and
df/
8
e
bytes for storing the
data
. All this is multiplied by
K
. Finally, we
allocate 8
K
bytes for the array of pointers to slots and 8
K
bytes for the locks (one lock
for each entry in the array). Hence, the total memory occupation of the hash table is
(40 + d(ft)/8e+df/8e)K.
Table 6shows examples of memory requirements of large ECDLP computations calcu-
lated in this way. The first two lines in this Table concern the computation in [
BEL+
] on
the elliptic curve
target117
over
F2127
. For this example, we suppose that a seed of 64 bits
is stored instead of the
k
-coefficient representing the
data
(see Remark 4). On the first line
we give the memory amount needed to store the expected number of distinguished points,
while on the second line we consider the actual computation in [
BEL+
], that finished after
collecting 968531433 distinguished points. Note that, in the case of the actual computation,
the
l
parameter was calculated with respect to the estimated value of
K
, since
l
always
needs to be set beforehand. Similarly, the size of the hash table and the number of locks
correspond to the estimated value of
K
, as the table is allocated beforehand. On the third
line, we also give the memory requirements for a discrete log computation on a 160-bit
curve, with an estimated number of stored distinguished points.
Table 6:
Memory requirements of large ECDLP computations using PRTLs and hash
tables.
Field θ K l PRTL Hash
table
117.35-bit 1/230 379821956 24 9.6GB 23.1GB
[BEL+] estimation
117.35-bit 1/230 968531433 24 24GB 49.6GB
[BEL+] computation
160-bit 1/240 240 35 43155GB 82463GB
estimation
ECDLP implementation details and scalability.
Teske [
Tes01
] showed experimentally
that the walk proposed by Pollard described in Equation
(1)
originally performs on average
slightly worse than a random walk. She proposes alternative mappings that lead to the
same performance as expected in the random case: additive walks and mixed walks. In
our implementation, we adopted the approach of using additive walks and we chose
r
= 20
as the number of sets
Si
that give a partition of the group
S
. Teske showed experimentally
that if r20 then additive walks are close to random walks.
In the theoretical model [
vW99
], the Parallel Collision Search is considered to have
linear scalability and our time complexity in Theorem 1confirms this. To assess the
parallel performance of our implementation, we experimented with
L {
1
,
2
,
7
,
14
,
28
}
threads, solving the discrete log over a 60-bit curve. Table 7shows the Wall clock runtime
and the parallel performance of the attack when we double the number of threads. The
parallel performance is an indication of how the runtime of a program changes when the
number of parallel processing elements increases. It is computed as
L1t1
L2t2
,
where
ti
is the Wall clock runtime with
Li
threads and
L1> L2
. A program is considered
to scale linearly if the speedup is equal to the number of threads used i.e. if the parallel
268 Time-Memory Analysis of Parallel Collision Search Algorithms
Table 7:
Runtime and Parallel performance of the attack on ECDLP. Results are based
on 100 runs per LiL.
L1
Runtime
t1
L2
Runtime
t2
Parallel
performance
1 2459s 2 1699s 0.72
7 776s 14 411s 0.94
14 411s 28 210s 0.97
performance is equal to 1 (or very close to 1, in practice). From our results, we conclude
that the parallel performance is not as good as expected for a small number of threads,
but gets closer to linear as the number of threads grows.
Multi-collision search computation.
To prove our claims from Section 3that more
storage space yields a faster algorithm, we ran a multi-collision search while limiting the
available memory. When the memory is filled, each thread continues to search for collisions
without adding new points. As a practical application of this computation, we chose
the discrete logarithm in the multi-user setting [
KS01
,
FJM14
]. Hence, the data that we
store for each distinguished point is the coefficient
k
, plus an integer representing the user.
Results in Table 8show that the PRTL yields a better runtime compared to a classic hash
table due to the more efficient memory use.
Table 8:
Runtime for multi-collision search for a 55-bit curve using PRTLs and hash tables.
Values for 1GB memory limit are an average of 100 runs and values for 2GB and 4GB
memory limits are an average of 10 runs.
Collisions Memory limit Runtime Stored points
PRTL Hash
table
PRTL Hash
table
4000000 1GB 34.64 h 58.80h 46820082 12912177
16000000 2GB 88.18 h 137.46h 93640161 25824345
50000000 4GB 203.24 h 276.80h
168325978
51648716
6 Limitations of our approach and future work
Our discrete logarithm implementation was not conceived with the intent to break records
and as such, it does not include techniques for fast arithmetic operations on the curve.
To quantify the impact, we would need to combine the new data structure to existing
arithmetic optimization techniques and adapt the implementation to a client-server model,
which takes into account the communication overhead. However, note that in current
records of a single ECDLP computation [
BEL+
,
BKM09
], the memory requirements were
quite small and we do not think that the use of the PRTL would result into new records
at the time of this writing. In any case, Table 6shows that the use of PRTL brings the
amount of RAM memory needed to run the attack for current security levels closer to
numbers which could be feasible in practice, especially if distributed memory were to be
used.
However, we expect the use of the PRTL to have a visible impact on the running time
of meet-in-the middle applications since these have significant memory requirements. To
quantify this impact, we intend to adapt our analysis and experiments to the case of a
golden collision search for meet-in-the-middle attacks [
ACC+18
,
CLN+20
,
HGSW03
]. As
shown by van Oorschot and Wiener and verified experimentally in the literature, this
attack requires several versions of the function
f
. Once a certain number of distinguished
Monika Trimoska, Sorina Ionica and Gilles Dequen 269
points have been produced, the function version needs to be changed and the distinguished
points in the memory discarded. All these suggest that at the implementation level, we
might need to adapt the PRTL data structure and the choice of the parameter
l
giving
the size of the array in the PRTL.
Finally, another perspective is to check whether it is possible to use less (or no) locks
in the data structure, as they present a non-negligible part of the memory requirements.
Locks are necessary to avoid inconsistencies. However, for real-world implementations, the
time to compute one distinguished point can be long enough for the probability of two
threads writing a distinguished point at the same time to the same slot to be negligibly
small. Our experimental results show that the algorithm can be parallelized linearly, which
suggests that the threads do not spend time waiting on a lock. However, more experiments
are needed to verify whether removing the locks is feasible.
7 Conclusion
We revisited the time complexity of the parallel collision search and explained how to
choose the optimal value for the proportion of distinguished points when implementing this
algorithm. We proposed an alternative memory structure for the parallel collision search
algorithm proposed by van Oorschot and Wiener [vW99]. We showed that this structure
yields a better memory complexity than the hash table variant of the algorithm. Moreover,
using the new memory structure, we obtained a better bound for the time complexity
of the parallel collision search, in the case where a large number of collisions is needed.
The experiments in this paper are limited to the cryptanalysis of the discrete logarithm
problem on elliptic curves. In future work, we will explore the applicability of our findings
on other applications of the multi-collision search, such as isogeny-based cryptography and
lattice-based cryptography.
Acknowledgements
We are grateful to the anonymous referees for their remarks which helped us improve the
clarity of our writing.
References
[ACC+18]
Gora Adj, Daniel Cervantes-Vázquez, Jesús-Javier Chi-Domínguez, Alfred
Menezes, and Francisco Rodríguez-Henríquez. On the cost of computing
isogenies between supersingular elliptic curves. In Carlos Cid and Michael
J. Jacobson Jr., editors, Selected Areas in Cryptography - SAC 2018 - 25th
International Conference, Calgary, AB, Canada, August 15-17, 2018, Revised
Selected Papers, volume 11349 of Lecture Notes in Computer Science, pages
322–343. Springer, 2018.
[BBB+09]
Daniel V. Bailey, Lejla Batina, Daniel J. Bernstein, Peter Birkner, Joppe W.
Bos, Hsieh-Chung Chen, Chen-Mou Cheng, Gauthier van Damme, Giacomo
de Meulenaer, Luis Julian Dominguez Perez, Junfeng Fan, Tim Güneysu,
Frank Gurkaynak, Thorsten Kleinjung, Tanja Lange, Nele Mentens, Ruben
Niederhagen, Christof Paar, Francesco Regazzoni, Peter Schwabe, Leif Uhsadel,
Anthony Van Herrewege, and Bo-Yin Yang. Breaking ECC2K-130. Cryptology
ePrint Archive, Report 2009/541, 2009.
https://eprint.iacr.org/2009/
541.
270 Time-Memory Analysis of Parallel Collision Search Algorithms
[BEL+]
Daniel J. Bernstein, Susanne Engels, Tanja Lange, Ruben Niederhagen,
Christof Paar, Peter Schwabe, and Ralf Zimmermann. Faster elliptic-curve
discrete logarithms on FPGAs. https://eprint.iacr.org/2016/382.
[BKK+12]
Joppe W. Bos, Marcelo E. Kaihara, Thorsten Kleinjung, Arjen K. Lenstra, and
Peter L. Montgomery. Solving a 112-bit prime elliptic curve discrete logarithm
problem on game consoles using sloppy reduction. Int. J. Appl. Cryptogr.,
2(3):212–228, 2012.
[BKM09]
Joppe W. Bos, Marcelo E. Kaihara, and Peter L. Montgomery. Pollard
rho on the Playstation 3. Workshop record of SHARCS’09
http://www.
hyperelliptic.org/tanja/SHARCS/record2.pdf, 2009.
[BLS11]
Daniel J. Bernstein, Tanja Lange, and Peter Schwabe. On the Correct Use of
the Negation Map in the Pollard rho Method. In Dario Catalano, Nelly Fazio,
Rosario Gennaro, and Antonio Nicolosi, editors, PKC 2011: 14th International
Conference on Theory and Practice of Public Key Cryptography, volume 6571 of
Lecture Notes in Computer Science, pages 128–146, Taormina, Italy, March 6–9,
2011. Springer, Heidelberg, Germany.
[Bre80]
Richard P. Brent. An improved Monte Carlo factorization algorithm. BIT,
20:176–184, 1980.
[CLN+20]
Craig Costello, Patrick Longa, Michael Naehrig, Joost Renes, and Fernando
Virdia. Improved Classical Cryptanalysis of SIKE in Practice. In Aggelos
Kiayias, Markulf Kohlweiss, Petros Wallden, and Vassilis Zikas, editors, Public-
Key Cryptography - PKC 2020 - 23rd IACR International Conference on
Practice and Theory of Public-Key Cryptography, Edinburgh, UK, May 4-7,
2020, Proceedings, Part II, volume 12111 of Lecture Notes in Computer Science,
pages 505–534. Springer, 2020.
[FJM14]
Pierre-Alain Fouque, Antoine Joux, and Chrysanthi Mavromati. Multi-user
collisions: Applications to discrete logarithm, Even-Mansour and PRINCE.
In Palash Sarkar and Tetsu Iwata, editors, Advances in Cryptology ASI-
ACRYPT 2014, Part I, volume 8873 of Lecture Notes in Computer Science,
pages 420–438, Kaoshiung, Taiwan, R.O.C., December 7–11, 2014. Springer,
Heidelberg, Germany.
[Fre60]
Edward Fredkin. Trie memory. Commun. ACM, 3(9):490–499, September
1960.
[HGSW03]
Nick Howgrave-Graham, Joseph H. Silverman, and William Whyte. A meet-
in-the-middle attack on an NTRU private key. Tehnical report, NTRU Cryp-
tosystems, 2003.
[Iso11]
Takanori Isobe. A single-key attack on the full GOST block cipher. In Antoine
Joux, editor, Fast Software Encryption FSE 2011, volume 6733 of Lecture
Notes in Computer Science, pages 290–305, Lyngby, Denmark, February 13–16,
2011. Springer, Heidelberg, Germany.
[Jou09]
Antoine Joux. Algorithmic Cryptanalysis, chapter 7, pages 225–226. Chapman
& Hall/CRC, 2009.
[Knu98]
Donald E. Knuth. The Art of Computer Programming, Volume 3: (2Nd
Ed.) Sorting and Searching. Addison Wesley Longman Publishing Co., Inc.,
Redwood City, CA, USA, 1998.
Monika Trimoska, Sorina Ionica and Gilles Dequen 271
[KS01]
Fabian Kuhn and René Struik. Random walks revisited: Extensions of Pollard’s
rho algorithm for computing multiple discrete logarithms. In Serge Vaudenay
and Amr M. Youssef, editors, Selected Areas in Cryptography, pages 212–229,
Berlin, Heidelberg, 2001. Springer Berlin Heidelberg.
[MM11]
Martin Maechler and Maintainer Martin Maechler. GNU Multiple Precision
Arithmetic Library. https://gmplib.org/, 2011.
[OPE]
Open Multi-Processing Specification for Parallel Programming.
https://
gmplib.org/.
[Pol78]
John Pollard. Monte Carlo methods for index computation (
mod p
).Math.
Comp., (32):918–924, 1978.
[Sil86]
Joseph H. Silverman. The Arithmetic of Elliptic Curves, volume 106 of
Graduate Texts in Mathematics. Springer, 1986.
[Tes01]
Edlyn Teske. On random walks for Pollard’s rho method. Math. Comp.,
70(234):809–825, 2001.
[vV16]
Christine van Vredendaal. Reduced memory meet-in-the-middle attack against
the NTRU private key. LMS Journal of Computation and Mathematics,
19(Issue A (Algorithmic Number Theory Symposium XII)):43–57, 2016.
[vW99]
Paul C. van Oorschot and Michael J. Wiener. Parallel collision search with
cryptanalytic applications. Journal of Cryptology, 12(1):1–28, 1999.
A Appendix : Proof of Theorem 1
Proof.
1. We call short path the chain of points computed by a thread between two
consecutive distinguished points. The expected number of distinguished points produced
after a certain clock time
T
is
T
. The probability of not having a collision at
T
= 1, for
one thread is
1
.
Note that any of the
L
threads can cause a collision. Thus, the probability for all threads
of not finding a collision on any point on the short walk is:
(1 L
n)L,
at the moment T= 1.
Let Xbe the number of points calculated per thread before duplication. Hence:
P(X > T ) = (1
)L·(1 2
)L·. . . ·(1 T
)L.
To do this multiplication, we will take a shortcut. When
x
is close to 0, a coarse
first-order Taylor approximation for exas:
ex1 + x.
Now we can rewrite our expression as:
P(X > T )=(eL
n·e2L
n·. . . ·eT L
n)L= (e(L+2L+...+T L)
n)L=
= (eT(T+1)L
2n)L= (eT2L
2n)L=eT2L2
2n.(12)
272 Time-Memory Analysis of Parallel Collision Search Algorithms
This gives us the probability
P(X > T ) = eT2L2
2n,
thus the expected number of distinguished points found before duplication, is
E(X) =
X
T=1
T·P(X=T) =
X
T=1
T·(P(X > T 1) P(X > T )) =
X
T=0
P(X > T ).
We approximate
E(X) =
X
T=0
eT2L2
2nZ
0
ex2L2
2ndx 1
Lrπn
2.
Since the expected length of a short walk is
1
θ
, the number of distinguished points before a
collision occurs is θ
Lrπn
2.
However, a collision might occur on any point on the walk and it will not be detected
until the walk reaches a distinguished one. We add
1
θ
to the number of calculations for the
discovery of a collision. Finally, the expected number of calculated points per thread is:
1
Lrπn
2+1
θ.
The two main operations in our algorithm are computing the next point on the random
walk and storing a distinguished point. Thus, the time complexity of our algorithm is:
T(θ)=(1
Lrπn
2+1
θ)tc+ ( θ
Lrπn
2)ts.(13)
2. To compute the worst time complexity, we compute the variance of the random variable
X
as
σ
(
X
) =
E
(
X2
)
E
(
X
)
2
. Using a similar approximation as in Equation
(12)
we
obtain that E(X2)R
0exL2
2ndx 2n
L2.Hence the worst case runtime is
T(θ)=(1
Lr(2 π
2)n+1
Lrπn
2+1
θ)tc+θ
L(r(2 π
2)n+rπn
2)ts.
Remark 6.Note that the analysis above shows that the number of points computed by the
algorithm is Oθpπn
2. This was proven by van Oorschot and Wiener in the first place.
B Appendix : Proof of Proposition 1
The lower and upper bound in Equation
(10)
are given by the worst-case and best-case
scenario for the number of nodes.
Worst-case scenario.
In the worst case scenario, for each new word added in this structure
we will create as much nodes as possible. This means that the
x
-coordinates of the added
points have the shortest possible common prefix, as shown in Figure 4. For the first
b
points, we will use
bc
nodes. After that, the first distinguished point that we find will take
c
1nodes, since all possibilities for the first letter in the string were created. This case is
repeated (b1)btimes, provided that K > b + (b1)b.
More generally, let
k
=
blogbKc
1. We build the tree by allocating nodes as follows:
Monika Trimoska, Sorina Ionica and Gilles Dequen 273
0
2
3
3
1
0
3
1
2
0
0
2
2
2
1
3
3
1
0
Figure 4: Worst-case scenario example with parameters K= 5 and b= 4
bc nodes for the first bpoints
(b1)b(c1) for the next (b1)bpoints
(b1)b2(c2) for the next (b1)b2points etc.
(b1)bk(ck)for (b1)bkpoints.
For each of the remaining
K
(
b
+
Pk
i=1
(
b
1)
bi
)points we will need
ck
1nodes. To
sum up, the total number of nodes that will bound our worst-case scenario is given by:
N(K) = bc +
k
X
i=1
(b1)bi(ci)+(Kbb(b1)
k1
X
i=0
bi)(ck1).
We simplify the sums and we approximate by:
N(K)b
b1bk+1 +K(ck1).
Since k=blog10 Kc 1, we have that
N(K)b
b1bblogbKc+K(c blogbKc).(14)
Best-case scenario.
Let
K
be the number of distinguished points that we need to store
and let
k
=
blogbKc
. In the best-case scenario, we may assume without loss of generality
that each time a new point is added in the structure, the minimal number of nodes is
used, i.e. the
x
-coordinate of the added point has the longest possible common prefix
with some other point that was previously stored. For example, for the first point
c
nodes
are allocated, for the next (
b
1) nodes, one extra node is allocated and so on, until
all subtrees of depth 1, 2 etc. are filled one by one. Figure 5gives an example of how
215 points are stored. If
K > bc1
, we fill the first tree and start a new one. Let
xi
, for
i {
0
,
1
...,k}
, denote the
i
-th digit of
K
, from right to the left. In full generality, since
c > k, we use:
xkcomplete subtrees of depth kand a (xk+1)-th incomplete tree of depth k;
the (
xk
+1)-th tree of depth
k
has
xk1
complete subtrees of depth
k
1and a
(xk1+1)-th incomplete tree of depth k1;
ck1extra nodes.
274 Time-Memory Analysis of Parallel Collision Search Algorithms
0 1
0
2 3 0 1
1
2 3
1
2
0 1
2
2 3 0 1
3
2 3
Figure 5: Best-case scenario example with parameters K= 16 and b= 4
Summing up all nodes, we get the following formula:
N(K) =
k
X
i=0
xi
i
X
j=0
bj+k+ck1 = 1
b1
k
X
i=0
xi(bi+1 1) + c=
=b+ 1
bKc1
b1
k
X
i=0
xi.
We conclude that:
N(K)b
b1Kck1.(15)
... Unlike other attacks in this thesis, the pcs does not use a sat solver. Most of the contributions of this work are presented in [ TID17 ] . ...
Thesis
In this thesis, we explore the use of combinatorial techniques, such as graph-based algorithms and constraint satisfaction, in cryptanalysis. Our main focus is on the elliptic curve discrete logarithm problem. First, we tackle this problem in the case of elliptic curves defined over prime-degree binary extension fields, using the index calculus attack. A crucial step of this attack is solving the point decomposition problem, which consists in finding zeros of Semaev’s summation polynomials and can be reduced to the problem of solving a multivariate Boolean polynomial system. To this end, we encode the point decomposition problem as a logical formula and define it as an instance of the SAT problem. Then, we propose an original XOR-reasoning SAT solver, named WDSat, dedicated to this specific problem. As Semaev’s polynomials are symmetric, we extend the WDSat solver by adding a novel symmetry breaking technique that, in contrast to other symmetry breaking techniques, is not applied to the modelization or the choice of a factor base, but to the solving process. Experimental running times show that our SAT-based solving approach is significantly faster than current algebraic methods based on Gröbner basis computation. In addition, our solver outperforms other state-of-the-art SAT solvers, for this specific problem. Finally, we study the elliptic curve discrete logarithm problem in the general case. More specifically, we propose a new data structure for the Parallel Collision Search attack proposed by van Oorschot and Wiener, which has significant consequences on the memory and time complexity of this algorithm.
Chapter
The security guarantees of most isogeny-based protocols rely on the computational hardness of finding an isogeny between two supersingular isogenous curves defined over a prime field Fq with q a power of a large prime p. In most scenarios, the isogeny is known to be of degree ℓe for some small prime ℓ. We call this problem the Supersingular Fixed-Degree Isogeny Path (SIPFD) problem. It is believed that the most general version of SIPFD is not solvable faster than in exponential time by classical as well as quantum attackers. In a classical setting, a meet-in-the-middle algorithm is the fastest known strategy for solving the SIPFD. However, due to its stringent memory requirements, it quickly becomes infeasible for moderately large SIPFD instances. In a practical setting, one has therefore to resort to time-memory trade-offs to instantiate attacks on the SIPFD. This is particularly true for GPU platforms, which are inherently more memory-constrained than CPU architectures. In such a setting, a van Oorschot-Wiener-based collision finding algorithm offers a better asymptotic scaling. Finding the best algorithmic choice for solving instances under a fixed prime size, memory budget and computational platform remains so far an open problem. To answer this question, we present a precise estimation of the costs of both strategies considering most recent algorithmic improvements. As a second main contribution, we substantiate our estimations via optimized software implementations of both algorithms. In this context, we provide the first optimized GPU implementation of the van Oorschot-Wiener approach for solving the SIPFD. Based on practical measurements we extrapolate the running times for solving different-sized instances. Finally, we give estimates of the costs of computing a degree-288 isogeny using our CUDA software library running on an NVIDIA A100 GPU server.
Conference Paper
Full-text available
This paper describes a high-performance PlayStation 3 (PS3) implementation of the Pollard rho discrete logarithm algorithm on ellip-tic curves over prime fields. A record has been set using this implementation by solving an elliptic curve discrete logarithm problem (ECDLP) with domain parameters from a currently standardized elliptic curve over a 112-bit prime field. Solving this 112-bit ECDLP instance required 62.6 PS3 years. Arithmetic algorithms have been designed for the PS3 to exploit the SIMD architecture and the rich instruction set of its computational units. Though our implementation is targeted at a specific 112-bit modulus, most of our implementation strategies apply to other large moduli as well.
Article
Full-text available
We describe a cell processor implementation of Pollard's rho method to solve discrete logarithms in groups of elliptic curves over prime fields. The implementation was used on a cluster of PlayStation 3 game consoles to set a new record. We present in detail the underlying single instruction multiple data modular arithmetic.
Article
Full-text available
We consider Pollard’s rho method,for discrete logarithm computation. Usually, in the analysis of its running time the assumption is made that a random,walk in the underlying group is simulated. We show that this assumption does not hold for the walk originally suggested by Pollard: its performance is worse than in the random,case. We study alternative walks that can be eciently applied to compute,discrete logarithms. We introduce a class of walks that lead to the same performance as expected in the random,case. We show that this holds for arbitrarily large prime group orders, thus making Pollard’s rho method,for prime group orders about 20% faster than before.
Article
Full-text available
In this report we describe a meet-in-the-middle attack on an NTRU private key. If the private key is chosen from a sample space with 2 M elements, then the security level of the cryptosystem is no more than 2 M=2 . We also describe variants of this attack applicable to product form NTRU keys.
Article
NTRU is a public-key cryptosystem introduced at ANTS-III. The two most used techniques in attacking the NTRU private key are meet-in-the-middle attacks and lattice-basis reduction attacks. Howgrave-Graham combined both techniques in 2007 and pointed out that the largest obstacle to attacks is the memory capacity that is required for the meet-in-the-middle phase. In the present paper an algorithm is presented that applies low-memory techniques to find ‘golden’ collisions to Odlyzko’s meet-in-the-middle attack against the NTRU private key. Several aspects of NTRU secret keys and the algorithm are analysed. The running time of the algorithm with a maximum storage capacity of $w$ is estimated and experimentally verified. Experiments indicate that decreasing the storage capacity $w$ by a factor $1
Article
We describe some novel methods to compute the index of any integer relative to a given primitive root of a prime $p$. Our first method avoids the use of stored tables and apparently requires $O(p^{1/2})$ operations. Our second algorithm, which may be regarded as a method of catching kangaroos, is applicable when the index is known to lie in a certain interval; it requires $O(w^{1/2})$ operations for an interval of width $w$, but does not have complete certainty of success. It has several possible areas of application, including the factorization of integers.
Article
Pollard's Monte Carlo factorization algorithm usually finds a factor of a composite integerN inO(N 1/4) arithmetic operations. The algorithm is based on a cycle-finding algorithm of Floyd. We describe a cycle-finding algorithm which is about 36 percent faster than Floyd's (on the average), and apply it to give a Monte Carlo factorization algorithm which is similar to Pollard's but about 24 percent faster.
Article
A simple new technique of parallelizing methods for solving search problems which seek collisions in pseudo-random walks is presented. This technique can be adapted to a wide range of cryptanalytic problems which can be reduced to finding collisions. General constructions are given showing how to adapt the technique to f inding discrete logarithms in cyclic groups, finding meaningful collisions in hash functions, and performing meet-in-the-middle attacks such as a known-plaintext attack on double encryption. The new technique greatly extends the reach of practical attacks, providing the most cost-effective means known to date for defeating: the small subgroup used in certain schemes based on discrete logarithms such as Schnorr, DSA, and elliptic curve cryptosystems; hash functions such as MD5, RIPEMD, SHA-1, MDC-2, and MDC-4; and double encryption and three-key triple encryption. The practical significance of the technique is illustrated by giving the design for three $10 million custom machines which could be built with current technology: one finds elliptic curve logarithms in GF(2155) thereby defeating a proposed elliptic curve cryptosystem in expected time 32 days, the second finds MD5 collisions in expected time 21 days, and the last recovers a double-DES key from 2 known plaintexts in expected time 4 years, which is four orders of magnitude faster than the conventional meet-in-the-middle attack on double-DES. Based on this attack, double-DES offers only 17 more bits of security than single- DES.