Finding Optimum Parallel Coprocessor Design for Genus 2 Hyperelliptic Curve Cryptosystems.
ABSTRACT Hardware accelerators are often used in cryptographic applications for speeding up the highly arithmetic-intensive public-key primitives, e.g. in high-end smart cards. One of these emerging and very promising public-key schemes is based on hyperelliptic curve cryptosystems (HECC). In the open literature only a few considerations deal with hardware implementation issues of HECC. Our contribution appears to be the first one to propose architectures for the latest findings in efficient group arithmetic on HEC. The group operation of HECC allows parallelization at different levels: bit-level parallelization (via different digit-sizes in multipliers) and arithmetic operation-level parallelization (via replicated multipliers). We investigate the trade-offs between both parallelization options and identify speed and time-area optimized configurations. We found that a coprocessor using a single multiplier (D=8) instead of two or more is best suited. This coprocessor is able to compute group addition and doubling in 479 and 334 clock cycles, respectively. Providing more resources it is possible to achieve 288 and 248 clock cycles, respectively.
Conference Proceeding: Parallel architectures for elliptic curve cryptoprocessors over binary extension fields[show abstract] [hide abstract]
ABSTRACT: The general trend of the hardware implementation of elliptic curve cryptography is to increase throughput by designing a variety of algorithms for the kP operation, by optimizing the architectures of the finite field basic operations, and by selecting the most appropriate coordinate system. Point addition and doubling leave few possibilities for parallelism when considering a single kP operation. It is however possible to explore the design space of an elliptic curve cryptoprocessor sharing the field operators among the computations of some different kP operations. In this paper, an analysis of various parallelism schemes is carried on. The obtained parallelism schemes are evaluated with respect to time performance, referring to an effective VLSI technology.Circuits and Systems, 2003 IEEE 46th Midwest Symposium on; 01/2004
Conference Proceeding: Genus Two Hyperelliptic Curve Coprocessor.[show abstract] [hide abstract]
ABSTRACT: Hyperellipticcurvecryptographywithgenuslargerthanone has not been seriously considered for cryptographic purposes because manyexistingimplementationsaresigniflcantlyslowerthanellipticcurve versions with the same level of security. In this paper, the flrst ever complete hardware implementation of a hyperelliptic curve coprocessor isdescribed.Thiscoprocessorisdesignedforgenustwocurvesover F2113. Additionally, a modiflcation to the Extended Euclidean Algorithm is presented for the GCD calculation required by Cantor's algorithm. On average, this new method computes the GCD in one-fourth the time required bythe Extended Euclidean Algorithm.Cryptographic Hardware and Embedded Systems - CHES 2002, 4th International Workshop, Redwood Shores, CA, USA, August 13-15, 2002, Revised Papers; 01/2002
- [show abstract] [hide abstract]
ABSTRACT: The design of a modular standard basis inversion for Galois fields GF(2<sup>m</sup>) based on Euclid's algorithm for computing the greatest common divisor of two polynomials is presented. The asymptotic complexity is linear with m both in computation time and area requirement, thus resulting in an AT -complexity of O ( m <sup>2</sup>). This is a significant improvement over the best previous proposal which achieves AT -complexity of only O ( m <sup>3</sup>)IEEE Transactions on Computers 09/1993; · 1.38 Impact Factor
Finding Optimum Parallel Coprocessor Design
for Genus 2 Hyperelliptic Curve Cryptosystems
Guido Bertoni and Luca Breveglieri
Politecnico di Milano, Italy
Thomas Wollinger and Christof Paar
Communication Security Group (COSY)
Ruhr-Universitaet Bochum, Germany
Hardware accelerators are often used in cryptographic ap-
plications for speeding up the highly arithmetic-intensive
public-key primitives, e.g. in high-end smart cards. One
of these emerging and very promising public-key scheme
is based on HyperElliptic Curve Cryptosystems (HECC).
In the open literature only a few considerations deal with
hardware implementation issues of HECC.
Our contribution appears to be the first one to pro-
pose architectures for the latest findings in efficient group
arithmetic on HEC. The group operation of HECC al-
lows parallelization at different levels: bit-level paralleliza-
tion (via different digit-sizes in multipliers) and arithmetic
operation-level parallelization (via replicated multipliers).
We investigate the trade-offs between both parallelization
options and identify speed and time-area optimized config-
urations. We found that a coprocessor using a single mul-
tiplier (D = 8) instead of two or more is best suited. This
in 479 and 334 clock cycles, respectively. Providing more
resources it is possible to achieve 288 and 248 clock cycles,
Keywords: hyperelliptic curve, hardware architecture, co-
processor, parallelism, genus 2, embedded processor.
All modern security protocols, such as IPSec, SSL and TLS
use symmetric-key as well as public-key cryptographic al-
gorithms. In order to be able to provide highly arithmetic-
intensive public-key cryptographic primitives, hardware ac-
celerators are often used. An example are high-end smart
cards, where a cryptographic coprocessor takes over all the
expensive (area and time) computations.
In practical applications the most used public-key algo-
rithms are RSA and Elliptic Curve Cryptosystems (ECC).
One emerging and very promising public-key scheme is
the HyperElliptic Curve Cryptosystem (HECC). HECC has
been analyzed and implemented only recently both in soft-
ware [10, 11, 15, 17, 20, 21, 23–25, 27, 33] and in more
hardware-oriented platforms such as FPGAs [4,31,32].
The work at hand presents, for the first time, an archi-
tecture for a HECC coprocessor considering the most re-
cent explicit formulae to compute group operations. All of
the previous work implementing HECC in hardware used
the original Cantor algorithm, which is outdated. Further-
more, we present and evaluate different design options for
the HECC coprocessor. In order to do so, we wrote software
capable of scheduling the necessary operations, resulting in
an optimal architecture with respect to area and speed. Par-
allelizing at the bit and arithmetic operation level we found
that: 1) no more than three multiplier units are useful; 2)
architectures implementing one inversion and one multipli-
cation unit are the best choice; 3) and providing sufficient
resources group addition and doubling can be performed in
288 and 248 clock cycles, respectively. Moreover, we ex-
plored the overlapping of two group operations and we an-
alyzed the usage of registers.
The rest of the paper is organized as follows. Section 2
summarizes the contributions dealing with previous works.
Section 3 gives a brief overview of the mathematical back-
ground of HECC. Section 4 presents the architecture of the
HECC coprocessor and Section 5 the used methodology.
Finally, we end this contribution with a discussion of our
results (Section 6) and some conclusions (Section 7).
This section gives a short overview of the hardware imple-
mentations targeting HECC and of the previous research
work to parallelize hardware ECC.
The first work discussing hardware architectures for the
implementation of HECC appeared in [31,32]. The authors
describe efficient architectures to implement the necessary
field operations and polynomial arithmetic in hardware. All
of the presented architectures are speed and area optimized.
In , they also estimated that for a hypothetical clock
frequency of 20 MHz, the scalar multiplication of HECC
would take 21.4 ms using the window NAF method.
In  the authors presented the first complete hardware
implementation of a hyperelliptic curve coprocessor. This
implementation targets a genus-2 HEC over F2113. The tar-
get platform is a Xilinx II FPGA. Point addition and point
doubling with a clock frequency of 4 5MHz take 105µs and
90µs, respectively. The scalar multiplication could be com-
puted in 10.1 ms.
Note that publications [4,31,32] adopt the Cantor algo-
rithm to compute group operations. Today, there exist more
efficient algorithms to compute group addition and group
doubling, the so-called explicit formulae (for more details
see Section 3.2).
In  the authors proposed a parallelization of the ex-
plicit group operation of HECC. They developed a gen-
eral methodology for obtaining parallel algorithms. The
methodology guarantees that the obtained parallel version
requires a minimum number of rounds. They show that for
the inversion free arithmetic  using 4, 8 and 12 multipli-
ers in parallel, scalar multiplication can be carried out in 27,
14 and 10 parallel rounds, respectively. When using affine
coordinates  and 8 multipliers it can be performed in 11
rounds, including an inversion round.
to use so many multipliers in parallel as stated in . The
work at hand attempts to consider not only the minimum
number of rounds (speed), but also the necessary devices
(area) as well as practical applications.
A similar work as that presented here for HECC can be
the number of operators and different coordinate systems is
presented. In  two scalar multiplications are scheduled
in parallel on the same architecture: the two operations are
executed in different coordinate systems to improve the use
of the operators. Note that the group operations of elliptic
curves are much less complex than those of the hyperelliptic
ones. In ECC the silicon area of the possible architecture is
easily bounded since the critical path can be computed by
hand, while in the case of HECC it is much more complex.
In this section we introduce briefly the theory of HECC,
restricting attention to the material relevant for this work
only. The interested reader is referred to [1, 9] for more
background on HECC.
3.1Definition of HECC
Let F be a finite field and let F be the algebraic closure of
F. A hyperelliptic curve C of genus g ≥ 1 over F is the set
of the solutions (x,y) ∈ F × F to the following equation:
C : y2+ h(x)y = f(x)
The polynomial h(x) ∈ F[x] is of degree at most g and
f(x) ∈ F[x] is a monic polynomial of degree 2g + 1. For
odd characteristic it suffices to let h(x) = 0 and to have
f(x) squarefree. Such a curve C is said to be non-singular
if there does not exist any pair (x,y) ∈ F × F satisfying
the equation of the curve C and the two partial differential
equations 2y + h(x) = 0 and h?(x)y − f?(x) = 0.
The so-called divisor D is defined as follows: D =
?miPi, to be a formal weighted sum of points Piof the
ditional condition that Dσ=?miPσ
Divisors admit a reduced form. A reduced divisor can be
represented as a pair of polynomials u(x), v(x) [18, page
3.17]. Reduced divisors can be added (group addition),
e.g. D3 = D1+ D2, or doubled (group doubling), e.g.
D2= 2D1= D1+D1, and hence the so-called scalar mul-
tiplication kD = D + ··· + D for k times is defined. The
scalar multiplication kD is the basic operation of HECC,
that we want to implement with a coprocessor.
curve C (and the integers miare the wights), with the ad-
iis equal to D for all
the automorphisms σ of F over F (see  for details).
The formulae given for the group operations (addition, dou-
bling) of HEC by Cantor  can be rewritten in explicit
form, thus resulting in more efficient arithmetic. The ex-
plicit formulae were first presented in . Starting with this
finding, a considerable effort of different research groups
has been put into finding more efficient operations. The
group operations of genus-2 curves have been studied most
intensively ( [7, 11–15, 17, 22, 29]), but also group opera-
tions on genus-3 curves ( [10,20,21,33]) and even genus-4
curves ( ) have been considered.
In the work at hand, we target our HECC coprocessor
for genus-2 curves using underlying fields of characteris-
tic two. We used the up-to-date fastest explicit formulae,
as presented in , where the authors introduced a group
doubling requiring a single field inversion, 9 field multipli-
cations and 6 field squarings. Group addition can be com-
puted with 1 field inversions, 21 field multiplications and 3
3.3 Security of HECC
It is widely accepted that for most cryptographic applica-
tions based on EC or HEC, the necessary group is of order
at least ≈ 2160. Thus, for HECC over Fq, we must have at
least g · log2|Fq| ≈ 160. In particular, we will need a field
order |Fq| ≈ 280for genus-2 curves. Even the very recent
attack found by Th´ eriault  shows no progress in attacks
against genus-2 HEC.
4Architecture of the HECC coprocessor
To implement the coprocessor we chose a standard archi-
tecture, see Figure 1. It contains a register file to store tem-
porary results and outputs. The size of each register was
chosen to be the dimension of the field, namely 81 bits. The
register file has two output ports to feed the operators and
one input port to receive the result. This guarantees feasi-
bility and ease of implementation. At any given clock cycle
only one field operation can start. If the operation is unary,
such as inversion, one bus remains idle.
Figure 1. Crypto-processor architecture.
The following list is a summary of how we implemented the
• Addition: The addition of two elements requires the
modulo 2 addition of the coefficients of the field ele-
The squaring of a field element A =
i=0aixiis ruled by the following equation: A2≡
• Multiplication: We decided to use digit multipliers,
introduced in  for fields GF(2m). This kind of
multiplier allows a trade-off between speed, area and
power consumption. It works processing several mul-
tiplicand coefficients at the same time. The number
of coefficients processed in parallel is the digit-size D.
Given D, we denote by d = ?m/D? the total num-
ber of digits in a polynomial of degree m − 1. Hence,
C ≡ AB mod F(x) = A?d−1
• Inversion: The inversion is computed using the algo-
rithm proposed in . It is based on a modification of
i=0aix2imod F(x). Further details can be found
Table 1. Components of the coprocessor:
area and time.
[D · m]AND &
[D · m]XOR
[6 · m + log2m]AND &
[6 · m + log2m] XOR
[2(6m + log2m)]
2 · m
Euclid’s algorithm for computing the gcd of two poly-
nomials. The asymptotic complexity is linear with the
modulus both in time and area.
In Table 1 we give the area and latency for each arith-
metic components we used. The given estimates assume
2-input gates and optimum field polynomials F(x) = xm+
i=0fixi, where m − t ≥ D.
In this section we describe briefly our approach to find the
best suited architecture for the HECC coprocessor.
1. Input: First we evaluated the most recent findings re-
garding the group operation of HECC. The given for-
mulae were then prepared for the scheduler.
2. Scheduler: Our own software library, especially devel-
oped to schedule the HECC group operations, is the
heart of our methodology. The scheduler is based on
the method known as Operation Scheduling  and
works accordingly to the As Soon As Possible (ASAP)
policy. There is a list of operations that should be ex-
ecuted by the architecture. The scheduler takes one
operation at a time and searches for the earliest time
slot where the operation can be executed. It is con-
strained by the number of available resources and by
the different times required to execute each operation.
The same methodology is used by compilers for
scheduling machine instructions. It should be noted
that this methodology is heuristic and does not grant
optimal results. To reach the optimal scheduling it is
necessary to use other methods instead, see . The
scheduler has the following parameters:
• HECC formulae.
• Implementation method for addition, inversion,
multiplication and squaring.
• Number of multiplication units.
• Different digit-sizes for the multiplier.
• Properties of the bus.
• Memory access time.
3. Testing: The results of the scheduler were tested by ap-
plying test vectors. In order to do so we implemented
HECC group operations with the NTL library .
4. Analysis: The results were analyzed and, if needed,
the input architecture was changed, in order to find a
better structure for the coprocessor.
Traditionally, when evaluating the performance of cryp-
tographic implementations, emphasis was put first on the
throughput of the implementation and, second, on the
amount of hardware resources consumed to achieve the de-
sired throughput. We will consider both the hardware re-
quirements and the time constraints of the cryptographic
application. Hence, we are going to use the area-time prod-
uct and the optimal implementation will reach the highest
throughput consuming the smallest area.
We designed an architecture for the HECC coprocessor
using different design criteria. We varied the number of
multipliers (one, two, three and four), as well as their digit-
size D (D = 2,4,8,16). Hence, we changed different ar-
chitecture options, the processing power of the whole co-
processor and of each individual multiplier. Increasing the
processing power yields a speed up of the group operation,
but also causes a growth in area. Thus, there must exist an
optimum architecture, where the area-time product is mini-
Our goal is to implement a simple controller computing
group operations with a fixed execution order. Hence, we
look at a static schedule. The alternative would be to imple-
ment a finite state machine executing the schedule directly,
controlling the availability of resources and deciding which
operation should be executed. This solution is feasible but
we consider it as too complex and expensive compared to a
simple controller executing the operations in a fixed order.
Recent research in hyperelliptic curve cryptosystems
provides a large number of explicit formulae to choose from
for computing group operations. We chose the up-to-date
fastest ones, as are given in .
In this section the results of the scheduling methodology are
presented. Our results are based in the following considera-
tions: i) we choose a set of keys to schedule the operations;
ii) we schedule group addition and doubling, accordingly to
the values of the key; iii) we considered a long sequence
of concatenated group operations. If one implements scalar
multiplication using the double-and-add algorithm, the fol-
lowing consecutive group operations should be scheduled:
addition after doubling, doubling after doubling and dou-
bling after addition.
We examined all the cases in order to find out whether
we could schedule the second group operation in a way to
gain speed and achieve a higher hardware utilization. Our
results show that addition is always scheduled in the same
way. As for doubling, there are two different ways to ex-
ecute it, depending on the previous operation. When dou-
bling is executed after another doubling, we can gain speed;
thuswedecided to integratetwooptions forcomputing dou-
bling. We have to allow negligible extra hardware for the
controller to decide which option to choose.
6.1Time Requirements of the Group Operations
Tables 2 and 3 show the clock cycles necessary for perform-
ing a group operations in the different system configura-
tions. Table 3 shows two different latency figures because
the time necessary to perform doubling depends on the op-
eration executed before. The leftmost figure in each cell of
Table 3 is the number of clock cycles necessary to compute
doubling after doubling. The rightmost instead is the time
latency of doubling after addition.
One can see that by increasing the digit-size and the
operation decreases. For group addition our results show
that the performance does not increase in some cases, even
when augmenting the resources of the system. For exam-
ple, focusing on the speed of addition, using three rather
than four digit-size multipliers with D = 8 (Table 2, third
row) does not make any performance difference. The same
behavior can be observed for 2, 3 and 4 multipliers with
digit-size 16. This is due to the structure of the group oper-
ation, that shows no additional parallelism.
Focusing on doubling, there is almost no performance
gain in moving from 3 to 4 multipliers, no matter of what
kind (Table 3, rightmost two columns). In addition, there
is no performance gain in providing 2, 3 or 4 multipliers of
digit-size 8 or 16. Taking only performance into account,
one concludes from Tables 2 and 3 that the design option
using one inverter, two multipliers (D = 16), one adder
and one squarer is preferable. The architecture can perform
group doubling and addition in 289 and 248 clock cycles,
respectively. As stated in Section 5, the ASAP scheduling
policy does not grant an optimum solution; this is evident in
the case of 3 or 4 multipliers with D = 16. The scheduling
of doubling after addition is worse than the case with two
6.2Parallel Computation of Group Operations
In the case of genus-2 HEC each group operation produces
as output one polynomial of degree two and one monic
polynomial of degree three. Hence, the output consists of
4 coefficients, namely four field elements. They are neither
produced at the same time nor are all necessary to start the
Table 2. Clock cycles per group addition.
Digit-sizeNumber of multipliers
Table 3. Clock cycles per group doubling.
Digit-size Number of multipliers
2724 / 846486 / 560
4 464 / 526346 / 380
8334 / 366278 / 286
16 274 / 284248 / 247
458 / 490
338 / 350
278 / 279
248 / 250
458 / 484
338 / 344
278 / 279
248 / 250
next group operation right away. This means that when one
field element (one coefficient) is computed, it can be used
by the next group operation. We measure the overlapping
of two group operations as the difference of the time when
the last field operation of the former group operation ends
execution minus the time when the first of the latter starts.
Table 4 shows the overlapping for doubling after addi-
tion. Overlapping decreases as the speed and the number of
multipliers increase. Similar behaviors have been observed
in the other cases (doubling after doubling and doubling af-
ter addition). This decrease is due to the increase of the
parallelism in the tail of the operation.
6.3 Register Allocation
Schedulers fall into two broad families: unconstrained or
resource limited. We choose to upper bound the number of
resources as for busses and arithmetic units, while that of
registers is unbounded. Once the operations are scheduled,
we count the number of live registers and compute the regis-
ter allocation. This option is the simplest solution and gives
good results. In fact, each register stores a field element of
81 bits. Simulations demonstrate that changing the other
parameters of the system has a low impact on the number
of required registers. The system needs 18 and 20 registers
in the best and worst case, respectively.
If a designer wants to lower the number of required reg-
Table 4. Overlapping in clock cycles of dou-
bling after addition.
Digit-sizeNumber of multipliers
isters, he can trade the number of registers for additional
latency. In order to do so, he should avoid overlapping and
start a group operation only when the previous has finished.
We noted that a single group operation uses from 8 to 10
registers, while the maximum register number is reached
when two group operations overlap.
6.4Evaluation of Different Architecture Options
In this section we report and compare the different archi-
tectures in terms of latency and area. In order to do so,
we listed the use of the different multipliers as a percentage
of the total time of one group operation. Thus, the figures
shown in Table 5 and 6 are computed as follows:#mul·tmul
where #mulis the number of multiplications executed by
one multiplier unit during one group operation, tmulis the
time needed for one multiplication and tgroupis the total
execution time of the group operation.
In an ideal scenario all the multipliers should be used
uniformly. However, one can see that the fourth multiplier,
and in some cases also the third multiplier, are used very
infrequently (see Table 6, the columns corresponding to the
3rdand 4thmultiplier). Hence, for most applications it will
be unreasonable to provide this extra hardware units.
In Table 8 we show a comparison to find the optimal ar-
chitecture. The optimal implementation will achieve the
highest throughput consuming the smallest area (contrary
to some traditional cryptographic implementations, where
only best performance was evaluated). The analysis uses
the normalized area-time product (with respect to the low-
est area-time product). Table 8 shows that the architecture
using one inversion, one multiplication (D = 8), one addi-
tion and one squaring achieves the best area-time product.
To evaluate the latency of a complete scalar multiplica-
tion kD, reported in Table 7, we examined it in an average
case, where the integer k of 160 bits has half of its bits equal
to 1 and the rest equal to 0. This means that 80 and 160 ad-
ditions and doublings were performed, respectively. Half of
the 160 doublings are computed after another doubling and
half after an addition.
We supposed that the all different configurations of the
system work always at the same frequency. This is a worst-
case assumption. In fact, usually the multiplier unit dom-
inates the frequency, and a smaller digit-size will yield to
higher clock frequency and thus speed-up the system.
It should be noted that we omitted the register file area
in the estimation. We decided to do so after noting that the
required number of registers is almost the same in all the
configurations, and that the area consumed by a register can
vary depending on the implementation technology.