Finding Optimum Parallel Coprocessor Design for Genus 2 Hyperelliptic Curve Cryptosystems.
ABSTRACT Hardware accelerators are often used in cryptographic applications for speeding up the highly arithmeticintensive publickey primitives, e.g. in highend smart cards. One of these emerging and very promising publickey schemes is based on hyperelliptic curve cryptosystems (HECC). In the open literature only a few considerations deal with hardware implementation issues of HECC. Our contribution appears to be the first one to propose architectures for the latest findings in efficient group arithmetic on HEC. The group operation of HECC allows parallelization at different levels: bitlevel parallelization (via different digitsizes in multipliers) and arithmetic operationlevel parallelization (via replicated multipliers). We investigate the tradeoffs between both parallelization options and identify speed and timearea optimized configurations. We found that a coprocessor using a single multiplier (D=8) instead of two or more is best suited. This coprocessor is able to compute group addition and doubling in 479 and 334 clock cycles, respectively. Providing more resources it is possible to achieve 288 and 248 clock cycles, respectively.
 [Show abstract] [Hide abstract]
ABSTRACT: Parallelization of operations is of utmost importance for efficient implementation of Public Key Cryptography algorithms. Starting with a classification of parallelization methods at different abstraction levels of public key algorithms, we propose a novel memory architecture for elliptic curve implementations with multiple modular multiplier units. This architecture is wellsuited for different point addition and doubling algorithms over $ \mathbb{G}\mathbb{F}{\left( p \right)} $ to be implemented on FPGAs. It allows the execution time to scale with the number of modular multipliers and exhibits nearly no overhead compared to the mere runtime of the multipliers. The advantages of this distributed memory architecture are demonstrated by means of two different point addition and doubling algorithms.Journal of Signal Processing Systems 04/2008; 51(1). · 0.55 Impact Factor  SourceAvailable from: G.M. Bertoni
Conference Paper: Performance of HECC Coprocessors Using InversionFree Formulae.
[Show abstract] [Hide abstract]
ABSTRACT: The HyperElliptic Curve Cryptosystem (HECC) was quite extensively studied during the recent years. In the open literature one can flnd results on improving the group operations of HECC as well as implementations on various types of processors. There have also been some efiorts to implement HECC on hardware devices, like for instance FPGAs. Only one of these works, however, deals with the inversionfree formulae to compute the group operations of HECC. We present inversionfree group operations for the HEC y2 + xy = x5 + f1x + f0 and targeting characteristic two flelds. The reason being to al low a fair comparison to hardware architectures using the ane case presented in (BBWP04). In the main part of the paper we use these results to investigate various hardware architectures for a HECC VLSI coprocessor. If area constraints are not considered, scalar multiplication can be performed in 19769 clock cycles using three fleld multipliers (of type D = 32), one fleld adder and one fleld squarer, where D indicates the digit size of the multiplier. However, the optimal solution in terms of latency and area uses two multipliers (of type D = 4), one addition and one squaring. The main flnding of the present contribution is that copro cessors based on the inversionfree formulae should be preferred compared to those using group operations containing inversion. This holds despite the fact that one fleld inversion in the ane HECC group operation is traded by up to 24 fleld multiplications in the inversionfree case.Computational Science and Its Applications  ICCSA 2006, International Conference, Glasgow, UK, May 811, 2006, Proceedings, Part III; 01/2006  SourceAvailable from: Lejla Batina
Conference Paper: Hardware/Software Codesign for Hyperelliptic Curve Cryptography (HECC) on the 8051µP.
[Show abstract] [Hide abstract]
ABSTRACT: Implementing publickey cryptography on platforms with limited resources, such as microprocessors, is a challenging task. Hard ware/software codesign is often the only answer to implement the com putationally intensive operations with limited memory and power at an acceptable speed. This contribution describes such a solution for Hyper elliptic Curve Cryptography (HECC). The proposed hardware/software codesign of the HECC system was implemented and cosimulated using the GEZEL design environment (3). As a lowcost platform, we chose an 8bit 8051 microprocessor to which one small hardware coprocessor was added for fleld multiplication. We show that the Jacobian scalar multi plication can be computed in 2.488 sec at 12 MHz on this platform if a minimal hardware module is added i.e. a hardware multiplyadd unit. This optimal solution provides a factor of 26 speedup over a software only solution.Cryptographic Hardware and Embedded Systems  CHES 2005, 7th International Workshop, Edinburgh, UK, August 29  September 1, 2005, Proceedings; 01/2005
Page 1
Finding Optimum Parallel Coprocessor Design
for Genus 2 Hyperelliptic Curve Cryptosystems
Guido Bertoni and Luca Breveglieri
Politecnico di Milano, Italy
{bertoni,breveglieri}@elet.polimi.it
Thomas Wollinger and Christof Paar
Communication Security Group (COSY)
RuhrUniversitaet Bochum, Germany
{wollinger,cpaar}@crypto.rub.de
Abstract
Hardware accelerators are often used in cryptographic ap
plications for speeding up the highly arithmeticintensive
publickey primitives, e.g. in highend smart cards. One
of these emerging and very promising publickey scheme
is based on HyperElliptic Curve Cryptosystems (HECC).
In the open literature only a few considerations deal with
hardware implementation issues of HECC.
Our contribution appears to be the first one to pro
pose architectures for the latest findings in efficient group
arithmetic on HEC. The group operation of HECC al
lows parallelization at different levels: bitlevel paralleliza
tion (via different digitsizes in multipliers) and arithmetic
operationlevel parallelization (via replicated multipliers).
We investigate the tradeoffs between both parallelization
options and identify speed and timearea optimized config
urations. We found that a coprocessor using a single mul
tiplier (D = 8) instead of two or more is best suited. This
coprocessorisabletocomputegroupadditionanddoubling
in 479 and 334 clock cycles, respectively. Providing more
resources it is possible to achieve 288 and 248 clock cycles,
respectively.
Keywords: hyperelliptic curve, hardware architecture, co
processor, parallelism, genus 2, embedded processor.
1Introduction
All modern security protocols, such as IPSec, SSL and TLS
use symmetrickey as well as publickey cryptographic al
gorithms. In order to be able to provide highly arithmetic
intensive publickey cryptographic primitives, hardware ac
celerators are often used. An example are highend smart
cards, where a cryptographic coprocessor takes over all the
expensive (area and time) computations.
In practical applications the most used publickey algo
rithms are RSA and Elliptic Curve Cryptosystems (ECC).
One emerging and very promising publickey scheme is
the HyperElliptic Curve Cryptosystem (HECC). HECC has
been analyzed and implemented only recently both in soft
ware [10, 11, 15, 17, 20, 21, 23–25, 27, 33] and in more
hardwareoriented platforms such as FPGAs [4,31,32].
The work at hand presents, for the first time, an archi
tecture for a HECC coprocessor considering the most re
cent explicit formulae to compute group operations. All of
the previous work implementing HECC in hardware used
the original Cantor algorithm, which is outdated. Further
more, we present and evaluate different design options for
the HECC coprocessor. In order to do so, we wrote software
capable of scheduling the necessary operations, resulting in
an optimal architecture with respect to area and speed. Par
allelizing at the bit and arithmetic operation level we found
that: 1) no more than three multiplier units are useful; 2)
architectures implementing one inversion and one multipli
cation unit are the best choice; 3) and providing sufficient
resources group addition and doubling can be performed in
288 and 248 clock cycles, respectively. Moreover, we ex
plored the overlapping of two group operations and we an
alyzed the usage of registers.
The rest of the paper is organized as follows. Section 2
summarizes the contributions dealing with previous works.
Section 3 gives a brief overview of the mathematical back
ground of HECC. Section 4 presents the architecture of the
HECC coprocessor and Section 5 the used methodology.
Finally, we end this contribution with a discussion of our
results (Section 6) and some conclusions (Section 7).
2Previous Work
This section gives a short overview of the hardware imple
mentations targeting HECC and of the previous research
work to parallelize hardware ECC.
The first work discussing hardware architectures for the
implementation of HECC appeared in [31,32]. The authors
describe efficient architectures to implement the necessary
Page 2
field operations and polynomial arithmetic in hardware. All
of the presented architectures are speed and area optimized.
In [31], they also estimated that for a hypothetical clock
frequency of 20 MHz, the scalar multiplication of HECC
would take 21.4 ms using the window NAF method.
In [4] the authors presented the first complete hardware
implementation of a hyperelliptic curve coprocessor. This
implementation targets a genus2 HEC over F2113. The tar
get platform is a Xilinx II FPGA. Point addition and point
doubling with a clock frequency of 4 5MHz take 105µs and
90µs, respectively. The scalar multiplication could be com
puted in 10.1 ms.
Note that publications [4,31,32] adopt the Cantor algo
rithm to compute group operations. Today, there exist more
efficient algorithms to compute group addition and group
doubling, the socalled explicit formulae (for more details
see Section 3.2).
In [16] the authors proposed a parallelization of the ex
plicit group operation of HECC. They developed a gen
eral methodology for obtaining parallel algorithms. The
methodology guarantees that the obtained parallel version
requires a minimum number of rounds. They show that for
the inversion free arithmetic [12] using 4, 8 and 12 multipli
ers in parallel, scalar multiplication can be carried out in 27,
14 and 10 parallel rounds, respectively. When using affine
coordinates [11] and 8 multipliers it can be performed in 11
rounds, including an inversion round.
Notethatforaneffectiveimplementationitisimpractical
to use so many multipliers in parallel as stated in [16]. The
work at hand attempts to consider not only the minimum
number of rounds (speed), but also the necessary devices
(area) as well as practical applications.
A similar work as that presented here for HECC can be
foundin[3]forECC,whereastudyofthetradeoffbetween
the number of operators and different coordinate systems is
presented. In [2] two scalar multiplications are scheduled
in parallel on the same architecture: the two operations are
executed in different coordinate systems to improve the use
of the operators. Note that the group operations of elliptic
curves are much less complex than those of the hyperelliptic
ones. In ECC the silicon area of the possible architecture is
easily bounded since the critical path can be computed by
hand, while in the case of HECC it is much more complex.
3Mathematical Background
In this section we introduce briefly the theory of HECC,
restricting attention to the material relevant for this work
only. The interested reader is referred to [1, 9] for more
background on HECC.
3.1Definition of HECC
Let F be a finite field and let F be the algebraic closure of
F. A hyperelliptic curve C of genus g ≥ 1 over F is the set
of the solutions (x,y) ∈ F × F to the following equation:
C : y2+ h(x)y = f(x)
The polynomial h(x) ∈ F[x] is of degree at most g and
f(x) ∈ F[x] is a monic polynomial of degree 2g + 1. For
odd characteristic it suffices to let h(x) = 0 and to have
f(x) squarefree. Such a curve C is said to be nonsingular
if there does not exist any pair (x,y) ∈ F × F satisfying
the equation of the curve C and the two partial differential
equations 2y + h(x) = 0 and h?(x)y − f?(x) = 0.
The socalled divisor D is defined as follows: D =
?miPi, to be a formal weighted sum of points Piof the
ditional condition that Dσ=?miPσ
Divisors admit a reduced form. A reduced divisor can be
represented as a pair of polynomials u(x), v(x) [18, page
3.17]. Reduced divisors can be added (group addition),
e.g. D3 = D1+ D2, or doubled (group doubling), e.g.
D2= 2D1= D1+D1, and hence the socalled scalar mul
tiplication kD = D + ··· + D for k times is defined. The
scalar multiplication kD is the basic operation of HECC,
that we want to implement with a coprocessor.
curve C (and the integers miare the wights), with the ad
iis equal to D for all
the automorphisms σ of F over F (see [1] for details).
3.2Group Operations
The formulae given for the group operations (addition, dou
bling) of HEC by Cantor [6] can be rewritten in explicit
form, thus resulting in more efficient arithmetic. The ex
plicit formulae were first presented in [7]. Starting with this
finding, a considerable effort of different research groups
has been put into finding more efficient operations. The
group operations of genus2 curves have been studied most
intensively ( [7, 11–15, 17, 22, 29]), but also group opera
tions on genus3 curves ( [10,20,21,33]) and even genus4
curves ( [23]) have been considered.
In the work at hand, we target our HECC coprocessor
for genus2 curves using underlying fields of characteris
tic two. We used the uptodate fastest explicit formulae,
as presented in [22], where the authors introduced a group
doubling requiring a single field inversion, 9 field multipli
cations and 6 field squarings. Group addition can be com
puted with 1 field inversions, 21 field multiplications and 3
field squarings.
3.3Security of HECC
It is widely accepted that for most cryptographic applica
tions based on EC or HEC, the necessary group is of order
Page 3
at least ≈ 2160. Thus, for HECC over Fq, we must have at
least g · log2Fq ≈ 160. In particular, we will need a field
order Fq ≈ 280for genus2 curves. Even the very recent
attack found by Th´ eriault [30] shows no progress in attacks
against genus2 HEC.
4Architecture of the HECC coprocessor
To implement the coprocessor we chose a standard archi
tecture, see Figure 1. It contains a register file to store tem
porary results and outputs. The size of each register was
chosen to be the dimension of the field, namely 81 bits. The
register file has two output ports to feed the operators and
one input port to receive the result. This guarantees feasi
bility and ease of implementation. At any given clock cycle
only one field operation can start. If the operation is unary,
such as inversion, one bus remains idle.
Operation results
Operation inputs
Control signals
Multiplier
Control
Unit
Register
File
Squarer
Inverter
Adder
Figure 1. Cryptoprocessor architecture.
The following list is a summary of how we implemented the
field arithmetic:
• Addition: The addition of two elements requires the
modulo 2 addition of the coefficients of the field ele
ments.
• Squaring:
?m−1
in [19].
The squaring of a field element A =
i=0aixiis ruled by the following equation: A2≡
?m−1
• Multiplication: We decided to use digit multipliers,
introduced in [28] for fields GF(2m). This kind of
multiplier allows a tradeoff between speed, area and
power consumption. It works processing several mul
tiplicand coefficients at the same time. The number
of coefficients processed in parallel is the digitsize D.
Given D, we denote by d = ?m/D? the total num
ber of digits in a polynomial of degree m − 1. Hence,
C ≡ AB mod F(x) = A?d−1
• Inversion: The inversion is computed using the algo
rithm proposed in [5]. It is based on a modification of
i=0aix2imod F(x). Further details can be found
i=0BixDimod F(x).
Table 1. Components of the coprocessor:
area and time.
AreaLatency
[Clock
cycles]
1
1
Total number
of gates
[m]
[m+t+1]
Add
Sqr [19]
Mult [28]
[m]XOR
[m+t+1]XOR
[D · m]AND &
[D · m]XOR
[6 · m + log2m]AND &
[6 · m + log2m] XOR
[2Dm]
?m/D?
Inv [5]
[2(6m + log2m)]
2 · m
Euclid’s algorithm for computing the gcd of two poly
nomials. The asymptotic complexity is linear with the
modulus both in time and area.
In Table 1 we give the area and latency for each arith
metic components we used. The given estimates assume
2input gates and optimum field polynomials F(x) = xm+
?t
5Methodology
i=0fixi, where m − t ≥ D.
In this section we describe briefly our approach to find the
best suited architecture for the HECC coprocessor.
1. Input: First we evaluated the most recent findings re
garding the group operation of HECC. The given for
mulae were then prepared for the scheduler.
2. Scheduler: Our own software library, especially devel
oped to schedule the HECC group operations, is the
heart of our methodology. The scheduler is based on
the method known as Operation Scheduling [8] and
works accordingly to the As Soon As Possible (ASAP)
policy. There is a list of operations that should be ex
ecuted by the architecture. The scheduler takes one
operation at a time and searches for the earliest time
slot where the operation can be executed. It is con
strained by the number of available resources and by
the different times required to execute each operation.
The same methodology is used by compilers for
scheduling machine instructions. It should be noted
that this methodology is heuristic and does not grant
optimal results. To reach the optimal scheduling it is
necessary to use other methods instead, see [8]. The
scheduler has the following parameters:
• HECC formulae.
• Implementation method for addition, inversion,
multiplication and squaring.
• Number of multiplication units.
• Different digitsizes for the multiplier.
• Properties of the bus.
Page 4
• Memory access time.
3. Testing: The results of the scheduler were tested by ap
plying test vectors. In order to do so we implemented
HECC group operations with the NTL library [26].
4. Analysis: The results were analyzed and, if needed,
the input architecture was changed, in order to find a
better structure for the coprocessor.
Traditionally, when evaluating the performance of cryp
tographic implementations, emphasis was put first on the
throughput of the implementation and, second, on the
amount of hardware resources consumed to achieve the de
sired throughput. We will consider both the hardware re
quirements and the time constraints of the cryptographic
application. Hence, we are going to use the areatime prod
uct and the optimal implementation will reach the highest
throughput consuming the smallest area.
We designed an architecture for the HECC coprocessor
using different design criteria. We varied the number of
multipliers (one, two, three and four), as well as their digit
size D (D = 2,4,8,16). Hence, we changed different ar
chitecture options, the processing power of the whole co
processor and of each individual multiplier. Increasing the
processing power yields a speed up of the group operation,
but also causes a growth in area. Thus, there must exist an
optimum architecture, where the areatime product is mini
mal.
Our goal is to implement a simple controller computing
group operations with a fixed execution order. Hence, we
look at a static schedule. The alternative would be to imple
ment a finite state machine executing the schedule directly,
controlling the availability of resources and deciding which
operation should be executed. This solution is feasible but
we consider it as too complex and expensive compared to a
simple controller executing the operations in a fixed order.
Recent research in hyperelliptic curve cryptosystems
provides a large number of explicit formulae to choose from
for computing group operations. We chose the uptodate
fastest ones, as are given in [22].
6Results
In this section the results of the scheduling methodology are
presented. Our results are based in the following considera
tions: i) we choose a set of keys to schedule the operations;
ii) we schedule group addition and doubling, accordingly to
the values of the key; iii) we considered a long sequence
of concatenated group operations. If one implements scalar
multiplication using the doubleandadd algorithm, the fol
lowing consecutive group operations should be scheduled:
addition after doubling, doubling after doubling and dou
bling after addition.
We examined all the cases in order to find out whether
we could schedule the second group operation in a way to
gain speed and achieve a higher hardware utilization. Our
results show that addition is always scheduled in the same
way. As for doubling, there are two different ways to ex
ecute it, depending on the previous operation. When dou
bling is executed after another doubling, we can gain speed;
thuswedecided to integratetwooptions forcomputing dou
bling. We have to allow negligible extra hardware for the
controller to decide which option to choose.
6.1 Time Requirements of the Group Operations
Tables 2 and 3 show the clock cycles necessary for perform
ing a group operations in the different system configura
tions. Table 3 shows two different latency figures because
the time necessary to perform doubling depends on the op
eration executed before. The leftmost figure in each cell of
Table 3 is the number of clock cycles necessary to compute
doubling after doubling. The rightmost instead is the time
latency of doubling after addition.
One can see that by increasing the digitsize and the
numberofmultipliers, thetimenecessarytoexecuteagroup
operation decreases. For group addition our results show
that the performance does not increase in some cases, even
when augmenting the resources of the system. For exam
ple, focusing on the speed of addition, using three rather
than four digitsize multipliers with D = 8 (Table 2, third
row) does not make any performance difference. The same
behavior can be observed for 2, 3 and 4 multipliers with
digitsize 16. This is due to the structure of the group oper
ation, that shows no additional parallelism.
Focusing on doubling, there is almost no performance
gain in moving from 3 to 4 multipliers, no matter of what
kind (Table 3, rightmost two columns). In addition, there
is no performance gain in providing 2, 3 or 4 multipliers of
digitsize 8 or 16. Taking only performance into account,
one concludes from Tables 2 and 3 that the design option
using one inverter, two multipliers (D = 16), one adder
and one squarer is preferable. The architecture can perform
group doubling and addition in 289 and 248 clock cycles,
respectively. As stated in Section 5, the ASAP scheduling
policy does not grant an optimum solution; this is evident in
the case of 3 or 4 multipliers with D = 16. The scheduling
of doubling after addition is worse than the case with two
multipliers.
6.2Parallel Computation of Group Operations
In the case of genus2 HEC each group operation produces
as output one polynomial of degree two and one monic
polynomial of degree three. Hence, the output consists of
4 coefficients, namely four field elements. They are neither
produced at the same time nor are all necessary to start the
Page 5
Table 2. Clock cycles per group addition.
DigitsizeNumber of multipliers
1
21259
4 739
8 479
16356
234
739
479
349
289
664
444
335
288
635
435
335
288
Table 3. Clock cycles per group doubling.
DigitsizeNumber of multipliers
12
2724 / 846486 / 560
4464 / 526346 / 380
8 334 / 366 278 / 286
16274 / 284 248 / 247
34
458 / 490
338 / 350
278 / 279
248 / 250
458 / 484
338 / 344
278 / 279
248 / 250
next group operation right away. This means that when one
field element (one coefficient) is computed, it can be used
by the next group operation. We measure the overlapping
of two group operations as the difference of the time when
the last field operation of the former group operation ends
execution minus the time when the first of the latter starts.
Table 4 shows the overlapping for doubling after addi
tion. Overlapping decreases as the speed and the number of
multipliers increase. Similar behaviors have been observed
in the other cases (doubling after doubling and doubling af
ter addition). This decrease is due to the increase of the
parallelism in the tail of the operation.
6.3 Register Allocation
Schedulers fall into two broad families: unconstrained or
resource limited. We choose to upper bound the number of
resources as for busses and arithmetic units, while that of
registers is unbounded. Once the operations are scheduled,
we count the number of live registers and compute the regis
ter allocation. This option is the simplest solution and gives
good results. In fact, each register stores a field element of
81 bits. Simulations demonstrate that changing the other
parameters of the system has a low impact on the number
of required registers. The system needs 18 and 20 registers
in the best and worst case, respectively.
If a designer wants to lower the number of required reg
Table 4. Overlapping in clock cycles of dou
bling after addition.
DigitsizeNumber of multipliers
1
2 333
4 173
8 93
1650
234
172
92
48
30
169
89
48
33
140
80
48
33
isters, he can trade the number of registers for additional
latency. In order to do so, he should avoid overlapping and
start a group operation only when the previous has finished.
We noted that a single group operation uses from 8 to 10
registers, while the maximum register number is reached
when two group operations overlap.
6.4 Evaluation of Different Architecture Options
In this section we report and compare the different archi
tectures in terms of latency and area. In order to do so,
we listed the use of the different multipliers as a percentage
of the total time of one group operation. Thus, the figures
shown in Table 5 and 6 are computed as follows:#mul·tmul
where #mulis the number of multiplications executed by
one multiplier unit during one group operation, tmulis the
time needed for one multiplication and tgroupis the total
execution time of the group operation.
In an ideal scenario all the multipliers should be used
uniformly. However, one can see that the fourth multiplier,
and in some cases also the third multiplier, are used very
infrequently (see Table 6, the columns corresponding to the
3rdand 4thmultiplier). Hence, for most applications it will
be unreasonable to provide this extra hardware units.
In Table 8 we show a comparison to find the optimal ar
chitecture. The optimal implementation will achieve the
highest throughput consuming the smallest area (contrary
to some traditional cryptographic implementations, where
only best performance was evaluated). The analysis uses
the normalized areatime product (with respect to the low
est areatime product). Table 8 shows that the architecture
using one inversion, one multiplication (D = 8), one addi
tion and one squaring achieves the best areatime product.
To evaluate the latency of a complete scalar multiplica
tion kD, reported in Table 7, we examined it in an average
case, where the integer k of 160 bits has half of its bits equal
to 1 and the rest equal to 0. This means that 80 and 160 ad
ditions and doublings were performed, respectively. Half of
the 160 doublings are computed after another doubling and
half after an addition.
We supposed that the all different configurations of the
system work always at the same frequency. This is a worst
case assumption. In fact, usually the multiplier unit dom
inates the frequency, and a smaller digitsize will yield to
higher clock frequency and thus speedup the system.
It should be noted that we omitted the register file area
in the estimation. We decided to do so after noting that the
required number of registers is almost the same in all the
configurations, and that the area consumed by a register can
vary depending on the implementation technology.
tgroup
,
Page 6
Table 5. Use of the multiplier as a percentage
of the total time of group addition.
DigitsizeNumber of multipliers
1
2 68.3 %
61.0 %55.4 %
55.5 % 43.2 %
51.6 % 45.1 %
4 59.6 %
48.2 %43.8 %
42.5 % 33.1 %
38.6 % 33.7 %
848.2 %
34.6 %31.5 %
32.8 %22.9 %
32.8 %19.7 %
16 35.3 %
22.8 %20.7 %
22.9 % 16.6 %
22.9 % 16.6 %
2

3


4


 30.8 %
25.8 %


23.6 %
19.3 %


13.1 %
13.1 %


4.1 %
4.1 %
12.9 %



9.6 %



3.2 %



0 %



Table 6. Use of the multiplier as a percentage
of the total time of group doubling.
Digitsize Number of multipliers
1
2 50.9 %
42.1 %
35.8 %
35.8 %
4 40.7 %
30.3 %
24.8 %
24.8 %
829.6 %
19.7 %
15.8 %
15.8 %
16 19.7 %
12 %
12 %
12 %
2

3


4



33.7 %
35.8 %
35.8 %

24.2 %
24.8 %
24.8 %

15.8 %
15.8 %
15.8 %

9.6 %
9.6 %
9.6 %
8.9 %
8.9 %


6.2 %
6.2 %


3.9 %
3.9 %


0 %
0 %
0 %



0 %



0 %



0 %
Table 7. Latency estimation in clock cycles.
DigitsizeNumber of multipliers
1
2165987 113948
410652780228
87679763524
166296956216
234
100345
74625
61844
56139
99836
74136
61844
56139
Table 8. Areatime product.
DigitsizeNumber of multipliers
1
21.3552
4 1.0422
81
161.2277
234
1.1148
1.0447
1.2385
1.8241
1.1442
1.2133
1.6063
2.5487
1.3000
1.4454
2.0067
3.2758
7Conclusions and Further Research
We proposed for the first time an architecture for a HECC
coprocessor using the recently developed explicit formulae
for the group operations. Different options for the architec
ture were evaluated. These options differ in the kind (vari
ous digitsizes) and number of multipliers.
We found out that if resources are unbounded, the group
addition and doubling operations of HECC execute in 288
and 248 clock cycles, respectively. However, we noted that
using over three multipliers does not help significantly, be
cause additional multiplication units have very low utiliza
tion rates. For a realistic scenario the architecture using one
inverter, one multiplier (with D = 8), one adder and one
squarer achieves the best areatime product. In addition, we
tested the possibility to overlap group operations. In the
case of doubling after addition, we could compute these op
erations for 333 clock cycles in parallel. Finally we also
analyzed the usage of registers, resulting in the necessity of
19 registers of 81 bit each.
In the future one should address the following: 1) usage
of different policies to schedule group operations; 2) imple
ment the register file by means of a conventional 32bit; 3)
use inversion free formulae HEC group operations, which
could reach higher degree of parallelism; 4) implementation
using a FPGA, to examine the impact of the critical path of
the operators on the throughput of the system. Hopefully
our findings are of interest for the research community as
well as for industry.
References
[1] A. J. Menezes and Y. H. Wu, and R. J. Zuccherato. An Ele
mentary Introduction to Hyperelliptic Curves. Personal cor
respondence, November 1996.
[2] A. Antola, G. Bertoni, L. Breveglieri, and P. Maistri. Paral
lel Architectures for Elliptic Curve CryptoProcessors over
Binary Extension Fields. In IEEE Midwest symposium on
circuit and system 03 — MWSCS03, 2003.
[3] M. Bednara, M. Daldrup, J. Shokrollahi, J. Teich, and
J. von zur Gathen. Reconfigurable Implementation of El
liptic Curve Crypto Algorithms. In The 9th Reconfigurable
Architectures Workshop (RAW02), 2002.
[4] N. Boston, T. Clancy, Y. Liow, and J. Webster. Genus Two
Hyperelliptic Curve Coprocessor. In J. c. K. K. B. S. Kaliski
and C. Paar, editors, Cryptographic Hardware and Embed
ded Systems — CHES 2002, volume LNCS 2523, pages
529–539. SpringerVerlag, 2002.
[5] H. Brunner, A. Curiger, and M. Hofstetter. On Computing
Multiplicative Inverses in GF(2m). IEEE Transactions on
Computers, 42:1010–1015, August 1993.
[6] D. Cantor. Computing in Jacobian of a Hyperelliptic Curve.
In Mathematics of Computation, volume 48(177), pages 95
– 101, January 1987.
Page 7
[7] P. Gaudry and R. Harley. Counting Points on Hyperelliptic
Curves over Finite Fields. In W. Bosma, editor, ANTS IV,
volume 1838 of Lecture Notes in Computer Science, pages
297 – 312, Berlin, 2000. Springer Verlag.
[8] R. Govindaraian. Instruction scheduling. CRC Press, the
compiler design handbook edition, 2003.
[9] N. Koblitz. Algebraic Aspects of Cryptography. Springer
Verlag, Berlin, Germany, first edition, 1998.
[10] J. Kuroki, M. Gonda, K. Matsuo, J. Chao, and S. Tsujii.
Fast Genus Three Hyperelliptic CurveCryptosystems.
The 2002 Symposium on Cryptography and Information Se
curity, Japan — SCIS 2002, Jan.29Feb.1 2002.
[11] T. Lange.
Efficient Arithmetic on Genus 2 Hyper
elliptic Curves over Finite Fields via Explicit Formu
lae.Cryptology ePrint Archive, Report 2002/121, 2002.
http://eprint.iacr.org/.
[12] T. Lange. InversionFree Arithmetic on Genus 2 Hyperel
liptic Curves. Cryptology ePrint Archive, Report 2002/147,
2002. http:eprint.iacr.org.
[13] T. Lange. Weighted Coordinates on Genus 2 Hyperelliptic
Curves. Cryptology ePrint Archive, Report 2002/153, 2002.
http:eprint.iacr.org.
[14] T. Lange.Formulae for Arithmetic on Genus 2 Hyper
elliptic Curves, 2003.Available at http://www.ruhruni
bochum.de/itsc/tanja/preprints.html.
[15] K. Matsuo, J. Chao, and S. Tsujii. Fast Genus Two Hyperel
liptic Curve Cryptosystems. In ISEC200131, IEICE, 2001.
[16] P. K. Mishra and P. Sarkar. Parallelizing Explicit Formula
for Arithmetic in the Jacobian of Hyperelliptic Curves. In
Advances in Cryptology — Asiacrypt 2003, volume LNCS.
SpringerVerlag, 2003.
[17] Y. Miyamoto, H. Doi, K. Matsuo, J. Chao, and S. Tsuji. A
Fast Addition Algorithm of Genus Two Hyperelliptic Curve.
In The 2002 Symposium on Cryptography and Information
Security — SCIS 2002, IEICE Japan, pages 497 – 502, 2002.
in Japanese.
[18] D. Mumford. Tata lectures on theta II. In Prog. Math., vol
ume 43. Birkh¨ auser, 1984.
[19] G. Orlando and C. Paar. A HighPerformance Reconfig
urable Elliptic Curve Processor for GF(2m). In C ¸. K. Koc ¸
and C. Paar, editors, Cryptographic Hardware and Embed
ded Systems — CHES 2000, volume LNCS 1965. Springer
Verlag, 2000.
[20] J. Pelzl.
Hyperelliptic Cryptosystems on Embedded Mi
croprocessor.Master’s thesis, Department of Electrical
Engineering and Information Sciences, RuhrUniversitaet
Bochum, Bochum, Germany, Setember 2002.
[21] J. Pelzl, T. Wollinger, J. Guajardo, and C. Paar. Hyperellip
tic Curve Cryptosystems: Closing the Performance Gap to
Elliptic Curves. In C ¸. K. Koc ¸ and C. Paar, editors, Work
shop on Cryptographic Hardware and Embedded Systems
— CHES 2003. SpringerVerlag, 2003.
[22] J. Pelzl, T. Wollinger, and C. Paar.
arithmetic for hyperelliptic curve cryptosystems of genus
two. Cryptology ePrint Archive, Report 2003/212, 2003.
http://eprint.iacr.org/.
[23] J. Pelzl, T. Wollinger, and C. Paar. Low Cost Security: Ex
plicit Formulae for Genus4 Hyperelliptic Curves. In Tenth
Annual Workshop on Selected Areas in Cryptography —
SAC 2003. SpringerVerlag, 2003.
In
High performance
[24] Y. Sakai and K. Sakurai. On the Practical Performance of
Hyperelliptic Curve Cryptosystems in Software Implemen
tation. In IEICE Transactions on Fundamentals of Electron
ics, Communications and Computer Sciences, volume E83
A NO.4, pages 692 – 703, April 2000. IEICE Trans.
[25] Y. Sakai, K. Sakurai, and H. Ishizuka. Secure Hyperelliptic
Cryptosystems and their Performance. In Public Key Cryp
tography, volume 1431 of Lecture Notes in Computer Sci
ence, pages 164 – 181, Berlin, 1998. SpringerVerlag.
[26] V. Shoup. NTL: A libary for doing Number Theory (version
5.0c), 2001. http://www.shoup.net/ntl/index.html.
[27] N. Smart. On the performance of hyperelliptic cryptosys
tems. In J. Stern, editor, Advances in Cryptology – EURO
CRYPT ’99, volume LNCS 1592, pages 165–175. Springer
Verlag, 1999.
[28] L. Song and K. K. Parhi. Lowenergy digitserial/parallel
finite field multipliers. Journal of VLSI Signal Processing
Systems, 2(22):1–17, 1997.
[29] M. Takahashi. Improving Harley Algorithms for Jacobians
of Genus 2 Hyperelliptic Curves. In SCIS, IEICE Japan,
2002. in Japanese.
[30] N. Th´ eriault. Index calculus attack for hyperelliptic curves
of small genus. In Advances in Cryptology  ASIACRYPT
’03, Berlin, 2003. Springer Verlag. LNCS.
[31] T. Wollinger. Computer Architectures for Cryptosystems
Based on Hyperelliptic Curves. Master’s thesis, ECE De
partment, Worcester Polytechnic Institute, Worcester, Mas
sachusetts, USA, May 2001.
[32] T. Wollinger and C. Paar. Hardware Architectures proposed
for Cryptosystems Based on Hyperelliptic Curves. In Pro
ceedings of the 9th IEEE International Conference on Elec
tronics, Circuits and Systems  ICECS 2002, volume III,
pages 1159 – 1163, September 1518 2002.
[33] T. Wollinger, J. Pelzl, V. Wittelsberger, C. Paar, G. Saldamli,
and C ¸. K. Koc ¸. Elliptic & hyperelliptic curves on embedded
µp. ACM Transactions in Embedded Computing Systems
(TECS), 2003. Special Issue on Embedded Systems and Se
curity.