Finding Optimum Parallel Coprocessor Design for Genus 2 Hyperelliptic Curve Cryptosystems.
ABSTRACT Hardware accelerators are often used in cryptographic applications for speeding up the highly arithmetic-intensive public-key primitives, e.g. in high-end smart cards. One of these emerging and very promising public-key schemes is based on hyperelliptic curve cryptosystems (HECC). In the open literature only a few considerations deal with hardware implementation issues of HECC. Our contribution appears to be the first one to propose architectures for the latest findings in efficient group arithmetic on HEC. The group operation of HECC allows parallelization at different levels: bit-level parallelization (via different digit-sizes in multipliers) and arithmetic operation-level parallelization (via replicated multipliers). We investigate the trade-offs between both parallelization options and identify speed and time-area optimized configurations. We found that a coprocessor using a single multiplier (D=8) instead of two or more is best suited. This coprocessor is able to compute group addition and doubling in 479 and 334 clock cycles, respectively. Providing more resources it is possible to achieve 288 and 248 clock cycles, respectively.
- [Show abstract] [Hide abstract]
ABSTRACT: Pipelining is a well-known performance enhancing technique in computer science. Point multiplication is the computationally dominant operation in curve based cryptography. It is generally computed by repeatedly invoking some curve (group) operation like doubling, tripling, halving, addition of group elements. Such a computational procedure may be efficiently computed in a pipeline. More generally, let Π be a computational procedure, which computes its output by repeatedly invoking processes from a set of similar processes. Employing pipelining technique may speed up the running time of the computational procedure. To find pipeline sequence by trial and error method is a nontrivial task. In the current work, we present a general methodology, which given any such computational procedure Π can find a pipelined version with improved computational speed. To our knowledge, this is the first such attempt in curve based cryptography, where it can be used to speed up the point multiplication methods using inversion-free explicit formula for curves over prime fields. As an example, we employ the proposed general methodology to derive a pipelined version of the hyperelliptic curve binary algorithm for point multiplication and obtain a performance gain of 32% against the ideal theoretical value of 50%.Applied Cryptography and Network Security, 4th International Conference, ACNS 2006, Singapore, June 6-9, 2006, Proceedings; 01/2006
Conference Paper: A hyperelliptic curve crypto coprocessor for an 8051 microcontroller[Show abstract] [Hide abstract]
ABSTRACT: This paper presents a microcode instruction set coprocessor which is designed to work with an 8-bit 8051 microcontroller and implements a hyperelliptic curve cryptosystem (HECC). The microcode coprocessor is capable of performing a range of Galois field operations using a dual-multiplier/dual-adder datapath and storing the intermediate results in the local storage unit of the coprocessor (RAM). This coprocessor is programmed using the software routines from the 8051 microcontroller which implements the HECC divisor's doubling and addition operations. The Jacobian scalar multiplication was computed in a 656 msec (7.87 M cycles) at 12 MHz clock frequency.Signal Processing Systems Design and Implementation, 2005. IEEE Workshop on; 12/2005
Conference Paper: Performance of HECC Coprocessors Using Inversion-Free Formulae.[Show abstract] [Hide abstract]
ABSTRACT: The HyperElliptic Curve Cryptosystem (HECC) was quite extensively studied during the recent years. In the open literature one can flnd results on improving the group operations of HECC as well as implementations on various types of processors. There have also been some efiorts to implement HECC on hardware devices, like for instance FPGAs. Only one of these works, however, deals with the inversionfree formulae to compute the group operations of HECC. We present inversionfree group operations for the HEC y2 + xy = x5 + f1x + f0 and targeting characteristic two flelds. The reason being to al- low a fair comparison to hardware architectures using the a-ne case presented in (BBWP04). In the main part of the paper we use these results to investigate various hardware architectures for a HECC VLSI coprocessor. If area constraints are not considered, scalar multiplication can be performed in 19769 clock cycles using three fleld multipliers (of type D = 32), one fleld adder and one fleld squarer, where D indicates the digit size of the multiplier. However, the optimal solution in terms of latency and area uses two multipliers (of type D = 4), one addition and one squaring. The main flnding of the present contribution is that copro- cessors based on the inversionfree formulae should be preferred compared to those using group operations containing inversion. This holds despite the fact that one fleld inversion in the a-ne HECC group operation is traded by up to 24 fleld multiplications in the inversionfree case.Computational Science and Its Applications - ICCSA 2006, International Conference, Glasgow, UK, May 8-11, 2006, Proceedings, Part III; 01/2006
Finding Optimum Parallel Coprocessor Design
for Genus 2 Hyperelliptic Curve Cryptosystems
Guido Bertoni and Luca Breveglieri
Politecnico di Milano, Italy
Thomas Wollinger and Christof Paar
Communication Security Group (COSY)
Ruhr-Universitaet Bochum, Germany
Hardware accelerators are often used in cryptographic ap-
plications for speeding up the highly arithmetic-intensive
public-key primitives, e.g. in high-end smart cards. One
of these emerging and very promising public-key scheme
is based on HyperElliptic Curve Cryptosystems (HECC).
In the open literature only a few considerations deal with
hardware implementation issues of HECC.
Our contribution appears to be the first one to pro-
pose architectures for the latest findings in efficient group
arithmetic on HEC. The group operation of HECC al-
lows parallelization at different levels: bit-level paralleliza-
tion (via different digit-sizes in multipliers) and arithmetic
operation-level parallelization (via replicated multipliers).
We investigate the trade-offs between both parallelization
options and identify speed and time-area optimized config-
urations. We found that a coprocessor using a single mul-
tiplier (D = 8) instead of two or more is best suited. This
in 479 and 334 clock cycles, respectively. Providing more
resources it is possible to achieve 288 and 248 clock cycles,
Keywords: hyperelliptic curve, hardware architecture, co-
processor, parallelism, genus 2, embedded processor.
All modern security protocols, such as IPSec, SSL and TLS
use symmetric-key as well as public-key cryptographic al-
gorithms. In order to be able to provide highly arithmetic-
intensive public-key cryptographic primitives, hardware ac-
celerators are often used. An example are high-end smart
cards, where a cryptographic coprocessor takes over all the
expensive (area and time) computations.
In practical applications the most used public-key algo-
rithms are RSA and Elliptic Curve Cryptosystems (ECC).
One emerging and very promising public-key scheme is
the HyperElliptic Curve Cryptosystem (HECC). HECC has
been analyzed and implemented only recently both in soft-
ware [10, 11, 15, 17, 20, 21, 23–25, 27, 33] and in more
hardware-oriented platforms such as FPGAs [4,31,32].
The work at hand presents, for the first time, an archi-
tecture for a HECC coprocessor considering the most re-
cent explicit formulae to compute group operations. All of
the previous work implementing HECC in hardware used
the original Cantor algorithm, which is outdated. Further-
more, we present and evaluate different design options for
the HECC coprocessor. In order to do so, we wrote software
capable of scheduling the necessary operations, resulting in
an optimal architecture with respect to area and speed. Par-
allelizing at the bit and arithmetic operation level we found
that: 1) no more than three multiplier units are useful; 2)
architectures implementing one inversion and one multipli-
cation unit are the best choice; 3) and providing sufficient
resources group addition and doubling can be performed in
288 and 248 clock cycles, respectively. Moreover, we ex-
plored the overlapping of two group operations and we an-
alyzed the usage of registers.
The rest of the paper is organized as follows. Section 2
summarizes the contributions dealing with previous works.
Section 3 gives a brief overview of the mathematical back-
ground of HECC. Section 4 presents the architecture of the
HECC coprocessor and Section 5 the used methodology.
Finally, we end this contribution with a discussion of our
results (Section 6) and some conclusions (Section 7).
This section gives a short overview of the hardware imple-
mentations targeting HECC and of the previous research
work to parallelize hardware ECC.
The first work discussing hardware architectures for the
implementation of HECC appeared in [31,32]. The authors
describe efficient architectures to implement the necessary
field operations and polynomial arithmetic in hardware. All
of the presented architectures are speed and area optimized.
In , they also estimated that for a hypothetical clock
frequency of 20 MHz, the scalar multiplication of HECC
would take 21.4 ms using the window NAF method.
In  the authors presented the first complete hardware
implementation of a hyperelliptic curve coprocessor. This
implementation targets a genus-2 HEC over F2113. The tar-
get platform is a Xilinx II FPGA. Point addition and point
doubling with a clock frequency of 4 5MHz take 105µs and
90µs, respectively. The scalar multiplication could be com-
puted in 10.1 ms.
Note that publications [4,31,32] adopt the Cantor algo-
rithm to compute group operations. Today, there exist more
efficient algorithms to compute group addition and group
doubling, the so-called explicit formulae (for more details
see Section 3.2).
In  the authors proposed a parallelization of the ex-
plicit group operation of HECC. They developed a gen-
eral methodology for obtaining parallel algorithms. The
methodology guarantees that the obtained parallel version
requires a minimum number of rounds. They show that for
the inversion free arithmetic  using 4, 8 and 12 multipli-
ers in parallel, scalar multiplication can be carried out in 27,
14 and 10 parallel rounds, respectively. When using affine
coordinates  and 8 multipliers it can be performed in 11
rounds, including an inversion round.
to use so many multipliers in parallel as stated in . The
work at hand attempts to consider not only the minimum
number of rounds (speed), but also the necessary devices
(area) as well as practical applications.
A similar work as that presented here for HECC can be
the number of operators and different coordinate systems is
presented. In  two scalar multiplications are scheduled
in parallel on the same architecture: the two operations are
executed in different coordinate systems to improve the use
of the operators. Note that the group operations of elliptic
curves are much less complex than those of the hyperelliptic
ones. In ECC the silicon area of the possible architecture is
easily bounded since the critical path can be computed by
hand, while in the case of HECC it is much more complex.
In this section we introduce briefly the theory of HECC,
restricting attention to the material relevant for this work
only. The interested reader is referred to [1, 9] for more
background on HECC.
3.1Definition of HECC
Let F be a finite field and let F be the algebraic closure of
F. A hyperelliptic curve C of genus g ≥ 1 over F is the set
of the solutions (x,y) ∈ F × F to the following equation:
C : y2+ h(x)y = f(x)
The polynomial h(x) ∈ F[x] is of degree at most g and
f(x) ∈ F[x] is a monic polynomial of degree 2g + 1. For
odd characteristic it suffices to let h(x) = 0 and to have
f(x) squarefree. Such a curve C is said to be non-singular
if there does not exist any pair (x,y) ∈ F × F satisfying
the equation of the curve C and the two partial differential
equations 2y + h(x) = 0 and h?(x)y − f?(x) = 0.
The so-called divisor D is defined as follows: D =
?miPi, to be a formal weighted sum of points Piof the
ditional condition that Dσ=?miPσ
Divisors admit a reduced form. A reduced divisor can be
represented as a pair of polynomials u(x), v(x) [18, page
3.17]. Reduced divisors can be added (group addition),
e.g. D3 = D1+ D2, or doubled (group doubling), e.g.
D2= 2D1= D1+D1, and hence the so-called scalar mul-
tiplication kD = D + ··· + D for k times is defined. The
scalar multiplication kD is the basic operation of HECC,
that we want to implement with a coprocessor.
curve C (and the integers miare the wights), with the ad-
iis equal to D for all
the automorphisms σ of F over F (see  for details).
The formulae given for the group operations (addition, dou-
bling) of HEC by Cantor  can be rewritten in explicit
form, thus resulting in more efficient arithmetic. The ex-
plicit formulae were first presented in . Starting with this
finding, a considerable effort of different research groups
has been put into finding more efficient operations. The
group operations of genus-2 curves have been studied most
intensively ( [7, 11–15, 17, 22, 29]), but also group opera-
tions on genus-3 curves ( [10,20,21,33]) and even genus-4
curves ( ) have been considered.
In the work at hand, we target our HECC coprocessor
for genus-2 curves using underlying fields of characteris-
tic two. We used the up-to-date fastest explicit formulae,
as presented in , where the authors introduced a group
doubling requiring a single field inversion, 9 field multipli-
cations and 6 field squarings. Group addition can be com-
puted with 1 field inversions, 21 field multiplications and 3
3.3Security of HECC
It is widely accepted that for most cryptographic applica-
tions based on EC or HEC, the necessary group is of order
at least ≈ 2160. Thus, for HECC over Fq, we must have at
least g · log2|Fq| ≈ 160. In particular, we will need a field
order |Fq| ≈ 280for genus-2 curves. Even the very recent
attack found by Th´ eriault  shows no progress in attacks
against genus-2 HEC.
4Architecture of the HECC coprocessor
To implement the coprocessor we chose a standard archi-
tecture, see Figure 1. It contains a register file to store tem-
porary results and outputs. The size of each register was
chosen to be the dimension of the field, namely 81 bits. The
register file has two output ports to feed the operators and
one input port to receive the result. This guarantees feasi-
bility and ease of implementation. At any given clock cycle
only one field operation can start. If the operation is unary,
such as inversion, one bus remains idle.
Figure 1. Crypto-processor architecture.
The following list is a summary of how we implemented the
• Addition: The addition of two elements requires the
modulo 2 addition of the coefficients of the field ele-
The squaring of a field element A =
i=0aixiis ruled by the following equation: A2≡
• Multiplication: We decided to use digit multipliers,
introduced in  for fields GF(2m). This kind of
multiplier allows a trade-off between speed, area and
power consumption. It works processing several mul-
tiplicand coefficients at the same time. The number
of coefficients processed in parallel is the digit-size D.
Given D, we denote by d = ?m/D? the total num-
ber of digits in a polynomial of degree m − 1. Hence,
C ≡ AB mod F(x) = A?d−1
• Inversion: The inversion is computed using the algo-
rithm proposed in . It is based on a modification of
i=0aix2imod F(x). Further details can be found
Table 1. Components of the coprocessor:
area and time.
[D · m]AND &
[D · m]XOR
[6 · m + log2m]AND &
[6 · m + log2m] XOR
[2(6m + log2m)]
2 · m
Euclid’s algorithm for computing the gcd of two poly-
nomials. The asymptotic complexity is linear with the
modulus both in time and area.
In Table 1 we give the area and latency for each arith-
metic components we used. The given estimates assume
2-input gates and optimum field polynomials F(x) = xm+
i=0fixi, where m − t ≥ D.
In this section we describe briefly our approach to find the
best suited architecture for the HECC coprocessor.
1. Input: First we evaluated the most recent findings re-
garding the group operation of HECC. The given for-
mulae were then prepared for the scheduler.
2. Scheduler: Our own software library, especially devel-
oped to schedule the HECC group operations, is the
heart of our methodology. The scheduler is based on
the method known as Operation Scheduling  and
works accordingly to the As Soon As Possible (ASAP)
policy. There is a list of operations that should be ex-
ecuted by the architecture. The scheduler takes one
operation at a time and searches for the earliest time
slot where the operation can be executed. It is con-
strained by the number of available resources and by
the different times required to execute each operation.
The same methodology is used by compilers for
scheduling machine instructions. It should be noted
that this methodology is heuristic and does not grant
optimal results. To reach the optimal scheduling it is
necessary to use other methods instead, see . The
scheduler has the following parameters:
• HECC formulae.
• Implementation method for addition, inversion,
multiplication and squaring.
• Number of multiplication units.
• Different digit-sizes for the multiplier.
• Properties of the bus.
• Memory access time.
3. Testing: The results of the scheduler were tested by ap-
plying test vectors. In order to do so we implemented
HECC group operations with the NTL library .
4. Analysis: The results were analyzed and, if needed,
the input architecture was changed, in order to find a
better structure for the coprocessor.
Traditionally, when evaluating the performance of cryp-
tographic implementations, emphasis was put first on the
throughput of the implementation and, second, on the
amount of hardware resources consumed to achieve the de-
sired throughput. We will consider both the hardware re-
quirements and the time constraints of the cryptographic
application. Hence, we are going to use the area-time prod-
uct and the optimal implementation will reach the highest
throughput consuming the smallest area.
We designed an architecture for the HECC coprocessor
using different design criteria. We varied the number of
multipliers (one, two, three and four), as well as their digit-
size D (D = 2,4,8,16). Hence, we changed different ar-
chitecture options, the processing power of the whole co-
processor and of each individual multiplier. Increasing the
processing power yields a speed up of the group operation,
but also causes a growth in area. Thus, there must exist an
optimum architecture, where the area-time product is mini-
Our goal is to implement a simple controller computing
group operations with a fixed execution order. Hence, we
look at a static schedule. The alternative would be to imple-
ment a finite state machine executing the schedule directly,
controlling the availability of resources and deciding which
operation should be executed. This solution is feasible but
we consider it as too complex and expensive compared to a
simple controller executing the operations in a fixed order.
Recent research in hyperelliptic curve cryptosystems
provides a large number of explicit formulae to choose from
for computing group operations. We chose the up-to-date
fastest ones, as are given in .
In this section the results of the scheduling methodology are
presented. Our results are based in the following considera-
tions: i) we choose a set of keys to schedule the operations;
ii) we schedule group addition and doubling, accordingly to
the values of the key; iii) we considered a long sequence
of concatenated group operations. If one implements scalar
multiplication using the double-and-add algorithm, the fol-
lowing consecutive group operations should be scheduled:
addition after doubling, doubling after doubling and dou-
bling after addition.
We examined all the cases in order to find out whether
we could schedule the second group operation in a way to
gain speed and achieve a higher hardware utilization. Our
results show that addition is always scheduled in the same
way. As for doubling, there are two different ways to ex-
ecute it, depending on the previous operation. When dou-
bling is executed after another doubling, we can gain speed;
thuswedecided to integratetwooptions forcomputing dou-
bling. We have to allow negligible extra hardware for the
controller to decide which option to choose.
6.1 Time Requirements of the Group Operations
Tables 2 and 3 show the clock cycles necessary for perform-
ing a group operations in the different system configura-
tions. Table 3 shows two different latency figures because
the time necessary to perform doubling depends on the op-
eration executed before. The leftmost figure in each cell of
Table 3 is the number of clock cycles necessary to compute
doubling after doubling. The rightmost instead is the time
latency of doubling after addition.
One can see that by increasing the digit-size and the
operation decreases. For group addition our results show
that the performance does not increase in some cases, even
when augmenting the resources of the system. For exam-
ple, focusing on the speed of addition, using three rather
than four digit-size multipliers with D = 8 (Table 2, third
row) does not make any performance difference. The same
behavior can be observed for 2, 3 and 4 multipliers with
digit-size 16. This is due to the structure of the group oper-
ation, that shows no additional parallelism.
Focusing on doubling, there is almost no performance
gain in moving from 3 to 4 multipliers, no matter of what
kind (Table 3, rightmost two columns). In addition, there
is no performance gain in providing 2, 3 or 4 multipliers of
digit-size 8 or 16. Taking only performance into account,
one concludes from Tables 2 and 3 that the design option
using one inverter, two multipliers (D = 16), one adder
and one squarer is preferable. The architecture can perform
group doubling and addition in 289 and 248 clock cycles,
respectively. As stated in Section 5, the ASAP scheduling
policy does not grant an optimum solution; this is evident in
the case of 3 or 4 multipliers with D = 16. The scheduling
of doubling after addition is worse than the case with two
6.2Parallel Computation of Group Operations
In the case of genus-2 HEC each group operation produces
as output one polynomial of degree two and one monic
polynomial of degree three. Hence, the output consists of
4 coefficients, namely four field elements. They are neither
produced at the same time nor are all necessary to start the
Table 2. Clock cycles per group addition.
Digit-sizeNumber of multipliers
Table 3. Clock cycles per group doubling.
Digit-sizeNumber of multipliers
2724 / 846486 / 560
4464 / 526346 / 380
8 334 / 366 278 / 286
16274 / 284 248 / 247
458 / 490
338 / 350
278 / 279
248 / 250
458 / 484
338 / 344
278 / 279
248 / 250
next group operation right away. This means that when one
field element (one coefficient) is computed, it can be used
by the next group operation. We measure the overlapping
of two group operations as the difference of the time when
the last field operation of the former group operation ends
execution minus the time when the first of the latter starts.
Table 4 shows the overlapping for doubling after addi-
tion. Overlapping decreases as the speed and the number of
multipliers increase. Similar behaviors have been observed
in the other cases (doubling after doubling and doubling af-
ter addition). This decrease is due to the increase of the
parallelism in the tail of the operation.
6.3 Register Allocation
Schedulers fall into two broad families: unconstrained or
resource limited. We choose to upper bound the number of
resources as for busses and arithmetic units, while that of
registers is unbounded. Once the operations are scheduled,
we count the number of live registers and compute the regis-
ter allocation. This option is the simplest solution and gives
good results. In fact, each register stores a field element of
81 bits. Simulations demonstrate that changing the other
parameters of the system has a low impact on the number
of required registers. The system needs 18 and 20 registers
in the best and worst case, respectively.
If a designer wants to lower the number of required reg-
Table 4. Overlapping in clock cycles of dou-
bling after addition.
Digit-sizeNumber of multipliers
isters, he can trade the number of registers for additional
latency. In order to do so, he should avoid overlapping and
start a group operation only when the previous has finished.
We noted that a single group operation uses from 8 to 10
registers, while the maximum register number is reached
when two group operations overlap.
6.4 Evaluation of Different Architecture Options
In this section we report and compare the different archi-
tectures in terms of latency and area. In order to do so,
we listed the use of the different multipliers as a percentage
of the total time of one group operation. Thus, the figures
shown in Table 5 and 6 are computed as follows:#mul·tmul
where #mulis the number of multiplications executed by
one multiplier unit during one group operation, tmulis the
time needed for one multiplication and tgroupis the total
execution time of the group operation.
In an ideal scenario all the multipliers should be used
uniformly. However, one can see that the fourth multiplier,
and in some cases also the third multiplier, are used very
infrequently (see Table 6, the columns corresponding to the
3rdand 4thmultiplier). Hence, for most applications it will
be unreasonable to provide this extra hardware units.
In Table 8 we show a comparison to find the optimal ar-
chitecture. The optimal implementation will achieve the
highest throughput consuming the smallest area (contrary
to some traditional cryptographic implementations, where
only best performance was evaluated). The analysis uses
the normalized area-time product (with respect to the low-
est area-time product). Table 8 shows that the architecture
using one inversion, one multiplication (D = 8), one addi-
tion and one squaring achieves the best area-time product.
To evaluate the latency of a complete scalar multiplica-
tion kD, reported in Table 7, we examined it in an average
case, where the integer k of 160 bits has half of its bits equal
to 1 and the rest equal to 0. This means that 80 and 160 ad-
ditions and doublings were performed, respectively. Half of
the 160 doublings are computed after another doubling and
half after an addition.
We supposed that the all different configurations of the
system work always at the same frequency. This is a worst-
case assumption. In fact, usually the multiplier unit dom-
inates the frequency, and a smaller digit-size will yield to
higher clock frequency and thus speed-up the system.
It should be noted that we omitted the register file area
in the estimation. We decided to do so after noting that the
required number of registers is almost the same in all the
configurations, and that the area consumed by a register can
vary depending on the implementation technology.
Table 5. Use of the multiplier as a percentage
of the total time of group addition.
Digit-sizeNumber of multipliers
2 68.3 %
61.0 %55.4 %
55.5 % 43.2 %
51.6 % 45.1 %
4 59.6 %
48.2 %43.8 %
42.5 % 33.1 %
38.6 % 33.7 %
34.6 %31.5 %
32.8 %22.9 %
32.8 %19.7 %
16 35.3 %
22.8 %20.7 %
22.9 % 16.6 %
22.9 % 16.6 %
- 30.8 %
Table 6. Use of the multiplier as a percentage
of the total time of group doubling.
Digit-size Number of multipliers
2 50.9 %
4 40.7 %
16 19.7 %
Table 7. Latency estimation in clock cycles.
Digit-sizeNumber of multipliers
Table 8. Area-time product.
Digit-sizeNumber of multipliers
7Conclusions and Further Research
We proposed for the first time an architecture for a HECC
coprocessor using the recently developed explicit formulae
for the group operations. Different options for the architec-
ture were evaluated. These options differ in the kind (vari-
ous digit-sizes) and number of multipliers.
We found out that if resources are unbounded, the group
addition and doubling operations of HECC execute in 288
and 248 clock cycles, respectively. However, we noted that
using over three multipliers does not help significantly, be-
cause additional multiplication units have very low utiliza-
tion rates. For a realistic scenario the architecture using one
inverter, one multiplier (with D = 8), one adder and one
squarer achieves the best area-time product. In addition, we
tested the possibility to overlap group operations. In the
case of doubling after addition, we could compute these op-
erations for 333 clock cycles in parallel. Finally we also
analyzed the usage of registers, resulting in the necessity of
19 registers of 81 bit each.
In the future one should address the following: 1) usage
of different policies to schedule group operations; 2) imple-
ment the register file by means of a conventional 32-bit; 3)
use inversion free formulae HEC group operations, which
could reach higher degree of parallelism; 4) implementation
using a FPGA, to examine the impact of the critical path of
the operators on the throughput of the system. Hopefully
our findings are of interest for the research community as
well as for industry.
 A. J. Menezes and Y. H. Wu, and R. J. Zuccherato. An Ele-
mentary Introduction to Hyperelliptic Curves. Personal cor-
respondence, November 1996.
 A. Antola, G. Bertoni, L. Breveglieri, and P. Maistri. Paral-
lel Architectures for Elliptic Curve Crypto-Processors over
Binary Extension Fields. In IEEE Midwest symposium on
circuit and system 03 — MWSCS03, 2003.
 M. Bednara, M. Daldrup, J. Shokrollahi, J. Teich, and
J. von zur Gathen. Reconfigurable Implementation of El-
liptic Curve Crypto Algorithms. In The 9th Reconfigurable
Architectures Workshop (RAW-02), 2002.
 N. Boston, T. Clancy, Y. Liow, and J. Webster. Genus Two
Hyperelliptic Curve Coprocessor. In J. c. K. K. B. S. Kaliski
and C. Paar, editors, Cryptographic Hardware and Embed-
ded Systems — CHES 2002, volume LNCS 2523, pages
529–539. Springer-Verlag, 2002.
 H. Brunner, A. Curiger, and M. Hofstetter. On Computing
Multiplicative Inverses in GF(2m). IEEE Transactions on
Computers, 42:1010–1015, August 1993.
 D. Cantor. Computing in Jacobian of a Hyperelliptic Curve.
In Mathematics of Computation, volume 48(177), pages 95
– 101, January 1987.
 P. Gaudry and R. Harley. Counting Points on Hyperelliptic
Curves over Finite Fields. In W. Bosma, editor, ANTS IV,
volume 1838 of Lecture Notes in Computer Science, pages
297 – 312, Berlin, 2000. Springer Verlag.
 R. Govindaraian. Instruction scheduling. CRC Press, the
compiler design handbook edition, 2003.
 N. Koblitz. Algebraic Aspects of Cryptography. Springer-
Verlag, Berlin, Germany, first edition, 1998.
 J. Kuroki, M. Gonda, K. Matsuo, J. Chao, and S. Tsujii.
Fast Genus Three Hyperelliptic CurveCryptosystems.
The 2002 Symposium on Cryptography and Information Se-
curity, Japan — SCIS 2002, Jan.29-Feb.1 2002.
 T. Lange.
Efficient Arithmetic on Genus 2 Hyper-
elliptic Curves over Finite Fields via Explicit Formu-
lae.Cryptology ePrint Archive, Report 2002/121, 2002.
 T. Lange. Inversion-Free Arithmetic on Genus 2 Hyperel-
liptic Curves. Cryptology ePrint Archive, Report 2002/147,
 T. Lange. Weighted Coordinates on Genus 2 Hyperelliptic
Curves. Cryptology ePrint Archive, Report 2002/153, 2002.
 T. Lange.Formulae for Arithmetic on Genus 2 Hyper-
elliptic Curves, 2003.Available at http://www.ruhr-uni-
 K. Matsuo, J. Chao, and S. Tsujii. Fast Genus Two Hyperel-
liptic Curve Cryptosystems. In ISEC2001-31, IEICE, 2001.
 P. K. Mishra and P. Sarkar. Parallelizing Explicit Formula
for Arithmetic in the Jacobian of Hyperelliptic Curves. In
Advances in Cryptology — Asiacrypt 2003, volume LNCS.
 Y. Miyamoto, H. Doi, K. Matsuo, J. Chao, and S. Tsuji. A
Fast Addition Algorithm of Genus Two Hyperelliptic Curve.
In The 2002 Symposium on Cryptography and Information
Security — SCIS 2002, IEICE Japan, pages 497 – 502, 2002.
 D. Mumford. Tata lectures on theta II. In Prog. Math., vol-
ume 43. Birkh¨ auser, 1984.
 G. Orlando and C. Paar. A High-Performance Reconfig-
urable Elliptic Curve Processor for GF(2m). In C ¸. K. Koc ¸
and C. Paar, editors, Cryptographic Hardware and Embed-
ded Systems — CHES 2000, volume LNCS 1965. Springer-
 J. Pelzl.
Hyperelliptic Cryptosystems on Embedded Mi-
croprocessor.Master’s thesis, Department of Electrical
Engineering and Information Sciences, Ruhr-Universitaet
Bochum, Bochum, Germany, Setember 2002.
 J. Pelzl, T. Wollinger, J. Guajardo, and C. Paar. Hyperellip-
tic Curve Cryptosystems: Closing the Performance Gap to
Elliptic Curves. In C ¸. K. Koc ¸ and C. Paar, editors, Work-
shop on Cryptographic Hardware and Embedded Systems
— CHES 2003. Springer-Verlag, 2003.
 J. Pelzl, T. Wollinger, and C. Paar.
arithmetic for hyperelliptic curve cryptosystems of genus
two. Cryptology ePrint Archive, Report 2003/212, 2003.
 J. Pelzl, T. Wollinger, and C. Paar. Low Cost Security: Ex-
plicit Formulae for Genus-4 Hyperelliptic Curves. In Tenth
Annual Workshop on Selected Areas in Cryptography —
SAC 2003. Springer-Verlag, 2003.
 Y. Sakai and K. Sakurai. On the Practical Performance of
Hyperelliptic Curve Cryptosystems in Software Implemen-
tation. In IEICE Transactions on Fundamentals of Electron-
ics, Communications and Computer Sciences, volume E83-
A NO.4, pages 692 – 703, April 2000. IEICE Trans.
 Y. Sakai, K. Sakurai, and H. Ishizuka. Secure Hyperelliptic
Cryptosystems and their Performance. In Public Key Cryp-
tography, volume 1431 of Lecture Notes in Computer Sci-
ence, pages 164 – 181, Berlin, 1998. Springer-Verlag.
 V. Shoup. NTL: A libary for doing Number Theory (version
5.0c), 2001. http://www.shoup.net/ntl/index.html.
 N. Smart. On the performance of hyperelliptic cryptosys-
tems. In J. Stern, editor, Advances in Cryptology – EURO-
CRYPT ’99, volume LNCS 1592, pages 165–175. Springer-
 L. Song and K. K. Parhi. Low-energy digit-serial/parallel
finite field multipliers. Journal of VLSI Signal Processing
Systems, 2(22):1–17, 1997.
 M. Takahashi. Improving Harley Algorithms for Jacobians
of Genus 2 Hyperelliptic Curves. In SCIS, IEICE Japan,
2002. in Japanese.
 N. Th´ eriault. Index calculus attack for hyperelliptic curves
of small genus. In Advances in Cryptology - ASIACRYPT
’03, Berlin, 2003. Springer Verlag. LNCS.
 T. Wollinger. Computer Architectures for Cryptosystems
Based on Hyperelliptic Curves. Master’s thesis, ECE De-
partment, Worcester Polytechnic Institute, Worcester, Mas-
sachusetts, USA, May 2001.
 T. Wollinger and C. Paar. Hardware Architectures proposed
for Cryptosystems Based on Hyperelliptic Curves. In Pro-
ceedings of the 9th IEEE International Conference on Elec-
tronics, Circuits and Systems - ICECS 2002, volume III,
pages 1159 – 1163, September 15-18 2002.
 T. Wollinger, J. Pelzl, V. Wittelsberger, C. Paar, G. Saldamli,
and C ¸. K. Koc ¸. Elliptic & hyperelliptic curves on embedded
µp. ACM Transactions in Embedded Computing Systems
(TECS), 2003. Special Issue on Embedded Systems and Se-