Content uploaded by Andy Rupp

Author content

All content in this area was uploaded by Andy Rupp on Nov 13, 2015

Content may be subject to copyright.

Time-Area Optimized

Public-Key Engines: MQ-Cryptosystems as

Replacement for Elliptic Curves?

∗

Andrey Bogdanov, Thomas Eisenbarth, Andy Rupp, Christopher Wolf

Horst G¨ortz Institute for IT-Security

Ruhr-University Bochum, Germany

{abogdanov,eisenbarth,arupp}@crypto.rub.de,

chris@Christopher-Wolf.de or cbw@hgi.rub.de

Abstract

In this pap er ways to eﬃciently implement public-key schemes based on Multivariate Qua-

dratic p olynomials (MQ-schemes for short) are investigated. In particular, they are claimed

to resist quantum computer attacks. It is shown that such schemes can have a much better

time-area product than elliptic curve cryptosystems. For instance, an optimised FPGA im-

plementation of amended TTS is estimated to be over 50 times more eﬃcient with respect to

this parameter. Moreover, a general framework for implementing small-ﬁeld MQ-schemes in

hardware is proposed w hich includes a systolic architecture performing Gaussian elimination

over composite binary ﬁelds.

1 Introduction

Eﬃcient implementations of public key schemes play a crucial role in numerous real-world security

applications: Some of them require messages to be signed in real time (like in such safety-enhancing

automotive applications as car-to-car communication), others deal with thousands of signatures per

second to be generated (e.g. high-performance secur ity servers using so-called HSMs - Hardware

Security Modules). In this context, software implementations even on high-end pr ocessors can often

not provide the performance level needed, hardware implementations being thus the only option.

In this paper we explore the approaches to implement Multivariate Quadratic-based public-key

systems in hardware meeting the r equirements of eﬃcient high-performance applications. The secu-

rity of public key cryptosystems widely spread at the moment is based on the diﬃculty of solving a

small class of problems: the RSA scheme relies on the diﬃculty of factoring large integers, while the

hardness of computing discrete logarithms provides the basis for ElGamal, Diﬃe-Hellmann scheme

and elliptic curves cryptography (ECC). Given that the security of all public key schemes used in

practice relies on such a limited set of problems that are currently considered to be hard, research

on new schemes based on other classes of problems is necessary as such work will provide greater

diversity and hence forces cryptanalysts to spend additional eﬀort concentrating on completely new

types of problems. Moreover, we make sure that not all “crypto-eggs” are in one basket. In this

context, we want to point out that important results on the potential weaknesses of existing public

key schemes are emerging. In particular techniques for factorisation and solving discrete logarithms

improve continually. For example, polynomial time quantum algorithms can b e used to solve both

problems. Therefore, the existence of quantum computers in the range of a few thousands of qbits

∗

This is a revised version of the original paper accepted for CHES 2008.

1

would be a real-world threat to systems based on factoring or the discrete logarithm problem. This

emphasises the importance of research into new algorithms for asymmetric cryptography.

One proposal for secure public key schemes is based on the problem of solving Multivariate

Quadratic equations (MQ-problem) over ﬁnite ﬁelds F, i.e. ﬁnding a solution vector x ∈ F

n

for a

given system of m polynomial equations in n variables each

y

1

= p

1

(x

1

, . . . , x

n

)

y

2

= p

2

(x

1

, . . . , x

n

)

.

.

.

y

m

= p

m

(x

1

, . . . , x

n

) ,

for given y

1

, . . . , y

m

∈ F and unknown x

1

, . . . , x

n

∈ F is diﬃcult, namely N P-complete. An overview

over this ﬁeld can be found in [14].

Roughly speaking, most work on public-key hardware architectures tries to optimise either the

speed of a single instance of an algorithm (e.g., high-speed ECC or RSA implementations) or to

build the smallest possible realization of a scheme (e.g., lightweight ECC engine). A major goal

in high-performance applications is, however, in addition to pure time eﬃciency, an optimised

cost-performance ratio. In the case of hardware implementations, which are often the only solution

in such scenarios, costs (measured in chip area and power consumption) is roughly proportional

to the number of logic elements (gates, FPGA slices) needed. A major ﬁnding of this paper is

that MQ-schemes have the better time-area product than established public key schemes. This

holds, interestingly, also if compared to elliptic curve schemes, which have the reputation of being

particularly eﬃcient.

The ﬁrst public hardware implementation of a cryptosystem based on multivariate polynomials

we are aware of is [17], where enTTS is realized. A more recent result on the evaluation of hardware

performance for Rainbow can be found in [2].

1.1 Our Contribution

Our contribution is many-fold. First, a clear taxonomy of secure multivariate systems and existing

attacks is given. Second, we pres ent a systolic architecture implementing Gauss-Jordan elimination

over GF(2

k

) which is based on the work in [13]. The performance of this central operation is impor-

tant for the overall eﬃciency of multivariate based signature systems. Then, a number of concrete

hardware architectures are presented having a low time-area product. Here we address both rather

conservative schemes such as UOV as well as more aggressively designed proposals such as Rain-

bow or amended TTS (amTTS). For instance, an optimised implementation of amTTS is estimated

to have a TA-product over 50 times lower than some of the most eﬃcient ECC implementations.

Moreover, we suggest a generic hardware architecture capable of computing signatures for the wide

class of multivariate polynomial systems based on small ﬁnite ﬁelds. This generic hardware design

allows us to achieve a time-area product for UOV which is s omewhat smaller than that for ECC,

being considerably smaller for the short-message variant of UOV.

2 Foundations of MQ-Systems

In this section, we introduce some properties and notations useful for the remainder of this article.

After brieﬂy introducing MQ-systems, we explain our choice of signature schemes and give a brief

description of them.

2

signature x

x = (x

1

, . . . , x

n

)

6

private: S

x

′

6

private: P

′

y

′

6

private: T

message y

public:

(p

1

, . . . , p

n

)

Generation Veriﬁcation

Figure 1: Graphical Representation of the MQ-trapdoor (S, P

′

, T )

2.1 Mathematical Background

Let F be a ﬁnite ﬁeld with q := |F| elements and deﬁne Multivariate Quadratic (MQ) polynomials

p

i

of the form

p

i

(x

1

, . . . , x

n

) :=

X

1≤j≤k≤n

γ

i,j,k

x

j

x

k

+

n

X

j=1

β

i,j

x

j

+ α

i

,

for 1 ≤ i ≤ m; 1 ≤ j ≤ k ≤ n and α

i

, β

i,j

, γ

i,j,k

∈ F (constant, linear, and quadratic terms). We

now deﬁne the polynomial-vector P := (p

1

, . . . , p

m

) which yields the public key of these Multi-

variate Quadratic systems. This public vector is used for signature veriﬁcation. Moreover, the

private key (cf Fig.1) consists of the triple (S, P

′

, T ) where S ∈ Aﬀ(F

n

), T ∈ Aﬀ(F

m

) are aﬃne

transformations and P

′

∈ MQ(F

n

, F

m

) is a polynomial-vector P

′

:= (p

′

1

, . . . , p

′

m

) with m com-

ponents; each component is in x

′

1

, . . . , x

′

n

. Throughout this paper, we will denote components of

this private vector P

′

by a prime

′

. The linear transformations S and T can be represented in

the form of invertible matrices M

S

∈ F

n×n

, M

T

∈ F

m×m

, and vectors v

S

∈ F

n

, v

T

∈ F

m

i.e. we

have S(x) := M

S

x + v

S

and T (x) := M

T

x + v

T

, respectively. In contrast to the public polynomial

vector P ∈ MQ(F

n

, F

m

), our design goal is that the private polynomial vector P

′

does allow an

eﬃcient computation of x

′

1

, . . . , x

′

n

for given y

′

1

, . . . , y

′

m

. At least for secure MQ-schemes, this is

not the case if the public key P alone is given. The main diﬀerence between MQ-schemes lies in

their special construction of the central equations P

′

and consequently the trapdoor they embed

into a speciﬁc class of MQ-problems.

In this kind of schemes, the public key P is computed as function composition of the aﬃne

transformations S : F

n

→ F

n

, T : F

m

→ F

m

and the central equations P

′

: F

n

→ F

m

, i.e. we

have P = T ◦ P

′

◦ S. To ﬁx notation further, we note that we have P, P

′

∈ MQ(F

n

, F

m

), i.e.

both are functions from the vector space F

n

to the vector space F

m

. By construction, we have

∀x ∈ F

n

: P(x) = T (P

′

(S(x))).

2.2 Signing

To sign for a given y ∈ F

m

, we observe that we have to invert the computation of y = P(x). Using

the trapdoor-information (S, P

′

, T ), cf Fig. 1, this is easy. First, we observe that transformation

T is a bijection. In particular, we can compute y

′

= M

−1

T

y. The same is true for given x

′

∈ F

n

and S ∈ Aﬀ(F

n

). Using the LU-decomposition of the matrices M

S

, M

T

, this computation takes

3

time O(n

2

) and O(m

2

), respectively. Hence, the diﬃculty lies in evaluating x

′

= P

′−1

(y

′

). We will

discuss strategies for diﬀerent central systems P

′

in Sect. 2.4.

2.3 Veriﬁcation

In contrast to signing, the veriﬁcation step is the same for all MQ-schemes and also rather cheap,

computationally speaking: given a pair x ∈ F

n

, y ∈ F

m

, we evaluate the p olynomials

p

i

(x

1

, . . . , x

n

) :=

X

1≤j≤k≤n

γ

i,j,k

x

j

x

k

+

n

X

j=1

β

i,j

x

j

+ α

i

,

for 1 ≤ i ≤ m; 1 ≤ j ≤ k ≤ n and given α

i

, β

i,j

, γ

i,j,k

∈ F. Then, we verify that p

i

= y

i

holds

for all i ∈ {1, . . . , m}. Obviously, all operations can be eﬃciently computed. The total number of

operations takes time O(mn

2

).

2.4 Description of the Selected Systems

Based on [14] and some newer results, we have selected the following suitable candidates for eﬃcient

implementation of signature schemes: enhanced TTS, amended TTS, Unbalanced Oil and Vinegar

and Rainbow. Systems of the big-ﬁeld classes HFE (Hidden Field Equations), MIA (Matsumoto

Imai Scheme A) and the mixed-ﬁeld class ℓIC — ℓ-Invertible Cycle [8] were excluded as results

from their software implementation show that they cannot be implemented as eﬃciently as schemes

from the small-ﬁeld classes, i.e. enTTS, amTTS, UOV and Rainbow. The prop osed schemes and

parameters are summarised in Table 1.

Table 1: Proposed Schemes and Parameters

q n m τ K Solver

Unbalanced Oil 256 30 10 0.003922 10 1 × K = 10

and Vinegar (UOV) 60 20 20 1 × K = 20

Rainbow 256 42 24 0.007828 12 2 × K = 12

enhanced TTS (v1) 256 28 20 0.000153 9 2 × K = 9

(v2) 0.007828 10 2 × K = 10

amended TTS 256 34 24 0.011718 4,10 1 × K = 4, 2 × K = 10

2.4.1 Unbalanced Oil and Vinegar (UOV).

p

′

i

(x

′

1

, . . . , x

′

n

) :=

n−m

X

j=1

n

X

k=j

γ

′

i,j,k

x

′

j

x

′

k

for i = 1 . . . v

1

Unbalanced Oil and Vinegar Schemes were introduced in [10, 11]. Here we have γ ∈ F, i.e. the

polynomials p are over the ﬁnite ﬁeld F. In this context, the variables x

′

i

for 1 ≤ i ≤ n−m are called

the “vinegar” variables and x

′

i

for n − m < i ≤ n the “oil” variables. We also write o := m for the

number of oil variables and v := n−m = n−o for the number of vinegar variables. To invert UOV,

we need to assign random values to the vinegar variables x

′

1

, . . . , x

′

v

and obtain a linear system in

the oil variables x

′

v+1

, . . . , x

′

n

. All in all, we need to solve a m × m system and have hence K = m.

The probability that we do not obtain a solution for this system is τ

UOV

= 1 −

Q

m−1

i=0

q

m

−q

i

q

m

2

as

there are q

m

2

matrices over the ﬁnite ﬁeld F with q := |F| elements and

Q

m−1

i=0

q

m

− q

i

invertible

ones [14].

Taking the currently known attacks into account, we derive the following secure choice of

parameters for a security level of 2

80

:

• Small datagrams: m = 10, n = 30, τ ≈ 0.003922 and one K = 10 solver

4

• Hash values: m = 20, n = 60, τ ≈ 0.003922 and one K = 20 solver

The security has been evaluated using the formula O(q

v−m−1

m

4

) = O(q

n−2m−1

m

4

). Note that

the ﬁrst version (i.e. m = 10) can only be used with messages of less than 80 bits. However, such

datagrams occur frequently in applications with power or bandwidth restrictions, hence we have

noted this special possibility here.

2.4.2 Rainbow.

Rainbow is the name for a generalisation of UOV [7]. In particular, we do not have one layer, but

several layers. This way, we can reduce the number of variables and hence obtain a faster scheme

when dealing with hash values. The general form of the Rainbow central map is given below.

p

′

i

(x

′

1

, . . . , x

′

n

) :=

v

l

X

j=1

v

l+1

X

k=j

γ

′

i,j,k

x

′

j

x

′

k

for i = v

l

. . . v

l+1

, 1 ≤ l ≤ L

We have the coeﬃcients γ ∈ F, the layers L ∈ N and the vinegar s plits v

1

< . . . < v

L+1

∈ N

with n = v

L+1

. To invert Rainbow, we follow the strategy for UOV — but now layer for layer,

i.e. we pick random values for x

1

, . . . , x

v

1

, solve the ﬁrst layer with an (v

2

− v

1

) × (v

2

− v

1

)-solver

for x

v

1

+1

, . . . , x

v

2

, insert the values x

1

, . . . , x

v

2

into the second layer, solve second layer with an

(v

3

− v

2

) × (v

3

− v

2

)-solver for x

v

2

+1

, . . . , x

v

3

until the last layer L. All in all, we need to solve

sequentially L times (v

l

− v

l−1

) × (v

l

− v

l−1

) systems for l = 2 . . . L + 1. The probability that we

do not obtain a solution for this system is τ

rainbow

= 1 −

Q

L

l=1

Q

v

l+1

−v

l

i=0

q

v

l+1

−v

l

−q

i

q

v

l+1

−v

2

l

using a similar

argument as in Sec. 2.4.1.

Taking the latest attack from [3] into account, we obtain the parameters L = 2, v

1

= 18, v

2

=

30, v

3

= 42 for a security level of 2

80

, i.e. a two layer scheme 18 initial vinegar variables and 12

equations in the ﬁrst layer and 12 new vinegar variables and 12 equations in the second layer.

Hence, we need two K = 12 solvers and obtain τ ≈ 0.007828

2.4.3 amended TTS (amTTS).

The central polynomials P

′

∈ MQ(F

n

, F

m

) for m = 24, n = 34 in amTTS [6] are deﬁned as given

below:

p

′

i

:= x

′

i

+ α

′

i

x

′

σ (i)

+

8

X

j=1

γ

′

i,j

x

′

j+1

x

′

11+(i+j mod 10)

, for i = 10 . . . 19;

p

′

i

:= x

′

i

+ α

′

i

x

′

σ (i)

+ γ

′

0,i

x

′

1

x

′

i

+

8

X

j=1

γ

′

i,j

x

′

15+(i+j+4 mod 8)j+1

x

′

π (i,j)

, for i = 20 . . . 23;

p

′

i

:= x

′

i

+ γ

′

0,i

x

′

0

x

′

i

+

9

X

j=1

γ

′

i,j

x

′

24+(i+j+6 mod 10)j+1

x

′

π (i,j)

, for i = 24 . . . 33.

We have α, γ ∈ F and σ, π permutations, i.e. all polynomials are over the ﬁnite ﬁeld F. We see that

they are similar to the equations of Rainbow (Sec. 2.4.2) — but this time with sparse polynomials.

Unfortunately, there are no more conditions given on σ, π in [6] — we have hence picked one

suitable permutation for our implementation.

To invert amTTS, we follow the sames ideas as for Rainbow — except with the diﬀerence that

we have to invert twice a 10 × 10 system (i = 10 . . . 19 and 24 . . . 33) and once a 4 × 4 system, i.e.

we have K = 10 and K = 4. Due to the structure of the equations, the probability for not getting

a solution here is the same as for a 3-Layer Rainbow scheme with v

1

= 10, v

2

= 20, v

3

= 24, v

4

= 34

variables, i.e. τ

amT T S

= τ

Rainbow

(10, 20, 24, 34) ≈ 0.011718.

5

a

1,1

a

2,1

a

m−1,1

a

m,1

a

1,m

a

2,m

a

m−1,m

a

m,m

b

1

b

m

x

1

x

m

· · ·

· · ·

· · ·

· · ·

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Figure 2: Signature Core Building Blo ck: Systolic Array LSE Solver (Structure)

2.4.4 enhanced TTS (enTTS).

The overall idea of enTTS is similar to amTTS, m = 20, n = 28. For a detailed description of enTTS

see [16, 15]. According to [6], enhanced TTS is broken, hence we do not advocate its use nor did

we give a detailed description in the main part of this article, However, it was implemented in [17],

so we have included it here to allow the reader a comparison between the previous implementation

and ours.

3 Building Blocks for MQ-Signature Cores

Considering Section 2 we see that in order to generate a signature using an MQ-signature scheme

we need the following common operations:

• computing aﬃne transformations (i.e. vector addition and matrix-vector multiplication),

• (partially) evaluating multivariate polynomials over GF(2

k

),

• solving linear systems of equations (LSEs) over GF(2

k

).

In this section we describe the main computational building blocks for realizing these operations.

Using these generic building blocks we can compose a signature core for any of the presented

MQ-schemes (cf Section 4).

3.1 A Systolic Array LSE Solver for GF(2

k

)

In 1989, Hochet et al. [9] proposed a systolic architecture for Gaussian elimination over GF(p).

They considered an architecture of simple processors, used as systolic cells that are connected in a

triangular network. They distinguish two diﬀerent types of cells, main array cells and the boundary

cells of the main diagonal.

6

GF(2

k

)-Inv

1-Bit Reg

E

in

E

out

Cr

in

T

out

Figure 3: Pivot Cell of the Systolic Array LSE Solver

Wang and Lin followed this approach and proposed an architecture in 1993 [13] for comput-

ing inverses over GF(2

k

). They provided two methods to eﬃciently implement the Gauss-Jordan

algorithm over GF(2) in hardware. Their ﬁrst approach was the classical systolic array approach

similar to the one of Ho chet et al.. It features a critical path that is independent of the s ize of the

array. A full solution of an m × m LSE is generated after 4m cycles and every m cycles thereafter.

The solution is computed in a serial fashion.

The other approach, which we call a systolic network, allows signals to propagate through the

whole architecture in a single clock cycle. This allows the initial latency to be reduced to 2m clock

cycles for the ﬁrst result. Of course the critical path now depends of the size of the whole array,

slowing the design down for huge systems of equations. Systolic arrays can be derived from systolic

networks by putting delay elements (registers) into the s ignal paths between the cells.

We followed the approach presented in [13] to build an LSE solver architecture over GF(2

k

).

The biggest advantage of systolic architectures with regard to our application is the low amount

of cells compared to other architectures like SMITH [4]. For solving a m × m LSE, a systolic array

consisting of only m boundary cells and m(m + 1)/2 main cells is required.

An overview of the architecture is given in Figure 2. The boundary cells shown in Figure 3

mainly comprise one inverter that is needed for pivoting the corresponding line. Furthermore,

a single 1-bit register is needed to store whether a pivot was found. The main cells shown in

Figure 4 comprise of one GF(2

k

) register, a multiplier and an adder over GF(2

k

). Furthermore,

a few multiplexers are needed. If the row is not initialised yet (T

in

= 0), the entering data is

multiplied with the inverse of the pivot (E

in

) and stored in the cell. If the pivot was zero, the

element is simply stored and passed to the next row in the next clock cycle. If the row is initialised

(T

in

= 1) the data element a

i,j+1

of the entering line is reduced with the stored data element and

passed to the following row. Hence, one can say that the k-th row of the array performs the k-th

iteration of the Gauss-Jordan algorithm.

The inverters of the boundary cells contribute most of the delay time t

delay

of the systolic

network. Instead of introducing a full systolic array, it is already almost as helpful to simply add

delay elements only between the rows. This seems to be a good trade-oﬀ between delay time and

the number of registers used. This approach we call systolic lines.

As described earlier, the LSEs we generate are not always solvable. We can easily detect an

unsolvable LSE by checking the state of the boundary cells after 3m clock cycles (m clock cycles

for a systolic network, respectively). If one of them is not set, the system is not solvable and a

new LSE needs to be generated. However, as shown in Table 1, this happens very rarely. Hence,

the impact on the performance of the implementation is negligible. Table 2 shows implementation

7

GF(2

k

)-Add

GF(2

k

)-Mul

k-Bit Reg

E

in

E

out

T

in

T

out

D

in

D

out

Figure 4: Main Cell of the Systolic Array LSE Solver

results of the diﬀerent types of systolic arrays for diﬀerent sizes of LSEs (over GF(2

8

)) on diﬀerent

FPGAs.

Table 2: Implementation results for diﬀerent types of systolic arrays and diﬀerent sizes of LSEs

over GF(2

8

) (t

delay

in ns, F

Max

in MHz)

Size on FPGA Speed Size on ASIC

Engine Slices LUTs FFs t

delay

F

Max

GE (estimated)

Systolic arrays on a Spartan-3 device (XC3S1500, 300 MHz)

Systolic Array (10x10) 2,533 4,477 1,305 12.5 80 38,407

Systolic Array (12x12) 3,502 6,160 1,868 12.65 79 53,254

Systolic Array (20x20) 8,811 15,127 5,101 11.983 83 133,957

Alternative systolic arrays on a Spartan-3

Systolic Network (10x10) 2,251 4,379 461 118.473 8.4 30,272

Systolic Lines (12x12) 3,205 6,171 1,279 13.153 75 42,013

Systolic arrays on a Virtex-V device (XC5VLX50-3, 550 MHz)

Systolic Array (10x10) 1314 3498 1305 4.808 207 36,136

Systolic Lines (12x12) 1,534 5,175 1,272 9.512 105 47,853

Systolic Array (20x20) 4552 12292 5110 4.783 209 129,344

3.2 Matrix-Vector Multiplier and Polynomial Evaluator

For performing matrix-vector multiplication, we use the building block depicted in Figure 5. In

the following we call this blo ck a t-MVM. As you can see a t-MVM consists of t multipliers, a tree

of adders of depth about log

2

(t) to compute the sum of all products a

i

· b

i

, and an extra adder

to recursively add up previously computed intermediate values that are stored in a register. Using

the RST-signal we can initially set the register content to zero.

To compute the matrix-vector product

A · b =

a

1,1

. . . a

1,u

.

.

.

.

.

.

a

v,1

. . . a

v,u

·

b

1

.

.

.

b

u

using a t-MVM, where t is chosen in a way that it divides

1

u, we proceed row by row as follows:

1

Note that in the case that t does not divide u we can nevertheless use a t-MVM to compute the matrix-vector

product by setting superﬂuous input signals to zero.

8

GF(2

k

)-Add

GF(2

k

)-Add

Tree

GF(2

k

)-Mul

GF(2

k

)-Mul

k-Bit Reg

RST

a

1

b

1

a

t

b

t

c

.

.

.

Figure 5: Signature Core Building Block: Combined Matrix-Vector-Multiplier and Polynomial-

Evaluator

We set the register content to zero by using RST. Then we feed the ﬁrst t elements of the ﬁrst row

of A into the t-MVM, i.e. we set a

1

= a

1,1

, . . . , a

t

= a

1,t

, as well as the ﬁrst t elements of the vector

b. After the register content is set to

P

t

i=1

a

1,i

b

i

, we feed the next t elements of the row and the

next t elements of the vector into the t-MVM. This leads to a register content corresponding to

P

2t

i=1

a

1,i

b

i

. We go on in this way until the last t elements of the row and the vector are processed

and the register content equals

P

u

i=1

a

1,i

b

i

. Thus, at this point the data signal c corresponds to

the ﬁrst component of the matrix-vector product. Proceeding in a analogous manner yields the

remaining components of the desired vector. Note that the

u

t

parts of the vector b are re-used in a

periodic manner as input to the t-MVM. In Section 3.4 we describe a building block, called word

rotator, providing these parts in the required order to the t-MVM without re-loading them each

time and hence avoid a waste of resources.

Therefore, using a t-MVM (and an additional vector adder) it is clear how to implement the

aﬃne transformations S : F

n

→ F

n

and T : F

m

→ F

m

which are important ingredients of an

MQ-scheme. Note that the parameter t has a signiﬁcant inﬂuence on the performance of an

implementation of such a scheme and is chosen diﬀerently for our implementations (as can b e seen

in Section 4).

Besides realizing the required aﬃne transformations, a t-MVM can be re-used to implement

(partial) polynomial evaluation. It is quite obvious that evaluating the polynomials p

′

i

(belong-

ing to the central map P

′

of a MQ-scheme, cf Section 2) with the vinegar variables involves

matrix-vector multiplications as the main operations. For instance, consider a ﬁxed polynomial

p

′

i

(x

′

1

, . . . , x

′

n

) =

P

n−m

j=1

P

n

k=j

γ

′

i,j,k

x

′

j

x

′

k

from the central map of UOV that we evaluate with ran-

dom values b

1

, . . . , b

n−m

∈ F for the vinegar variables x

′

1

, . . . , x

′

n−m

. Here we like to compute the

coeﬃcients β

i,0

, β

i,n−m+1

, . . . , β

i,n

of the linear polynomial

p

′

i

(b

1

, . . . , b

n−m

, x

′

n−m+1

, . . . , x

′

n

) = β

i,0

+

n

X

j=n−m+1

β

i,j

x

′

j

.

We immediately obtain the coeﬃcients of the non-constant part of this linear polynomial, i.e.

β

i,n−m+1

, . . . , β

i,n

, by computing the following matrix-vector product:

γ

′

i,1,n−m+1

. . . γ

′

i,n−m,n−m+1

.

.

.

.

.

.

γ

′

i,1,n

. . . γ

′

i,n−m,n

·

b

1

.

.

.

b

n−m

=

β

i,n−m+1

.

.

.

β

i,n

(1)

Also the main step for computing β

i,0

can be written as a matrix-vector product:

9

GF(2

k

)-Add

GF(2

k

)-Mul

k-Bit Reg

k-Bit Reg

k-Bit Reg

k-Bit Reg

a

1

a

2

a

w−1

a

w

a

0

β

i,j

α

i,j

b

j

y

′

i

· · ·

Figure 6: Signature Core Building Blo ck: Equation Register

γ

′

i,1,1

0 0 . . . 0

γ

′

i,1,2

γ

′

i,2,2

0 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

γ

′

i,1,n−m−1

γ

′

i,2,n−m−1

. . . γ

′

i,n−m−1,n−m−1

0

γ

′

i,1,n−m

γ

′

i,2,n−m

. . . γ

′

i,n−m,n−m

·

b

1

.

.

.

b

n−m

=

α

i,1

.

.

.

α

i,n−m

(2)

Of course, we can exploit the fact that the above matrix is a lower triangular matrix and we

actually do not have to perform a full matrix-vector multiplication. This must simply be taken into

account when implementing the control logic of the signature core. In order to obtain β

i,0

from

(α

i,1

. . . α

i,n−m

)

T

we have to perform the following additional computation:

β

i,0

= α

i,1

b

1

+ . . . + α

i,n−m

b

n−m

.

This ﬁnal step is performed by another unit called equation register which is presented in the next

section.

3.3 Equation Register

The Equation Register building block is shown in Figure 6. A w-ER es sentially consists of w + 1

register blocks each storing k bits as well as one adder and one multiplier. It is used to temporarily

store parts of an linear equation until this equation has been completely generated and can be

transferred to the systolic array solver.

For instance, in the case of UOV we consider linear equations of the form

p

′

i

(b

1

, . . . , b

n−m

, x

′

n−m+1

, . . . , x

′

n

) = y

′

i

⇔

n−m

X

j=1

α

i,j

b

j

− y

′

i

+

n

X

j=n−m+1

β

i,j

x

′

j

= 0

where we used the notation from Section 3.2. To compute and store the constant part

P

n−m

j=1

α

i,j

b

j

−

y

′

i

of this equation the left-hand part of an m-ER is used (see Figure 6): The respective register is

initially set to y

′

i

. Then the values α

i,j

are computed one after another using a t-MVM building

block and fed into the multiplier of the ER. The corresponding values b

j

are provided by a t-WR

building block which is presented in the next section. Using the adder, y

′

i

and the products can be

added up iteratively. The coeﬃcients β

i,j

of the linear equation are also computed consecutively

by the t-MVM and fed into the shift-register that is shown on the right-hand side of Figure 6.

10

GF(2

k

)

GF(2

k

)

GF(2

k

)

Reg-Block

Reg-Block

Reg-Block

R

1

R

2

R

r

SELECT

x

b

b

j

CTRL-1

CTRL-2

CTRL-r

CTRL-SELECT

· · ·

· · ·

· · ·

Figure 7: Signature Core Building Blo ck: Word Rotator

3.4 Word Rotator

A word cyclic shift register will in the following be referred to as word rotator (WR). A (t, r)-WR,

depicted in Figure 7, consists of r register blocks storing the

u

t

parts of the vector b involved in the

matrix vector products considered in Section 3.2. Each of these r register blocks stores t elements

from GF(2

k

), hence each register block consists of t k-bit registers. The main task of a (t, r)-WR

is to provide the correct parts of the vector b to the t-M VM at all times. The r register blocks can

be serially loaded using the input bus x. After loading, the r register blocks are rotated at each

clock cycle. The cycle length of the rotation can be modiﬁed using the multiplexers by providing

appropriate control signals. This is esp ecially helpful for the partial polynomial evaluation where

due to the triangularity of the matrix in Equation (2), numerous operations can be saved. Here, the

cycle length is

j

t

, where j is the index of the processed row. The possibility to adjust the cycle

length is also necessary in the case r >

u

t

frequently appearing if we use the same (t, r)-WR, i.e.,

ﬁxed parameters t and r, to implement the aﬃne transformation T , the polynomial evaluations,

and the aﬃne transformation S. Additionally, the WR provides b

j

to the ER building block which

is needed by the ER at the end of each rotation cycle. Since this b

j

value always occurs in the last

register block of a cycle, the selector component (right-hand side of Figure 7) can simply load it

and provide it to the ER.

4 Performance Estimations of Small-Field MQ-Schemes in

Hardware

We implemented the most crucial building blocks of the architecture as described in Section 3

(systolic structures, word rotators, matrix-vector multipliers of diﬀerent sizes). In this section, the

estimations of the hardware performance for the whole architecture are performed based on those

implementation results. The power of the approach and the eﬃciency of MQ-schemes in hardware

is demonstrated at the example of UOV, Rainbow, enTTS and amTTS as speciﬁed in Section 2.

Side-Note: The volume of data that needs to be imported to the hardware engine for MQ-

schemes may seem too high to be realistic in some applications. However, the contents of the

matrices and the polynomial coeﬃcients (i.e. the private key) does not necessarily have to be

imported from the outside world or from a large on-board memory. Instead, they can b e generated

online in the engine using a cryptographically strong pseudo-random number generator, requiring

only a small, cryptographically strong secret, i.e. some random bits.

11

4.1 UOV

We treat two parameter sets for UOV as shown in Table 3: n = 60, n = 20 (long-message UOV) as

well as n = 30, m = 10 (short-message UOV). In UOV signature generation, there are three basic

operations: linearising polynomials, solving the resulting equation system, and an aﬃne transform

to obtain the signature. The most time-consuming operation of UOV is the partial evaluation

of the polynomials p

′

i

, since their coeﬃcients are nearly random. However, as already mentioned

in the previous section, for some polynomials approximately one half of the coeﬃcients for the

polynomials are zero. This somewhat simpliﬁes the task of linearization.

For the linearization of polynomials in the long-message UOV, 40 random bytes are generated

to invert the central mapping ﬁrst. To do this, we use a 20-MVM, a (20,3)-WR, and a 20-ER.

For each polynomial one needs about 100 clock cycles (40 clocks to calculate the linear terms and

another 60 ones to compute the constants, see (1) and (2)) and obtains a linear equation with 20

variables. As there are 20 polynomials, this yields about 2000 clock cycles to perform this step.

After this, the 20 × 20 linear system over GF(2

8

) is solved using a 20 × 20 systolic array. The

signature is then the result of this operation which is returned after about 4×20=80 clock cycles.

Then, the 20-byte solution is concatenated with the randomly generated 40 bytes and the result is

passed through the aﬃne transformation, whose major part is a matrix-vector multiplication with

a 60×60-byte matrix. To perform this operations, we re-use the 20-MVM and a (20,3)-WR. This

requires about 180 cycles of 20-MVM and 20 bytes of the matrix entries to be input in each cycle.

For the short-message UOV, one has a very similar structure. More precisely, one needs a 10-

MVM, a (10,3)-WR, a 10-ER and a 10×10 systolic array. The design requires approximately 500

cycles for the partial evaluation of the polynomials, about 40 cycles to solve the resulting 10×10

LSE over GF(2

8

) as well as another 90 cycles for the ﬁnal aﬃne map.

Note that the critical path of the Gaussian elimination engine is much longer than that for

the remaining building blocks. So this block represents the performance bottleneck in terms of

frequency and hardware complexity. For this reason we decided to clock diﬀerent components of

the design with diﬀerent frequencies. For the XC5VLX50-3 device the Gaussian elimination engine

is clocked with 200 MHz and the rest with 400 MHz. Alternatively, for the XC3S1500 device the

Gaussian elimination component is clocked with about 80 MHz, the remaining engines with 160

MHz. See Table 3 for our estimations.

4.2 Rainbow

In the version of Rainbow we consider, the message length is 24 byte. That is, a 24-byte matrix-

vector multiplication has to be performed ﬁrst. One can take a 6-MVM and a (6,7)-WR which

require about 96 clock cycles to perform the computation. Then the ﬁrst 18 variables of x

′

i

are

randomly ﬁxed and 12 ﬁrst polynomials are partially evaluated. This requires about 864 clock

cycles. The results are stored in a 12-ER. After this, the 12×12 system of linear equations is

solved. This requires a 12×12 systolic array over GF(2

8

) which outputs the solution after 48 clock

cycles. Then the last 12 polynomials are linearised using the same matrix-vector multiplier and

word rotator based on the 18 random values previously chosen and the 12-byte solution. This needs

about 1800 clock cycles. This is followed by another run of the 12×12 systolic array with the same

execution time of about 48 clock cycles. At the end, roughly 294 more cycles are spent performing

the ﬁnal aﬃne transform on the 42-byte vector. See Table 3 for some concrete performance ﬁgures

in this case.

4.3 enTTS and amTTS

Like in Rainbow, for enTTS two vector-matrix multiplications are needed at the beginning and at

the end of the operation with 20- and 28-byte vectors each. We take a 10-MVM and a (10,3)-WR for

this. The operations require 40 and 84 clock cycles, respectively. One 9-ER is required. Two 10×10

linear systems over GF(2

8

) need to be solved, requiring about 40 clock cycles each. The operation

12

of calculating the linearization of the polynomials can be signiﬁcantly optimised compared to the

generic UOV or Rainbow (in terms of time) which can drastically reduce the time-area product.

This behaviour is due to the special selection of polynomials, where only a small proportion of

coeﬃcients is non-zero.

After choosing 7 variables randomly, 10 linear equations have to be generated. For each of these

equations, one has to perform only a few multiplications in GF(2

8

) which can be done in parallel.

This requires about 20 clock cycles. After this, another variable is ﬁxed and a further set of 10

polynomials is partially evaluated. This requires about 20 further cycles.

In amTTS, which is quite s imilar to enTTS, two aﬃne maps with 24- and 34-byte vectors are

performed with a 12-MVM and a (12,3)-WR yielding 48 and 102 clock cycles, respectively. Two

10×10 and one 4×4 linear systems have to be solved requiring for a 10×10 systolic array (twice 40

and once 16 clock cycles). Moreover, a 10-ER is needed. The three steps of the partial evaluation

of polynomials requires roughly 40 clock cycles in this case. See Table 3 for our estimations on

enTTS and amTTS.

Table 3: Comparison of hardware implementations for ECC and our performance estimations for

MQ-schemes based on the implementations of the major building blocks (F=frequency,T=Time,

L=luts, S=slices, FF=ﬂip-ﬂops, A=area, XC3=XC3S1500, XC5=XC5VLX50-3)

Implementation F, MHz T, µs S/L/FF A,kGE S·T [S·ms]

ECC-163, [1], XC2V200 100 41 -/8,300/1100 - 85.1

ECC-163, CMOS 167 21 - 36 -

ECC-163, [12], XCV200E-7 48 68.9 -/25,763/7,467 - 447.9

UOV(60,20), XC3 80/160 14.625 9821 / 16694 / 5665 149 143.6

UOV(60,20), XC5 200/400 5.85 5334 / 13437 / 5774 143 31.2

UOV(30,10), XC3 80/160 4.188 3060 / 5304 / 1649 46 12.8

UOV(30,10), XC5 200/400 1.675 1585 / 4098 / 1649 43 2.7

Rainbow(42,24), XC3 80/160 7.781 4123 / 7173 / 2332 63 32.1

Rainbow(42,24), XC5 200/400 5.595 2000 / 5626 / 2330 59 11.2

enTTS(28,20), [17], CMOS 80

#

200 - 22 -

enTTS(28,20), XC3 80/160 2.025 3060 / 5304 / 1649 46 6.2

enTTS(28,20), XC5 200/400 0.81 1585 / 4098 / 1649 43 1.2

amTTS(34,24), XC3 80/160 2.438 3139 / 5434 / 1697 48 7.7

amTTS(34,24), XC5 200/400 0.975 1659 / 4200 / 1697 42 1.6

#

For comparison purposes we assume that the design can be clocked with up to 80 MHz.

5 Comparison and Conclusions

Our implementation results (as well as the estimations for the optimisations in case of enTTS and

amTTS) are compared to the scalar multiplication in the group of points of elliptic curves with

ﬁeld bitlengths in the rage of 160 bit (corres ponding to the security level of 2

80

) over GF(2

k

), see

Table 3. A good survey on hardware implementations for ECC can be found in [5].

Even the most conservative design, i.e. long-message UOV, can outperform some of the most ef-

ﬁcient ECC implementations in terms of TA-product on some hardware platforms. More hardware-

friendly designs such as the short-message UOV or Rainbow provide a considerable advantage over

ECC. The more aggressively designed enTTS and amTTS allow for extremely eﬃcient implementa-

tions having a more than 70 or 50 times lower TA-product, respectively. Though the metric we use

is not optimal, the results indicate that MQ-schemes perform better than elliptic curves in hard-

ware with respect to the TA-product and are hence an interesting option in cost- or size-sensitive

areas.

Acknowledgements. The authors would like to thank our college Christof Paar for fruitful dis-

cussions and helpful remarks as well as Sundar Balasubramanian, Harold Carter (University of

Cincinnati, USA) and Jintai Ding (University of Cincinnati, USA and Technical University of

Darmstadt, Germany) for exchanging some ideas while working on another paper about MQ-

schemes.

13

References

[1] B. Ansari and M. Anwar Hasan. High performance architecture of elliptic curve scalar multiplication.

Technical report, CACR, January 2006.

[2] S. Balasubramanian, A. Bogdanov, A. Rupp, J. Ding, and H. W. Carter. Fast multivariate signature

generation in hardware: The case of Rainbow. In ASAP 2008. to appear.

[3] O. Billet an d H. Gilbert. Cryptanalysis of rainbow. InSCN 2006, volume 4116 of LNCS, pages 336–347.

Springer, 2006.

[4] A. Bogdanov, M. Mertens, C. Paar, J. Pelzl, and A. Rupp. A parallel hardware architecture for fast

gaussian elimination over GF(2). In FCCM 2006, 2006.

[5] G. Meurice de Dormale and J.-J. Quisquater. High-speed hardware implementations of elliptic curve

cryptography: A survey. Journal of Systems Architecture, 53:72–84, 2007.

[6] J. Ding, L. Hu, B.-Y. Yang, and J.-M. Chen. Note on design criteria for rainbow-type multivariates.

Cryptology ePrint Archive http://eprint.iacr.org, Report 2006/307, 2006.

[7] J. Ding and D. Schmidt. Rainbow, a new multivariable polynomial signature scheme. In ACNS 2005,

volume 3531 of LNCS, pages 164–175. Springer, 2005.

[8] J. Ding, C. Wolf, and B.-Y. Yang. ℓ-invertible cycles for multivariate quadratic public key cryptogra-

phy. In PKC 2007, volume 4450 of LNCS, pages 266–281, Springer, 2007.

[9] B. Hochet, P. Quinton, and Y. Robert. Systolic Gaussian Elimination over GF (p) with Partial

Pivoting. IEEE Transactions on Computers, 38(9):1321–1324, 1989.

[10] A. Kipnis, J. Patarin, and L. Goubin. Unbalanced Oil and Vinegar signatur e schemes. In EURO-

CRYPT 1999, volume 1592 of LNCS. Springer, 1999.

[11] A. Kipnis, J. Patarin, and L. Goubin. Unbalanced Oil and Vinegar signature schemes — extended

version, 2003. 17 pages, citeseer/231623.html, 2003-06-11.

[12] C. Shu, K. Gaj, and T. El-Ghazawi. Low latency elliptic curve cryptography accelerators for nist

curves on binary ﬁelds. In IEEE FPT’05, 2005.

[13] C.L. Wang and J.L. Lin. A Systolic Architecture for Computing Inverses and Divisions in Finite

Fields GF (2

m

). IEEE TransComp, 42(9):1141–1146, 1993.

[14] C. Wolf and B. Preneel. Taxonomy of public key schemes based on the problem of multivariate

quadratic equations. Cryptology ePrint Archive http://eprint.iacr.org, Report 2005/077, 12

th

of

May 2005.

[15] B.-Y. Yang and J.-M. Chen. Rank attacks and defence in Tame-like multivariate PKC’s. Cryptology

ePrint Archive http://eprint.iacr.org, Report 2004/061, 29

rd

September 2004.

[16] B.-Y. Yang and J.-M. Chen. Building secure tame-like multivariate public-key cryptosystems: The

new TTS. In ACISP 2005, volume 3574 of LNCS, pages 518–531. Springer, July 2005.

[17] B.-Y. Yang, D. C.-M. Cheng, B.-R. Chen, and J.-M. Chen. Implementing minimized multivariate

public-key cryptosystems on low-resource embedded systems. In SPC 2006, volume 3934 of LNCS,

pages 73–88. Springer, 2006.

14