# Compact FPGA Implementations of the Five SHA-3 Finalists.

**ABSTRACT** Allowing good performances on different platforms is an important criteria for the selection of the future sha-3 standard. In this paper, we consider the compact implementations of blake, Grøstl, jh, Keccak and Skein on recent fpga devices. Our results bring an interesting complement to existing analyzes, as most previous works on fpga implementations of the sha-3 candidates were optimized for high throughput applications. Following recent guidelines for the fair comparison of hardware architectures, we put forward clear trends for the selection of the future standard. First, compact fpga implementations of Keccak are less efficient than their high throughput counterparts. Second, Grøstl shows interesting performances in this setting, in particular in terms of throughput over area ratio. Third, the remaining candidates are comparably suitable for compact fpga implementations, with some slight contrasts (in area cost and throughput).

**0**Bookmarks

**·**

**53**Views

- Meeta Srivastav, Xu Guo, Sinan Huang, Dinesh Ganta, Michael B. Henry, Leyla Nazhandali, Patrick Schaumont[Show abstract] [Hide abstract]

**ABSTRACT:**This contribution describes our efforts in the design of a 130 nm CMOS ASIC that implements Skein, BLAKE, JH, Grøstl, and Keccak, the five candidates selected by NIST in the third round SHA-3 competition. The objective of the ASIC is to accurately measure the performance and power dissipation of each candidate when implemented as an ASIC. The design of this ASIC, and its optimization for benchmarking, creates unique problems, related to the integration of five heterogeneous architectures on a single chip. We implemented each algorithm in a separate clock region, and we integrated an on-chip clock generator with flexible testing modes. The chip is further designed to be compatible with SASEBO-R board, a power-analysis and side-channel analysis environment. We report the design flow and test results of the chip, including area, performance and shmoo plot. Furthermore, we compare our ASIC benchmark with an equivalent FPGA benchmark.Microprocessors and Microsystems 03/2013; 37(2):246–257. · 0.55 Impact Factor - SourceAvailable from: Athar Mahboob
##### Conference Paper: Efficient Hardware Implementations and Hardware Performance Evaluation of SHA-3 Finalists

[Show abstract] [Hide abstract]

**ABSTRACT:**Cryptographic hash functions are at the heart of many information security applications like digital signatures, message authentication codes (MACs), and other forms of authentication. In consequence of recent innovations in cryptanalysis of commonly used hash algorithms, NIST USA announced a publicly open competition for selection of new standard Secure Hash Algorithm called SHA-3. An essential part of this contest is hardware performance evaluation of the candidates. In this work we present efficient hardware implementations and hardware performance evaluations of SHA-3 finalists. We implemented and investigated the performance of SHA-3 finalists on latest Xilinx FPGAs. We show our results in the form of chip area consumption, throughput and throughput per area on most recently released devices from Xilinx on which implementations have not been reported yet. We have achieved substantial improvements in implementation results from all of the previously reported work. This work serves as performance investigation of SHA-3 finalists on most up-to-date FPGAs.3rd SHA-3 Candidate Conference; 03/2012

Page 1

Compact FPGA Implementations

of the Five SHA-3 Finalists

St´ ephanie Kerckhof1, Fran¸ cois Durvaux1,

Nicolas Veyrat-Charvillon1, Francesco Regazzoni1,

Guerric Meurice de Dormale2, Fran¸ cois-Xavier Standaert1.

1Universit´ e catholique de Louvain, UCL Crypto Group,

B-1348 Louvain-la-Neuve, Belgium.

{stephanie.kerckhof, francois.durvaux, nicolas.veyrat,

francesco.regazzoni, fstandae}@uclouvain.be

2MuElec, Belgium. gm@muelec.com

Abstract. Allowing good performances on different platforms is an im-

portant criteria for the selection of the future sha-3 standard. In this pa-

per, we consider the compact implementations of blake, Grøstl, jh, Kec-

cak and Skein on recent fpga devices. Our results bring an interesting

complement to existing analyzes, as most previous works on fpga imple-

mentations of the sha-3 candidates were optimized for high throughput

applications. Following recent guidelines for the fair comparison of hard-

ware architectures, we put forward clear trends for the selection of the

future standard. First, compact fpga implementations of Keccak are less

efficient than their high throughput counterparts. Second, the remaining

candidates are comparably suitable for compact fpga implementations,

with some slight contrasts (both in area cost and throughput). Finally,

our implementations provide performances that are in the same range as

a similarly designed aes implementation.

Introduction

The sha-3 competition has been announced by nist on November 2, 2007. Its

goal is to develop a new cryptographic hash algorithm, addressing the concerns

raised by recent cryptanalysis results against sha-1 and sha-2. As for the aes

competition, a number of criteria have been retained for the selection of the final

algorithm. Security against cryptanalysis naturally comes in the first place. But

good performances on a wide range of platforms is another important condition.

In this paper, we consider the hardware performances of the sha-3 finalists on

recent fpga devices.

In this respect, an important observation is that most previous works on

hardware implementations of the sha-3 candidates were focused on expensive

and high throughput architectures, e.g. [17,23]. On the one hand, this is natural

as such implementations provide a direct snapshot of the elementary operations’

cost for the different algorithms. On the other hand, fully unrolled and pipelined

architectures may sometimes hide a part of the algorithms’ complexity that

Page 2

is better revealed in compact implementations. Namely, when trying to design

more serial architectures, the possibility to share resources, the regularity of the

algorithms, and the simplicity to address memories, are additional factors that

may influence the final performances. In other words, compact implementations

do not only depend on the cost of each elementary operation needed in an algo-

rithm, but also on the number of different operations and the way they interact.

Besides, investigating such implementations is also interesting from an applica-

tion point of view, as the resources available for cryptographic functionalities

in hardware systems can be very limited. Consequently, the evaluation of this

constrained scenario is generally an important step in better understanding the

implementation potentialities of an algorithm.

As extensively discussed in the last years, the evaluation of hardware archi-

tectures is inherently difficult, in view of the amount of parameters that may

influence their performances. The differences can be extreme when changing

technologies. For example, asic and fpga implementations have very different

ways to deal with memories and registers, that generally imply different design

choices [14,16]. In a similar way, comparing fpga implementations based on

different manufacturers can only lead to rough intuitions about their respec-

tive efficiency. In fact, even comparing different architectures on the same fpga

is difficult, as carefully discussed in Saar Drimer’s PhD dissertation [12]. Ob-

viously, this does not mean that performance comparisons are impossible, but

simply that they have to be considered with care. In other words, it is impor-

tant to go beyond the quantified results obtained by performance tables, and to

analyze the different metrics they provide (area cost, clock cycles, register use,

throughput, ...) in a comprehensive manner.

Following these observations, the goal of this paper is to compare the five sha-

3 finalists on the basis of their compact fpga implementation. In order to allow

as fair a comparison as possible, we applied the approach described by Gaj et

al. at ches 2010 [14]. Namely, the ip cores were designed according to similar

architectural choices and identical interface protocols. In particular, our results

are based on the a priori decision to rely on a 64-bit datapath (see Section 3

for the details). As for their optimization goals, we targeted implementations in

the hundreds of slices range (that are the fpgas’ basic resources) in the first

place, additionally aiming for throughputs in the hundreds of Mbits/s range, in

accordance with the usual characteristics of a security IP core. In other words,

we did not aim for the lowest cost implementations (e.g. with an 8-bit datapath),

and rather investigated how efficiently the different sha-3 finalists allow sharing

resources and addressing memories, under optimization goals that we believe

reflective of the application scenarios where reconfigurable computing is useful.

As a result, and to the best of our knowledge, we obtain the first complete

study of compact fpga implementations for the sha-3 finalists. For some of

the algorithms, the obtained results are the only available ones for such opti-

mization goals. For the others, they at least compare to the previously reported

ones, sometimes bringing major improvements. For comparison purposes, we ad-

ditionally provide the implementation results of an aes implementation based

Page 3

on the same framework. Eventually, we take advantage of our results to dis-

cuss and compare the five investigated algorithms. While none of the remaining

candidates leads to dramatically poor performances, this discussion allows us to

contrast the previous conclusions obtained from high throughput implementa-

tions. In particular, we put forward that the clear advantage of Keccak in a high

throughput fpga implementation context vanishes in a low area one.

1 SHA-3 finalists

This section provides a quick overview of the five sha-3 finalists. We refer to the

original submissions for the detailed algorithm descriptions.

BLAKE. blake [3] is built on previously studied components, chosen for

their complementarity. The iteration mode is haifa, an improved version of

the Merkle-Damgard paradigm proposed by Biham and Dunkelman [10]. It pro-

vides resistance to long-message second preimage attacks, and explicitly handles

hashing with a salt and a “number of bits hashed so far” counter. The internal

structure is the local wide-pipe, which was already used within the lake hash

function [4]. The compression algorithm is a modified version of Bernstein’s

stream cipher ChaCha [5], which is easily parallelizable. The two main instances

of blake are blake-256 and blake-512. They respectively work with 32- and

64-bit words, and produce 256- and 512-bit digests. The compression function

of blake relies heavily on the function g, which consists in additions, xor op-

erations and rotations. It works with four variables : a, b, c and d. It is called

112 to 128 times respectively for the 32- and 64-bit versions.

Grøstl. Grøstl [15] is an iterated hash function with a compression function

built from two fixed, large, distinct but very similar permutations p and q.

These are constructed using the wide-trail design strategy. The hash function

is based on a byte-oriented sp-network which borrows components from the

aes [11], described by the transforms AddRoundConstant, SubBytes, ShiftBytes

and MixBytes. Grøstl is a so-called wide-pipe construction where the size of the

internal state (represented by a two 8 × 16-byte matrices) is significantly larger

than the size of the output. The specification was last updated in March of 2011.

JH. jh [24] essentially exploits two techniques : a new compression function

structure and a generalized aes design methodology, which provides a simple

approach to obtain large block ciphers from small components. The compression

function proposed for jh is composed as follows. Half of a 1024-bit hash value

H(i−1)is xor-ed with a 512-bit block message M(i). The result of this operation

is passed through a bijective function e8 which is a 42-rounds block cipher with

constant key. The output of e8 output is then once again xor-ed with M(i).

This paper considers the round 3 version of the jh specifications submitted to

the nist, in which the number of rounds has been increased from 35.5 to 42.

Page 4

Keccak. Keccak [6] is a family of sponge functions [7], characterized by two

parameters: a bitrate r, and a capacity c. The sponge construction uses r + c

bits of state and essentially works in two steps. In a first absorbing phase, r

bits are updated by xoring them with message bits and applying the Keccak

permutation (called f). Next, during the squeezing phase, r bits are output after

each application of the same permutation. The remaining c bits are not directly

affected by message bits, nor taken as output. The version of the Keccak function

proposed as sha standard operates on a 1600-bit state, organized in words. The

function f is iterated a number of times determined by the size of the state

and it is composed of five operations. Theta consists of a parity computation, a

rotation of one position, and a bitwise xor. Rho is a rotation by an offset which

depends on the word position. Pi is a permutation. Chi consists of bitwise xor,

not and and gates. Finally, iota is a round constant addition.

Skein. Skein [2] is built out of a tweakable block cipher [20] which allows hashing

configuration data along with the input text in every block, and makes every

instance of the compression function unique. The underlying primitive of Skein

is the Treefish block cipher: it contains no S-box and implements a non-linear

layer using a combination of 64-bit rotations, xors and additions (i.e. operations

that are very efficient on 64-bit processors). The Unique Block Iteration (ubi)

chaining mode uses Threefish to build a compression function that maps an

arbitrary input size to a fixed output size. Skein supports internal state sizes

of 256, 512 and 1024 bits, and arbitrary output sizes. The proposition was last

updated in October of 2010 (version 1.3) [13].

2 Related works

Table 1 provides a partial overview of existing low area fpga implementations

for the sha-3 candidates, as reported in the sha-3 Zoo [1]. Namely, since our

following results were obtained for Virtex-6 and Spartan-6 devices, we list only

the most relevant implementations on similar fpgas.

blake has been implemented in two different ways. The first one, designed by

Aumasson et al. [3], consists in the core functionality (cf) with one g function.

This implementation offers a light version of the algorithm but does not really

exploit fpga specificities. On the other hand, the second blake implementation,

by Beuchat et al. [9], consists in a fully autonomous implementation (fa) and

is designed to perfectly fit the Xilinx fpga architecture : the slice’s carry-chain

logic is exploited to build a adder/xor operator within the same slices. The

authors also harness the 18-kbit embedded memory blocks (mb) to implement

the register file and store the micro-code of the control unit. Table 1 shows

Spartan-3 (s3) and Virtex-5 (v5) implementation results.

Jungk et al. [18][19] chose to implement the Grøstl algorithm on a Spartan-3

device. They provide a fully autonomous implementation including padding. The

similarity between Grøstl and the aes is exploited and aes-specific optimizations

presented in previous works are applied. The table only reports the best and

Page 5

Algorithm Scope fpga

Area

[slices]

390

124

Reg. mb Clk

Freq.Thr.

cyc. [MHz] [Mbps]

- 91

844 190

844 372

- 59

- 3 1164

- 3 1164

- 0-

- 0-

- 0-

- - 36

- -36

444 227 - 3870

AS3 1385** 1858 -

Aumasson et al. [3]

Beuchat et al. [9]

Beuchat et al. [9]

Aumasson et al. [3]

Beuchat et al. [9]

Beuchat et al. [9]

Jungk et al. [19]

Jungk et al. [18]

Jungk et al. [18]

Homsirikamol et al. [17]

Homsirikamol et al. [17]

Bertoni et al. [8]

Namin et al. [22]

blake-32

blake-32

blake-32

blake-64

blake-64

blake-64

Grøstl-256 FA*

Grøstl-256 FA*

Grøstl-512 FA*

JH-256

JH-512

Keccak-256 EM

Skein-256

CF

FA

FA

CF

FA

FA

V5

S3

V5

V5

S3

V5

S3

S3

S3

V5

V5

V5

- -

- 2

- 2

- -

575

115

225

533

138

314

404

192

144

5416

5610

70

161

56

939

229

108

2486

1276

2110

1018

1104

158

358

63

60

63

381

395

265

574

FA

FA

CF72

Table 1: Existing compact fpga implementations of third round sha-3 candi-

dates (* padding included, ** Altera aluts).

most recent results from [18]. Also, only serial implementations of p and q are

considered, because they better match our low area optimization goal.

No low area implementation of jh has been proposed up to now. In order

to have a comparison, the implementation proposed by Homsirikamol et al. [17]

may be mentioned. It is the high speed fpga implementation that has the lowest

area cost reported in the literature.

A low area implementation of the Keccak algorithm is given by Bertoni et

al. [8]. In this implementation, the hash function is implemented as a small area

coprocessor using system (external) memory (em). In the best case, with a 64-bit

memory, the function takes approximately 5000 clock cycles to compute. With

a 32-bit memory, this number increases up to 9000 clock cycles.

Finally, Namin et al. [22] presented a low area implementation of Skein. It

provides the core functionality and is evaluated on an Altera Stratix-III (as3)

fpga.

3 Methodology

As seen in the previous section, there are only a few existing low area fpga

implementations of the sha-3 candidates up to now. Furthermore, those im-

plementations often lack of similar specifications which make them difficult to

compare. Therefore, we propose to implement low area designs of the 5 third-

round candidates and to evaluate their performances. In order to have a fair

comparison between the different implementations, we followed the methodol-

ogy described by Gaj et al. [14], which suggests to use uniform interface and

architecture, and defines some performance metrics.

Page 6

First of all, we decided to primarily focus on the sha-3 candidate variants

with the 512-bit digest output size, as they correspond to the most challenging

scenario for compact implementations - and may be the most informative for

comparison purposes. For completeness, we also report the implementation re-

sults of the 256-bit versions in appendix, that are based on essentially similar

architectures. Next, since we are implementing low area designs, we limited the

internal data-path to 64-bit bus widths. This is a natural choice, as most pre-

sented algorithms are designed to operate well on 64-bit processors. Therefore,

trying to decrease the bus size tends to be cumbersome and provides a limited

area improvement at the expense of a significantly decreased throughput. In ad-

dition, we specified a common interface for all our designs, in which we chose

to have an input message width of 64 bits, as this is a commonly encountered

bus size in hardware. Bigger bus sizes would most of the time require to add a

parallelizer in front of the module, which is resources consuming. All our cores

have been designed to be fully autonomous, which will help us in the comparison

of the total resources needed by each candidate.

Drimer presented in [12] that implementation results are subject to great

variations, depending on the implementation options. Furthermore, comparing

different implementations with each others can be irrelevant if not made with

careful considerations. We therefore specified fixed implementation options and

architecture choices for all our implementations. We choose to work on a Virtex-

6 and Spartan-6 fpgas, specifically a xc6vlx75t with speed grade -1 and a

xc6slx9 with speed grade -2, which are the most constraining fpgas in their

respective families, in terms of number of available logic elements. Note that

the selection of a high-performance device is not in contradiction with com-

pact implementations, as we typically envision applications in which the hash

functionality can only consume a small fraction of the fpga resources. Also,

we believe it is interesting to detail implementation results exploiting the latest

fpga structures, as these advanced structures will typically be available in future

low cost fpgas too. In other words, we expect this choice to better reflect the

evolution of reconfigurable hardware devices. Besides, and as will be illustrated

by the implementation tables in Section 5, the results for Virtex-6 and Spartan-6

devices do not significantly modify our conclusions regarding the suitability of

the sha-3 finalists for compact fpga implementations.

We did not use any dedicated fpga resources such as block rams or dsps.

It is indeed easier to compare implementations when they are all represented in

terms of slices rather than in a combination of several factors. Additionally, the

use of block rams is often not optimal as they are too big for our actual needs.

All the implementations took advantage of the particular lut capabilities of

the Virtex-6 and Spartan-6, and use shift registers and/or distributed rams (or

roms). The different modules are however always inferred so that portability to

other devices is possible, even if not optimal. The design was implemented using

ise 12.1 and for two different sets of parameters. Those two sets are predefined

sets available in ise Design Goals and Strategies project options and are specified

Page 7

as “Area Reduction with Physical Synthesis” and “Timing Performance without

iob Packing”.

We have made the assumption that padding is performed outside of our cores

for the same reasons as in [14]. The padding functions are very similar from

one hash function to another and will mainly result in the same absolute area

overhead. Additionally, complexity of the padding function will depend on the

granularity of the message (bit, byte, words,...) considered in each application.

Finally, the performance metrics we used in this text is always the throughput

for long message (as defined in [14]). We did not specify the throughput for short

message, but the information needed to compute it is present in the result tables

of Section 5.

4 Architectures

This section presents the different compact architectures we developed. Because

of space constraints, we mainly focus on the description of their block diagrams.

BLAKE. blake algorithm is implemented as a narrow-pipelined-dapatath de-

sign. The architecture of blake is illustrated in Figure 1. The overall organiza-

tion is similar to the implementation proposed by Beuchat et al..

blake has a large 16-word state matrix v but each operation works with

only two elements of it. Hence, the datapath does not need to be larger than 32

or 64 bits, respectively for the blake-256 and blake-512 implementations.

The operations are quite simple, they consist in additions, xor and rotations.

This allows us to design a small alu embedding all the required operators in

parallel, followed by a multiplexer. The way the alu is build allows computing

xor-rotation and xor-addition operations in one clock cycle.

Our blake implementation uses distributed ram memory to store interme-

diate values, message blocks and c constants. Using this kind of memory offers

some advantages. Beyond effective slices occupation, the controller must be able

to access randomly to different values. Indeed, message blocks and c constants

are chosen according to elements of a permutation matrix. Furthermore, ele-

ments of the inner state matrix are selected in different orders during column

and diagonal steps.

The 4-input multiplexer in front of the ram memory is used to load message

blocks (m), salt (s) and counter (t) through the Message input, to load the

initialization vector (iv), to write the alu results thanks to the feedback loop,

and to set automatically the salt to zero if the user does not specify any value.

Loading salt or initializing it to zero takes 4 clock cycles. Loading initialization

vector takes 8 clock cycles. These two first steps are made once per message.

The two following steps, which are loading the counter and message block, take

18 clock cycles and are carried out at each new message block.

The scheduling is made so that, for each call of the round function g (as

described in Section 1), the variable a is computed in two clock cycles, because

it needs two additions between three different inputs. The three other variables

Page 8

(b, c, and d) are computed in one clock cycle thanks to the feedback loop on the

alu. As a result, one call of the g function needs 10 clock cycles to be executed.

To avoid pipeline bubbles between column and diagonal steps, the ordering of g

functions during diagonal step is changed to g4, g7, g6and g5. The blake-64

version needs 16 (rounds) × 8 (g calls) × 10 = 1280 clock cycles to process one

block through the round function, and 4 more ones to empty the pipeline. The

initialization and the finalization steps need each 20 clock cycles. So, complete

hashing one message block takes 18+1284+40 = 1342 clock cycles. Finally, the

hash value is directly read on the output of the ram and takes 8 clock cycles to

be entirely read.

As expected, these results are very close to those announced by Beuchat et

al. [9] after adjustement (they considered 14 rounds for the blake-64 version

rather than 16), since the overall architectures are very similar.

Fig.1: blake Architecture

Grøstl. The 64-bit architecture of Grøstl algorithm is depicted in Figure 2.

This pipelined datapath implements the p and q permutation rounds in an in-

terleaved fashion (to avoid data dependency problems). The last round function

(Ω) is implemented with the same datapath and only resorts to p. The difference

between p and q lies in slightly different AddRound constants and ShiftBytes

shift pattern. Besides the main aes-like functions, there are several circuits. A

layer of multiplexers and bitwise xors is required at the beginning and at the end

of the datapath. They implement algorithm initialization, additions necessary at

beginning and end of each round, and internal and external data loading. Two

distinct rams are used to store the p and q state matrices and input message

mi(ram qpm) and the hash result (ram h). ram qpm is a 64 × 64-bit dual port

ram. One ram slot is used to store message miand three other slots are used

to store current and next p and q states (slots are used as a circular buffer).

ram h is a 32 × 64-bit dual port ram that stores current and next H (as well

as final result).

Page 9

The four main operations of each p or q rounds are implemented in the fol-

lowing way. The ShiftBytes operation comes first. It is implemented by accessing

bytes of different columns instead of a single column (as if ram qpm was a col-

lection of eight 8-bit rams), to save a memory in the middle of the datapath.

Different memory access patterns (meaning different initialization of address

counters) are required to implement p and q ShiftBytes as well as no shift (for

post-addition with h and hash unloading). Constants of AddRoundConstant are

computed thanks to a few small-size integer counters (corresponding to the row

and round numbers) and all-zero or all-one constants. Addition of those constants

with data is a simple bitwise xor operation. The eight S-boxes of SubBytes are

simply implemented as eight 8 × 8-bit roms (efficiently implemented in 6-input

look-up tables-based fpgas). Finally, the MixBytes operation is similar to the

aes MixColumn, except that 8 × 6 different 8-bit F2multiplications by small

constants are required, and that eight 64-bit partial products have to be added

together. We implemented it as a large xor tree, with multipliers hardcoded as

8-bit xors and partial products xored together.

Hashing a 1024-bit chunk of a message takes around 450 cycles: 16 (loading of

mi) + 14 (rounds) × 2 (interleaved p and textscq) × 16 (columns of state matrix)

+ 8 (ending). The last operation Ω requires around 350 cycles: 14 (rounds) ×

(16 (columns) + 6 (pipeline flush)) + 8 (ending) + 8 (hash output)

Roughly speaking, the most consuming parts of the architecture are MixBytes

(accounting for 30 % of the final cost), the S-boxes (25 %) and the control of the

dual port rams (25 %). Note that most pipeline registers are taken from already

consumed slices, hence do not increase the slice count of the implementation.

Fig.2: Grøstl Architecture

JH. The jh architecture is illustrated in Figure 3 and is composed as follows.

Two 16×32-bit single port distributed rams (hash ram) are used to store the

intermediate hash values. Those rams are first initialized in 16 clock cycles

with iv values coming from a 16×64-bit distributed rom1and are afterwards

1iv rom contains H(0)initial value and not H(−1)as defined in jh specifications.

That way, we save 688 cycles of initialization and only loose a couple of slices

Page 10

updated with the output of r8 or the xor operation output. r8 performs the

round functions and is composed of sixteen 5×4 S-boxes, eight linear functions

and a permutation. As the permutation layer always puts in correspondence

two consecutive nibbles with a nibble from the first half and another from the

second half of the permuted state, the output of r8 can be split into two 32-bits

words, one coming from the first half and the other from the second half of the

intermediate hash value. An address controller (addr contr), composed of two

16×4-bit dual-port distributed rams is then used to reach the wanted location

in each hash ram, at each cycle. Rotations before and after r8 are needed to

organize correctly the hash intermediate values in the two hash rams.

A similar path is designed for constants generation. Two 16×8-bit single-port

distributed ram (cst rams) are used to store the constants intermediate values.

The function r6 performs a round function on 16 bits of the constant state. The

same address controller as for hash rams is used for cst rams.

Finally, a group/de-group block is used to re-organize the input message.

As jh has been designed to achieve efficient bit-slice software implementations,

a grouping of bits into 4-bit elements has been defined as the first step of the

jh bijective function e8. Similarly, a de-grouping is performed in the last step of

e8. When those grouping and de-grouping phases have no impact on high speed

hardware implementations (as they result only in routing), this in not the case

anymore for low area architectures. Indeed, those steps requires 16 additional

clock cycles per message block, as well as more complex controls to access the

single port rams. To avoid this, we chose to always work on a grouped hash and

therefore to perform the data organization on the message with the group/de-

group block. The same component is also used to re-organize the hash final

value before sending it to the user.

Fig.3: jh Architecture

Page 11

Our implementation of jh needs 16×42 clock cycles to compute the 42 rounds

and 16 additional ones to perform the final xor operation. In total, 688 clock

cycles are required to process a 512-bit message block, 16 for ram initialization

and 20 additional clock cycles are used for the finalization step (4 to empty the

pipeline and 16 to output hash from the group/de-group component).

Keccak. Our architecture, depicted in Figure 4, implements the Keccak version

proposed as sha-3 standard. It works on state of 1600 bits organized into 24

words of 64 bits each. The whole algorithm does not use complex operations,

but only xors, rotations, negations and additions. The basic operations are

performed on the 64-bit words, thus our implementation has a 64-bit internal

datapath.

We maintained the same organization of Bertoni et al., where the compu-

tation was split into three main steps: the first which does part of the theta

transformation by calculating of the parity of the columns, the second which

completes the theta transformation and performs the rho and pi transforma-

tions, and the third which computes the chi and iota steps. This structure re-

quires a memory of 50 words of 64 bits, which are needed to store the state and

the intermediate values at the end of the pi transformation.

To allow parallel read/write operations and to simplify the access to the state,

we organized the whole memory into two distinct asynchronous read single port

ram of 32×64-bit (ram a and ram b), and we reserved ram b to store the

output of the pi transformation.

Internally, our architecture has 5 registers of 64 bits, connected in order to

create a word oriented rotator. During the theta transformation, the registers

store the results of the computed parities. The rotator allows to quickly position

the correct word for computing the second part of theta, as well as for computing

the chi transformation.

Fig.4: Keccak Architecture

Page 12

The most crucial part of Keccak is the rho transformation, which consist of

rotation of words with an offset which depends from the specific index. We im-

plemented this step efficiently in fpga by explicitly instantiating a 64-bit barrel

rotator and by storing the rotation offsets into a dedicated look up table. Using

a single barrel rotator it is possible to significantly reduce the area requirements.

While this negatively affects the performances (since all the 25 words of the state

need to be processed by the same component), it allows reaching an overall cost

that is comparable to the one of the other algorithms, as will be detailed in the

next Section.

Our implementation of Keccak requires 88 clock cycles to compute a single

round. Since Keccak-1600 has 24 rounds, the total number of cycles required

to hash a message is 2112, to which is should be added the initial xor with

the current state (25 cycles repeated for each block), the load of the message (9

cycles), and the offloading of the final result (8 cycles).

Fig.5: Skein Architecture

Skein. Our implementation only contains a minimal set of operations necessary

to the realization of round computations. In order to provide acceptable perfor-

mances and memory requirements, the operations are not broken up all the way

down to the basic addition, exclusive or and rotate operations, but rather realize

the mix and subkey addition steps. The architecture is illustrated in figure 5. the

initial ubi value is obtained through an 8×64-bit rom (iv) which avoids hashing

the first configuration block. Key extension is performed on-the-fly using some

simple arithmetic and a 64-bit register (extend). One 17×64-bit ram memory

(key/msg ram) is used to store both the message block in view of the next

ubi chaining, and the keys used for the current block. The hash variables can

be memorized in two different 4×64-bit rams (hash ram), since the permute

layer never swaps even and odd words. The permute operation itself is implicitly

Page 13

computed using arithmetic on memory addresses. The mix operations take two

64-bit values (mix), and require 4 cycles per round. The subkey addition acts on

64-bit values (add), requiring 8 cycles every 4 rounds. Subkeys are computed

just before addition, with the help of the tweak registers (subkey and tweak).

Finally, a 64-bit xor is used for ubi chaining. After the completion of round

operations, the hash digest is read from the key register. Given the variable

management in this architecture, only single-port rams are needed, rather than

the more expensive dual-port rams. All these are used asynchronously. When

hashing a message, the operator first has to load the initialization vector, taking

9 cycles, followed by 457 cycles per 512-bit message block. Finally, one last block

has to be processed before the hash value is output, leading to an overhead of

466 additional cycles.

5Implementation results & discussion

The complete implementation results for our different architectures are given in

Tables 2 and 3 for Virtex-6 and Spartan-6 devices, respectively. As expected,

one can notice the strong impact of the two sets of options we considered (i.e.

area and timing). Still, a number of important intuitions can be extracted.

blake Grøstl

1024

1342

12/8 24/354 16/20

701 912

371556

192 260

240280

183 640

810966

541 571

215293

304 330

232 754

JH

512

688

Keccak Skein AES

576 512

2137 458

9/8 9/466 8/0

519 770

429158

144 240

250 160

68 179

610 1039 845

533 506

188291

285200

77223

Properties

Input block message size

Clock cycles per block

Clock cycles overhead (pre/post)

Number of LUTs

Number of Registers

Number of Slices

Frequency (MHz)

Throughput (Mbit/s)

Number of LUTs

Number of Registers

Number of Slices

Frequency (MHz)

Throughput (Mbit/s)

1024

448

128

44

Area

789

411

240

288

214

1034

463

304

299

222

658

364

205

222

646

Timing

524

236

250

727

Table 2: Implementation results for the 5 sha-3 candidates on Virtex-6 (512-bit

digests).

In the first place, and compared to previous works, we see that our imple-

mentation results for blake are quite close to the previous ones of Beuchat et

al. The main difference is our exploitation of distributed memories (reported in

the slices count) rather than embedded memory blocks. By contrast, for all the

other algorithms, our results bring some interesting novelty. In particular, for

Keccak, the previous architecture of Bertoni et al. was using only three internal

Page 14

blake Grøstl

1024

1342

12/8 24/354 16/20

719912

370574

230343

135240

103 548

856 766

594759

303281

150265

114605

JH

512

688

Keccak Skein AES

576 512

2137458

9/89/466 8/0

525888

433 249

193292

166

45 102

640 1059 852

476 395

216 351

166 111

45124

Properties

Input block message size

Clock cycles per block

Clock cycles overhead (pre/post)

Number of LUTs

Number of Registers

Number of Slices

Frequency (MHz)

Throughput (Mbit/s)

Number of LUTs

Number of Registers

Number of Slices

Frequency (MHz)

Throughput (Mbit/s)

1024

448

128

44

Area

737

338

260

113

84

1106

646

362

175

130

685

365

232

125

364

91

Timing

529

274

154

448

Table 3: Implementation results for the 5 sha-3 candidates on Spartan-6 (512-bit

digests).

registers, because of its compact asic-oriented flavor. This was at the cost of a

weak performances, in the range of 5000 clock cycles per hash block. We paid a

significant attention in taking advantage of the fpga structure, in particular its

distributed rams. As a result, we reduced the number of clock cycles by a factor

of more than two. As for the three remaining algorithms, no similar results were

known to date, which make them interesting, as first milestones.

Next, this table also leads to a number of comments regarding the different

algorithms and their compact fpga implementations. First, one can notice that

Grøstl compares favorably with all the other candidates (although not by a big

margin). While it has quite expensive components, interleaving the p and q

functions allows reducing the logic resources. More importantly, this algorithm

proceeds blocks of 1024 bits and has a quite limited cycle count, which leads to

significantly higher throughput than our other implementations.

blake and jh also achieve reasonable throughput, but do not reach the level

of performance of Grøstl in this case study. For blake, the input blocks are

still 1024-bit wide, but our implementation requires three times more cycles per

block. For jh, it is rather the reduction of input block size that is in cause.

Skein provides interesting performances too. Its most noticeable limitation

is a lower clock frequency, that could be improved by better pipelining the ad-

ditions involved in our design. As a first step, we exploited the carry propagate

adders that are efficiently implemented in Xilinx fpgas. But this is not a the-

oretical limitation of the algorithm. One could reasonably assume that further

optimization efforts would increase the frequency at the level of the other can-

didates.

Finally, Keccak presents the poorest performances. This is an interesting re-

sult in view of the excellent behavior of this algorithm in a high throughput

implementation context [14]. Further optimizations could be investigated in or-

Page 15

der to reduce the number of clock cycles. But this would be at the cost of a larger

datapath (hence, higher slice count). Also, even considering a very optimistic 50

cycles per round, the throughput of Keccak would remain 6 times smaller than

the one of Grøstl. This suggests that compared to the other finalists, Keccak is

inherently less suitable for compact fpga implementations. The main reason of

this observation relates different rotations used in this algorithm (that come for

free in unrolled implementations but may turn out to be expensive in compact

ones) and to the large state that needs to be addressed multiple times when

hashing a block.

Unsurprisingly, the main difference between the Virtex-6 and Spartan-6 im-

plementations consists in a slightly larger number of slices, most likely due to

the more constraining fpga, and a reduction in frequency due to the lower

performance of the Spartan-6 fpgas.

In addition to these results, Table 4 in appendix provides the implementation

results for the 256-bit digest versions of the hash algorithms, on Virtex-6. In

general, these smaller variants do not exhibit significantly different conclusions.

One important reason for this observation is that, when using distributed ram’s

in an implementation, reducing the size of a state does not directly imply a gain

in slices for a compact implementation (as only the depth of the memories are

affected in this case). Nevertheless, this move towards smaller digests is positive

for Keccak, because of a larger bitrate r. By contrast, for blake, the processing

of 512-bit blocks does not come with an sufficient reduction of the number of

rounds, hence leading to smaller throughputs. As for Grøstl, the number of

rounds is also reduced by less than a factor 2, but the smaller number of columns

in the state matrix allows keeping a higher throughput.

To conclude this work, we finally reported the performance results for an

aes-128 implementation, with “on-the-fly” key scheduling, based on a 32-bit

architecture. This implementation is best compared with the 256-bit versions of

the sha-3 candidates (because of a 128-bit key). One can notice that the slice

count and throughput also range in the same levels.

References

1. The sha-3 zoo. http://ehash.iaik.tugraz.at/wiki/The_SHA-3_Zoo.

2. The skein hash function family. http://www.skein-hash.info/.

3. Jean-Philippe Aumasson, Luca Henzen, Willi Meier, and Raphael C.-W. Phan.

Sha-3 proposal blake (version 1.4), 2011. http://131002.net/blake/.

4. Jean-Philippe Aumasson, Willi Meier, and Raphael C.-W. Phan. The hash function

family lake. In FSE, pages 36–53, 2008.

5. Daniel J. Bernstein. Chacha, a variant of salsa20. Workshop Record of SASC 2008:

The State of the Art of Stream Ciphers, 2008. http://cr.yp.to/chacha.html#

chacha-paper.

6. G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche. The keccak sha-3 submis-

sion. Submission to NIST (Round 3), 2011.

#### View other sources

#### Hide other sources

- Available from François-Xavier Standaert · May 31, 2014
- Available from eu.org