Compact FPGA Implementations of the Five SHA3 Finalists.
ABSTRACT Allowing good performances on different platforms is an important criteria for the selection of the future sha3 standard. In this paper, we consider the compact implementations of blake, Grøstl, jh, Keccak and Skein on recent fpga devices. Our results bring an interesting complement to existing analyzes, as most previous works on fpga implementations of the sha3 candidates were optimized for high throughput applications. Following recent guidelines for the fair comparison of hardware architectures, we put forward clear trends for the selection of the future standard. First, compact fpga implementations of Keccak are less efficient than their high throughput counterparts. Second, Grøstl shows interesting performances in this setting, in particular in terms of throughput over area ratio. Third, the remaining candidates are comparably suitable for compact fpga implementations, with some slight contrasts (in area cost and throughput).

Conference Paper: High throughput pipelined FPGA implementation of the new SHA3 cryptographic hash algorithm
[Show abstract] [Hide abstract]
ABSTRACT: In this paper a twostaged pipelined architecture of the new SHA3 (Keccak) algorithm is presented. The core can operate on both oneblock and multiblock messages, realizing all possible modes of Keccak. Special effort has been paid and different design alternatives have been studied to derive efficient FPGA implementations in terms of throughput and throughput/area metrics. The proposed core has been implemented in Xilinx Virtex5, Virtex6, and Virtex7 FPGA technologies and achieves significant improvements compared to existing FPGA implementations. Specifically, for Virtex5 the proposed architecture achieves better throughput and throughput/area results from 45.8% to 248× and from 8.9% up to 17.9×, respectively. Regarding Virtex6, the improvements in throughput and throughput/area are from 47.2% up to 18.1× and from 8% up to 27.3×, respectively.2014 6th International Symposium on Communications, Control and Signal Processing (ISCCSP); 05/2014  SourceAvailable from: DureShahwar Kundi
Conference Paper: Compact implementation of SHA3512 on FPGA
[Show abstract] [Hide abstract]
ABSTRACT: In this work we present a compact design of newly selected Secure Hash Algorithm (SHA3) on Xilinx Field Programable Gate Array (FPGA) device Virtex5. The design is logically optimized for area efficiency by merging Rho, Pi and Chi steps of algorithm into single step. By logically merging these three steps we save 16 % logical resources for overall implementation. It in turn reduced latency and enhanced maximum operating frequency of design. It utilizes only 240 Slices and has frequency of 301.02 MHz. Comparing the results of our design with the previously reported FPGA implementations of SHA3512, our design shows the best throughput per slice (TPS) ratio of 30.1.IEEE Conference on Information Assurance and Cyber Security (CIACS2014), Islamabad, Pakistan; 06/2014  SourceAvailable from: Michael Hutter
Conference Paper: Putting together what fits together: grÆstl
[Show abstract] [Hide abstract]
ABSTRACT: We present GrÆStl, a combined hardware architecture for the Advanced Encryption Standard (AES) and Grøstl, one of the final round candidates of the SHA3 hash competition. GrÆStl has been designed for lowresource devices implementing AES128 (encryption and decryption) as well as Grøstl256 (tweaked version). We applied several resourcesharing optimizations and based our design on an 8/16bit datapath. As a feature, we aim for high flexibility by targeting both ASIC and FPGA platforms and do not include technology or platformdependent components such as RAM macros, DSPs, or Block RAMs. Our ASIC implementation (fabricated in a 0.18μm CMOS process) needs only 16.5 kGEs and requires 742/1,025 clock cycles for encryption/decryption and 3,093 clock cycles for hashing one message block. On a Xilinx Spartan3 FPGA, our design requires 956 logic slices and 302 logic slices on a Xilinx Virtex6. Both standalone implementations of AES and Grøstl outperform existing FPGA solutions regarding lowarea design by needing 79% and 50% less resources as compared to existing work. GrÆStl is the first combined AES and Grøstl implementation that has been fabricated as an ASIC.Proceedings of the 11th international conference on Smart Card Research and Advanced Applications; 11/2012
Page 1
Compact FPGA Implementations
of the Five SHA3 Finalists
St´ ephanie Kerckhof1, Fran¸ cois Durvaux1,
Nicolas VeyratCharvillon1, Francesco Regazzoni1,
Guerric Meurice de Dormale2, Fran¸ coisXavier Standaert1.
1Universit´ e catholique de Louvain, UCL Crypto Group,
B1348 LouvainlaNeuve, Belgium.
{stephanie.kerckhof, francois.durvaux, nicolas.veyrat,
francesco.regazzoni, fstandae}@uclouvain.be
2MuElec, Belgium. gm@muelec.com
Abstract. Allowing good performances on different platforms is an im
portant criteria for the selection of the future sha3 standard. In this pa
per, we consider the compact implementations of blake, Grøstl, jh, Kec
cak and Skein on recent fpga devices. Our results bring an interesting
complement to existing analyzes, as most previous works on fpga imple
mentations of the sha3 candidates were optimized for high throughput
applications. Following recent guidelines for the fair comparison of hard
ware architectures, we put forward clear trends for the selection of the
future standard. First, compact fpga implementations of Keccak are less
efficient than their high throughput counterparts. Second, the remaining
candidates are comparably suitable for compact fpga implementations,
with some slight contrasts (both in area cost and throughput). Finally,
our implementations provide performances that are in the same range as
a similarly designed aes implementation.
Introduction
The sha3 competition has been announced by nist on November 2, 2007. Its
goal is to develop a new cryptographic hash algorithm, addressing the concerns
raised by recent cryptanalysis results against sha1 and sha2. As for the aes
competition, a number of criteria have been retained for the selection of the final
algorithm. Security against cryptanalysis naturally comes in the first place. But
good performances on a wide range of platforms is another important condition.
In this paper, we consider the hardware performances of the sha3 finalists on
recent fpga devices.
In this respect, an important observation is that most previous works on
hardware implementations of the sha3 candidates were focused on expensive
and high throughput architectures, e.g. [17,23]. On the one hand, this is natural
as such implementations provide a direct snapshot of the elementary operations’
cost for the different algorithms. On the other hand, fully unrolled and pipelined
architectures may sometimes hide a part of the algorithms’ complexity that
Page 2
is better revealed in compact implementations. Namely, when trying to design
more serial architectures, the possibility to share resources, the regularity of the
algorithms, and the simplicity to address memories, are additional factors that
may influence the final performances. In other words, compact implementations
do not only depend on the cost of each elementary operation needed in an algo
rithm, but also on the number of different operations and the way they interact.
Besides, investigating such implementations is also interesting from an applica
tion point of view, as the resources available for cryptographic functionalities
in hardware systems can be very limited. Consequently, the evaluation of this
constrained scenario is generally an important step in better understanding the
implementation potentialities of an algorithm.
As extensively discussed in the last years, the evaluation of hardware archi
tectures is inherently difficult, in view of the amount of parameters that may
influence their performances. The differences can be extreme when changing
technologies. For example, asic and fpga implementations have very different
ways to deal with memories and registers, that generally imply different design
choices [14,16]. In a similar way, comparing fpga implementations based on
different manufacturers can only lead to rough intuitions about their respec
tive efficiency. In fact, even comparing different architectures on the same fpga
is difficult, as carefully discussed in Saar Drimer’s PhD dissertation [12]. Ob
viously, this does not mean that performance comparisons are impossible, but
simply that they have to be considered with care. In other words, it is impor
tant to go beyond the quantified results obtained by performance tables, and to
analyze the different metrics they provide (area cost, clock cycles, register use,
throughput, ...) in a comprehensive manner.
Following these observations, the goal of this paper is to compare the five sha
3 finalists on the basis of their compact fpga implementation. In order to allow
as fair a comparison as possible, we applied the approach described by Gaj et
al. at ches 2010 [14]. Namely, the ip cores were designed according to similar
architectural choices and identical interface protocols. In particular, our results
are based on the a priori decision to rely on a 64bit datapath (see Section 3
for the details). As for their optimization goals, we targeted implementations in
the hundreds of slices range (that are the fpgas’ basic resources) in the first
place, additionally aiming for throughputs in the hundreds of Mbits/s range, in
accordance with the usual characteristics of a security IP core. In other words,
we did not aim for the lowest cost implementations (e.g. with an 8bit datapath),
and rather investigated how efficiently the different sha3 finalists allow sharing
resources and addressing memories, under optimization goals that we believe
reflective of the application scenarios where reconfigurable computing is useful.
As a result, and to the best of our knowledge, we obtain the first complete
study of compact fpga implementations for the sha3 finalists. For some of
the algorithms, the obtained results are the only available ones for such opti
mization goals. For the others, they at least compare to the previously reported
ones, sometimes bringing major improvements. For comparison purposes, we ad
ditionally provide the implementation results of an aes implementation based
Page 3
on the same framework. Eventually, we take advantage of our results to dis
cuss and compare the five investigated algorithms. While none of the remaining
candidates leads to dramatically poor performances, this discussion allows us to
contrast the previous conclusions obtained from high throughput implementa
tions. In particular, we put forward that the clear advantage of Keccak in a high
throughput fpga implementation context vanishes in a low area one.
1 SHA3 finalists
This section provides a quick overview of the five sha3 finalists. We refer to the
original submissions for the detailed algorithm descriptions.
BLAKE. blake [3] is built on previously studied components, chosen for
their complementarity. The iteration mode is haifa, an improved version of
the MerkleDamgard paradigm proposed by Biham and Dunkelman [10]. It pro
vides resistance to longmessage second preimage attacks, and explicitly handles
hashing with a salt and a “number of bits hashed so far” counter. The internal
structure is the local widepipe, which was already used within the lake hash
function [4]. The compression algorithm is a modified version of Bernstein’s
stream cipher ChaCha [5], which is easily parallelizable. The two main instances
of blake are blake256 and blake512. They respectively work with 32 and
64bit words, and produce 256 and 512bit digests. The compression function
of blake relies heavily on the function g, which consists in additions, xor op
erations and rotations. It works with four variables : a, b, c and d. It is called
112 to 128 times respectively for the 32 and 64bit versions.
Grøstl. Grøstl [15] is an iterated hash function with a compression function
built from two fixed, large, distinct but very similar permutations p and q.
These are constructed using the widetrail design strategy. The hash function
is based on a byteoriented spnetwork which borrows components from the
aes [11], described by the transforms AddRoundConstant, SubBytes, ShiftBytes
and MixBytes. Grøstl is a socalled widepipe construction where the size of the
internal state (represented by a two 8 × 16byte matrices) is significantly larger
than the size of the output. The specification was last updated in March of 2011.
JH. jh [24] essentially exploits two techniques : a new compression function
structure and a generalized aes design methodology, which provides a simple
approach to obtain large block ciphers from small components. The compression
function proposed for jh is composed as follows. Half of a 1024bit hash value
H(i−1)is xored with a 512bit block message M(i). The result of this operation
is passed through a bijective function e8 which is a 42rounds block cipher with
constant key. The output of e8 output is then once again xored with M(i).
This paper considers the round 3 version of the jh specifications submitted to
the nist, in which the number of rounds has been increased from 35.5 to 42.
Page 4
Keccak. Keccak [6] is a family of sponge functions [7], characterized by two
parameters: a bitrate r, and a capacity c. The sponge construction uses r + c
bits of state and essentially works in two steps. In a first absorbing phase, r
bits are updated by xoring them with message bits and applying the Keccak
permutation (called f). Next, during the squeezing phase, r bits are output after
each application of the same permutation. The remaining c bits are not directly
affected by message bits, nor taken as output. The version of the Keccak function
proposed as sha standard operates on a 1600bit state, organized in words. The
function f is iterated a number of times determined by the size of the state
and it is composed of five operations. Theta consists of a parity computation, a
rotation of one position, and a bitwise xor. Rho is a rotation by an offset which
depends on the word position. Pi is a permutation. Chi consists of bitwise xor,
not and and gates. Finally, iota is a round constant addition.
Skein. Skein [2] is built out of a tweakable block cipher [20] which allows hashing
configuration data along with the input text in every block, and makes every
instance of the compression function unique. The underlying primitive of Skein
is the Treefish block cipher: it contains no Sbox and implements a nonlinear
layer using a combination of 64bit rotations, xors and additions (i.e. operations
that are very efficient on 64bit processors). The Unique Block Iteration (ubi)
chaining mode uses Threefish to build a compression function that maps an
arbitrary input size to a fixed output size. Skein supports internal state sizes
of 256, 512 and 1024 bits, and arbitrary output sizes. The proposition was last
updated in October of 2010 (version 1.3) [13].
2 Related works
Table 1 provides a partial overview of existing low area fpga implementations
for the sha3 candidates, as reported in the sha3 Zoo [1]. Namely, since our
following results were obtained for Virtex6 and Spartan6 devices, we list only
the most relevant implementations on similar fpgas.
blake has been implemented in two different ways. The first one, designed by
Aumasson et al. [3], consists in the core functionality (cf) with one g function.
This implementation offers a light version of the algorithm but does not really
exploit fpga specificities. On the other hand, the second blake implementation,
by Beuchat et al. [9], consists in a fully autonomous implementation (fa) and
is designed to perfectly fit the Xilinx fpga architecture : the slice’s carrychain
logic is exploited to build a adder/xor operator within the same slices. The
authors also harness the 18kbit embedded memory blocks (mb) to implement
the register file and store the microcode of the control unit. Table 1 shows
Spartan3 (s3) and Virtex5 (v5) implementation results.
Jungk et al. [18][19] chose to implement the Grøstl algorithm on a Spartan3
device. They provide a fully autonomous implementation including padding. The
similarity between Grøstl and the aes is exploited and aesspecific optimizations
presented in previous works are applied. The table only reports the best and
Page 5
Algorithm Scope fpga
Area
[slices]
390
124
Reg. mb Clk
Freq.Thr.
cyc. [MHz] [Mbps]
 91
844 190
844 372
 59
 3 1164
 3 1164
 0
 0
 0
  36
 36
444 227  3870
AS3 1385** 1858 
Aumasson et al. [3]
Beuchat et al. [9]
Beuchat et al. [9]
Aumasson et al. [3]
Beuchat et al. [9]
Beuchat et al. [9]
Jungk et al. [19]
Jungk et al. [18]
Jungk et al. [18]
Homsirikamol et al. [17]
Homsirikamol et al. [17]
Bertoni et al. [8]
Namin et al. [22]
blake32
blake32
blake32
blake64
blake64
blake64
Grøstl256 FA*
Grøstl256 FA*
Grøstl512 FA*
JH256
JH512
Keccak256 EM
Skein256
CF
FA
FA
CF
FA
FA
V5
S3
V5
V5
S3
V5
S3
S3
S3
V5
V5
V5
 
 2
 2
 
575
115
225
533
138
314
404
192
144
5416
5610
70
161
56
939
229
108
2486
1276
2110
1018
1104
158
358
63
60
63
381
395
265
574
FA
FA
CF72
Table 1: Existing compact fpga implementations of third round sha3 candi
dates (* padding included, ** Altera aluts).
most recent results from [18]. Also, only serial implementations of p and q are
considered, because they better match our low area optimization goal.
No low area implementation of jh has been proposed up to now. In order
to have a comparison, the implementation proposed by Homsirikamol et al. [17]
may be mentioned. It is the high speed fpga implementation that has the lowest
area cost reported in the literature.
A low area implementation of the Keccak algorithm is given by Bertoni et
al. [8]. In this implementation, the hash function is implemented as a small area
coprocessor using system (external) memory (em). In the best case, with a 64bit
memory, the function takes approximately 5000 clock cycles to compute. With
a 32bit memory, this number increases up to 9000 clock cycles.
Finally, Namin et al. [22] presented a low area implementation of Skein. It
provides the core functionality and is evaluated on an Altera StratixIII (as3)
fpga.
3 Methodology
As seen in the previous section, there are only a few existing low area fpga
implementations of the sha3 candidates up to now. Furthermore, those im
plementations often lack of similar specifications which make them difficult to
compare. Therefore, we propose to implement low area designs of the 5 third
round candidates and to evaluate their performances. In order to have a fair
comparison between the different implementations, we followed the methodol
ogy described by Gaj et al. [14], which suggests to use uniform interface and
architecture, and defines some performance metrics.
Page 6
First of all, we decided to primarily focus on the sha3 candidate variants
with the 512bit digest output size, as they correspond to the most challenging
scenario for compact implementations  and may be the most informative for
comparison purposes. For completeness, we also report the implementation re
sults of the 256bit versions in appendix, that are based on essentially similar
architectures. Next, since we are implementing low area designs, we limited the
internal datapath to 64bit bus widths. This is a natural choice, as most pre
sented algorithms are designed to operate well on 64bit processors. Therefore,
trying to decrease the bus size tends to be cumbersome and provides a limited
area improvement at the expense of a significantly decreased throughput. In ad
dition, we specified a common interface for all our designs, in which we chose
to have an input message width of 64 bits, as this is a commonly encountered
bus size in hardware. Bigger bus sizes would most of the time require to add a
parallelizer in front of the module, which is resources consuming. All our cores
have been designed to be fully autonomous, which will help us in the comparison
of the total resources needed by each candidate.
Drimer presented in [12] that implementation results are subject to great
variations, depending on the implementation options. Furthermore, comparing
different implementations with each others can be irrelevant if not made with
careful considerations. We therefore specified fixed implementation options and
architecture choices for all our implementations. We choose to work on a Virtex
6 and Spartan6 fpgas, specifically a xc6vlx75t with speed grade 1 and a
xc6slx9 with speed grade 2, which are the most constraining fpgas in their
respective families, in terms of number of available logic elements. Note that
the selection of a highperformance device is not in contradiction with com
pact implementations, as we typically envision applications in which the hash
functionality can only consume a small fraction of the fpga resources. Also,
we believe it is interesting to detail implementation results exploiting the latest
fpga structures, as these advanced structures will typically be available in future
low cost fpgas too. In other words, we expect this choice to better reflect the
evolution of reconfigurable hardware devices. Besides, and as will be illustrated
by the implementation tables in Section 5, the results for Virtex6 and Spartan6
devices do not significantly modify our conclusions regarding the suitability of
the sha3 finalists for compact fpga implementations.
We did not use any dedicated fpga resources such as block rams or dsps.
It is indeed easier to compare implementations when they are all represented in
terms of slices rather than in a combination of several factors. Additionally, the
use of block rams is often not optimal as they are too big for our actual needs.
All the implementations took advantage of the particular lut capabilities of
the Virtex6 and Spartan6, and use shift registers and/or distributed rams (or
roms). The different modules are however always inferred so that portability to
other devices is possible, even if not optimal. The design was implemented using
ise 12.1 and for two different sets of parameters. Those two sets are predefined
sets available in ise Design Goals and Strategies project options and are specified
Page 7
as “Area Reduction with Physical Synthesis” and “Timing Performance without
iob Packing”.
We have made the assumption that padding is performed outside of our cores
for the same reasons as in [14]. The padding functions are very similar from
one hash function to another and will mainly result in the same absolute area
overhead. Additionally, complexity of the padding function will depend on the
granularity of the message (bit, byte, words,...) considered in each application.
Finally, the performance metrics we used in this text is always the throughput
for long message (as defined in [14]). We did not specify the throughput for short
message, but the information needed to compute it is present in the result tables
of Section 5.
4 Architectures
This section presents the different compact architectures we developed. Because
of space constraints, we mainly focus on the description of their block diagrams.
BLAKE. blake algorithm is implemented as a narrowpipelineddapatath de
sign. The architecture of blake is illustrated in Figure 1. The overall organiza
tion is similar to the implementation proposed by Beuchat et al..
blake has a large 16word state matrix v but each operation works with
only two elements of it. Hence, the datapath does not need to be larger than 32
or 64 bits, respectively for the blake256 and blake512 implementations.
The operations are quite simple, they consist in additions, xor and rotations.
This allows us to design a small alu embedding all the required operators in
parallel, followed by a multiplexer. The way the alu is build allows computing
xorrotation and xoraddition operations in one clock cycle.
Our blake implementation uses distributed ram memory to store interme
diate values, message blocks and c constants. Using this kind of memory offers
some advantages. Beyond effective slices occupation, the controller must be able
to access randomly to different values. Indeed, message blocks and c constants
are chosen according to elements of a permutation matrix. Furthermore, ele
ments of the inner state matrix are selected in different orders during column
and diagonal steps.
The 4input multiplexer in front of the ram memory is used to load message
blocks (m), salt (s) and counter (t) through the Message input, to load the
initialization vector (iv), to write the alu results thanks to the feedback loop,
and to set automatically the salt to zero if the user does not specify any value.
Loading salt or initializing it to zero takes 4 clock cycles. Loading initialization
vector takes 8 clock cycles. These two first steps are made once per message.
The two following steps, which are loading the counter and message block, take
18 clock cycles and are carried out at each new message block.
The scheduling is made so that, for each call of the round function g (as
described in Section 1), the variable a is computed in two clock cycles, because
it needs two additions between three different inputs. The three other variables
Page 8
(b, c, and d) are computed in one clock cycle thanks to the feedback loop on the
alu. As a result, one call of the g function needs 10 clock cycles to be executed.
To avoid pipeline bubbles between column and diagonal steps, the ordering of g
functions during diagonal step is changed to g4, g7, g6and g5. The blake64
version needs 16 (rounds) × 8 (g calls) × 10 = 1280 clock cycles to process one
block through the round function, and 4 more ones to empty the pipeline. The
initialization and the finalization steps need each 20 clock cycles. So, complete
hashing one message block takes 18+1284+40 = 1342 clock cycles. Finally, the
hash value is directly read on the output of the ram and takes 8 clock cycles to
be entirely read.
As expected, these results are very close to those announced by Beuchat et
al. [9] after adjustement (they considered 14 rounds for the blake64 version
rather than 16), since the overall architectures are very similar.
Fig.1: blake Architecture
Grøstl. The 64bit architecture of Grøstl algorithm is depicted in Figure 2.
This pipelined datapath implements the p and q permutation rounds in an in
terleaved fashion (to avoid data dependency problems). The last round function
(Ω) is implemented with the same datapath and only resorts to p. The difference
between p and q lies in slightly different AddRound constants and ShiftBytes
shift pattern. Besides the main aeslike functions, there are several circuits. A
layer of multiplexers and bitwise xors is required at the beginning and at the end
of the datapath. They implement algorithm initialization, additions necessary at
beginning and end of each round, and internal and external data loading. Two
distinct rams are used to store the p and q state matrices and input message
mi(ram qpm) and the hash result (ram h). ram qpm is a 64 × 64bit dual port
ram. One ram slot is used to store message miand three other slots are used
to store current and next p and q states (slots are used as a circular buffer).
ram h is a 32 × 64bit dual port ram that stores current and next H (as well
as final result).
Page 9
The four main operations of each p or q rounds are implemented in the fol
lowing way. The ShiftBytes operation comes first. It is implemented by accessing
bytes of different columns instead of a single column (as if ram qpm was a col
lection of eight 8bit rams), to save a memory in the middle of the datapath.
Different memory access patterns (meaning different initialization of address
counters) are required to implement p and q ShiftBytes as well as no shift (for
postaddition with h and hash unloading). Constants of AddRoundConstant are
computed thanks to a few smallsize integer counters (corresponding to the row
and round numbers) and allzero or allone constants. Addition of those constants
with data is a simple bitwise xor operation. The eight Sboxes of SubBytes are
simply implemented as eight 8 × 8bit roms (efficiently implemented in 6input
lookup tablesbased fpgas). Finally, the MixBytes operation is similar to the
aes MixColumn, except that 8 × 6 different 8bit F2multiplications by small
constants are required, and that eight 64bit partial products have to be added
together. We implemented it as a large xor tree, with multipliers hardcoded as
8bit xors and partial products xored together.
Hashing a 1024bit chunk of a message takes around 450 cycles: 16 (loading of
mi) + 14 (rounds) × 2 (interleaved p and textscq) × 16 (columns of state matrix)
+ 8 (ending). The last operation Ω requires around 350 cycles: 14 (rounds) ×
(16 (columns) + 6 (pipeline flush)) + 8 (ending) + 8 (hash output)
Roughly speaking, the most consuming parts of the architecture are MixBytes
(accounting for 30 % of the final cost), the Sboxes (25 %) and the control of the
dual port rams (25 %). Note that most pipeline registers are taken from already
consumed slices, hence do not increase the slice count of the implementation.
Fig.2: Grøstl Architecture
JH. The jh architecture is illustrated in Figure 3 and is composed as follows.
Two 16×32bit single port distributed rams (hash ram) are used to store the
intermediate hash values. Those rams are first initialized in 16 clock cycles
with iv values coming from a 16×64bit distributed rom1and are afterwards
1iv rom contains H(0)initial value and not H(−1)as defined in jh specifications.
That way, we save 688 cycles of initialization and only loose a couple of slices
Page 10
updated with the output of r8 or the xor operation output. r8 performs the
round functions and is composed of sixteen 5×4 Sboxes, eight linear functions
and a permutation. As the permutation layer always puts in correspondence
two consecutive nibbles with a nibble from the first half and another from the
second half of the permuted state, the output of r8 can be split into two 32bits
words, one coming from the first half and the other from the second half of the
intermediate hash value. An address controller (addr contr), composed of two
16×4bit dualport distributed rams is then used to reach the wanted location
in each hash ram, at each cycle. Rotations before and after r8 are needed to
organize correctly the hash intermediate values in the two hash rams.
A similar path is designed for constants generation. Two 16×8bit singleport
distributed ram (cst rams) are used to store the constants intermediate values.
The function r6 performs a round function on 16 bits of the constant state. The
same address controller as for hash rams is used for cst rams.
Finally, a group/degroup block is used to reorganize the input message.
As jh has been designed to achieve efficient bitslice software implementations,
a grouping of bits into 4bit elements has been defined as the first step of the
jh bijective function e8. Similarly, a degrouping is performed in the last step of
e8. When those grouping and degrouping phases have no impact on high speed
hardware implementations (as they result only in routing), this in not the case
anymore for low area architectures. Indeed, those steps requires 16 additional
clock cycles per message block, as well as more complex controls to access the
single port rams. To avoid this, we chose to always work on a grouped hash and
therefore to perform the data organization on the message with the group/de
group block. The same component is also used to reorganize the hash final
value before sending it to the user.
Fig.3: jh Architecture
Page 11
Our implementation of jh needs 16×42 clock cycles to compute the 42 rounds
and 16 additional ones to perform the final xor operation. In total, 688 clock
cycles are required to process a 512bit message block, 16 for ram initialization
and 20 additional clock cycles are used for the finalization step (4 to empty the
pipeline and 16 to output hash from the group/degroup component).
Keccak. Our architecture, depicted in Figure 4, implements the Keccak version
proposed as sha3 standard. It works on state of 1600 bits organized into 24
words of 64 bits each. The whole algorithm does not use complex operations,
but only xors, rotations, negations and additions. The basic operations are
performed on the 64bit words, thus our implementation has a 64bit internal
datapath.
We maintained the same organization of Bertoni et al., where the compu
tation was split into three main steps: the first which does part of the theta
transformation by calculating of the parity of the columns, the second which
completes the theta transformation and performs the rho and pi transforma
tions, and the third which computes the chi and iota steps. This structure re
quires a memory of 50 words of 64 bits, which are needed to store the state and
the intermediate values at the end of the pi transformation.
To allow parallel read/write operations and to simplify the access to the state,
we organized the whole memory into two distinct asynchronous read single port
ram of 32×64bit (ram a and ram b), and we reserved ram b to store the
output of the pi transformation.
Internally, our architecture has 5 registers of 64 bits, connected in order to
create a word oriented rotator. During the theta transformation, the registers
store the results of the computed parities. The rotator allows to quickly position
the correct word for computing the second part of theta, as well as for computing
the chi transformation.
Fig.4: Keccak Architecture
Page 12
The most crucial part of Keccak is the rho transformation, which consist of
rotation of words with an offset which depends from the specific index. We im
plemented this step efficiently in fpga by explicitly instantiating a 64bit barrel
rotator and by storing the rotation offsets into a dedicated look up table. Using
a single barrel rotator it is possible to significantly reduce the area requirements.
While this negatively affects the performances (since all the 25 words of the state
need to be processed by the same component), it allows reaching an overall cost
that is comparable to the one of the other algorithms, as will be detailed in the
next Section.
Our implementation of Keccak requires 88 clock cycles to compute a single
round. Since Keccak1600 has 24 rounds, the total number of cycles required
to hash a message is 2112, to which is should be added the initial xor with
the current state (25 cycles repeated for each block), the load of the message (9
cycles), and the offloading of the final result (8 cycles).
Fig.5: Skein Architecture
Skein. Our implementation only contains a minimal set of operations necessary
to the realization of round computations. In order to provide acceptable perfor
mances and memory requirements, the operations are not broken up all the way
down to the basic addition, exclusive or and rotate operations, but rather realize
the mix and subkey addition steps. The architecture is illustrated in figure 5. the
initial ubi value is obtained through an 8×64bit rom (iv) which avoids hashing
the first configuration block. Key extension is performed onthefly using some
simple arithmetic and a 64bit register (extend). One 17×64bit ram memory
(key/msg ram) is used to store both the message block in view of the next
ubi chaining, and the keys used for the current block. The hash variables can
be memorized in two different 4×64bit rams (hash ram), since the permute
layer never swaps even and odd words. The permute operation itself is implicitly
Page 13
computed using arithmetic on memory addresses. The mix operations take two
64bit values (mix), and require 4 cycles per round. The subkey addition acts on
64bit values (add), requiring 8 cycles every 4 rounds. Subkeys are computed
just before addition, with the help of the tweak registers (subkey and tweak).
Finally, a 64bit xor is used for ubi chaining. After the completion of round
operations, the hash digest is read from the key register. Given the variable
management in this architecture, only singleport rams are needed, rather than
the more expensive dualport rams. All these are used asynchronously. When
hashing a message, the operator first has to load the initialization vector, taking
9 cycles, followed by 457 cycles per 512bit message block. Finally, one last block
has to be processed before the hash value is output, leading to an overhead of
466 additional cycles.
5Implementation results & discussion
The complete implementation results for our different architectures are given in
Tables 2 and 3 for Virtex6 and Spartan6 devices, respectively. As expected,
one can notice the strong impact of the two sets of options we considered (i.e.
area and timing). Still, a number of important intuitions can be extracted.
blake Grøstl
1024
1342
12/8 24/354 16/20
701 912
371556
192 260
240280
183 640
810966
541 571
215293
304 330
232 754
JH
512
688
Keccak Skein AES
576 512
2137 458
9/8 9/466 8/0
519 770
429158
144 240
250 160
68 179
610 1039 845
533 506
188291
285200
77223
Properties
Input block message size
Clock cycles per block
Clock cycles overhead (pre/post)
Number of LUTs
Number of Registers
Number of Slices
Frequency (MHz)
Throughput (Mbit/s)
Number of LUTs
Number of Registers
Number of Slices
Frequency (MHz)
Throughput (Mbit/s)
1024
448
128
44
Area
789
411
240
288
214
1034
463
304
299
222
658
364
205
222
646
Timing
524
236
250
727
Table 2: Implementation results for the 5 sha3 candidates on Virtex6 (512bit
digests).
In the first place, and compared to previous works, we see that our imple
mentation results for blake are quite close to the previous ones of Beuchat et
al. The main difference is our exploitation of distributed memories (reported in
the slices count) rather than embedded memory blocks. By contrast, for all the
other algorithms, our results bring some interesting novelty. In particular, for
Keccak, the previous architecture of Bertoni et al. was using only three internal
Page 14
blake Grøstl
1024
1342
12/8 24/354 16/20
719912
370574
230343
135240
103 548
856 766
594759
303281
150265
114605
JH
512
688
Keccak Skein AES
576 512
2137458
9/89/466 8/0
525888
433 249
193292
166
45 102
640 1059 852
476 395
216 351
166 111
45124
Properties
Input block message size
Clock cycles per block
Clock cycles overhead (pre/post)
Number of LUTs
Number of Registers
Number of Slices
Frequency (MHz)
Throughput (Mbit/s)
Number of LUTs
Number of Registers
Number of Slices
Frequency (MHz)
Throughput (Mbit/s)
1024
448
128
44
Area
737
338
260
113
84
1106
646
362
175
130
685
365
232
125
364
91
Timing
529
274
154
448
Table 3: Implementation results for the 5 sha3 candidates on Spartan6 (512bit
digests).
registers, because of its compact asicoriented flavor. This was at the cost of a
weak performances, in the range of 5000 clock cycles per hash block. We paid a
significant attention in taking advantage of the fpga structure, in particular its
distributed rams. As a result, we reduced the number of clock cycles by a factor
of more than two. As for the three remaining algorithms, no similar results were
known to date, which make them interesting, as first milestones.
Next, this table also leads to a number of comments regarding the different
algorithms and their compact fpga implementations. First, one can notice that
Grøstl compares favorably with all the other candidates (although not by a big
margin). While it has quite expensive components, interleaving the p and q
functions allows reducing the logic resources. More importantly, this algorithm
proceeds blocks of 1024 bits and has a quite limited cycle count, which leads to
significantly higher throughput than our other implementations.
blake and jh also achieve reasonable throughput, but do not reach the level
of performance of Grøstl in this case study. For blake, the input blocks are
still 1024bit wide, but our implementation requires three times more cycles per
block. For jh, it is rather the reduction of input block size that is in cause.
Skein provides interesting performances too. Its most noticeable limitation
is a lower clock frequency, that could be improved by better pipelining the ad
ditions involved in our design. As a first step, we exploited the carry propagate
adders that are efficiently implemented in Xilinx fpgas. But this is not a the
oretical limitation of the algorithm. One could reasonably assume that further
optimization efforts would increase the frequency at the level of the other can
didates.
Finally, Keccak presents the poorest performances. This is an interesting re
sult in view of the excellent behavior of this algorithm in a high throughput
implementation context [14]. Further optimizations could be investigated in or
Page 15
der to reduce the number of clock cycles. But this would be at the cost of a larger
datapath (hence, higher slice count). Also, even considering a very optimistic 50
cycles per round, the throughput of Keccak would remain 6 times smaller than
the one of Grøstl. This suggests that compared to the other finalists, Keccak is
inherently less suitable for compact fpga implementations. The main reason of
this observation relates different rotations used in this algorithm (that come for
free in unrolled implementations but may turn out to be expensive in compact
ones) and to the large state that needs to be addressed multiple times when
hashing a block.
Unsurprisingly, the main difference between the Virtex6 and Spartan6 im
plementations consists in a slightly larger number of slices, most likely due to
the more constraining fpga, and a reduction in frequency due to the lower
performance of the Spartan6 fpgas.
In addition to these results, Table 4 in appendix provides the implementation
results for the 256bit digest versions of the hash algorithms, on Virtex6. In
general, these smaller variants do not exhibit significantly different conclusions.
One important reason for this observation is that, when using distributed ram’s
in an implementation, reducing the size of a state does not directly imply a gain
in slices for a compact implementation (as only the depth of the memories are
affected in this case). Nevertheless, this move towards smaller digests is positive
for Keccak, because of a larger bitrate r. By contrast, for blake, the processing
of 512bit blocks does not come with an sufficient reduction of the number of
rounds, hence leading to smaller throughputs. As for Grøstl, the number of
rounds is also reduced by less than a factor 2, but the smaller number of columns
in the state matrix allows keeping a higher throughput.
To conclude this work, we finally reported the performance results for an
aes128 implementation, with “onthefly” key scheduling, based on a 32bit
architecture. This implementation is best compared with the 256bit versions of
the sha3 candidates (because of a 128bit key). One can notice that the slice
count and throughput also range in the same levels.
References
1. The sha3 zoo. http://ehash.iaik.tugraz.at/wiki/The_SHA3_Zoo.
2. The skein hash function family. http://www.skeinhash.info/.
3. JeanPhilippe Aumasson, Luca Henzen, Willi Meier, and Raphael C.W. Phan.
Sha3 proposal blake (version 1.4), 2011. http://131002.net/blake/.
4. JeanPhilippe Aumasson, Willi Meier, and Raphael C.W. Phan. The hash function
family lake. In FSE, pages 36–53, 2008.
5. Daniel J. Bernstein. Chacha, a variant of salsa20. Workshop Record of SASC 2008:
The State of the Art of Stream Ciphers, 2008. http://cr.yp.to/chacha.html#
chachapaper.
6. G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche. The keccak sha3 submis
sion. Submission to NIST (Round 3), 2011.
View other sources
Hide other sources
 Available from FrançoisXavier Standaert · May 31, 2014
 Available from eu.org