Available via license: CC BY 4.0
Content may be subject to copyright.
Citation: Ma, K.-M.; Le, D.-H.; Pham,
C.-K.; Hoang, T.-T. Design of an SoC
Based on 32-Bit RISC-V Processor
with Low-Latency Lightweight
Cryptographic Cores in FPGA. Future
Internet 2023,15, 186. https://
doi.org/10.3390/fi15050186
Academic Editor: Wei Yu
Received: 1 May 2023
Revised: 16 May 2023
Accepted: 18 May 2023
Published: 19 May 2023
Copyright: © 2023 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
future internet
Article
Design of an SoC Based on 32-Bit RISC-V Processor with
Low-Latency Lightweight Cryptographic Cores in FPGA
Khai-Minh Ma 1, Duc-Hung Le 1, * , Cong-Kha Pham 2and Trong-Thuc Hoang 2
1Faculty of Electronics and Telecommunications, The University of Science, Vietnam National University
Ho Chi Minh City, Ho Chi Minh City 700000, Vietnam
2Department of Computer and Network Engineering, The University of Electro-Communications (UEC),
Tokyo 182-8585, Japan
*Correspondence: ldhung@hcmus.edu.vn
Abstract:
The security of Internet of Things (IoTs) devices in recent years has created interest in de-
veloping implementations of lightweight cryptographic algorithms for such systems. Additionally,
open-source hardware and field-programable gate arrays (FPGAs) are gaining traction via newly de-
veloped tools, frameworks, and HDLs. This enables new methods of creating hardware and systems
faster, more simply, and more efficiently. In this paper, the implementation of a system-on-chip (SoC)
based on a 32-bit RISC-V processor with lightweight cryptographic accelerator cores in FPGA and an
open-source integrating framework is presented. The system consists of a 32-bit VexRiscv processor,
written in SpinalHDL, and lightweight cryptographic accelerator cores for the PRINCE block cipher,
the PRESENT-80 block cipher, the ChaCha stream cipher, and the SHA3-512 hash function, written in
Verilog HDL and optimized for low latency with fewer clock cycles. The primary aim of this work was
to develop a customized SoC platform with a register-controlled bus suitable for integrating lightweight
cryptographic cores to become compact embedded systems that require encryption functionalities.
Additionally, custom firmware was developed to verify the functionality of the SoC with all integrated
accelerator cores, and to evaluate the speed of cryptographic processing. The proposed system was
successfully implemented in a Xilinx Nexys4 DDR FPGA development board. The resources of the
system in the FPGA were low with 11,830 LUTs and 9552 FFs. The proposed system can be applicable to
enhancing the security of Internet of Things systems.
Keywords: system-on-chip; FPGA; RISC-V; VexRiscv; lightweight cryptography
1. Introduction
In recent years, RISC-V [
1
], a free and open-source instruction set architecture (ISA)
for microprocessors, has received considerable attention. Based on the reduced instruction
set computer (RISC) design principle, RISC-V is compact, scalable, and highly configurable.
These distinguishing features make it appealing to open-source communities in the aca-
demic and commercial sectors. The development of a standalone processor, however, is
insufficient. The computational tasks handled by the processor have become increasingly
complex, surpassing the general-purpose computing capabilities that they were able to
perform whilst still being required to be highly efficient. Using accelerator cores, which
are capable of managing such complex and intensive tasks, it reduces the execution time
and saves energy compared to microprocessor-based tasks. A number of studies in RISC-V
system development with accelerator cores have been performed, with applications in-
cluding digital signal processing [
2
], artificial intelligence [
3
], and the implementation of
mathematical algorithms.
On the other hand, extensive real-world and real-time data transferred between
peripheral devices and nodes are potential cyber-attack targets due to the rapid expansion
of the Internet of Things (IoTs). The obvious and effective countermeasure is to effectively
Future Internet 2023,15, 186. https://doi.org/10.3390/fi15050186 https://www.mdpi.com/journal/futureinternet
Future Internet 2023,15, 186 2 of 20
encrypt the data. As a consequence, lightweight cryptographic algorithms are approaching
the horizon of cryptographic solutions, as a result of numerous advancements over the
past few years. These lightweight cryptographic algorithms have a small memory footprint
and low computational complexity, allowing them to be implemented in devices with
limited resources. In addition to RISC-V, the need for encryption in these peripheral
devices has spurred the development of new hardware and platforms with improved
energy efficiency, performance, connectivity, and security. In recent years, many new
tools, toolchains, frameworks, and HDLs have been developed due to the present state of
open-source hardware and the popularity of field-programable gate arrays (FPGAs). This
enables the development of new methods for creating systems that are quicker, simpler,
and more efficient.
In this paper, we present the implementation and results of a system-on-chip (SoC)
based on a 32-bit RISC-V processor with lightweight cryptographic accelerator cores in
an FPGA. The proposed system was configured with a 32-bit VexRiscv processor, imple-
menting RV32IM instruction sets, and lightweight cryptographic accelerator cores for the
PRINCE block cipher, the PRESENT-80 block cipher, the ChaCha stream cipher, and the
SHA3-512 hash function. The selection of these three algorithms was driven by a desire
to experiment with recently discovered, promising lightweight cryptographic algorithms.
These lightweight cryptographic cores were also developed for low-implementation cycles,
resulting in low latency. The focus of this work was to integrate and provide a com-
plete system-on-chip implementation of the best RISC-V processors and cryptographic
accelerators, with minimal compromise. This was achieved using recently developed
and novel toolchains, frameworks, and HDLs to build the system. Using the Configura-
tion/Status Register (CSR) bus to connect customized and optimized cores such as PRINCE,
PRESENT-80, ChaCha, and SHA-3 with VexRiscv to form a high-performance SoC with
low-implementation clock cycles and low-logic resources was also a notable contribution
of this work. Compared to the software implementation of the cryptography algorithms,
the implementation of the system using 11,830 look-up tables (LUTs) and 9552 flip-flops
(FFs) reduces the execution time by 100 to over 4400 times. The design was deployed using
a Xilinx Nexys4 DDR FPGA board and the Vivado toolchain, and firmware was also devel-
oped to effectively utilize all lightweight cryptographic accelerator cores. The objective of
this paper was to provide a simple and efficient SoC design using CSR configuration and
open-source resources.
The remainder of this paper is organized as follows. Section 2provides some background
of relevant works around the related topics of RISC-V and cryptography. Sections 3and 4cover
the relevant fundamental subjects, including the VexRiscv core, lightweight cryptographic
algorithms, and the hash function. The implementation process is depicted in Section 5. The
experiment and validation results, along with discussions and comparisons with other works,
are analyzed in Section 6. Finally, Section 7concludes the paper.
2. Related Works
In this section, we have examined a wide variety of interesting studies and topics that
have recently experienced a remarkable expansion and increase in research. The first is
the emergence of RISC-V, which has resulted in many open-source projects and academic
research [
4
] focusing in RISC-V-based processor implementations. The growth and expan-
sion of the ISA have been commented on as being “inevitable” [
5
], as adoption was already
happening. Some of the work even progressed to the stage where it was market-ready [
6
].
Works such as [
7
] provide freely available courses along with comprehensive instructions
and labs for educational purposes, promoting the RISC-V ecosystem. Applications for
these RISC-V processors vary from the Internet of Things, such as security monitoring
systems [
8
] with real-time detection and tracking and edge computing platforms based on
callability [9], to neural networks, artificial intelligence, and more.
An effective DNN (deep neural network)-application-focused RISC-V processor was
proposed by Zhang H. et al. [
10
]. The work demonstrated promising capabilities in effec-
Future Internet 2023,15, 186 3 of 20
tively executing DNN tasks while minimizing power consumption. This design approach
ensured that the processor was well suited for edge devices, where power efficiency is
crucial for extending battery life and enabling real-time processing. Lim S.-H. et al. also
proposed work [
11
] on implementing a DNN operation accelerator based on a virtual
platform of RISC-V and successfully processed the darknet CNN model. This integration
enabled the efficient execution of complex convolutional operations, resulting in enhanced
performance and reduced computational overhead. Another work by
Gamino del Río I
.
proposed modifying the architecture of an RISC pipelined processor to eliminate the ex-
ecution time overhead introduced by code instrumentation [
12
]. An RISC-V processor
was implemented in VHDL and synthesized in an FPGA, which enabled non-intrusive
tracing while executing instrumented code without introducing additional delays. Robotics
applications were also researched by Lee J. by implementing the robotics operating system
on top of an RISC-V processor in an FPGA [13].
The categories even extended to some special applications, such as in space environ-
ments, where D. A. Santos et al. presented a low-cost fault-tolerant implementation of
the RISC-V architecture that reduced error propagation and was aimed at space applica-
tions [
14
]. A similar implementation was NOEL-V which was compatible with the AMBA
AHB 2.0 bus and could efficiently be deployed in an FPGA and ASIC [15].
The second topic that is related to this work and received the same amount of attention
is cryptographic algorithms, especially lightweight cryptographic algorithms in
RISC-V
.
They are algorithms that were designed to be implemented in resource-constrained de-
vices. Some analysis was made by El-hajj, M. et al. to further investigate the adequate
cryptographic algorithms for these systems by evaluating and benchmarking more than
39 symmetric block ciphers [
16
]. The work by Hao Cheng et al. [
17
] provided the completed
fundamental analysis, design, implementation results, hardware, and software of an ISE
(instruction set extension) for 10 lightweight cryptography algorithms. In addition to utiliz-
ing the C programing language for pure software implementations [
18
], it seemed that the
preferred implementation approach also involved using an ISE. The work [
19
] introduced
a lightweight ISE designed to support the ChaCha stream cipher in RISC-V architectures,
which achieved a speedup gain of at least 5.4 times compared to the OpenSSL baseline
and 3.4 times compared to an ISA optimized implementation. The study [
20
] focused on
achieving secure and efficient execution of AES by separating different ISEs for 32-bit and
64-bit, which demonstrated significant performance improvements for AES-128 block en-
cryption. Furthermore, the authors explored how the proposed standard bit manipulation
extension in RISC-V can be effectively utilized for the efficient implementation of AES-
GCM (Galois/Counter Mode). The GIFT family of block ciphers was utilized in various
NIST candidates but required optimization techniques such as bit-slicing and fix-slicing
for optimal performance. The researchers of [
21
] developed assembly implementations for
GIFT-64 and GIFT-128 using the RV32I ISA, evaluated their performance in the HiFive1
development board, and achieved clock cycle reductions of 88.69% (GIFT-64) and 95.05%
(GIFT-128) using fix-slicing with the key pre-computation technique. ASCON, a recently
standardized lightweight cryptographic algorithm by NIST, has been implemented by
Altınay Ö. in the base RV32I processor. The proposed work [
22
] implemented non-standard
RISC-V instructions, and an end-to-end test environment was formed by extending the
GNU Compiler Collection and Spike RISC-V ISA Simulator.
Another work, [
23
], proposed a new and efficient validation platform by deploying a
cryptographic SoC as a virtual prototype using a hybrid hardware and software design
strategy. Compared to RTL simulation, the custom virtual prototype demonstrated sig-
nificant performance advantages, performing approximately 10–450 times faster while
maintaining a simulation error of only about 4%.
Among the numerous related research projects and works, we have identified a smaller
sub-set that is more connected to and comparable to our work, thereby distinguishing it
from those that may have more distant or peripheral relevance. First, the work [
24
,
25
]
presented standalone implementations of the PRINCE block cipher in an FPGA, aiming
Future Internet 2023,15, 186 4 of 20
for low resource usage. For PRESENT-80, the work [
26
] integrated the crypto co-processor
for both AES and PRESENT-80 in an FPGA-based SoC platform with an emphasis on
energy efficiency and performance. The implementation of ChaCha [
27
] by Nurat At et al.
enhanced efficiency by interleaving independent tasks and minimizing data dependencies.
Concerning the SHA-3 hash functions, since NIST’s announcement of SHA-3, there has been
a surge in research and numerous hardware implementations dedicated to this algorithm,
including [
28
,
29
]. The current state-of-the-art, well-optimized ASIC implementation [
30
] in
7 nm TSMC also falls into this category. Evaluations between these works and our results
will be discussed in Section 7.
Similarly to our work, two earlier works [
31
,
32
] designed and incorporated algorithm
acceleration processing into the SoC, but in two distinct ways: once by optimizing hardware
instructions and once by optimizing software instructions.
3. RISC-V Processor
3.1. RISC-V ISA
RISC-V is an open-source ISA which has been widely accepted and adopted in many
projects by communities. Many resources and tools, such as compilers and debuggers, have also
been developed, forming a diverse open-source ecosystem. The RISC-V project started in 2010
at UC Berkeley. Compared to ARM and x86, an RISC-V processor has the following advantages:
•Free: RISC-V ISA and its surrounding development projects are mostly open-source.
•Simple: RISC-V is much smaller than other commercial ISAs.
•Modular: RISC-V has a small standard base ISA with multiple standard extensions.
•
Stable: The base and many extensions of the ISA are already standardized and frozen.
No significant changes are expected.
•Extensibility: specialized instructions can be added, based on extensions.
The based instruction sets defined by RISC-V are RV32I, RV64I, and RV128I, supporting
32-bit, 64-bit, and 128-bit, respectively, with around 40 instructions. RISC-V specifies
the encoding, control flow, register sizes, memory, addressing, logic manipulation, and
ancillaries for the processor. While RV64I is suited for large, sophisticated, complex systems,
and RV128I can serve as a theoretical instruction set for a 128-bit processor in the future,
RV32I is the most suitable for small embedded systems, such as IoT devices, for example.
In addition, the architecture of the processor can be extended, advancing its specialization
through standardized extended instruction sets such as M, A, F, D, Q, and G (general
purpose—IMAFD). In recent years, many implementations of RISC-V processors have been
published in response to the open-hardware movement. Some notable ISA implementation
processors supported by communities include the RocketChip [
33
] from UC Berkeley, the
BlackParrot [34], and the SiFive E31 [6].
3.2. VexRiscv Processor
VexRiscv is a 32-bit RISC-V processor implementing RV32IMAC instruction sets and
is optimized for FPGA. VexRiscv was written in SpinalHDL, a new open-source high-level
digital hardware describing language with primitive’s library, which is a domain-specific
language (DSL) based on the Scala programing language. Since SpinalHDL allows object-
oriented programing and functional programing to elaborate the hardware, VexRiscv
has a modular design, with almost all of the components being optional plugins. These
include caches for instruction and data and debug extension, as well as the number of
pipeline stages, interruptions, exception handling, and a memory management unit. These
customization abilities make VexRiscv the ideal platform for developing an SoC with
hardware accelerators. Using the LiteX framework, the proposed SoC in this study, which
consisted of a VexRiscv processor integrated with lightweight cryptographic accelerator
components and other fundamental peripherals, was constructed.
Future Internet 2023,15, 186 5 of 20
4. Cryptographic Algorithms
4.1. PRINCE
PRINCE [
35
] is a lightweight symmetric block cipher with a substitution–permutation
network (SPN) structure and is based on the so-called FX construction. It was designed to
target low latency and unrolled hardware implementations. According to the author, when
compared with the advanced encryption standard (AES), PRINCE was able to operate at
much higher frequencies while utilizing less area with the same timing constraints and
technologies.
The block size of PRINCE is 64 bits, and the key size is 128 bits. The key is split into
two 64-bit keys denoted as k0and k1,
k=k0||k1(1)
A sub-key k00is then derived from k0, extending the key to 192 bits.
k0||k00||k1:=(k0||(k0≫1)⊕(k0≫63)|| k1)(2)
The input is XORed with k
0
, and then processed via a core function, PRINCE
core
,
using k
1
. The output of the PRINCE
core
function is XORed by k
00
to produce the final
output. The decryption is conducted by exchanging k
0
and k
00
and using k
1
XORed with
a constant denoted as alpha in the core function. This overall procedure is depicted
in Figure 1.
Future Internet 2023, 15, x FOR PEER REVIEW 5 of 20
include caches for instruction and data and debug extension, as well as the number of
pipeline stages, interruptions, exception handling, and a memory management unit.
These customization abilities make VexRiscv the ideal platform for developing an SoC
with hardware accelerators. Using the LiteX framework, the proposed SoC in this study,
which consisted of a VexRiscv processor integrated with lightweight cryptographic
accelerator components and other fundamental peripherals, was constructed.
4. Cryptographic Algorithms
4.1. PRINCE
PRINCE [35] is a lightweight symmetric block cipher with a substitution–
permutation network (SPN) structure and is based on the so-called FX construction. It was
designed to target low latency and unrolled hardware implementations. According to the
author, when compared with the advanced encryption standard (AES), PRINCE was able
to operate at much higher frequencies while utilizing less area with the same timing
constraints and technologies.
The block size of PRINCE is 64 bits, and the key size is 128 bits. The key is split into
two 64-bit keys denoted as k
0
and k
1
,
k
=
k0||k1 (1)
A sub-key k’
0
is then derived from k
0
, extending the key to 192 bits.
k0||k’0||k1≔ k0||k0⋙1 ⊕ k0⋙63|| k1
(2)
The input is XORed with k
0
, and then processed via a core function, PRINCE
core
, using
k
1
. The output of the PRINCE
core
function is XORed by k’
0
to produce the final output. The
decryption is conducted by exchanging k
0
and k’
0
and using k
1
XORed with a constant
denoted as alpha in the core function. This overall procedure is depicted in Figure 1.
Figure 1. PRINCE encryption process of a plaintext message m to a ciphertext c, using k
0
and k’
0
as
whitening keys.
The PRINCE
core
performs the encryption process, depicted in Figure 2, in twelve
rounds, divided into five “forward” rounds, five “backward” rounds, and a middle round
(that is counted as two). A forward round starts with a non-linear substitution layer S, a
linear layer M of permutation, and then the XORed operation with the round constant and
k
1
. The “backward” rounds are the identical inverse of the “forward” rounds with different
round constants.
Figure 2. Encryption process of PRINCE
core
.
Figure 1.
PRINCE encryption process of a plaintext message mto a ciphertext c, using k
0
and k
00
as
whitening keys.
The PRINCE
core
performs the encryption process, depicted in Figure 2, in twelve
rounds, divided into five “forward” rounds, five “backward” rounds, and a middle round
(that is counted as two). A forward round starts with a non-linear substitution layer S, a
linear layer M of permutation, and then the XORed operation with the round constant
and k
1
. The “backward” rounds are the identical inverse of the “forward” rounds with
different round constants.
Future Internet 2023, 15, x FOR PEER REVIEW 5 of 20
include caches for instruction and data and debug extension, as well as the number of
pipeline stages, interruptions, exception handling, and a memory management unit.
These customization abilities make VexRiscv the ideal platform for developing an SoC
with hardware accelerators. Using the LiteX framework, the proposed SoC in this study,
which consisted of a VexRiscv processor integrated with lightweight cryptographic
accelerator components and other fundamental peripherals, was constructed.
4. Cryptographic Algorithms
4.1. PRINCE
PRINCE [35] is a lightweight symmetric block cipher with a substitution–
permutation network (SPN) structure and is based on the so-called FX construction. It was
designed to target low latency and unrolled hardware implementations. According to the
author, when compared with the advanced encryption standard (AES), PRINCE was able
to operate at much higher frequencies while utilizing less area with the same timing
constraints and technologies.
The block size of PRINCE is 64 bits, and the key size is 128 bits. The key is split into
two 64-bit keys denoted as k
0
and k
1
,
k
=
k0||k1 (1)
A sub-key k’
0
is then derived from k
0
, extending the key to 192 bits.
k0||k’0||k1≔ k0||k0⋙1 ⊕ k0⋙63|| k1
(2)
The input is XORed with k
0
, and then processed via a core function, PRINCE
core
, using
k
1
. The output of the PRINCE
core
function is XORed by k’
0
to produce the final output. The
decryption is conducted by exchanging k
0
and k’
0
and using k
1
XORed with a constant
denoted as alpha in the core function. This overall procedure is depicted in Figure 1.
Figure 1. PRINCE encryption process of a plaintext message m to a ciphertext c, using k
0
and k’
0
as
whitening keys.
The PRINCE
core
performs the encryption process, depicted in Figure 2, in twelve
rounds, divided into five “forward” rounds, five “backward” rounds, and a middle round
(that is counted as two). A forward round starts with a non-linear substitution layer S, a
linear layer M of permutation, and then the XORed operation with the round constant and
k
1
. The “backward” rounds are the identical inverse of the “forward” rounds with different
round constants.
Figure 2. Encryption process of PRINCE
core
.
Figure 2. Encryption process of PRINCEcore.
In terms of hardware implementation, the target is to have a pipeline architecture with
combinational sub-modules for the S layer and M layer. The algorithm was designed to be
unrolled, which means that it should not contain any conditional loops. The key expansion
and addition function (XOR) should then be simple to implement.
Future Internet 2023,15, 186 6 of 20
4.2. PRESENT-80
PRESENT [
36
] is an ultra-lightweight block cipher, developed as a collaboration
between Ruhr-University, Germany, and the Technical University of Denmark back in 2007.
The algorithm is also based on the SPN structure, the same as PRINCE, and consists of
31 rounds. The input size is 64 bits and the supported key lengths by PRESENT are
80 bits and 128 bits. In the original paper [
36
], the 80-bit version was recommended
over the 128-bit version, which was more than adequate and suitable for low-security
applications such as small sensor systems, hence the name PRESENT-80.
Each round of PRESENT-80 consists of an XOR operation with the round key K
i
(for 1
≤
i
≤
32), a substitution process with a non-linear layer of S-box, and a linear bitwise
permutation process using the pLayer. For the XOR operation, the round keys are generated
from the 80-bit supplied key, which was designed to be stored in the key register, denoted
as K = k
79
k
78 . . .
k
0
. The round key K
i
at the round iis the 64 leftmost bits of the K = k
79
k
78
. . .
k
0.
After the extraction of the round key, the K=k
79
k
78 . . .
k
0
is updated as follows and
demonstrated in (3).
•Rotate the key register Kby 61-bit positions to the left.
•The four leftmost bits are substituted using the S-box.
•
The bits of [k
19
k
18
k
17
k
16
k
15
] are XORed with the current round number counter (5 bits).
1. [k79k78 . . . k1k0]=[k18k17 . . . k20k19]
2. [k79k78 k77k76]= S[k79 k78k77k76 ]
3. [k19k18 k17k16k15 ]=[k19k18 k17k16k15 ]⊕round_counter
(3)
PRESENT utilized a 4-bit to 4-bit substitution S-box, with the hexadecimal notation
of input xand output of the function S(x) given by Figure 3. To fully substitute the 64-bit
cipher state, 16 of these S-boxes were utilized in parallel to generate the output.
Future Internet 2023, 15, x FOR PEER REVIEW 6 of 20
In terms of hardware implementation, the target is to have a pipeline architecture
with combinational sub-modules for the S layer and M layer. The algorithm was designed
to be unrolled, which means that it should not contain any conditional loops. The key
expansion and addition function (XOR) should then be simple to implement.
4.2. PRESENT-80
PRESENT [36] is an ultra-lightweight block cipher, developed as a collaboration
between Ruhr-University, Germany, and the Technical University of Denmark back in
2007. The algorithm is also based on the SPN structure, the same as PRINCE, and consists
of 31 rounds. The input size is 64 bits and the supported key lengths by PRESENT are 80
bits and 128 bits. In the original paper [36], the 80-bit version was recommended over the
128-bit version, which was more than adequate and suitable for low-security applications
such as small sensor systems, hence the name PRESENT-80.
Each round of PRESENT-80 consists of an XOR operation with the round key K
i
(for
1 ≤ i ≤ 32), a substitution process with a non-linear layer of S-box, and a linear bitwise
permutation process using the pLayer. For the XOR operation, the round keys are
generated from the 80-bit supplied key, which was designed to be stored in the key
register, denoted as K = k
79
k
78
…k
0
. The round key K
i
at the round i is the 64 leftmost bits of
the K = k
79
k
78
…k
0.
After the extraction of the round key, the K = k
79
k
78
…k
0
is updated as
follows and demonstrated in (3).
• Rotate the key register K by 61-bit positions to the left.
• The four leftmost bits are substituted using the S-box.
• The bits of [k
19
k
18
k
17
k
16
k
15
] are XORed with the current round number counter (5 bits).
1. 𝑘𝑘 …𝑘𝑘 𝑘𝑘…𝑘𝑘
2. 𝑘𝑘𝑘𝑘 𝑆𝑘𝑘𝑘𝑘
3. 𝑘𝑘𝑘𝑘𝑘 𝑘𝑘𝑘𝑘𝑘 ⊕ 𝑟𝑜𝑢𝑛𝑑_𝑐𝑜𝑢𝑛𝑡𝑒𝑟
(3)
PRESENT utilized a 4-bit to 4-bit substitution S-box, with the hexadecimal notation
of input x and output of the function S(x) given by Figure 3. To fully substitute the 64-bit
cipher state, 16 of these S-boxes were utilized in parallel to generate the output.
Figure 3. The substitution function of PRESENT.
The bitwise permutation of PRESENT is given in Figure 4, where i is the bit position
of the cipher state being moved to the new position of P(i).
Figure 4. The permutation position table of PRESENT.
The simplicity of PRESENT and intention to be implemented in hardware were cited
as the key design goals. PRESENT would demand roughly the same hardware resources
for both encryption and decryption. The S-box, the permutation pLayer, and the FSM for
Figure 3. The substitution function of PRESENT.
The bitwise permutation of PRESENT is given in Figure 4, where iis the bit position
of the cipher state being moved to the new position of P(i).
Future Internet 2023, 15, x FOR PEER REVIEW 6 of 20
In terms of hardware implementation, the target is to have a pipeline architecture
with combinational sub-modules for the S layer and M layer. The algorithm was designed
to be unrolled, which means that it should not contain any conditional loops. The key
expansion and addition function (XOR) should then be simple to implement.
4.2. PRESENT-80
PRESENT [36] is an ultra-lightweight block cipher, developed as a collaboration
between Ruhr-University, Germany, and the Technical University of Denmark back in
2007. The algorithm is also based on the SPN structure, the same as PRINCE, and consists
of 31 rounds. The input size is 64 bits and the supported key lengths by PRESENT are 80
bits and 128 bits. In the original paper [36], the 80-bit version was recommended over the
128-bit version, which was more than adequate and suitable for low-security applications
such as small sensor systems, hence the name PRESENT-80.
Each round of PRESENT-80 consists of an XOR operation with the round key K
i
(for
1 ≤ i ≤ 32), a substitution process with a non-linear layer of S-box, and a linear bitwise
permutation process using the pLayer. For the XOR operation, the round keys are
generated from the 80-bit supplied key, which was designed to be stored in the key
register, denoted as K = k
79
k
78
…k
0
. The round key K
i
at the round i is the 64 leftmost bits of
the K = k
79
k
78
…k
0.
After the extraction of the round key, the K = k
79
k
78
…k
0
is updated as
follows and demonstrated in (3).
• Rotate the key register K by 61-bit positions to the left.
• The four leftmost bits are substituted using the S-box.
• The bits of [k
19
k
18
k
17
k
16
k
15
] are XORed with the current round number counter (5 bits).
1. 𝑘𝑘 …𝑘𝑘 𝑘𝑘…𝑘𝑘
2. 𝑘𝑘𝑘𝑘 𝑆𝑘𝑘𝑘𝑘
3. 𝑘𝑘𝑘𝑘𝑘 𝑘𝑘𝑘𝑘𝑘 ⊕ 𝑟𝑜𝑢𝑛𝑑_𝑐𝑜𝑢𝑛𝑡𝑒𝑟
(3)
PRESENT utilized a 4-bit to 4-bit substitution S-box, with the hexadecimal notation
of input x and output of the function S(x) given by Figure 3. To fully substitute the 64-bit
cipher state, 16 of these S-boxes were utilized in parallel to generate the output.
Figure 3. The substitution function of PRESENT.
The bitwise permutation of PRESENT is given in Figure 4, where i is the bit position
of the cipher state being moved to the new position of P(i).
Figure 4. The permutation position table of PRESENT.
The simplicity of PRESENT and intention to be implemented in hardware were cited
as the key design goals. PRESENT would demand roughly the same hardware resources
for both encryption and decryption. The S-box, the permutation pLayer, and the FSM for
Figure 4. The permutation position table of PRESENT.
The simplicity of PRESENT and intention to be implemented in hardware were cited
as the key design goals. PRESENT would demand roughly the same hardware resources
for both encryption and decryption. The S-box, the permutation pLayer, and the FSM
for round key scheduling are the three primary components that need to be designed
in the implementation. The S-box and pLayer can be implemented in hardware as a bit
manipulation module, and the S-box can also be utilized again throughout the round key
generation process. For the decryption process, the inverted S-box and inverted pLayer
also need to be designed.
Future Internet 2023,15, 186 7 of 20
4.3. ChaCha
ChaCha [
37
] is a stream cipher, more specifically, a family of stream ciphers based
on a variant of Salsa20 [
38
,
39
], with the modified round function increasing the amount
of diffusion per round. Both ChaCha and Salsa20 are built on a pseudorandom function,
based on ARX (Add-Rotate-XOR) operations: 32-bit addition, rotation operations, and
bitwise addition (XOR). The core function maps a 256-bit key, a 64-bit nonce, and a 64-bit
counter into a randomly accessible 512-bit block of the keystream. The internal state of
ChaCha is formed by 16 32-bit words arranged as a 4 ×4 matrix, denoted as S.
S=
61707865 3320346E 79622D32 6B206574
key[0]key[1]key[2]key[3]
key[4]key[5]key[6]key[7]
counter[0]counter[1]nonce[0]nonce[1]
(4)
The state consists of four words of the constant “expand 32-byte k”. The follow-
ing eight words are for the key, with two for the block counter and the last two for the
nonce. Theoretically, ChaCha can generate the keystream up to 2
70
bytes or 2
64
blocks of the
512-bit key. The algorithm implemented in this work optimized the original paper of ChaCha
(and Salsa20), using two words for the counter and two words for the nonce, as mentioned
above. This differs from the ChaCha20 variant used in the standardized IETF protocol in
RFC 8439 [
40
], with the state implementing one word for the counter and three for the nonce.
Depending on the even number of applied rounds, 8 and 20, for example, the respective
names would be ChaCha8 and ChaCha20. For each round in ChaCha, the core operation is
the quarter round “QR(a,b,c,d)” that takes a four-word input and produces a four-word
output from the state S. The quarter round performs the ARX operation on the 32-bit words
(a, b, c, and d), with “<<<” as the notation for bitwise left rotation, as shown below.
a+ = b; dˆ=a ; d ≪=16;
c+ = d; bˆ=c ; b ≪=12;
a+ = b; dˆ=a ; d ≪=8;
c+ = d; bˆ=c ; b ≪=7;
(5)
Four of these quarter rounds performed together on S would then form or be defined
as one round of the algorithm. Depending on the round count number, starting from one,
odd-numbered rounds apply the quarter round function to each of the four columns in
the 4
×
4 state matrix, and even-numbered rounds apply it to each of the four diagonals.
Figure 5lists the state S (4) and its 32-bit words as a 4
×
4 table, ranging from 0 to 15. Four
quarter rounds in an odd round will operate on those words as defined in (6).
QR(0, 4, 8, 12);
QR(1, 5, 9, 13);
QR(2, 6, 10, 14);
QR(3, 7, 11, 15);
(6)
Additionally, four other quarter rounds in an even round will manipulate those words,
as in (7). Two consecutive rounds of this odd round and even round is called a double round.
QR(0, 5, 10, 15);
QR(1, 6, 11, 12);
QR(2, 7, 8, 13);
QR(3, 4, 9, 14);
(7)
For hardware implementation, the design of the quarter round module alone would
significantly speed up the algorithm compared to software implementation. This is because
each addition, XOR, and rotation operation would already cost multiple circles when being
executed in the processor using standard instructions. The complexity of one round in
Future Internet 2023,15, 186 8 of 20
the accelerator is expected to be low as it only has 16 additions, 16 XORs, and 16 constant
distance rotations of 32-bit words.
Future Internet 2023, 15, x FOR PEER REVIEW 8 of 20
QR(0, 5, 10, 15);
QR(1, 6, 11, 12);
QR(2, 7, 8, 13);
QR(3, 4, 9, 14);
(7)
For hardware implementation, the design of the quarter round module alone would
significantly speed up the algorithm compared to software implementation. This is
because each addition, XOR, and rotation operation would already cost multiple circles
when being executed in the processor using standard instructions. The complexity of one
round in the accelerator is expected to be low as it only has 16 additions, 16 XORs, and 16
constant distance rotations of 32-bit words.
Figure 5. The indexed representation of the 4 × 4 matrix state S, ranging from 0 to 15.
4.4. SHA3-512
A cryptographic hash function is a mathematical algorithm that maps “message”
data of arbitrary size to a bit array of a fixed size “digest” output message. It is a one-way
function and it is practically infeasible to invert or reverse the computation to obtain the
original message, except for brute-forcing it. SHA-3 [41] is the latest member of the Secure
Hash Algorithm family of standards by NIST, with the predecessors being SHA-2 and
SHA-1. The NIST defined four instances in the SHA-3 standard for different digest
lengths, including SHA3-224, SHA3-256, SHA3-384, and SHA3-512.
SHA-3 is a sub-set of a broader cryptographic primitive family called Keccak and is
based on a new design approach called sponge construction, which is a comprehensive
collection of random functions or permutations. This allows inpuing, or “absorbing”,
any amount of data and outpuing, or “squeezing”, any amount of data. Figure 6 below
describes the sponge construction in the SHA-3 hash functions.
Figure 6. The sponge construction, SHA-3 standard: permutation-based hash and extendable output
functions.
In Figure 6, for SHA3-512 to output 512 bits of digest, the input message in the form
of a bit string N was padded using the pad10*1 paern padding functions to a multiple of
576 (bits), which is called the rate, denoted as r. The c is called the capacity and the state
Figure 5. The indexed representation of the 4 ×4 matrix state S, ranging from 0 to 15.
4.4. SHA3-512
A cryptographic hash function is a mathematical algorithm that maps “message” data
of arbitrary size to a bit array of a fixed size “digest” output message. It is a one-way
function and it is practically infeasible to invert or reverse the computation to obtain the
original message, except for brute-forcing it. SHA-3 [
41
] is the latest member of the Secure
Hash Algorithm family of standards by NIST, with the predecessors being SHA-2 and
SHA-1. The NIST defined four instances in the SHA-3 standard for different digest lengths,
including SHA3-224, SHA3-256, SHA3-384, and SHA3-512.
SHA-3 is a sub-set of a broader cryptographic primitive family called Keccak and is
based on a new design approach called sponge construction, which is a comprehensive
collection of random functions or permutations. This allows inputting, or “absorbing”,
any amount of data and outputting, or “squeezing”, any amount of data. Figure 6below
describes the sponge construction in the SHA-3 hash functions.
Future Internet 2023, 15, x FOR PEER REVIEW 8 of 20
QR(0, 5, 10, 15);
QR(1, 6, 11, 12);
QR(2, 7, 8, 13);
QR(3, 4, 9, 14);
(7)
For hardware implementation, the design of the quarter round module alone would
significantly speed up the algorithm compared to software implementation. This is
because each addition, XOR, and rotation operation would already cost multiple circles
when being executed in the processor using standard instructions. The complexity of one
round in the accelerator is expected to be low as it only has 16 additions, 16 XORs, and 16
constant distance rotations of 32-bit words.
Figure 5. The indexed representation of the 4 × 4 matrix state S, ranging from 0 to 15.
4.4. SHA3-512
A cryptographic hash function is a mathematical algorithm that maps “message”
data of arbitrary size to a bit array of a fixed size “digest” output message. It is a one-way
function and it is practically infeasible to invert or reverse the computation to obtain the
original message, except for brute-forcing it. SHA-3 [41] is the latest member of the Secure
Hash Algorithm family of standards by NIST, with the predecessors being SHA-2 and
SHA-1. The NIST defined four instances in the SHA-3 standard for different digest
lengths, including SHA3-224, SHA3-256, SHA3-384, and SHA3-512.
SHA-3 is a sub-set of a broader cryptographic primitive family called Keccak and is
based on a new design approach called sponge construction, which is a comprehensive
collection of random functions or permutations. This allows inpuing, or “absorbing”,
any amount of data and outpuing, or “squeezing”, any amount of data. Figure 6 below
describes the sponge construction in the SHA-3 hash functions.
Figure 6. The sponge construction, SHA-3 standard: permutation-based hash and extendable output
functions.
In Figure 6, for SHA3-512 to output 512 bits of digest, the input message in the form
of a bit string N was padded using the pad10*1 paern padding functions to a multiple of
576 (bits), which is called the rate, denoted as r. The c is called the capacity and the state
Figure 6.
The sponge construction, SHA-3 standard: permutation-based hash and extendable
output functions.
In Figure 6, for SHA3-512 to output 512 bits of digest, the input message in the form
of a bit string Nwas padded using the pad10*1 pattern padding functions to a multiple of
576 (bits), which is called the rate, denoted as r. The cis called the capacity and the state
of SHA-3 is defined as a bit string with the length of b=r+c, containing all zero bits in it
at the beginning. The cvalue in SHA3-512 is 1024 (bits), which makes b1600 (bits). The
padded input is then “absorbed” and fed into 24 rounds of block transformation f, which is
Keccak[1024](M || 01, 512). The number of “absorbing” times is dependent on the length of
the input, specifically, the number of 576-bit segments of the padded input. For example, if
the input bit string message is 500 bits, then it will be padded to the length of 576 bits, and
only one “absorbing” operation will be needed. In the “squeezing” stage, segments with
the length of rare collected to the bit string Zuntil the number of output bits is met, which
is 512 bits for SHA3-512. Zwill then be truncated to the output length, and will then be the
digest of the message, completing the hash operation.
Future Internet 2023,15, 186 9 of 20
A small note on SHA-3 is that it is not considered as a lightweight hash function,
and the standardization publication from NIST does not address this either. However,
there have been some works [
42
] on lightweight cryptographic algorithms implemented
based on the sponge construction and instances of Keccak, which are parts of SHA-3. The
SHA3-512 hash function accelerator core was integrated into the SoC to diversify the type
of algorithms that can be applied.
5. Implementations
5.1. The PRINCE Accelerator Core
The integrated PRINCE cipher accelerator core in the SoC was designed using Verilog
HDL. Figure 7presents the overall internal architecture of the implementation with a
simple but flexible interface, and the connection diagram with the Configuration/Status
Register (CSR) peripheral bus of the proposed SoC.
Future Internet 2023, 15, x FOR PEER REVIEW 9 of 20
of SHA-3 is defined as a bit string with the length of b = r + c, containing all zero bits in it
at the beginning. The c value in SHA3-512 is 1024 (bits), which makes b 1600 (bits). The
padded input is then “absorbed” and fed into 24 rounds of block transformation f, which
is Keccak[1024](M || 01, 512). The number of “absorbing” times is dependent on the length
of the input, specifically, the number of 576-bit segments of the padded input. For
example, if the input bit string message is 500 bits, then it will be padded to the length of
576 bits, and only one “absorbing” operation will be needed. In the “squeezing” stage,
segments with the length of r are collected to the bit string Z until the number of output
bits is met, which is 512 bits for SHA3-512. Z will then be truncated to the output length,
and will then be the digest of the message, completing the hash operation.
A small note on SHA-3 is that it is not considered as a lightweight hash function, and
the standardization publication from NIST does not address this either. However, there
have been some works [42] on lightweight cryptographic algorithms implemented based
on the sponge construction and instances of Keccak, which are parts of SHA-3. The SHA3-
512 hash function accelerator core was integrated into the SoC to diversify the type of
algorithms that can be applied.
5. Implementations
5.1. The PRINCE Accelerator Core
The integrated PRINCE cipher accelerator core in the SoC was designed using Verilog
HDL. Figure 7 presents the overall internal architecture of the implementation with a
simple but flexible interface, and the connection diagram with the Configuration/Status
Register (CSR) peripheral bus of the proposed SoC.
Figure 7. The PRINCE accelerator architecture with bus interface.
Accessing and communicating with the hardware core is achieved through the
process of reading and writing data values to registers with specified addresses in the
peripheral bus, represented and connected to the data ports of the accelerator. The
PRINCE accelerator core possesses the base address of 0x3000 in the CSR peripheral bus,
assigned by the SoC builder. By implementing this simple “register-controlled interface”
scheme, two dedicated functions to control the accelerator are formed:
• “prince_write_to_address(address, data)”: This function prepares the “write_data”
and “address” for the accelerator core and then simulates the core with “cs” and “we”
for the writing process. This function writes 32-bit data or configuration values for
the 8-bit address in the PRINCE accelerator core.
• “prince_read_from_address(address)”: The function simulates the core with “cs”
and “we” for the reading process and reads the output from the accelerator core after
preparing the “address” and ignoring the “write_data”. This function reads 32-bit
data from an 8-bit address in the PRINCE accelerator core.
Figure 7. The PRINCE accelerator architecture with bus interface.
Accessing and communicating with the hardware core is achieved through the process
of reading and writing data values to registers with specified addresses in the peripheral
bus, represented and connected to the data ports of the accelerator. The PRINCE accelerator
core possesses the base address of 0x3000 in the CSR peripheral bus, assigned by the SoC
builder. By implementing this simple “register-controlled interface” scheme, two dedicated
functions to control the accelerator are formed:
•
“prince_write_to_address(address, data)”: This function prepares the “write_data”
and “address” for the accelerator core and then simulates the core with “cs” and “we”
for the writing process. This function writes 32-bit data or configuration values for the
8-bit address in the PRINCE accelerator core.
•
“prince_read_from_address(address)”: The function simulates the core with “cs” and
“we” for the reading process and reads the output from the accelerator core after
preparing the “address” and ignoring the “write_data”. This function reads 32-bit
data from an 8-bit address in the PRINCE accelerator core.
The internal register address map of the accelerator used in the software interface is
shown in Table 1.
Table 1. Internal register address mapping for the PRINCE accelerator.
Name Address Description
Key Input 0x10–0x13 128-bit key input registers
Block Input 0x20–0x21 64-bit message input registers
Result 0x30–0x31 64-bit cipher output registers
Control 0x08 Accelerator control bit
Status 0x09 Accelerator status bit
Configuration 0x0A Accelerator configuration bit
Future Internet 2023,15, 186 10 of 20
5.2. The PRESENT-80
The integrated PRESENT-80 accelerator core of the SoC was designed and configured
the same way as the PRINCE core. The connection diagram between the PRESENT-80
accelerator and the SoC is shown in Figure 8. Internally, the accelerator consists of a key
scheduler for generating round keys and two separate sub-modules for the encryption and
decryption process.
Future Internet 2023, 15, x FOR PEER REVIEW 10 of 20
The internal register address map of the accelerator used in the software interface is
shown in Table 1.
Table 1. Internal register address mapping for the PRINCE accelerator.
Name Address Description
Key Inpu
t
0x10–0x13 128-bit key input registers
Block Input 0x20–0x21 64-bit message input registers
Result 0x30–0x31 64-bit cipher output registers
Control 0x08 Accelerator control bit
Status 0x09 Accelerator status bit
Configuration 0x0A Accelerator configuration bit
5.2. The PRESENT-80
The integrated PRESENT-80 accelerator core of the SoC was designed and configured
the same way as the PRINCE core. The connection diagram between the PRESENT-80
accelerator and the SoC is shown in Figure 8. Internally, the accelerator consists of a key
scheduler for generating round keys and two separate sub-modules for the encryption
and decryption process.
Figure 8. The PRESENT-80 accelerator architecture with bus interface.
This hierarchical structure simplified the hardware design of the accelerator, which
could then be accessed and configured at the address of 0x2800 in the CSR peripheral bus
after being integrated. The offset addresses are listed in Table 2.
Table 2. Internal register address mapping for the PRESENT-80 core.
Name Address Description
Configuration 0x00 Configuration of encryption or decryption
Key Input 0x01–0x03 80-bit key input register
Data Input 0x04–0x05 64-bit data input register
Cipher Output 0x06–0x07 64-bit cipher output register
5.3. The ChaCha Accelerator Core
The architecture of the accelerator core for ChaCha used in the SoC was also wrien
in Verilog HDL and connected using the same principle. The connection diagram between
ChaCha core and the SoC bus is shown in Figure 9. This implementation provides some
configurable elements, such as the round number, the optional 128-bit key input, and the
direct encryption output of a 512-bit message block (XORed with the 512-bit block of the
keystream). The accelerator core can be accessed at the address of 0x0200 in the CSR
peripheral bus. The offset addresses are listed in Table 3.
Figure 8. The PRESENT-80 accelerator architecture with bus interface.
This hierarchical structure simplified the hardware design of the accelerator, which
could then be accessed and configured at the address of 0x2800 in the CSR peripheral bus
after being integrated. The offset addresses are listed in Table 2.
Table 2. Internal register address mapping for the PRESENT-80 core.
Name Address Description
Configuration 0x00 Configuration of encryption or decryption
Key Input 0x01–0x03 80-bit key input register
Data Input 0x04–0x05 64-bit data input register
Cipher Output 0x06–0x07 64-bit cipher output register
5.3. The ChaCha Accelerator Core
The architecture of the accelerator core for ChaCha used in the SoC was also written
in Verilog HDL and connected using the same principle. The connection diagram between
ChaCha core and the SoC bus is shown in Figure 9. This implementation provides some
configurable elements, such as the round number, the optional 128-bit key input, and
the direct encryption output of a 512-bit message block (XORed with the 512-bit block of
the keystream). The accelerator core can be accessed at the address of 0x0200 in the CSR
peripheral bus. The offset addresses are listed in Table 3.
Future Internet 2023, 15, x FOR PEER REVIEW 11 of 20
Figure 9. The ChaCha accelerator architecture with bus interface.
Table 3. Internal register address mapping for the ChaCha accelerator.
Name Address Description
Key Inpu
t
0x10–0x17 256-bit key input registers
Nonce Input 0x20–0x21 64-bit number-use-once registers
Data Input 0x40–0x4F Optional 512-bit message input registers
Data Output 0x80–0x8F 512-bit keystream output registers
Control 0x08 Accelerator control bit
Status 0x09 Accelerator status bit
Configuration 0x30 Accelerator configuration bit
5.4. The SHA3-512 Accelerator Core
The SHA3-512 accelerator core used in this work was slightly improved from an
open-source project [43]. The modification was achieved for the padding module of the
accelerator. Since the [43] core was developed for Keccak, the padding specification was
to add to the input message a 1 bit, followed by a pre-determined amount of 0 bit, and to
end with a 1 bit. This padding scheme changed when Keccak was standardized to SHA-
3, with the paern of adding, in hexadecimal form, a 0x06 byte, followed by numbers of
0x00, and ending with a 0x80 byte. There was no change in terms of resources after this
modification. The connection diagram between the SHA-3 cores and the SoC is shown in
Figure 10. A wrapper was then created using a simplified interface similar to the PRINCE
and ChaCha cores and wrien in Verilog HDL. The core was assigned to the address of
0x4000 in the CSR bus and the internal register address mapping is shown in Table 4
below.
Figure 10. The SHA3-512 accelerator architecture with bus interface.
Figure 9. The ChaCha accelerator architecture with bus interface.
Future Internet 2023,15, 186 11 of 20
Table 3. Internal register address mapping for the ChaCha accelerator.
Name Address Description
Key Input 0x10–0x17 256-bit key input registers
Nonce Input 0x20–0x21 64-bit number-use-once registers
Data Input 0x40–0x4F Optional 512-bit message input registers
Data Output 0x80–0x8F 512-bit keystream output registers
Control 0x08 Accelerator control bit
Status 0x09 Accelerator status bit
Configuration 0x30 Accelerator configuration bit
5.4. The SHA3-512 Accelerator Core
The SHA3-512 accelerator core used in this work was slightly improved from an
open-source project [
43
]. The modification was achieved for the padding module of the
accelerator. Since the [
43
] core was developed for Keccak, the padding specification was to
add to the input message a 1 bit, followed by a pre-determined amount of 0 bit, and to end
with a 1 bit. This padding scheme changed when Keccak was standardized to SHA-3, with
the pattern of adding, in hexadecimal form, a 0x06 byte, followed by numbers of 0x00, and
ending with a 0x80 byte. There was no change in terms of resources after this modification.
The connection diagram between the SHA-3 cores and the SoC is shown in Figure 10. A
wrapper was then created using a simplified interface similar to the PRINCE and ChaCha
cores and written in Verilog HDL. The core was assigned to the address of 0x4000 in the
CSR bus and the internal register address mapping is shown in Table 4below.
Future Internet 2023, 15, x FOR PEER REVIEW 11 of 20
Figure 9. The ChaCha accelerator architecture with bus interface.
Table 3. Internal register address mapping for the ChaCha accelerator.
Name Address Description
Key Inpu
t
0x10–0x17 256-bit key input registers
Nonce Input 0x20–0x21 64-bit number-use-once registers
Data Input 0x40–0x4F Optional 512-bit message input registers
Data Output 0x80–0x8F 512-bit keystream output registers
Control 0x08 Accelerator control bit
Status 0x09 Accelerator status bit
Configuration 0x30 Accelerator configuration bit
5.4. The SHA3-512 Accelerator Core
The SHA3-512 accelerator core used in this work was slightly improved from an
open-source project [43]. The modification was achieved for the padding module of the
accelerator. Since the [43] core was developed for Keccak, the padding specification was
to add to the input message a 1 bit, followed by a pre-determined amount of 0 bit, and to
end with a 1 bit. This padding scheme changed when Keccak was standardized to SHA-
3, with the paern of adding, in hexadecimal form, a 0x06 byte, followed by numbers of
0x00, and ending with a 0x80 byte. There was no change in terms of resources after this
modification. The connection diagram between the SHA-3 cores and the SoC is shown in
Figure 10. A wrapper was then created using a simplified interface similar to the PRINCE
and ChaCha cores and wrien in Verilog HDL. The core was assigned to the address of
0x4000 in the CSR bus and the internal register address mapping is shown in Table 4
below.
Figure 10. The SHA3-512 accelerator architecture with bus interface.
Figure 10. The SHA3-512 accelerator architecture with bus interface.
Table 4. Internal register address mapping for the SHA3-512 core.
Name Address Description
Reset 0x00 Set and reset input
Input 0x01 32-bit input register
Byte Number 0x02 Set index of last byte
Input Last 0x03 Set the last input and start
Status 0x09 Accelerator status bit
Hash Output 0x10–0x1F 512-bit hash output registers
5.5. The System-on-Chip
In this paper, LiteX [
44
] was used as an SoC building framework, interconnecting
components and invoking suitable toolchains to synthesize and deploy the design in an
FPGA. The VexRiscv processor core with the Wishbone bus configuration was the first
component to be initialized in the system, along with other peripherals, forming a basic
SoC. All accelerator core designs in Verilog HDL were added later in the process. LiteX also
Future Internet 2023,15, 186 12 of 20
supports generating essential project files and constraints for the respective FPGA platform
toolchain, which was Vivado in this work.
Utilizing Migen, a Python-based fragmented hardware description language, the
LiteX framework enables hardware cores and SoC systems to be designed with ease,
experimenting with various digital design architectures and implementing them in various
FPGA hardware platforms. By using the framework alongside the hardware cores library
provided by the LiteX open-source community, including VexRiscv, a more-accessible
building process of large and complex SoCs can be made, improving the portability and
flexibility of the design process.
The SoC was generated with the following configuration after calling the SpinalHDL
generation process for the VexRiscv processor; the framework then started to compile
and elaborate the system design based on the building script, and then started the FPGA
synthesizing process, building the bitstream.
•
Single-core 32-bit VexRiscv processor (RV32IM), 32-bit Wishbone Bus with 4 GB address
space, 8 KB L1 cache (4 KB data cache and 4 KB instruction cache), and 8 KB L2 cache.
•Peripherals: UART, SPI, GPIO.
•
Custom cryptographic accelerator cores for PRINCE, PRESENT-80, ChaCha, and
SHA3-512.
The proposed SoC architecture composed of a VexRiscv processor and lightweight
cryptographic cores is presented in Figure 11. As briefly mentioned for the PRINCE acceler-
ator core in Section 4.1, the Configuration/Status Register bus in LiteX is a mechanism for
reading and writing values for the configuration registers of various intellectual property
(IP) cores within the FPGA. It should not be conflated with CSRs as defined by the RISC-V
ISA. This bus is used to control and monitor the behavior of the IP cores, such as setting
up clock frequencies, enabling or disabling features, and reading status information. The
CSR bus is an important feature in the LiteX SoC builder tool in building the proposed
SoC, as it allows for a simple and flexible interface with the IP cores, making it easy to
configure and control various components, including all of the lightweight cryptographic
accelerator cores.
Future Internet 2023, 15, x FOR PEER REVIEW 13 of 20
Figure 11. SoC architecture with lightweight cryptographic and hash function accelerator cores.
6. Experimental Results
6.1. FPGA Implementation Results
IO configuration and project constraints for the Nexys4 DDR FPGA development
board were generated alongside the Verilog RTL for the SoC as a “.xdc” file. The design
was synthesized and implemented using Vivado 2021.2, with the resource utilization
shown in Table 5. There were no DRC violations, and the STA (static timing analysis)
report was cleaned. The clock source for the Nexys4 DDR was from an onboard 100 MHz
crystal, whereas the VexRiscv core SoC was driven by an MMCM (Xilinx Mixed-Mode
Clock Manager) at 75MHz. The choice of using a frequency of 75 MHz in the SoC, rather
than aiming for the highest frequency, was made with the goal of beer representing the
system as an IoTs device. This frequency was chosen since it offers a compromise between
efficiency and performance, which is crucial for Internet of Things devices. By limiting the
operation to 75 MHz, the SoC can already perform the necessary computations, which
will be discussed later, while still being able to conserve energy if needed.
Table 5. Synthesis result for Xilinx Nexys4 DDR XC7A100TCSG324-1 FPGA.
Top Design * VexRiscv PRINCE PRESENT-80 ChaCha SHA3-512
LUTs 11,830 2025 1561 489 2890 3003
Logic LUTs 11,684 2025 1561 489 2890 3003
LUTs as RAMs 144 0 0 0 0 0
SRLs 2 0 0 0 0 0
FFs 9552 1279 648 901 1960 2291
RAMB 36 9 0 0 0 0
DSP 4 4 0 0 0 0
* The total utilized resources consist of accelerator cores, also shown in the table above, and many
other components that make up the system (e.g., the internal bus, basic peripherals).
6.2. Accelerator Core Function Verifications
All cryptographic accelerator cores connected to the SoC through the peripheral bus
were evaluated using custom firmware, developed and compiled using the standard and
open-source GNU Compiler Collection (GCC) RISC-V toolchain.
The firmware/software interface for each cryptographic component was developed
using the mapping address information from Tables 1–4. The complete cipher functions
and verification test cases for the PRINCE, PRESENT-80, ChaCha, and SHA3-512
algorithms are also included. The verification results of the PRINCE, PRESENT-80,
Figure 11. SoC architecture with lightweight cryptographic and hash function accelerator cores.
6. Experimental Results
6.1. FPGA Implementation Results
IO configuration and project constraints for the Nexys4 DDR FPGA development
board were generated alongside the Verilog RTL for the SoC as a “.xdc” file. The design
was synthesized and implemented using Vivado 2021.2, with the resource utilization
Future Internet 2023,15, 186 13 of 20
shown in Table 5. There were no DRC violations, and the STA (static timing analysis)
report was cleaned. The clock source for the Nexys4 DDR was from an onboard 100 MHz
crystal, whereas the VexRiscv core SoC was driven by an MMCM (Xilinx Mixed-Mode
Clock Manager) at 75 MHz. The choice of using a frequency of 75 MHz in the SoC, rather
than aiming for the highest frequency, was made with the goal of better representing the
system as an IoTs device. This frequency was chosen since it offers a compromise between
efficiency and performance, which is crucial for Internet of Things devices. By limiting the
operation to 75 MHz, the SoC can already perform the necessary computations, which will
be discussed later, while still being able to conserve energy if needed.
Table 5. Synthesis result for Xilinx Nexys4 DDR XC7A100TCSG324-1 FPGA.
Top Design * VexRiscv PRINCE PRESENT-80 ChaCha SHA3-512
LUTs 11,830 2025 1561 489 2890 3003
Logic LUTs 11,684 2025 1561 489 2890 3003
LUTs as RAMs 144 0 0 0 0 0
SRLs 2 0 0 0 0 0
FFs 9552 1279 648 901 1960 2291
RAMB 36 9 0 0 0 0
DSP440000
* The total utilized resources consist of accelerator cores, also shown in the table above, and many other components
that make up the system (e.g., the internal bus, basic peripherals).
6.2. Accelerator Core Function Verifications
All cryptographic accelerator cores connected to the SoC through the peripheral bus
were evaluated using custom firmware, developed and compiled using the standard and
open-source GNU Compiler Collection (GCC) RISC-V toolchain.
The firmware/software interface for each cryptographic component was developed
using the mapping address information from Tables 1–4. The complete cipher functions and
verification test cases for the PRINCE, PRESENT-80, ChaCha, and SHA3-512 algorithms
are also included. The verification results of the PRINCE, PRESENT-80, ChaCha, and
SHA-3 cores are displayed in Figures 12–16, respectively. This evaluation affirmed that all
functional components were in the proper operation condition.
Future Internet 2023, 15, x FOR PEER REVIEW 14 of 20
ChaCha, and SHA-3 cores are displayed in Figures 12–16, respectively. This evaluation
affirmed that all functional components were in the proper operation condition.
Figure 12. Custom firmware with interactive console, running in the SoC.
Figure 13. Verification result for the block cipher PRINCE. Two simple test cases in the electronic
codebook (ECB) mode for the algorithm were evaluated. Each case verified the encryption and
decryption function of the accelerator core and correctly outpued the expected data.
Figure 14. Verification result for the PRESENT-80 block cipher. Two simple test cases verified the
encryption and decryption function of the accelerator core and correctly returned the expected data.
Figure 12. Custom firmware with interactive console, running in the SoC.
Future Internet 2023,15, 186 14 of 20
Future Internet 2023, 15, x FOR PEER REVIEW 14 of 20
ChaCha, and SHA-3 cores are displayed in Figures 12–16, respectively. This evaluation
affirmed that all functional components were in the proper operation condition.
Figure 12. Custom firmware with interactive console, running in the SoC.
Figure 13. Verification result for the block cipher PRINCE. Two simple test cases in the electronic
codebook (ECB) mode for the algorithm were evaluated. Each case verified the encryption and
decryption function of the accelerator core and correctly outpued the expected data.
Figure 14. Verification result for the PRESENT-80 block cipher. Two simple test cases verified the
encryption and decryption function of the accelerator core and correctly returned the expected data.
Figure 13.
Verification result for the block