Conference Paper

A Script-Based Cycle-True Verification Framework to Speed-Up Hardware and Software Co-Design of System-on-Chip exploiting RISC-V Architecture

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The complexity of heterogenous Systems-on-Chip has overgrown in the last decades, and the effort necessary to set up a verification workflow has increased as well. The time spent on the verification phase of a design takes on average 57% of the project time, and in these years, several solutions aimed to automate that task have been developed. Some relevant works in this field automate the VLSI design flow from synthesis to Place-And-Route and Layout-Vs-Schematic design check but miss software design in the automated verification loop. Our work focuses on the early stages of the design phase, where designers take software and hardware choices to explore a larger design space. In this work, we present a flexible, Make-based framework to build up verification and design environments. It aids the development of Systems-on-Chip running RISC-V processors, automating software compilation, cycle-true simulations and post-synthesis analyses. It exploits the parallelism of the Make build tool to ensure results consistency, provide flow reproducibility, and accelerate the design space exploration using different flow recipes provided by the designer. Its modular structure allows it to perform each task with various third-party tools and makes the workflow execution chain customizable. Using the proposed framework, we show how the reduced designer effort increases design productivity. Indeed, the time needed to build up a validated development environment is consistently reduced by using few configuration properties to setup all the tools used in the workflow.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To perform the simulation, we used Siemens QuestaSim Core (QuestaSim) 22.4. This simulation leveraged the framework presented in [62], [63]. Using the Xilinx unsim pre- 1) The Verilog post-implementation netlist of the GPU@SAT IP core as the DUT 2) A simulated SystemVerilog Random Access Memory (RAM) to handle load and store operations from the GPU 3) A SystemVerilog module for managing AXI4-Lite write transactions towards the control interface A schematic view is presented in Figure 7. ...
Article
Full-text available
Field Programmable Gate Arrays are extensively utilized across numerous domains, including telecommunications, cryptography, Machine Learning, and safety-critical applications. In critical applications, FPGAs are often exposed to environmental challenges, such as Single Event Upsets, which can affect the reliability of the design. Therefore, assessing the robustness of FPGA-based systems becomes crucial to ensure their correct functionality under normal operation and mitigate potential risks. This can be done by developing a fault injection environment to simulate the effect of Single Event Effects over the specified Device Under Test. Firstly, this paper briefly reviews other related works, highlighting the differences between the various approaches. Then, our novel netlist-level fault injection tool called RADSAFiE is introduced. The tool is characterized by a Python-based User Interface that allows users to modify an input netlist by inserting fault injection logic primitives. The user can select between two interfaces: a Graphical User Interface or a Command Line Interface, both taking an HDL netlist as input. We discuss the supported fault types and highlight the compatibility with different FPGA vendors. We also conduct code profiling on the application and analyze, for specific operations, both execution time and memory usage. Specifically, we focus on a use case that features the Soft Graphic Processing Unit IP provided by IngeniArs S.r.l., namely GPU@SAT. We present performance results across netlists of different sizes and offer a concise overview of the simulation environment. Finally, in the last section of the paper, we hint at possible future development and conclude the work.
... The target of this work, which extends our previous publication [8], is the workflow of heterogeneous digital designs described in Figure 1. The development cycle of these designs is very long. ...
Article
Full-text available
Digital designs complexity has exponentially increased in the last decades. Heterogeneous Systems-on-Chip integrate many different hardware components which require a reliable and scalable verification environment. The effort to set up such environments has increased as well and plays a significant role in digital design projects, taking more than 50% of the total project time. Several solutions have been developed with the goal of automating this task, integrating various steps of the Very Large Scale Integration design flow, but without addressing the exploration of the design space on both the software and hardware sides. Early in the co-design phase, designers break down the system into hardware and software parts taking into account different choices to explore the design space. This work describes the use of a framework for automating the verification of such choices, considering both hardware and software development flows. The framework automates compilation of software, cycle-true simulations and analyses on synthesised netlists. It accelerates the design space exploration exploiting the GNU Make tool, and we focus on ensuring consistency of results and providing a mechanism to obtain reproducibility of the design flow. In design teams, the last feature increases cooperation and knowledge sharing from single expert to the whole team. Using flow recipes, designers can configure various third-party tools integrated into the modular structure of the framework, and make workflow execution customisable. We demonstrate how the developed framework can be used to speed up the setup of the evaluation flow of an Elliptic-Curve-Cryptography accelerator, performing post-synthesis analyses. The framework can be easily configured taking approximately 30 min, instead of few days, to build up an environment to assess the accelerator performance and its resistance to simple power analysis side-channel attacks.
... The hardware structure of ICU4SAT is shown in Fig. 3. It has been developed thanks to the framework presented in [19]. In the following, we briefly describe its building blocks. ...
... In Table V, we present the system throughputs for each operating mode. The measurements have been carried out exploiting the system presented in [40], where an emulation environment capable of performing a low-level post-synthesis simulation of code execution on a RISC-V-based system is carefully described. The test has been performed simulating the entire system, loading the memory with bare-metal code, aiming at measuring the maximum capabilities of the HW system. ...
Article
Full-text available
This article presents a cryptographic hardware (HW) accelerator supporting multiple advanced encryption standard (AES)-based block cipher modes, including the more advanced cipher-based MAC (CMAC), counter with CBC-MAC (CCM), Galois counter mode (GCM), and XOR-encrypt-XOR-based tweaked-codebook mode with ciphertext stealing (XTS) modes. The proposed design implements advanced and innovative features in HW, such as AES key secure management, on-chip clock randomization, and access privilege mechanisms. The system has been tested in a RISC-V-based system-on-chip (SoC), specifically designed for this purpose, on an Ultrascale + Xilinx FPGA, analyzing resource and power consumption, together with system performances. The cryptoprocessor has been then synthesized on a 7-nm CMOS standard-cells technology; performances, complexity, and power consumption information are analyzed and compared with the state of the art. The proposed cryptoprocessor is ready to be embedded within the innovative European Processor Initiative (EPI) chip.
Article
Full-text available
In spite of decades of work, design verification remains a highly expensive and time-consuming component of electronic system development. With the advent of Systemon- Chip (SoC) architectures, verification has morphed into an activity that spans the entire life-cycle, making use of diverse platforms, tools, and technologies. This paper provides a tutorial overview of the state of the practice in verification in the context of verification of modern SoC designs, current industrial trends, and key research challenges.
Article
Full-text available
The Internet of Things (IoT) refers to a pervasive presence of interconnected and uniquely identifiable physical devices. These devices’ goal is to gather data and drive actions in order to improve productivity, and ultimately reduce or eliminate reliance on human intervention for data acquisition, interpretation and use. The proliferation of these connected low-power devices will result in a data explosion that will significantly increase data transmission costs with respect to energy consumption and latency. Edge computing reduces these costs by performing computations at the edge nodes, prior to data transmission, to interpret and/or utilize the data. While much research has focused on the IoT’s connected nature and communication challenges, the challenges of IoT embedded computing with respect to device microprocessors has received much less attention. This article explores IoT applications’ execution characteristics from a microarchitectural perspective and the microarchitectural characteristics that will enable efficient and effective edge computing. To tractably represent a wide variety of next-generation IoT applications, we present a broad IoT application classification methodology based on application functions, to enable quicker workload characterizations for IoT microprocessors. We then survey and discuss potential microarchitectural optimizations and computing paradigms that will enable the design of right-provisioned microprocessors that are efficient, configurable, extensible, and scalable. Our work provides a foundation for the analysis and design of a diverse set of microprocessor architectures for next-generation IoT devices.
Article
Full-text available
Industry is building larger, more complex, manycore processors on the back of strong institutional knowledge, but academic projects face difficulties in replicating that scale. To alleviate these difficulties and to develop and share knowledge, the community needs open architecture frameworks for simulation, synthesis, and software exploration which support extensibility, scalability, and configurability, alongside an established base of verification tools and supported software. In this paper we present OpenPiton, an open source framework for building scalable architecture research prototypes from 1 core to 500 million cores. OpenPiton is the world's first open source, general-purpose, multithreaded manycore processor and framework. OpenPiton leverages the industry hardened OpenSPARC T1 core with modifications and builds upon it with a scratch-built, scalable uncore creating a flexible, modern manycore design. In addition, OpenPiton provides synthesis and backend scripts for ASIC and FPGA to enable other researchers to bring their designs to implementation. OpenPiton provides a complete verification infrastructure of over 8000 tests, is supported by mature software tools, runs full-stack multiuser Debian Linux, and is written in industry standard Verilog. Multiple implementations of OpenPiton have been created including a taped-out 25-core implementation in IBM's 32nm process and multiple Xilinx FPGA prototypes.
Article
Full-text available
Industry is building larger, more complex, manycore processors on the back of strong institutional knowledge, but academic projects face difficulties in replicating that scale. To alleviate these difficulties and to develop and share knowledge, the community needs open architecture frameworks for simulation, synthesis, and software exploration which support extensibility, scalability, and configurability, alongside an established base of verification tools and supported software. In this paper we present OpenPiton, an open source framework for building scalable architecture research prototypes from 1 core to 500 million cores. OpenPiton is the world's first open source, general-purpose, multithreaded manycore processor and framework. OpenPiton leverages the industry hardened OpenSPARC T1 core with modifications and builds upon it with a scratch-built, scalable uncore creating a flexible, modern manycore design. In addition, OpenPiton provides synthesis and backend scripts for ASIC and FPGA to enable other researchers to bring their designs to implementation. OpenPiton provides a complete verification infrastructure of over 8000 tests, is supported by mature software tools, runs full-stack multiuser Debian Linux, and is written in industry standard Verilog. Multiple implementations of OpenPiton have been created including a taped-out 25-core implementation in IBM's 32nm process and multiple Xilinx FPGA prototypes.
Article
Full-text available
Industry is building larger, more complex, manycore processors on the back of strong institutional knowledge, but academic projects face difficulties in replicating that scale. To alleviate these difficulties and to develop and share knowledge, the community needs open architecture frameworks for simulation, synthesis, and software exploration which support extensibility, scalability, and configurability, alongside an established base of verification tools and supported software. In this paper we present OpenPiton, an open source framework for building scalable architecture research prototypes from 1 core to 500 million cores. OpenPiton is the world's first open source, general-purpose, multithreaded manycore processor and framework. OpenPiton leverages the industry hardened OpenSPARC T1 core with modifications and builds upon it with a scratch-built, scalable uncore creating a flexible, modern manycore design. In addition, OpenPiton provides synthesis and backend scripts for ASIC and FPGA to enable other researchers to bring their designs to implementation. OpenPiton provides a complete verification infrastructure of over 8000 tests, is supported by mature software tools, runs full-stack multiuser Debian Linux, and is written in industry standard Verilog. Multiple implementations of OpenPiton have been created including a taped-out 25-core implementation in IBM's 32nm process and multiple Xilinx FPGA prototypes.
Article
Full-text available
As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and application level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). We believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.
Article
Full-text available
Because of technology scaling, power dissipation is today's major performance limiter. Moreover, the traditional way to achieve power efficiency, application-specific designs, is prohibitively expensive. These power and cost issues necessitate rethinking digital design. To reduce design costs, we need to stop building chip instances, and start making chip generators instead. Domain-specific chip generators are templates that codify designer knowledge and design trade-offs to create different application-optimized chips.
Conference Paper
Full-text available
This paper is meant to be a short introduction to a new paradigm for systems on chip (SoC) design. The premises are that a component-based design methodology will prevail in the future, to support component re-use in a plug-and-play fashion. At the same time, SoCs will have to provide a functionally-correct, reliable operation of the interacting components. The physical interconnections on chip will be a limiting factor for performance and energy consumption
Article
Full-text available
The development of application-specific instruction-set processors (ASIP) is currently the exclusive domain of the semiconductor houses and core vendors. This is due to the fact that building such an architecture is a difficult task that requires expertise in different domains: application software development tools, processor hardware implementation, and system integration and verification. This paper presents a retargetable framework for ASIP design which is based on machine descriptions in the LISA language. From that, software development tools can be generated automatically including high-level language C compiler, assembler, linker, simulator, and debugger frontend. Moreover, for architecture implementation, synthesizable hardware description language code can be derived, which can then be processed by standard synthesis tools. Implementation results for a low-power ASIP for digital video broadcasting terrestrial acquisition and tracking algorithms designed with the presented methodology are given. To show the quality of the generated software development tools, they are compared in speed and functionality with commercially available tools of state-of-the-art digital signal processor and μC architectures
Article
Synthesis and simulation of hardware design can take hours before results are available even for small changes. In contrast, software development embraced live programming to boost productivity. This article proposes LiveHD, an open-source incremental framework for hardware synthesis and simulation that provides feedback within seconds. Three principles for incremental design automation are presented. LiveHD uses a unified VLSI data model, LGraph, to support the implementation of incremental principles for synthesis and simulation. LiveHD also employs a tree-like high-level intermediate representation to interface modern hardware description languages. We present early results comparing with commercial and open source tools. LiveHD can provides feedback for synthesis, placement and routing in <30s for most changes tested with negligible QoR impact. For the incremental simulation, LiveHD is capable of getting any simulation cycle in under 2s for a 256 RISC-V core design.
Article
The open-source RISC-V instruction set architecture (ISA) is gaining traction, both in industry and academia. The ISA is designed to scale from microcontrollers to server-class processors. Furthermore, openness promotes the availability of various open-source and commercial implementations. Our main contribution in this paper is a thorough power, performance, and efficiency analysis of the RISC-V ISA targeting baseline "application class" functionality, i.e., supporting the Linux OS and its application environment based on our open-source single-issue in-order implementation of the 64-bit ISA variant (RV64GC) called Ariane. Our analysis is based on a detailed power and efficiency analysis of the RISC-V ISA extracted from silicon measurements and calibrated simulation of an Ariane instance (RV64IMC) taped-out in GlobalFoundries 22,FDX technology. Ariane runs at up to 1.7-GHz, achieves up to 40-Gop/sW energy efficiency, which is superior to similar cores presented in the literature. We provide insight into the interplay between functionality required for the application-class execution (e.g., virtual memory, caches, and multiple modes of privileged operation) and energy cost. We also compare Ariane with RISCY, a simpler and a slower microcontroller-class core. Our analysis confirms that supporting application-class execution implies a nonnegligible energy-efficiency loss and that compute performance is more cost-effectively boosted by instruction extensions (e.g., packed SIMD) rather than the high-frequency operation.
Article
Modern networks have critical security needs and a suitable level of protection and performance is usually achieved with the use of dedicated hardware cryptographic cores. Although the Advanced Encryption Standard (AES) is considered the best approach when symmetric cryptography is required, one of its main weaknesses lies in its measurable power consumption. Side-Channel Attacks (SCAs) use this emitted power to analyse and revert the mathematical steps and extract the encryption key. Nowadays they exist several dedicated equipments and workstations for SCA weaknesses analysis and the evaluation of the related countermeasures, but they can present significant drawbacks as an high cost for the instrumentation or, in case of cheaper instrumentation, the need to underclock the physical circuit implementing the AES cipher, in order to adapt the circuit clock frequency accordingly to the power sampling rate of ADCs or oscilloscopes bandwidth. In this work we proposed a methodology for Correlation and Differential Power Analysis against hardware implementations of an AES core, relying only on a simulative approach. Our solution extracts simulated power traces from a gate-level netlist and then elaborates them using mathematical-statistical procedures. The main advantage of our solution is that it allows to emulate a real attack scenario based on emitted power analysis, without requiring any additional physical circuit or dedicated equipment for power samples acquisition, neither modifying the working conditions of the target application context (such as the circuit clock frequency). Thus our approach can be used to validate and benchmark any SCA countermeasure during an early step of the design, thereby shortening and helping the designers to find the best solution during a preliminary phase and potentially without additional costs.
Article
Innovations like domain-specific hardware, enhanced security, open instruction sets, and agile chip development will lead the way.
Conference Paper
Modern networks have critical security needs and a suitable level of protection and performance is usually achieved with the use of dedicated hardware cryptographic cores. Although the Advanced Encryption Standard (AES) is considered the best approach when symmetric cryptography is required, one of its main weaknesses lies in its measurable power consumption. Side Channel Attacks (SCAs) use this emitted power to analyze and revert the mathematical steps and extract the encryption key. In this work we propose a simulated methodology based on Correlation and Differential Power Analysis. Our solution extracts the simulated power from a gate-level implementation of the AES core and elaborates it using mathematical-statistical procedures. An SCA countermeasure can then be evaluated without the need for any physical circuit. Each solution can be benchmarked during an early step of the design thereby shortening the evaluation phase and helping designers to find the best solution during a preliminary phase. The cost of our approach is lower compared to any kind of analysis that requires the silicon chip to evaluate SCA protection.
Article
Ultra-low-power embedded systems have recently started the move to multi-core designs. Aggressive voltage scaling techniques have the potential to reduce the power consumption within the admitted envelope, but memory operations on standard six-transistor static RAM (6T-SRAM) cells become unreliable at low voltages. While standard cell memory (SCM) overcomes this limitation, it has much lower area density than SRAM, and thus it is too costly. On the other hand, several applications have inherent tolerance to computation errors, and executing such workloads with approximation has already proven a viable way to reduce energy consumption. In this work we propose a novel HW/SW approach to design energy-efficient ultralow- power systems which combine the key ideas of approximate computing and hybrid memory systems featuring both SCM and 6T-SRAM. We introduce a novel hardware support to split errortolerant data so to host most significant bits (MSB) in the SCM and least significant bits (LSB) in the 6T-SRAM. This allows to power the memory system at a low voltage while ensuring correct operation by binding possible (flip-bit) errors to the LSBs only. In addition, by organizing 6T-SRAM banks into multiple and independent voltage domains we enable fine-grained, softwarecontrolled voltage switching policies. At the software level, we propose language constructs to specify what regions of code and what variables are tolerant to approximation, plus compiler support to optimize data placement. Experimental results show that our proposal can reduce the energy consumption of the memory system by 47% on average, always complying with the result accuracy required by practical applications constraints.
Article
The final phase of CMOS technology scaling provides continued increases in already vast transistor counts, but only minimal improvements in energy efficiency, thus requiring innovation in circuits and architectures. However, even huge teams are struggling to complete large, complex designs on schedule using traditional rigid development flows. This article presents an agile hardware development methodology, which the authors adopted for 11 RISC-V microprocessor tape-outs on modern 28-nm and 45-nm CMOS processes in the past five years. The authors discuss how this approach enabled small teams to build energy-efficient, cost-effective, and industry-competitive high-performance microprocessors in a matter of months. Their agile methodology relies on rapid iterative improvement of fabricatable prototypes using hardware generators written in Chisel, a new hardware description language embedded in a modern programming language. The parameterized generators construct highly customized systems based on the free, open, and extensible RISC-V platform. The authors present a case study of one such prototype featuring a RISC-V vector microprocessor integrated with a switched-capacitor DC-DC converter alongside an adaptive clock generator in a 28-nm, fully depleted silicon-on-insulator process.
Conference Paper
This paper presents Heracles, an open-source, functional, parameterized, synthesizable multicore system toolkit. Such a multi/many-core design platform is a powerful and versatile research and teaching tool for architectural exploration and hardware-software co-design. The Heracles toolkit comprises the soft hardware (HDL) modules, application compiler, and graphical user interface. It is designed with a high degree of modularity to support fast exploration of future multicore processors of di erent topologies, routing schemes, processing elements (cores), and memory system organizations. It is a component-based framework with parameterized interfaces and strong emphasis on module reusability. The compiler toolchain is used to map C or C++ based applications onto the processing units. The GUI allows the user to quickly con gure and launch a system instance for easy factorial development and evaluation. Hardware modules are implemented in synthesizable Verilog and are FPGA platform independent. The Heracles tool is freely available under the open-source MIT license at: http://projects.csail.mit.edu/heracles
Article
In this paper we introduce Chisel, a new hardware construction language that supports advanced hardware design using highly parameterized generators and layered domain-specific hardware languages. By embedding Chisel in the Scala programming language, we raise the level of hardware design abstraction by providing concepts including object orientation, functional programming, parameterized types, and type inference. Chisel can generate a high-speed C++-based cycle-accurate software simulator, or low-level Verilog designed to map to either FPGAs or to a standard ASIC flow for synthesis. This paper presents Chisel, its embedding in Scala, hardware examples, and results for C++ simulation, Verilog emulation and ASIC synthesis.
Article
The growing complexity of digital architectures strongly impacts on their verification phase, which becomes critical. In fact, without giving up cycle-accuracy, it seems there are not at the state-of-the-art fast frameworks allowing multi-parametric simulations at a fine granularity level for accurate verification and design space exploration purposes. In this paper, we propose a parallel, fast and cycle-accurate SystemC simulation framework: SysCgrid. SysCgrid is designed to provide automatic generation and parallel execution of multi-parametric simulations with minimum effort by hardware architects. This framework is conceived to run on a cluster/grid computing infrastructure, exploiting the message passing interface (MPI) to automatically distribute the multi-parametric simulation set over more than one node. The achieved results on a network-on-chip (NoC) architecture verification demonstrate the good SysCgrid scalability, both in case of shared and non-shared memory computing infrastructures, achieving impressive speed up with respect to traditional single-run simulations approaches.
Article
Current systems-on-chip (SoC) execute applications that demand extensive parallel processing; thus, the amount of processors, memories and application-specific signal processing cores is rapidly increasing. In these new multi-processor SoCs, (MPSoCs) one of the most critical elements regarding overall efficiency is on-chip interconnections. Network-on-chip (NoC) provides a structured way of realizing interconnections on silicon, and obviate the limitations of bus-based solutions. NoCs can have regular or ad hoc topologies and can be tuned by a large set of parameters. Simulation and functional validation are essential to assess the correctness and performance of MPSoC architectures. We present a flexible hardware-software emulation framework implemented on a FPGA that is specially designed to suitably explore, evaluate and compare a wide range of NoC solutions with a very limited effort. Our experimental results show a speed-up of four orders of magnitude with respect to cycle-accurate HDL simulation, while retaining cycle accuracy and flexibility of software simulators. Finally, we propose a validation flow for MPSoCs based on our flexible NoC emulation framework, which allows designers to explore and optimize a range of solutions, as well as quickly characterize performance figures and identify possible limitations in their on-chip interconnection architectures.
Byoc: A ”bring your own core" framework for heterogeneous-isa research
  • J Balkind
Sonicboom: The 3rd generation berkeley out-of-order machine
  • J Zhao
  • B Korpan
  • A Gonzalez
  • K Asanovic
The rocket chip generator
  • K Asanović
Sonicboom: The 3rd generation berkeley out-of-order machine
  • zhao