Conference Paper

Is RISC-V ready for HPC prime-time: Evaluating the 64-core Sophon SG2042 RISC-V CPU

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To our knowledge, no other work in literature investigated the performance of language models on a RISC-V architecture similar to the one we use in this work. Brown et al. [29] evaluated the parallel workloads from the RAJAPerf benchmarking suite on the SOPHON SG2042 SoC and compared the performance with state-of-the-art ARM and x86 architectures. The authors explored the RVV v0.7.1 capabilities using the XuanTie GCC compiler provided by the vendor to enable single-precision autovectorization on the RAJAPerf kernels. ...
... However, many other factors may influence this result when comparing two different platforms. Both [29] and [6] point out the challenges of developing and running vectorized code on RISC-V due to the immaturity of tooling and hardware. ...
... The three related work we found [29,6,30] evaluate the RVV capabilities in similar CPUs as this work. However, only [29] investigated the SOPHON SG2042, with its complex NUMA design, but they did not focus on evaluating RVV performance. ...
Conference Paper
Full-text available
The rising usage of compute-intensive AI applications with fast response time requirements, such as text generation using large language models, underscores the need for more efficient and versatile hardware solutions. This drives the exploration of emerging architectures like RISC-V, which has the potential to deliver strong performance within tight power constraints. The recent commercial release of processors with RISC-V Vector (RVV) silicon-enabled extensions further amplifies the significance of RISC-V architectures, offering enhanced capabilities for parallel processing and accelerating tasks critical to large language models and other AI applications. This work aims to evaluate the BERT and GPT-2 language models inference performance on the SOPHON SG2042 64-core RISC-V architecture with silicon-enabled RVV v0.7.1. We benchmarked the models with and without RVV, using OpenBLAS and BLIS as BLAS backends for PyTorch to enable vectorization. Enabling RVV in OpenBLAS improved the inference performance by up to 40% in some cases.
... Not only does this processor provide significantly more cores that existing, SoC based, mass produced RISC-V CPUs, but furthermore the T-Head XuanTie C920 cores themselves have been designed for high performance. Consequently this new RISC-V CPU is very interesting to the HPC community and previous work [2] found that, for the RAJAPerf suite [3], it delivers a considerable performance uplift compared to existing commodity available RISC-V CPUs, but struggled to match a set of x86-based CPUs that are commonplace in HPC machines. In this paper we leverage NASA's NAS Parallel Benchmark (NPB) suite to undertake more in depth performance characterisation of the SG2042. ...
... In [2] the authors benchmarked the SG2042 using the RAJAPerf suite, however this was across a large number of individual kernels and from the results it was difficult to isolate and identify individual performance patterns. By contrast, in this paper we characterise and explore each individual benchmark of the NPB suite to better classify the performance properties of the SG2042. ...
... Consistently, the U74 of the VisionFive V2 performs closest to the C920, but is still only delivering between 21% and 38% the performance of the C920. Whilst the VisionFive V1 and SiFive U740 both contain the same U74 core as the VisionFive V2, they are significantly slower and this is broadly in agreement with [2]. The C906 of the All Winner D1 is out performed by the C920 and the U74 of the V2 quite considerably. ...
Preprint
Full-text available
Whilst RISC-V has grown phenomenally quickly in embedded computing, it is yet to gain significant traction in High Performance Computing (HPC). However, as we move further into the exascale era, the flexibility offered by RISC-V has the potential to be very beneficial in future supercomputers especially as the community places an increased emphasis on decarbonising its workloads. Sophon's SG2042 is the first mass produced, commodity available, high-core count RISC-V CPU designed for high performance workloads. First released in summer 2023, and at the time of writing now becoming widely available, a key question is whether this is a realistic proposition for HPC applications. In this paper we use NASA's NAS Parallel Benchmark (NPB) suite to characterise performance of the SG2042 against other CPUs implementing the RISC-V, x86-64, and AArch64 ISAs. We find that the SG2042 consistently outperforms all other RISC-V solutions, delivering between a 2.6 and 16.7 performance improvement at the single core level. When compared against the x86-64 and AArch64 CPUs, which are commonplace for high performance workloads, we find that the SG2042 performs comparatively well with computationally bound algorithms but decreases in relative performance when the algorithms are memory bandwidth or latency bound. Based on this work, we identify that performance of the SG2042's memory subsystem is the greatest bottleneck.
... Containing 64-cores, the SG2042 RISC-V CPU is aimed at high performance workloads and consequently this is a far more serious proposition for the HPC community than previous RISC-V hardware. Previous work [3] found that whilst each core of the SG2042 was a considerable improvement beyond other existing RISC-V hardware, the CPU struggled to match the performance of x86-based CPUs that are commonplace in HPC. However, from this previous work it was not possible to understand whether the issues that were observed were inherent to CPU itself or the rest of the system, the Milk-V Pioneer. ...
... Prior work [5] developed a tool that would backport RVV v1.0 to RVV 0.7.1, enabling a wider range of compilers such as LLVM to vectorise for the CPU. [3] undertook the first benchmarking of the Sophon SG2042 and the authors leveraged the RAJAPerf suite to compare and contrast against x86 based CPUs. However, the RAJAPerf suite comprises a large number of kernels and the results were such that whilst it was possible to generally observe that the Sophon SG2042 fell short of the performance delivered by these other CPUs, it was difficult to identify specific patterns and results behind the performance that was observed. ...
Preprint
Whilst RISC-V has become popular in fields such as embedded computing, it is yet to find mainstream success in High Performance Computing (HPC). However, the 64-core RISC-V Sophon SG2042 is a potential game changer as it provides a commodity available CPU with much higher core count than existing technologies. In this work we benchmark the SG2042 CPU hosted in an experimental, dual-socket, system to explore the performance properties of the CPU when running a common HPC benchmark suite across sockets. Earlier benchmarks found that, on the Milk-V Pioneer workstation, whilst the SG2042 performs well for compute bound codes, it struggles when pressure is placed on the memory subsystem. The performance results reported here confirm that, even on a different system, these memory performance limitations are still present and hence inherent in the CPU. However, a multi-socket configuration does enable the CPU to scale to a larger number of threads which, in the main, delivers an improvement in performance and-so this is a realistic system configuration for the HPC community.
... While a number of closed-source superscalar, OoO RISC-V cores have been designed (e.g. the products from Ventana [17], Sifive [16], Semidynamics [14] or SOPHGO [5]) the landscape of opensource cores is much more scarcely populated. Existing open-source RISC-V cores with superscalar and OoO capabilities, such as BOOM [21] and Xiangshan [18], have been developed using Chisel, a hardware description language (HDL) that, while powerful, is not widely adopted in the industry. ...
Preprint
Full-text available
Open-source RISC-V cores are increasingly demanded in domains like automotive and space, where achieving high instructions per cycle (IPC) through superscalar and out-of-order (OoO) execution is crucial. However, high-performance open-source RISC-V cores face adoption challenges: some (e.g. BOOM, Xiangshan) are developed in Chisel with limited support from industrial electronic design automation (EDA) tools. Others, like the XuanTie C910 core, use proprietary interfaces and protocols, including non-standard AXI protocol extensions, interrupts, and debug support. In this work, we present a modified version of the OoO C910 core to achieve full RISC-V standard compliance in its debug, interrupt, and memory interfaces. We also introduce CVA6S+, an enhanced version of the dual-issue, industry-supported open-source CVA6 core. CVA6S+ achieves 34.4% performance improvement over CVA6 core. We conduct a detailed performance, area, power, and energy analysis on the superscalar out-of-order C910, superscalar in-order CVA6S+ and vanilla, single-issue in-order CVA6, all implemented in a 22nm technology and integrated into Cheshire, an open-source modular SoC. We examine the performance and efficiency of different microarchitectures using the same ISA, SoC, and implementation with identical technology, tools, and methodologies. The area and performance rankings of CVA6, CVA6S+, and C910 follow expected trends: compared to the scalar CVA6, CVA6S+ shows an area increase of 6% and an IPC improvement of 34.4%, while C910 exhibits a 75% increase in area and a 119.5% improvement in IPC. However, efficiency analysis reveals that CVA6S+ leads in area efficiency (GOPS/mm2), while the C910 is highly competitive in energy efficiency (GOPS/W). This challenges the common belief that high performance in superscalar and out-of-order cores inherently comes at a significant cost in area and energy efficiency.
... The pseudocode for the optimized algorithm for a lower triangular non-transposed matrix is given in Algorithm 4. It traverses the matrix "bottom-up". For the last columns, calculations are performed using the baseline algorithm (lines [4][5][6][7][8][9][10][11][12][13][14], and for the remaining columns, they are performed by an optimized algorithm that traverses along diagonals (lines [15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31]. . Then the next unknown x[i] is calculated. ...
Preprint
Full-text available
The rapid development of RISC-V instruction set architecture presents new opportunities and challenges for software developers. Is it sufficient to simply recompile high-performance software optimized for x86-64 onto RISC-V CPUs? Are current compilers capable of effectively optimizing C and C++ codes or is it necessary to use intrinsics or assembler? Can we analyze and improve performance without well-developed profiling tools? Do standard optimization techniques work? Are there specific RISC-V features that need to be considered? These and other questions require careful consideration. In this paper, we present our experience optimizing four BLAS algorithms for band matrix operations on RISC-V processors. We demonstrate how RISC-V-optimized implementations of OpenBLAS algorithms can be significantly accelerated through improved vectorization of computationally intensive loops. Experiments on Lichee Pi 4A and Banana Pi BPI-F3 devices using RVV 0.7.1 and RVV 1.0 vector instruction sets respectively, show speedups of 1.5x to 10x depending on the operation compared to the OpenBLAS baseline. In particular, the successful use of vector register grouping with RVV can lead to significant performance improvements.
... However, the Milk-V Pioneer's availability as the first desktop-grade system with 64 Cores creates new opportunities for further testing and evaluation. Milk-V's Pioneer has been used for a comparison study with x86 CPUS and the RAJAPerf benchmarking suite [6] in [7]. A study on software support for academic and industrial applications is available here [8]. ...
Preprint
In recent years, interest in RISC-V computing architectures have moved from academic to mainstream, especially in the field of High Performance Computing where energy limitations are increasingly a point of concern. The results presented in this paper are part of a longer-term evaluation of RISC-V's viability for HPC applications. In this work, we use the Octo-Tiger multi-physics, multi-scale, 3D adaptive mesh refinement astrophysics application as the bases for our analysis. We report on our experience in porting this modern C++ code (which is built upon several open-source libraries such as HPX and Kokkos) to RISC-V. We also compare the application's performance, scalability, and power consumption on RISC-V to an A64FX system.
... It can be seen from Table 1 that the SG2042 delivers very considerably higher performance than the previous commodity available RISC-V hardware, however benchmarking [1] has shown that it is still around between four and eight times slower than modern x86-based hardware commonly used in HPC. ...
Preprint
Full-text available
Funded by the UK ExCALIBUR H&ES exascale programme, since early 2022 we have provided a RISC-V testbed for HPC to offer free access for scientific software developers to experiment with RISC-V for their workloads. Based upon our experiences of providing access to RISC-V for the HPC community, and our involvement with the RISC-V community at large, in this extended abstract we summarise the current state of RISC-V for HPC and consider the high priority areas that should be addressed to help drive adoption.
Article
Full-text available
Automatic vectorization is an important technique for compilers to improve the parallelism of programs. With the widespread usage of SIMD (Single Instruction Multiple Data) extensions in modern processors, automatic vectorization has become a hot topic in the research of compiler techniques. Accurately evaluating the effectiveness of automatic vectorization in typical compilers is quite valuable for compiler optimization and design. This paper evaluates the effectiveness of automatic vectorization, analyzes the limitation of automatic vectorization and the main causes, and improves the automatic vectorization technology. This paper firstly classifies the programs by two main factors: program characteristics and transformation methods. Then, it evaluates the effectiveness of automatic vectorization in three well-known compilers (GCC, LLVM, and ICC, including their multiple versions in recent 5 years) through TSVC (Test Suite for Vectorizing Compilers) benchmark. Furthermore, this paper analyzes the limitation of automatic vectorization based on source code analysis, and introduces the differences between academic research and engineering practice in automatic vectorization and the main causes, Finally, it gives some suggestions as to how to improve automatic vectorization capability.
Chapter
Leveraging vectorisation, the ability for a CPU to apply operations to multiple elements of data concurrently, is critical for high performance workloads. However, at the time of writing, commercially available physical RISC-V hardware that provides the RISC-V vector extension (RVV) only supports version 0.7.1, which is incompatible with the latest ratified version 1.0. The challenge is that upstream compiler toolchains, such as Clang, only target the ratified v1.0 and do not support the older v0.7.1. Because v1.0 is not compatible with v0.7.1, the only way to program vectorised code is to use a vendor-provided, older compiler. In this paper we introduce the rvv-rollback tool which translates assembly code generated by the compiler using vector extension v1.0 instructions to v0.7.1. We utilise this tool to compare vectorisation performance of the vendor-provided GNU 8.4 compiler (supports v0.7.1) against LLVM 15.0 (supports only v1.0), where we found that the LLVM compiler is capable of auto-vectorising more computational kernels, and delivers greater performance than GNU in most, but not all, cases. We also tested LLVM vectorisation with vector length agnostic and specific settings, and observed cases with significant difference in performance.KeywordsRISC-V vector extensionHPCClangRVV Rollback
Article
Internet-of-Things (IoT) opens a new world of possibilities for both personal and industrial applications. At the heart of an IoT device, the processor is the core component. Hence, as an open and free instruction set architecture RISC-V is gaining huge popularity for IoT. A large ecosystem is available around RISC-V, including various RTL implementations at one end and high-speed instruction set simulators (ISSs) at the other end. These ISSs facilitate functional verification of RTL implementations as well as early SW development to some extent. However, being designed predominantly for speed, they can hardly be extended to support further system-level use cases such as design space exploration, power/timing/performance validation or analysis of complex HW/SW interactions. In this paper, we propose and implement a RISC-V based Virtual Prototype (VP) with the goal of filling this gap. We provide a 32 and 64 bit RISC-V core supporting the IMAC instruction set with different privilege levels, the RISC-V CLINT and PLIC interrupt controllers and an essential set of peripherals. We support simulation of (mixed 32 and 64 bit) multi-core platforms, provide SW debug and coverage measurement capabilities and support the FreeRTOS and Zephyr operating systems. The VP is designed as extensible and configurable platform with a generic bus system and implemented in standard-compliant SystemC and TLM-2.0. The latter point is very important, since it allows to leverage cutting-edge SystemC-based modeling techniques needed for the mentioned use cases. Our VP allows a significantly faster simulation compared to RTL, while being more accurate than existing ISSs. Finally, our RISC-V VP is fully open source (MIT licence) to help expanding the RISC-V ecosystem and stimulating further research and development.
Check for A Study on the Performance Implications of AArch64 Atomics Ricardo Jesus () and Michèle Weiland EPCC
  • Jesus Ricardo
ChapelPerf: A Performance Suite for Chapel
  • Ricardo Jesus
  • Michèle Weiland
  • Jesus Ricardo
Richard D Hornung and Holger E Hones. 2017. Raja performance suite
  • D Richard
  • Hornung
  • E Holger
  • Hones
  • Hornung D
Raja performance suite
  • D Richard
  • Hornung
  • E Holger
  • Hones
  • Hornung D
RAJA: Portable performance for large-scale scientific applications
  • Jason David A Beckingsale
  • Rich Burmark
  • Holger Hornung
  • William Jones
  • Killian
  • J Adam
  • Olga Kunen
  • Peter Pearce
  • Robinson
  • Thomas Rw Brian S Ryujin
  • Scogland
  • Beckingsale A