Available via license: CC BY 4.0
Content may be subject to copyright.
Performance characterisation of the 64-core
SG2042 RISC-V CPU for HPC
Nick Brown1[0000−0003−2925−7275] and Maurice Jamieson1[0000−0003−1626−4871]
EPCC at the University of Edinburgh, 47 Potterrow, Edinburgh, UK
Abstract. Whilst RISC-V has grown phenomenally quickly in embed-
ded computing, it is yet to gain significant traction in High Performance
Computing (HPC). However, as we move further into the exascale era,
the flexibility offered by RISC-V has the potential to be very beneficial in
future supercomputers especially as the community places an increased
emphasis on decarbonising its workloads. Sophon’s SG2042 is the first
mass produced, commodity available, high-core count RISC-V CPU de-
signed for high performance workloads. First released in summer 2023,
and at the time of writing now becoming widely available, a key question
is whether this is a realistic proposition for HPC applications.
In this paper we use NASA’s NAS Parallel Benchmark (NPB) suite to
characterise performance of the SG2042 against other CPUs implement-
ing the RISC-V, x86-64, and AArch64 ISAs. We find that the SG2042
consistently outperforms all other RISC-V solutions, delivering between
a 2.6 and 16.7 performance improvement at the single core level. When
compared against the x86-64 and AArch64 CPUs, which are common-
place for high performance workloads, we find that the SG2042 performs
comparatively well with computationally bound algorithms but decreases
in relative performance when the algorithms are memory bandwidth or
latency bound. Based on this work, we identify that performance of the
SG2042’s memory subsystem is the greatest bottleneck.
Keywords: RISC-V ·Sophon SG2042 ·NAS Parallel Benchmark suite
(NPB) ·High Performance Computing (HPC)
1 Introduction
RISC-V is an open Instruction Set Architecture (ISA) that, since it was first
released over a decade ago, has gained significant traction. At the time of writ-
ing it was recently announced that over 13 billion RISC-V CPU cores have been
manufactured, but many of these are in embedded computing such as automo-
tive, space, and micro-controllers. RISC-V has yet to become commonplace in
High Performance Computing (HPC), but as the HPC community moves further
into the exascale era and there is an increased emphasis on decarbonisation of
workloads, we need to consider how to best deliver both increased performance
and greater energy efficiency. To this end, there is a renewed interest in new
hardware solutions and technologies built atop RISC-V have a strong potential
arXiv:2406.12394v1 [cs.DC] 18 Jun 2024
2 N. Brown
here as they can offer specialisation whilst still providing a common software
ecosystem.
Sophon’s SG2042 is the first high core count commodity available RISC-V
CPU designed for high performance workloads. First released in summer 2023,
this mass produced, 64-core RISC-V CPU is aimed at high performance work-
loads. Not only does this processor provide significantly more cores that existing,
SoC based, mass produced RISC-V CPUs, but furthermore the T-Head XuanTie
C920 cores themselves have been designed for high performance. Consequently
this new RISC-V CPU is very interesting to the HPC community and previous
work [2] found that, for the RAJAPerf suite [3], it delivers a considerable per-
formance uplift compared to existing commodity available RISC-V CPUs, but
struggled to match a set of x86-based CPUs that are commonplace in HPC ma-
chines. In this paper we leverage NASA’s NAS Parallel Benchmark (NPB) suite
to undertake more in depth performance characterisation of the SG2042. Run-
ning this suite across CPUs that implement the RISC-V, x86-86 and AArch64
ISAs, and in the later two categories because we have selected CPUs that are
used in production supercomputers, we are able to better understand the types
of workloads that the SG2042 suits and where it might fall short.
2 Background
2.1 The Sophon SG2042
The Sophon SG2042 CPU is a 64-core processor running at 2GHz and organised
in clusters of four XuanTie C920 cores. Each 64-bit core, designed by T-Head, is
designed for high performance workloads and adopts a 12-stage out-of-order mul-
tiple issue superscalar pipeline design [7]. Implementing the RV64GCV instruc-
tion set, the C920 has three decode, four rename/dispatch, eight issue/execute
and two load/store execution units. Version 0.7.1 of the vectorisation standard
extension (RVV v0.7.1) is supported [11], with a vector width of 128 bits. Each
C920 core contains 64KB of L1 instruction (I) and data (D) cache, 1MB of L2
cache which is shared between the cluster of four cores, and 64MB of L3 system
cache which is shared by all cores in the package. The SG2042 also provides four
DDR4-3200 memory controllers, and 32 lanes of PCI-E Gen4. The CPU we use
for the benchmarking in this paper is contained in a Pioneer Box by Milk-V
which has 128GB of DDR4 RAM.
The SG2042’s C920 core only provides RVV v0.7.1 which is not supported by
mainline GCC or LLVM. To this end, T-Head have provided their own fork of the
GNU compiler (XuanTie GCC) which has been optimised for their processors
and supports RVV v0.7.1. It has been found [5] that GCC8.4, which is part
of their 20210618 release, provides the best auto-vectorisation capability and-so
this is the version we use for the benchmarking experiments undertaken in this
paper. Their version of the compiler generates Vector Length Specific (VLS)
RVV assembly which specifically targets the 128-bit vector width of the C920.
All codes are compiled at optimisation level three, and all reported results are
Performance characterisation of the 64-core SG2042 RISC-V CPU for HPC 3
averaged over five runs. At the time of execution each benchmark run reported
in this paper was making exclusive use of the machine.
In [2] the authors benchmarked the SG2042 using the RAJAPerf suite, how-
ever this was across a large number of individual kernels and from the results
it was difficult to isolate and identify individual performance patterns. By con-
trast, in this paper we characterise and explore each individual benchmark of
the NPB suite to better classify the performance properties of the SG2042.
2.2 NAS Parallel Benchmarks (NPB) suite
The NAS Parallel Benchmark (NPB) suite [1] is a collection of benchmarks de-
veloped by NASA’s Advanced Supercomputing (NAS) division to characterise
HPC systems, especially for Computational Fluid Dynamics (CFD) applications.
First released in the mid 1990s, in this paper we leverage the original eight bench-
marks in the suite, which comprises five kernels and three pseudo applications.
The kernels capture key algorithmic patterns that are ubiquitous throughout
HPC codes and test key performance characteristics that are important across
many workloads. The pseudo applications combine multiple kernels to provide
more complicated workloads. All these benchmarks are configured using a vari-
ety of problem sizes known as classes. There are a variety of implementations of
the suite provided by NAS, including the OpenMP and MPI versions that we use
here, and throughout this paper use the official code without any modifications.
Table 1: Summary of memory behaviour for NPB benchmarks on a Xeon Plat-
inum 8170
Benchmark Clock ticks
cache stall
Clock ticks
DDR stall
Time DDR
bandwidth bound
Integer Sort (IS) 35% 0% 16%
Multi Grid (MG) 34% 20% 88%
Embarrassingly Parallel (EP) 11% 0% 0%
Conjugate Gradient (CG) 19% 18% 0%
Fast Fourier Transform (FT) 13% 9% 18%
Block Tridiagonal (BT) 8% 9% 0%
LU Gauss Seidel (LU) 12% 11% 0%
Scalar Pentadiagonal (SP) 20% 21% 0%
Table 1 summarises, for each benchmark in the suite, the memory behaviour
when run using OpenMP on all 26 physical cores of a Xeon Platinum 8170. The
Clock ticks cache stall and Clock ticks DDR stall columns report how often the
CPU was stalled on cache and main memory accesses respectively, and the Time
DDR bandwidth bound column reports the percentage of execution time that
there was a high DDR bandwidth utilisation.
The IS kernel tests indirect, random, memory accesses which it can be seen
stalls a significant fraction of the CPU due to cache accesses. It can be seen that
4 N. Brown
the MG kernel is heavily memory bound both in terms of time stalled on cache
and main memory accesses, and also the percentage of execution time where
DDR is under high utilisation. By contrast, the EP benchmark is designed to
test compute performance and there are far fewer cycles stalled on memory ac-
cess, and no time spent with high DDR bandwidth utilisation. CG comprises
irregular memory access and nearest neighbour communication, which results in
around 37% of clock ticks stalled on cache or DDR accesses, and the FT bench-
mark requires all-to-all communications between ranks to undertake a parallel
transposition of data. For FT it can be seen that whilst there is only 22% of
clock ticks stalled, which is lower than the five other kernels apart from EP, the
kernel is utilising a high DDR bandwidth for 18% of the time.
The BT, LU and SP pseudo application benchmarks are more complicated
than the five NPB kernels, and represent common, real-world, HPC use-cases.
All three of these pseudo applications compute a finite difference solution to the
3D compressible Navier Stokes equations, where the LU benchmark solves this
via a block-lower block-upper triangular approximation based upon Gauss Seidel
iterative method [8]. The BT and SP benchmarks solve the same problem as LU,
but base their solution on a Beam-Warming approximation. In BT the resulting
equations are block-tridiagonal whereas in SP are fully diagonalised [8]. Both
these systems are solved using Gaussian elimination. It can be seen from Table
1 that, out of these three pseudo applications, BT stalls the least on memory
accesses and SP the most.
3 RISC-V core comparison
In this section we compare performance of existing commodity RISC-V solu-
tions. Due to the difference in core counts between RISC-V CPUs, we focus
here on single cores performance to understand how the XuanTie C920 core of
the Sophon SG2042 performs against other widely available RISC-V cores. We
compare against the U74 core [9] which is contained in the JH7200 and JH7100
SoCs of the VisionFive V2 and V1 respectively, and both of these boards contain
8GB of DRAM. We also compare against the SiFive Freedom U740 SoC, also
containing the U74 core and 16GB of DDR, and the T-Head XuanTie C906 [7]
in the AllWinner D1 SoC with 1GB of memory.
Table 2 reports a single core performance comparison between these RISC-V
technologies, for the five NPB kernels at class B, with performance reported in
million operations per second (Mop/s) and a higher number is better. In itali-
cised red is the percentage performance that a single core of this CPU delivers
compared to a single C920 core found in the SG2042. It can be seen that, ir-
respective of the kernel, the C920 significantly out performs all other RISC-V
technologies. Consistently, the U74 of the VisionFive V2 performs closest to the
C920, but is still only delivering between 21% and 38% the performance of the
C920. Whilst the VisionFive V1 and SiFive U740 both contain the same U74
core as the VisionFive V2, they are significantly slower and this is broadly in
agreement with [2].
Performance characterisation of the 64-core SG2042 RISC-V CPU for HPC 5
Table 2: Single core comparison between RISC-V technologies with performance
reported in Mops/s (Higher is better) using NPB kernels running at class B. In
red is the percentage performance delivered compared to the C920 core of the
SG2042.
Benchmark SG2042 VisionFive V2 VisionFive V1 SiFive U740 All Winner D1
IS 60.6 17.84
(29%)
6.36
(10%)
9.09
(15%)
5.41
(9%)
MG 1210.05 288.65
(24%)
72.31
(6%)
90.28
(7%)
163.19
(13%)
EP 31.35 12.01
(38%)
7.55
(24%)
9.08
(29%)
9.23
(29%)
CG 205.25 43.61
(21%)
21.96
(11%)
20.09
(10%)
12.99
(6%)
FT 857.64 245.99
(29%)
88.35
(10%)
116.59
(14%) DNR
The C906 of the All Winner D1 is out performed by the C920 and the U74
of the V2 quite considerably. However, this is the cheapest of the SoCs con-
sidered here, and the C906 outperforms the V1 and U740 for the EP and MG
benchmarks. Given the performance profile of the benchmarks reported in Table
1, this suggests that the raw compute power of the C906 is similar to that of
the U74 and the memory bandwidth is greater on the All Winner D1 than the
VisionFive V1 and SiFive U740. However, for those benchmarks with more com-
plex, irregular, memory patterns such as IS and CG, the C906 seems to struggle
compared to the other RISC-V cores. Incidentally, it was not possible to run
the FT benchmark on the All Winner D1 due to the limited 1GB of memory
becoming exhausted.
In this section we therefore conclude that the C920 core of the SG2042 sig-
nificantly outperforms all other commodity available RISC-V CPU cores. Whilst
this is in agreement with [2], in this section we have compared against a wider
range of RISC-V CPUs than [2] and for specific algorithmic patterns that are
very commonly found in HPC codes, especially for CFD. Considering that they
use the same U74 core, it is surprising that the VisionFive V2 outperforms the
V1 and U740 by such a large margin, but again this is in agreement with [2] and
[6], and one of the reasons for this is that the V2 is running at 1.5GHz compared
to 1.2GHz for both the V1 and U740.
4 Comparing the SG2042 against other architectures
In Section 3 we compared the performance of the SG2042’s C920 core against
other RISC-V commodity available CPU cores. Whilst it is interesting to explore
performance against RISC-V CPUs, and indeed the C920 delivers impressive
performance compared to other RISC-V hardware, to understand whether the
SG2042 is a contender for HPC it is far more instructive to benchmark against
CPUs of other architectures that are commonly used for HPC workloads.
6 N. Brown
Table 3: Summary of CPUs that are benchmarked in this section
CPU ISA Part Base clock Number
of cores Vector
AMD EPYC x86-64 EPYC 7742 2.25GHz 64 AVX2
Intel Skylake x86-64 Xeon Platinum 8170 2.1 GHz 26 AVX512
Marvell ThunderX2 ARMv8.1 CN9980 2 GHz 32 NEON
Sophon SG2042 RV64GCV SG2042 2 GHz 64 RVV v0.7.1
In this section we compare against CPUs of other architectures that are com-
monplace in HPC, and these are summarised in Table 3. The AMD EPYC is the
Rome series of AMD CPUs, containing the Zen-2 micro architecture and we run
this on ARCHER2, a Cray EX and the UK national supercomputer. Similarly
to the SG2042, the AMD EPYC contains 64 physical cores across four NUMA
regions, each with 16 cores, but has eight instead of four memory controllers
and memory channels. Each core in the AMD EPYC contains 32KB of I and D
L1 cache, 512 KB of L2 cache, and there is 16MB of L3 cache shared between
four cores. Providing AVX2, the EPYC 7742 has 256-bit wide vector registers,
which is double that of the SG2042, but is capable of processing two AVX-256
instructions per cycle. ARCHER2 contains 256GB of DDR memory. We use
GCC version 11.2 when compiling on ARCHER2. Simultaneous Multithreading
(SMT) is disabled for our runs, which is the default configuration on ARCHER2.
We also compare against an Intel Skylake Xeon Platinum 8170, which is the
same CPU used to profile the NPB benchmarks in Table 1. This Skylake-SP
CPU contains 26 cores, each with 32KB of I and D L1 cache, 1MB of L2 cache
and 1.375MB of L3 cache (the later is shared across all cores). The Skylake sup-
ports AVX512, double and quadruple the width of the EPYC 7742 and SG2042
respectively, and each Skylake core has two FPUs. The machine we run on has
192GB of DDR4 memory, and we use GCC version 8.4.
Lastly, we compare against the CN9980 Marvell ThunderX2 which contains
32 cores implementing the ARMv8.1 (AArch64) ISA via the Vulcan micro ar-
chitecture. Each core contains 32KB of I and D L1 cache, as well as 256KB of
L2. There is a total of 32MB L3 cache, 1MB per core, shared by the entire chip.
NEON is supported, which provides 128 bit wide vector registers and this is
interesting because it matches the vector width of the C920 core in the SG2042.
Similarly to the Skylake, the Marvell ThunderX2 has two FPUs per core. This
is the CPU used in Fulhame, an HPE Apollo 70 system, with 128GB of DDR
per node, we use GCC version 9.2 and SMT is also disabled in our runs.
For the performance comparison undertaken in this section, we run class C
of the NASA Parallel Benchmarks and run over multiple cores of the CPUs
by using the OpenMP implementations of the benchmarks [4]. Each thread is
mapped to an individual physical CPU core, all reported results are averaged
over five runs and all codes are built at optimisation level three.
Performance characterisation of the 64-core SG2042 RISC-V CPU for HPC 7
4.1 Integer Sort (IS)
As described in Section 2.2, the Integer Sort (IS) benchmark is concerned with
integer comparison and indirect, random, memory access performance. Figure
1 illustrates the performance results for this benchmarks across our CPUs of
interested, reported in Mops/s (higher is better). It can be seen that the SG2042
performs considerably worse than all other CPUs with performance plateauing
at 16 cores. By contrast, the ThunderX2 and AMD EPYC delivering similar
performance until the 32 cores of the ThunderX2 are exhausted. The Skylake
performs better than all the other CPUs, but is limited by its lower core count,
where the ThunderX2 catches up to the Skylake at 32 cores and the EPYC
outperforms it at 64 cores.
Fig. 1: IS benchmark performance (higher is better) parallelised via OpenMP
It can be seen from Figure 1 that the SG2042 struggles significantly with
this benchmark where as was seen in Table 1 the irregular, random, memory
accesses result in a comparatively large number of time stalled due to cache
access and DDR bandwidth utilisation is high for a small fraction of the runtime.
A hypothesis is that could be due to the cache hierarchy, where the Skylake
which performs the best has the largest L2 cache, 1MB per core, compared to
256KB (per core, 1MB shared between four cores) for the SG2042, 256KB for
the ThunderX2 and 512KB for the AMD EPYC. The surprise here is in the
performance difference between the SG2042 and the ThunderX2, as per core
they both have the same amount of L2 and L3 cache.
8 N. Brown
4.2 Multi Grid (MG)
It was illustrated in Section 2.2 that the Multi Grid (MG) benchmark is heavily
memory bandwidth bound, and results of executing this benchmark kernel on
the CPUs of interest is illustrated in Figure 2. It can be seen that the AMD
EPYC provides considerably best performance, with the Skylake and ThunderX2
delivering similar performance and both plateauing at 16 cores where memory
bandwidth is likely saturated. By contrast, the SG2042 lags the other CPUs
considerably, also plateauing at 16 and 32 cores but then with a performance
increase at 64 cores.
Fig. 2: MG benchmark performance (higher is better) parallelised via OpenMP
The memory configuration of the CPUs partially helps to explain the relative
performance reported in Figure 2. The AMD EPYC has 8 memory controllers
and 8 memory channels, connected to DDR4-3200 memory. By contrast, the
Skylake and ThunderX2 both only have 2 memory controllers and are both
connected to DDR4-2666 albeit with the ThunderX2 having 8 memory channels
compared to 6 memory channels in the Skylake. The SG2042 has four memory
controllers and only four memory channels, connected to DDR4-3200. Whilst
there are fewer memory channels on the SG2042 than the other CPUs, it also
has double the memory controllers than the Skylake and ThunderX2 CPUs, and
also faster memory, but lags performance compared with those CPUs. Details
around the memory subsystem on the SG2042 are difficult to come by, but it is
our hypothesis that the memory controllers on the SG2042 are considerably less
advanced than the other CPUs considered in this section.
The behaviour of the MG benchmark also helps explain one of the anomalies
of the IS benchmark performance. It was our hypothesis that the L2 and L3
Performance characterisation of the 64-core SG2042 RISC-V CPU for HPC 9
cache design was in part governing performance of the SG2042 compared to other
CPUs, however the ThunderX2 also has the same cache design but was faster
than the SG2042 for that benchmark. However, as seen for the MG benchmark,
the SG2042 is also severely memory bandwidth bound and this likely explains the
gap in performance between the SG2042 and ThunderX2 for the IS benchmark.
4.3 Embarrassingly Parallel (EP)
The Embarrassingly Parallel (EP) benchmark is compute bound, and results of
this on our CPUs are illustrated in Figure 3. It can be seen that across the CPUs
being tested, there are two groups; the SG2042 and ThunderX2 share very similar
performance but with the SG2042’s 64 cores then making a significant difference
compared to the 32 cores of the ThunderX2. The EPYC and Skylake both deliver
similar performance, which is greater than the SG2042 and ThunderX2, but
the 26 cores of the Skylake are a disadvantage against the SG2042 which then
significantly out performs the Skylake at 64 cores. The AMD EPYC performs
best out of all the CPUs, especially at the larger core counts.
Fig. 3: EP benchmark performance (higher is better) parallelised via OpenMP
This performance behaviour is in stark contrast to the IS and MG bench-
marks, and demonstrates that the SG2042 and ThunderX2 deliver very similar
compute performance at the same number of cores. This makes some sense given
that they both provide 128-bit vectorisation, albeit with the ThunderX2 having
two FPUs per core compared to one on the SG2042. By contrast, the Skylake
and AMD EPYC CPUs provides wider vectorisation, 512-bit and 256-bit respec-
tively and this in part helps explain the performance difference between these
10 N. Brown
two groups. However, given the Skylake provides AVX512, and the AMD EPYC
only AVX2, and that the ThunderX2 has two FPUs per core and the SG2042
only one, clearly the GCC compiler is not able to fully vectorise the code of this
benchmark and make full use of the FPUs.
Given that, at the largest number of cores each technology provides, the
SG2042 performs second best for this benchmark out of all CPUs in our com-
parison, this demonstrates that for compute bound problems the large core count
of the SG2042 is beneficial.
4.4 Conjugate Gradient (CG)
As can be seen from Table 1, the Conjugate Gradient (CG) benchmark also
spends considerable time stalled on cache and DDR memory accesses, and this
is because it comprises of irregular memory access and nearest neighbour com-
munications. Figure 4 illustrates the performance of this benchmark kernel across
our CPUs. Given the performance of the IS and MG benchmarks it is no sur-
prise that the SG2042 falls short of the other technologies, but it is closer to the
ThunderX2 than we had expected delivering around 50% the performance of the
ThunderX2 at 32 cores.
Fig. 4: CG benchmark performance (higher is better) parallelised via OpenMP
Potentially, what is making the difference here is in the size of the L3 cache,
where the AMD EPYC has 16MB L3 shared between four cores (4MB per core),
whereas the Skylake has 1.3MB L3 cache per core shared across all cores (po-
tentially helped by the larger 1MB L2 cache). By contrast, both the ThunderX2
and SG2042 have the same size of 256MB L2 and 1MB L3 cache per core.
Performance characterisation of the 64-core SG2042 RISC-V CPU for HPC 11
This would help explain the performance differences, with the additional mem-
ory bandwidth limitations of the SG2042 causing additional the overhead which
reduces performance further.
4.5 Fast Fourier Transform (FT)
The fast Fourier Transform (FT) benchmark requires all-to-all communication
between ranks, and the performance of this benchmark can be seen in Figure
5. As described in Table 1, for this benchmark on the Skylake there was some
stalling due to cache and DDR access (13% and 9% respectively) but also for
18% of the time DDR was under high utilisation. Once again, the SG2042 is
significantly slower than the other CPUs, with the ThunderX2 sitting around
half way between the performance of the x86 CPUs and the SG2042 and this is
likely for the same reasons explored for the CG benchmark.
Fig. 5: FT benchmark performance (higher is better) parallelised via OpenMP
4.6 Pseudo Applications
Table 4 reports performance for the BT, LU and SP benchmarks and this is
expressed as how many times faster this CPU is than the SG2042. Given the
findings that the SG2042 struggles to perform when there is increased pressure
on the memory subsystem, and based upon the stall numbers reported in Ta-
ble 1, it was our expectation that the SG2042 would perform best for the BT
benchmark and worst for the SP benchmark with the LU benchmark in be-
tween. This is broadly the case based upon the figures in Table 4, where each
12 N. Brown
number reports the number of times faster than each of the other CPUs are for
each pseudo application at a specific core count. It can be seen that three other
CPUs significantly outperform the SG2042 for the three pseudo applications.
Table 4: For each pseudo application, the number of times faster a specific CPU
is than the SG2042 at the given number of cores
Number BT benchmark LU benchmark SP benchmark
cores EPYC Skylake ThunderX2 EPYC Skylake ThunderX2 EPYC Skylake ThunderX2
16 3.23 3.28 2.43 3.65 4.15 2.86 5.01 3.91 3.65
26 3.57 2.97 2.69 3.20 3.16 2.62 6.25 3.48 3.57
32 3.68 - 2.64 3.40 - 2.94 5.26 - 3.22
64 4.19 - - 2.95 - - 4.22 - -
5 MPI vs OpenMP on the Sophon SG2042
In Section 4, the NPB benchmarks were all run using the official NAS OpenMP
implementation. This is sensible given that execution is occurring within a single
memory space, however a question is whether, for best performance, one should
write their parallel code using OpenMP or MPI within a node. These two mod-
els are very different, where OpenMP follows a thread based approach and by
default all threads share the same memory area and explicit marshalling and
protection of shared memory is required. In contrast, when using MPI tasks run
as independent processes which share no data and instead communicate with
each other via explicit messages.
As there are both OpenMP and MPI implementations of the NPB benchmark
suite, we are able to undertake a direct comparison. This is illustrated in Figure 6
which reports the percentage performance delivered by the MPI implementation
compared to the OpenMP version, for each benchmark at different numbers of
cores. It can be seen that there are significant differences in performance between
OpenMP and MPI for some configurations.
It can be observed, in Figure 6, that over one or two cores the benchmarks do
not benefit from MPI compared to OpenMP. However, from four cores upwards
the CG and FT benchmarks run faster when using MPI than OpenMP. At 64
cores, all the MPI implementations run faster than their OpenMP counterparts.
When exploring MPI against OpenMP on other architectures, MPI was always
slower on the AMD EPYC and this was also true on the Skylake apart for the
CG benchmark where MPI was on average two times faster than the OpenMP
implementation. When profiling the CG MPI benchmark on the Skylake, it was
found that clock ticks stalls due to cache reduced from 19% in the OpenMP
implementation to 5.5% with the MPI version. The MPI implementation ex-
perienced no clock ticks stalled due to DDR accesses, down from 18% for the
OpenMP version.
Performance characterisation of the 64-core SG2042 RISC-V CPU for HPC 13
Fig. 6: Percentage performance of MPI based NPB benchmark compared to the
OpenMP benchmark implementation
We surmise that the way these benchmarks are implemented means that the
structure imposed by the MPI implementation tends to put less pressure on the
memory subsystem when undertaking communications. For codes with heavy
inter-core communication such as CG (nearest neighbour point to point) and
FT (all to all collective) then this can be beneficial on the SG2042.
6 Conclusions
The Sophon SG2042 is an impressive RISC-V CPU and, using NASA’s NAS Par-
allel Benchmark (NPB) suite, we have demonstrated that for benchmarks that
closely represent ubiquitous HPC algorithms, especially in CFD, it significantly
out performs existing RISC-V solutions. When compared against CPUs which
implement other ISAs and whose use is widespread for high performance work-
loads, the Sophon SG2042 is outperformed by the AMD EPYC by between 1.77
and 15.06 times, the Intel Skylake between 0.59 and 5.98 times, and the Marvel
ThunderX2 between 0.59 and 5.91 times. The SG2042 is most competitive for
computationally bound codes, and its high core count ultimately out performed
the Skylake and ThunderX2 for the EP benchmark. However the SG2042 CPU
struggled with algorithms that are memory bandwidth or latency bound.
From this work we conclude that the memory subsystem of the SG2042 is a
bottleneck, and Sophon recently announced the SG2044 which is reported to have
14 N. Brown
three times the DDR memory bandwidth [10], as well as implementing RVV v1.0.
This has the potential to provide a very significant performance improvement
over the SG2042 for many of the benchmarks explored in this paper, and the
improved memory performance will likely ameliorate the bottlenecks we have
observed with the high core count continuing to deliver good computational
performance. It is therefore our conclusion that the Sophon SG family of RISC-
V CPUs has strong potential in HPC, and whilst the current generation SG2042
is an impressive first generation, based on announcements made by Sophon it
looks likely that the SG2044 will address the key performance challenges that we
have observed in this paper and make for a compelling RISC-V product family.
7 Acknowledgements
This work has been funded by the ExCALIBUR H&ES RISC-V testbed. This
work used the ARCHER2 UK National Supercomputing Service (https://www.archer2.ac.uk).
The Fulhame HPE Apollo 70 system is supplied to EPCC as part of the Cata-
lyst UK programme. For the purpose of open access, the author has applied
a Creative Commons Attribution (CC BY) licence to any Author Accepted
Manuscript version arising from this submission.
References
1. Benchmarks, N.P.: Nas parallel benchmarks. CG and IS (2006)
2. Brown, N., Jamieson, M., Lee, J., Wang, P.: Is risc-v ready for hpc prime-time:
Evaluating the 64-core sophon sg2042 risc-v cpu. In: Proceedings of the SC’23
Workshops of The International Conference on High Performance Computing, Net-
work, Storage, and Analysis. pp. 1566–1574 (2023)
3. Hornung, R.D., Hones, H.E.: Raja performance suite. Tech. rep., Lawrence Liver-
more National Lab.(LLNL), Livermore, CA (United States) (2017)
4. Jin, H.Q., Frumkin, M., Yan, J.: The openmp implementation of nas parallel bench-
marks and its performance (1999)
5. Lee, J.K., Jamieson, M., Brown, N.: Backporting risc-v vector assembly. In: In-
ternational Conference on High Performance Computing. pp. 433–443. Springer
(2023)
6. Lee, J.K., Jamieson, M., Brown, N., Jesus, R.: Test-driving risc-v vector hardware
for hpc. In: International Conference on High Performance Computing. pp. 419–
432. Springer (2023)
7. Open xuantie c906 (2023), https://xrvm.com/cpu-details?id=
4056751997003636736
8. Saphir, W., Van der Wijngaart, R.F., Woo, A., Yarrow, M.: New implementations
and results for the nas parallel benchmarks 2. In: PPSC. Citeseer (1997)
9. Sifive u74-mc core complex manual (2021), https://starfivetech.com/uploads/
u74mc_core_complex_manual_21G1.pdf
10. Sophgo risc-v roadmap (2024), https://github.com/RISCVtestbed/
riscvtestbed.github.io/blob/main/assets/files/hpcasia24/hpc_asia_
wang.pdf
11. C920: Specifications (2023), https://xuantie.t-head.cn/product/xuantie/
4082464366237126656