PreprintPDF Available

Is RISC-V ready for HPC prime-time: Evaluating the 64-core Sophon SG2042 RISC-V CPU

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The Sophon SG2042 is the world's first commodity 64-core RISC-V CPU for high performance workloads and an important question is whether the SG2042 has the potential to encourage the HPC community to embrace RISC-V. In this paper we undertaking a performance exploration of the SG2042 against existing RISC-V hardware and high performance x86 CPUs in use by modern supercomputers. Leveraging the RAJAPerf benchmarking suite, we discover that on average, the SG2042 delivers, per core, between five and ten times the performance compared to the nearest widely available RISC-V hardware. We found that, on average, the x86 high performance CPUs under test outperform the SG2042 by between four and eight times for multi-threaded workloads, although some individual kernels do perform faster on the SG2042. The result of this work is a performance study that not only contrasts this new RISC-V CPU against existing technologies, but furthermore shares performance best practice.
Content may be subject to copyright.
arXiv:2309.00381v1 [cs.DC] 1 Sep 2023
Is RISC-V ready for HPC prime-time: Evaluating the 64-core
Sophon SG2042 RISC-V CPU
Nick Brown
n.brown@ed.ac.uk
EPCC at the University of Edinburgh
Edinburgh, UK
Maurice Jamieson
EPCC at the University of Edinburgh
Edinburgh, UK
Joseph Lee
EPCC at the University of Edinburgh
Edinburgh, UK
ABSTRACT
The Sophon SG2042 is the world’s first commodity 64-core RISC-V
CPU for high performance workloads and an important question is
whether the SG2042 has the potential to encourage the HPC com-
munity to embrace RISC-V.
In this paper we undertaking a performance exploration of the
SG2042 against existing RISC-V hardware and high performance
x86 CPUs in use by modern supercomputers. Leveraging the RA-
JAPerf benchmarking suite, we discover that on average, the SG2042
delivers, per core, between five and ten times the p erformance com-
pared to the nearest widely available RISC-V hardware. We found
that, on average, the x86 high performance CPUs under test out-
perform the SG2042 by between four and eight times for multi-
threaded workloads, although some individual kernels do perform
faster on the SG2042. The result of t his work is a performance study
that not only contrasts this new RISC-V CPU against existing tech-
nologies, but furthermore shares performance best practice.
KEYWORDS
RISC-V, HPC benchmarking, Sophon SG2042, XuanTie C920, RA-
JAPerf
1 INTRODUCTION
In recent years RISC-V has become a well-established open ISA
standard with over 10 billion RISC-V CPU cores having been man-
ufactured at the time of writing. Whilst RISC-V has gained trac-
tion in embedded computing fields such as automotive, space, and
micro-controllers it can also potentially offer benefits to a variety
of other areas of computing including High Performance Comput-
ing (HPC).
Whilst the HPC community moves further into the exascale era,
an important question is what technologies will power our future
supercomputers. To this end there have been numerous RISC-V
software activities focused on HPC, however a major challenge has
been the availability of physical hardware which is capable of sup-
porting such workloads. There are several RISC-V HPC hardware
efforts in development, but apart from a few prototypes the HPC
community has only been able to obtain widespread access to em-
bedded Single Board Computers (SBC), often with more of a focus
on energy efficiency and small core counts than high performance.
Consequently, whilst some preparatory work is being undertaken
around RISC-V for HPC [12] [5] [14], it is difficult to encourage the
HPC community to fully embrace the potential of RISC-V without
serious hardware propositions.
However, RISC-V is moving very rapidly and in the summer of
2023 the 64-core Sophon SG2042 CPU was released. This is the
world’s first publicly available, mass produced, 64-core RISC-V CPU
aimed at high performance workloads. Not only does this proces-
sor provide a step change in terms of the number of cores available,
but furthermore these cores are T-Head XuanTie C920s which a
marketed as being designed for high performance. Consequently
this new RISC-V CPU is very interesting to the HPC community
and has the potential to bring RISC-V to high performance work-
loads. In this paper we benchmark the SG2042 using the popular
RAJAPerf HPC performance suite. Comparing this CPU to exist-
ing RISC-V processors to understand whether the promised step
change in capability is delivered compared to other RISC-V com-
modity options. We then explore how configuration and tool choices
impact the performance of workloads on the SG2042 before com-
paring this 64 core CPU against high performance x86 CP Us that
are commonly used in supercomputers. Our ultimate aim is to an-
swer the central question of whether the SG2042 means that RISC-
V can now be considered a more serious proposition for HPC work-
loads.
This paper is structured as follows; the background to this work
is described in Section 2 where we present technical details of the
Sophon SG2042 CPU and host machine used in this work as well as
the RAJAPerf benchmark suite. Section 3 t hen reports the results of
our benchmarking activities, first describing comparisons against
existing RISC-V CPUs in Section 3.1, then exploring the impact of
configurations and tools on performance in Section 3.2, before Sec-
tion 3.3 compares against high performance x86 CPUs common in
HPC machines. Lastly, Section 4 draws conclusions and discusses
further work.
It should be stressed that this is an independent benchmarking
study of the Sophon SG2042 and the authors of this work have
no links to the manufacturers of the CPU or cores. To the best of
our knowledge this is the world’s first independent benchmarking
study of a high performance 64-core RISC-V CPU.
2 BACKGROUND
In early 2023, RISC-V International identified HPC as a strategic
priority area for growth and from this, combined with activities
such as the recently ratified vector extension and numerous HPC
software efforts which are porting key HPC libraries and tools, it
is evident that momentum in this space is growing rapidly. Activ-
ities around the world, such as the European eProcessor project
[2], the thousand core Esperanto CPUs [3], and the multi-vendor
RISE project [2] which aims to develop critical software compo-
nent support for RISC-V, have the potential to popularise RISC-V
in high-end computing, including HPC, and ultimately enable the
community to build supercomputers around this technology. Fur-
thermore, early application studies report favourably on the bene-
fits that RISC-V can deliver to high performance workloads.
However, whilst there are numerous companies working on pro-
totype high performance RISC-V hardware, to date choice has been
extremely limited when looking to run workloads on commodity
available RISC-V hardware. There has been an unfortunate choice
between either using soft-cores1or physical cores designed more
for energy efficient and embedded workloads. Irrespective, whilst
these solutions enable experimentation with RISC-V, they do not
provide the capabilities required for production high performance
workloads on the architecture. Consequently, whilst there is inter-
est in the HPC community around RISC-V, it is yet to embrace the
technology.
2.1 The Sophon SG2042 CPU
The Sophon SG2042 CPU is a 64-core processor running at 2GHz
and organised in clusters of four XuanTie C920 cores. Each 64-bit
core, designed by T-Head, is designed for high performance and
adopts a 12-stage out-of-order multiple issue superscalar pipeline
design. Providing the RV64GCV instruction set, the C920 has three
decode, four rename/dispatch, eight issue/execute and two load/-
store execution units. Version 0.7.1 of the vectorisation standard
extension (RVV v0.7.1) is supported, with a vector width of 128
bits supporting data types FP16, FP32, INT8, INT16, INT32, and
INT64. However, it should be highlighted that there are conflict-
ing specifications, for instance the T-Head datasheet [17] lists the
C920 as supp orting FP64 vectorisation, whereas other material [16]
does not. This is an important consideration because double preci-
sion floating point underlies the vast majority of high performance
workloads, therefore a core that is able to vectorise these opera-
tions will likely provide much greater performance for the HPC
community.
Each C920 core contains 64KB of L1 instruction (I) and data (D)
cache, 1MB of L2 cache which is shared between the cluster of four
cores, and 64MB of L3 system cache which is shared by all cores in
the package. The SG2042 also provides four DDR4-3200 memory
controllers, and 32 lanes of PCI-E Gen4. The CPU we use for the
benchmarking in this paper is mounted in a Pioneer Box by Milk-V
which contains 32GB of RAM and a 1TB SSD.
An important consideration for HPC workloads is that of vec-
torisation and due to the C920 cores only supporting RVV v0.7.1,
compiler support is a challenge. The current upstream version of
the RISC-V GNU compiler does not provide support for any ver-
sion of the vector extension. Whilst the GNU repository contains
an rvv-next branch [15] whose purpose is to support RVV v1.0, at
the time of writing this is not actively maintained. Furthermore
there was an rvv-0.7.1 branch which targeted RVV v0.7.1 but this
has been deleted. Due to this lack of support in mainline GCC, T-
Head, the chip division of Alibaba, have provided their own fork of
the GNU compiler (XuanTie GCC) which has been optimised for
their processors.
T-Head’s bespoke compiler supports both RVV v0.7.1 and their
own bespoke custom extensions. Whilst several versions of this
compiler have been provided, it has been found [10] that GCC8.4
which is part of their 20210618 release provides the best auto-vectorisation
1Soft-cores are software descriptions of a CP U that run on an FPGA, however their
clock frequency tends to be far lower than a physical CP U
capability and-so this is the version we have selected for the bench-
marking experiments undertaken in this paper. Their version of
the compiler generates Vector Length Specific (VLS) RVV assembly
which specifically targets the 128-bit vector width of the C920. All
kernels are compiled at optimisation level three, and all reported
results are averaged over five runs.
2.2 RAJAPerf
The RAJA performance suite [6] is designed to explore the perfor-
mance of loop-based computational kernels, which are common
in HPC applications. Whilst it was initially developed as a tool
to benchmark the performance of the RAJA parallel programming
framework [1], the suite has been extended and developed to sup-
port a variety of different targets, such as OpenMP. This popu-
lar benchmarking suite has been used extensively for testing HPC
hardware and tools, and consequently RAJAPerf has been ported
into numerous languages [8] and used for a variety of studies [7]
[9].
The benchmark comprises of 64 kernels which are categorised
into six classes:
Algorithm: Contains six kernels which undertake basic al-
gorithmic activities such as memory copies, the sorting of
data and reductions.
Apps: Comprises thirteen kernels and these represent com-
mon components of HPC applications such as an FIR filter,
data packing and unpacking for halo exchanges, 3D diffu-
sion and convection by partial assembly, and solving LaPlace’s
equation for diffusion in 1D.
Basic: Represents foundational mathematical functions via
sixteen kernels. These include DAXPY, matrix multiplica-
tion, integer reduction, and calculation of PI by reduction.
Lcals: The Livermore Compiler Analysis Loop Suite which
is a collect ion of eleven loop based kernels including t ridiag-
onal elimination, calculation of differences, and calculations
of minimums and maximums.
Polybench: Thirteen polyhedral kernels which includes two
and three matrix multiplications, matrix transposition and
vector multiplication, a 2D Jacobi stencil computation, and
an alternating direction implicit solver.
Stream: Five kernels that focus on memory bandwidth and
the corresponding computation, these are based upon sim-
ple vectorisable functions.
Given the differences between these classes, it is interesting to
explore how different features of the hardware or software un-
der test perform with specific aspects. In a previous study [10],
RAJAPerf was used to compare the performance of the four core
SiFive VisionFive V2 against the single core AllWinner D1. Whilst
these processors are designed for embedded workloads, the All-
Winner D1 contains the XuanTie C906 core which, although it is
designed for energy efficiency rather than performance [13], pro-
vides support for the RVV v0.7.1 extension. In that study it was
found that whilst the U74 core in the VisionFive V2 tended to out-
perform the C906 for scalar workloads, when enabling vectorisa-
tion the C906 then most often outperformed the U74. However,
these machines provide small core count CPUs containing cores
2
which are not designed for high performance workloads. Conse-
quently, whilst [10] was an interesting study, the hardware under
test was never a realistic option for production HPC workloads.
3 BENCHMARKING
3.1 Comparison against RISC-V processors
In this section we compare performance of the Sophon SG2042
against a StarFive VisionFive V1 and StarFive VisionFive V2. The
V1 contains the JH7100 SoC, whereas the V2 contains the JH7110.
Both these SoCs are built around the 64-bit RISC-V SiFive U74
core, with the JH7100 containing two and the JH7110 containing
four cores. The SoCs are listed as running at 1.5GHz and the U74
cores contain 32KB D and 32KB I L1 cache, with both SoC models
also containing 2MB of L2 cache shared between the cores. How-
ever, only RV64GC is provided by t he SiFive U74, and consequently
there is no support for the RISC-V vector extension.
Figure 1 reports a performance comparison of the VisionFive
V2 and V1 against the SG2042 for double (FP64) and single (FP32)
precision. The numbers reported are relative to the performance of
the V2 running the RAJAPerf benchmark suite at double precision
as a baseline. Zero on the graph indicates the same performance,
positive numbers are the number of times that the configuration
is faster than the baseline, and negative numbers is the number
of times slower. The bars in the graph report an average across
the specific RAJAPerf class, as described in Section 2.2, and the
whiskers report the max-min range, with the top of the whisker
being the maximum speedup compared to the baseline and the
bottom of the whisker the minimum speedup (or maximum slow
down).
It can be seen from Figure 1 that a single C92 0 core of the SG2042
outperforms the U74 core of the V2 and V1 at both double and sin-
gle precision. At double precision the C920 core delivers on aver-
age between 4.3 and 6.5 times the performance achieved when run-
ning at double precision on the U74 in the V2. Furthermore, at sin-
gle precision the C920 achieves between 5.6 and 11.8 times the per-
formance on average across the benchmarks. This is an impressive
performance gain, and there were no kernels that ranslower on the
C920 core than the U74. The performance of some kernels was very
impressive on the C920, for instance the memory set benchmark
from the algorithm group ran 40 times faster in FP32 and 18 times
faster in FP64 than on the U74.
It should be highlighted that we are running the benchmarks
on these cores at their best possible configuration. Effectively this
means that we leverage vectorisation on the SG2042’s C920, whereas
this is not supported by the U74 and hence unavailable on the V1
or V2. As described in Section 2.1, the documentation is conflict-
ing around whether the C920 provides FP64 vector isation, however
the results in Figure 1 demonstrate a noteworthy performance dif-
ference between FP32 and FP64 on the SG2042, suggesting that
in-fact FP64 is not supported by C920 vector operations. By com-
parison, the performance difference between running double and
single precision on the V2 is far less.
An aspect of the results in Figure 1 that surprised us is how
much slower the VisionFive V1 is than the V2. Considering that
we are running RAJAPerf over a single core only, so the dual vs
quad core nature of these machines does not matter, and that they
Figure 1: Single core comparison baselined against StarFive
VisionFive V2 running in double precision (FP64), against V1
and SG2042
both contain the same U74 core then one would assume that perfor-
mance should be fairly comparable. However, at double precision
the V1 is between six and three times slower than the V2, and at
single precision between one and three times slower. Whilst we
hypothesised that the V1 might be running at a slower clock fre-
quency than the V2, even though they are both listed as running
at 1.5GHz in the data sheets, there is no documentation or output
on the machine that confirms this. As a performance comparison
between the V1 and V2 is not the objective of this paper, we leave
the endeavour of explaining this phenomena to future work.
It can be seen from Figure 1 that the performance obtained by
a single C920 core of the SG2042 is impressive compared to exist-
ing, publicly available, commodity RISC-V hardware. This core is
described by T-Head as a high performance RISC-V processor, and
the benchmarking results reported in this section demonstrate that
it delivers a large improvement in performance across the entire
benchmarking suite against the U74 which would previously have
been considered the best choice of widely available RISC-V CPU
to experiment with HPC workloads upon.
3.2 SG2042 performance exploration
In Section 3.1 we compared the SG2042’s XuanTie C920 core against
other RISC-V commodity hardware that, until now, might be seen
as the best choice to experiment u pon with high performance work-
loads. However, in addition to single core performance the SG2042
is also significantly ahead of the V1’s JH7100 and V2’s JH7110 SoCs
in terms of the number of cores. Consequently, it is interesting to
explore this facet in more detail to understand the performance
properties at different configurations, and in this section we also
explore the vectorisation of the C920 core too.
Table 1 reports speed up 2and parallel efficiency 3obtained
across the classes of the RAJAPerf benchmark suite when scaling
the number of threads. Throughout ou r runs we set the OMP_PROC_BIND
2Speed up is the execution time on one thread divided by execution on nthreads
3Parallel efficiency is the speed up divided by the number of threads. This rangesfrom
1 to 0, where 1 is optimal
3
Algorithm Apps Basic Lcals Polybench Stream
Threads Speedup PE Speedup PE Speedup PE Speedup PE Speedup PE Speedup PE
2 1.19 0.60 0.66 0.33 1.02 0.51 1.61 0.81 1.86 0.93 1.00 0.50
4 1.12 0.28 1.14 0.29 1.81 0.45 1.82 0.45 3.46 0.86 0.97 0.24
8 2.02 0.25 2.27 0.28 3.55 0.44 3.27 0.41 7.72 0.96 1.88 0.24
16 4.64 0.29 4.31 0.27 6.92 0.43 6.86 0.43 15.39 0.96 4.31 0.27
32 1.11 0.03 1.86 0.06 0.22 0.007 4.38 0.14 14.09 0.44 0.82 0.03
64 0.97 0.02 4.10 0.06 12.33 0.19 14.89 0.23 40.42 0.63 1.77 0.03
Table 1: Speed up and parallel efficiency for benchmark classes as we scale the number of threads, using block allocation of
threads to cores where threads map contiguously to CPU cores
environment variable to be true to ensure that threads can not mi-
grate during execution. These multi-threaded runs are undertaken
in single precision, FP32.
In the experiment reported in Table 1 we assigned threads to
cores contiguously in a block allocation approach. For instance,
thread one is mapped to core one, thread two mapped to core two,
and thread three mapped to core three. It can be seen from these
results that although threading is generally beneficial at smaller
thread counts, the apps class ran slower with two threads com-
pared with one. However, as the thread count increases then the
parallel efficiency decreases significantly for some b enchmark classes,
and in some cases execution time is actually greater on 32 or 64
threads than on one.
To explain some of the lacklustre scaling seen in Table 1 it was
our hypothesis that the thread placement we had adopted was caus-
ing many of the performance issues. This is because there is one
DDR memory controller per NUMA region, and-so it was likely
that a large contributor to this poor performance was a bottleneck
on individual controllers because our threads were filling up each
NUMA region contiguously, thus some NUMA regions contained
a large number of active threads whereas others potentially none
for medium thread counts. Using the lscpu tool, we discovered that
the SG2042 contains four NUMA regions. Unusually for physical
CPUs (i.e. not SMT), the core ids are not contiguous in a NUMA re-
gion but instead eight consecutive cores reside in a NUMA region,
then there is a gap of eight and the following eight are also in the
NUMA region. Consequently, cores 0-7 and 16-23 are in NUMA re-
gion 0, 8-15 and 24-31 are in NUMA region 1, 32-39 and 48-55 are
in NUMA region 2, and 40-47 and 56-63 are in NUMA region 3.
We therefore we ran an experiment which involved allocating
threads cyclically across the NUMA regions. For example, four threads
are mapped to cores 0, 8, 32, and 40. Beyond this number of threads
the placement within a NUMA region is contiguous, for instance
eight threads are placed onto cores 0, 8, 32, 40, 1, 9, 33, and 41. Ta-
ble 2 reports the speed up and parallel efficiency results with this
cyclic placement and it can be seen, compared to the block alloca-
tion that was used for experiments in Table 1, this placement policy
delivers significantly improved scaling in general. Furthermore, it
can be seen that at 64 threads the cyclic allocation policy outper-
forms the block policy, apart from with the stream class, which is
surprising as one would assume that because all the cores are al-
located then this cycling across the NUMA regions would cease to
be beneficial.
As described in Section 2.1, the SG2042’s C920 cores are organ-
ised in clusters of four with each cluster containing 1MB of shared
L2 cache. Consequently, we ran another placement experiment to
explore whether, within a NUMA region, also working cyclically
across these clusters would provide improved performance as do-
ing so could make better use of the shared L2 cache for smaller to
medium thread counts. For example, using this placement then 8
threads would be mapped to cores 0, 8, 32, 40, 16, 24, 48, and 56.
Table 3 reports the speed up and parallel efficiency for this clus-
ter placement policy, and it can be seen that up to and including 32
threads such a policy delivers a noticeable improvement compared
to the previous cyclic policy of placing cyclically across NUMA re-
gions but contiguously within a region. At 64 threads, cyclic place-
ment tends to be more beneficial however, as all the CPU cores are
active then the benefit of this cluster aware placement will be lim-
ited because the L2 cache will be shared between four active cores
regardless. Therefore, based upon these experiments, for perfor-
mance, the importance of considering the NUMA regions and four
core clusters when adopting a thread placement has been demon-
strated.
Figure 2: Maximum single core speedup for each benchmark
class when enabling vectorisation on C920 of SG2042
When considering the performance that the SG2042 will deliver,
another important consideration is whether to enable vectorisa-
tion or not. Figure 2 reports the difference in performance, when
running on a single core, enabling vectorisation for FP32 and FP64
compared to each of these configurations running scalar-only. The
bars depict the average across the benchmark class, with the top
4
Algorithm Apps Basic Lcals Polybench Stream
Threads Speedup PE Speedup PE Speedup PE Speedup PE Speedup PE Speedup PE
2 1.52 0.76 0.70 0.35 1.06 0.53 1.81 0.91 2.11 1.06 1.93 0.96
4 3.21 0.80 1.37 0.34 2.09 0.52 3.61 0.90 4.11 1.03 4.19 1.05
8 4.72 0.59 2.64 0.33 3.96 0.49 6.08 0.76 8.15 1.02 4.46 0.56
16 4.55 0.28 4.32 0.27 6.97 0.44 7.12 0.45 15.07 0.94 4.19 0.26
32 6.10 0.19 6.32 0.20 13.11 0.41 14.84 0.46 30.05 0.94 13.91 0.43
64 2.09 0.03 4.31 0.07 17.29 0.27 26.53 0.41 57.93 0.91 1.62 0.03
Table 2: Speed up and parallel efficiency for benchmark classes as we scale the number of threads, using cyclic allocation of
threads to cores where threads cycle round NUMA regions and are then allocated contiguously in a region
Algorithm Apps Basic Lcals Polybench Stream
Threads Speedup PE Speedup PE Speedup PE Speedup PE Speedup PE Speedup PE
2 1.52 0.76 0.70 0.35 1.06 0.53 1.81 0.91 2.11 1.06 1.93 0.96
4 3.21 0.80 1.37 0.34 2.09 0.52 3.61 0.90 4.11 1.03 4.19 1.05
8 6.37 0.80 2.71 0.34 4.16 0.52 7.15 0.89 8.23 1.03 11.20 1.40
16 10.54 0.66 5.13 0.32 8.09 0.51 13.55 0.85 16.51 1.03 11.60 0.73
32 12.72 0.40 8.77 0.27 14.05 0.44 21.29 0.67 31.76 0.99 15.18 0.47
64 1.98 0.03 3.69 0.06 17.30 0.27 17.70 0.28 58.26 0.91 1.51 0.02
Table 3: Speed up and parallel efficiency for benchmark classes as we scale the number of threads, using cluster aware cyclic
allocation of threads to cores where threads cycle round NUMA regions and cycle round inside each NUMA region cyclically
across the clusters of four cores
of the whiskers being the greatest speed up and the bottom the
lowest. Zero means that performance is the same, one means that
performance is one time faster (e.g. double), whereas negative in-
dicates it is slower (e.g. minus one indicates it is twice as slow).
It can be seen that enabling vectorisation for FP64 delivers very
marginal benefit and this suggests that vectorisation of FP64 is not
supported by the XuanTie C920 core. Some benefit of FP64 vectori-
sation with the basic class can be observed, but it is just one kernel
which operates on integers that is driving this average upwards.
From Figure 2 it can be observed that there is a greater benefit in
enabling vectorisation for single precision, FP32. This benefit does
vary considerably across the different kernels in each class, and
hence the average for each class is fairly low, but one can see from
the whiskers on the plot that there are some kernels that strongly
benefit from vectorisation such as those in the stream class. Whilst
some individual kernels ran slower with vectorisation enabled, for
FP32 these are in the minority and the benefits across the suite
outweigh them. There are more kernels that run slower with vec-
torisation for FP64, however as can be seen from the whiskers in
Figure 2 the overhead of even the worst performing kernels tends
to be small. Therefore from these results it is our recommendation
that vectorisation should be enabled where possible when compil-
ing for the SG2042.
Another important consideration is whether t o use GCC or Clang
when compiling vectorised kernels for the SG2042. Indeed, Clang
is often able to automatically vectorise a wider variety of code [4]
for RISC-V then GCC, where [11] found that out of the 64 kernels
in the RAJAPerf benchmark suite only 30 were auto-vectorised by
GCC and out of those 30 the scalar code path was executed for 7 of
these at runtime. By comparison, Clang was able to auto-vectorise
59 kernels with only 3 of these following the scalar path at runtime.
This is one of the reasons why the average benefit across the bench-
mark classes when enabling vectorisation is fairly low in Figure 2,
because GCC is unable to vectorise many of these individual bench-
mark kernels. Indeed, the stream class is unique as GCC is able to
vectorise all of its constituent kernels, and this demonstrated by
far the largest average improvement when enabling vectorisation.
It is not only because of this improved ability to auto-vectorise
that it would be useful to compile with Clang, but furthermore, in
addition to supporting Vector Length Specific (VLS) RV V assembly
generation, Clang can also generate Vector Length Agnostic (VLA).
GCC by comparison only generates VLS RVV assembly, and there-
fore Clang provides more flexibility around how one might lever-
age the hardware and it is also interesting to explore the perfor-
mance differences that selecting one of VLA or VLS will make.
However, the Clang compiler only supports RVV v1.0, whereas
the SG2042’s C920 cores provide RVV v0.7.1 only which is incom-
patible. Consequently, it is not possible to use Clang directly to
compile code targeting the C920’s RVV because the assembly will
be incompatible. To enable experimentation with Clang we lever-
aged the RVV-rollback [11] tool which operates upon RVV v1.0
assembly and rewrites it to backport it to RVV v0.7.1. Whilst this
tool is experimental, and we would not expect users to use it for
their production HPC codes on the SG2042, it does enable us to ex-
plore the use of Clang, and the different configurations it provides,
for our benchmark on the CPU.
Figure 3 reports a single core performance comparison when
using Clang in VLA and VLS mode baselined against GCC8.3 for
selected Polybench kernels in FP32. Zero is the same performance,
5
Figure 3: Clang VLA and VLS single core comparison against
using GCC for selected Polybench kernels in FP32
a positive number meaning Clang VLA or VLS is faster, and a neg-
ative number slower. Out of these kernels it should be noted that
GCC is unable to auto-vectorise the Warshall and Heat3D kernels,
and furthermore whilst Jacobi1D and Jacobi2D are vectorised by
GCC the scalar code path is chosen for execution at runtime.
By contrast, Clang is able to vectorise all the kernels but the
2MM,3MM and GEMM kernels execute in scalar mode only and
switching to Clang delivers worse performance for these three ker-
nels. For most of the other kernels there is a benefit to compiling
with Clang, and VLS tends to outperform VLA mode. However, a
surprise was that the Jacobi2D kernel is slower with Clang com-
pared to its GCC counterpart, which is contrary to the findings of
[11] however that study was running on the Allwinner D1’s far
simpler C906 core.
We deduce from this experiment that VLS tends to outperform
VLA on the C920 and it is unfortunate that Clang only supports
RVV v1.0 as using this compiler would likely be beneficial for many
codes running on the SG2042. However, for optimal performance
it is desirable to experiment with both Clang and GCC, on a kernel
by kernel basis.
3.3 x86 performance comparison
Until this point we have explored performance of the SG2042 against
other commodity RISC-V CPUs, and explored the SG2042’s multi-
threading and vectorisation behaviour. Our experiments have demon-
strated that this CP U significantly outperforms existing widely avail-
able RISC-V hardware, but for HPC workloads that is not in-fact
the competition. Instead, the critical q uestion that must be answered
for the HPC community is how the SG2042 compares against other
CPUs that are currently in use in current generation supercomput-
ers. Consequently, in this section we explore performance of the
SG2042 against the x86 CPUs that are summarised in Table 4. We
only execute on physical cores of these x86 CPUs as all SMT is
disabled by default.
The AMD Rome EPYC 7742 CPU is found in ARCHER2, which
is a Cray EX and the UK national supercomputer. Similar to the
SG2042, the EPYC 7742 contains 64 physical cores across four NUMA
regions, each with 16 cores, but has eight instead of four memory
controllers. Each core contains 32KB of I and 32KB of D L1 cache,
512 KB of L2 cache, and there is 16MB of L3 cache shared between
four cores. Providing AVX2, the EPYC 7742 has 256-bit wide vector
registers, which is double that of the SG2042, and supports vectori-
sation of FP64.
CPU Part Clock Cores Vector
AMD Rome EPYC 7742 2.25GHz 64 AVX2
Intel Broadwell Xeon E5-2695 2.1GHz 18 AVX2
Intel Icelake Xeon 6330 2.0GHz 28 AVX512
Intel Sandybridge Xeon E5-2609 2.40GHz 4 AVX
Table 4: Summary of x86 CPUs used to compare against the
SG2042
The Intel Broadwell CPU is in Cirrus, an SGI/HPE 8600 Cluster,
and the 18 physical cores reside in one NUMA region. This Xeon
E5-2695 provides 32KB of I and 32KB of D L1 cache, 256KB of L2
cache which is on average the same per-core as the SG2042, and
45MB of L3 cache shared across the cores. Similarly to the AMD
Rome CPU, the Xeon E5-2695 supports AVX2 and there are four
memory controllers. The Intel Icelake CPU is the newest CPU that
we compare against and all 28 physical cores are in a single NUMA
region with 8 memory controllers. This Xeon 6330 has 32KB I and
48KB D L1 cache, 1MB L2 cache per core, which is four times that
of the SG2042, and 43MB shared L3 cache. Providing AVX512, the
Xeon 6330 provides 512-bit wide vector registers.
The Intel Sandybridge is the oldest CP U compared against in
this study and the only x86 CPU we consider that is not being
actively used in a current generation supercomputer. Released in
2012, it is interesting to explore how the performance of a decade
old x86 CPU compares against the SG2042. Providing only four
physical cores, each one has 64KB of I and 64KB of D L1 cache,
as well as 256KB of L2 cache and shared 10MB L3 cache. This E5-
2609 supports only AVX and therefore the vector register lengths
are the same, 128-bit, as the SG2042, although FP64 is supported
by AVX.
In all experiments conducted in this section we bind to physical
cores of the x86 system and disable hyperthreading. We use GCC
version 8.3 on all systems apart from ARCHER2, where GCC ver-
sion 11.2 is used because that is the nearest available version. Com-
pilation is undertaken at optimisation level O3 throughout. On all
systems we execute over the most performant number of threads,
on all the x86 systems this was found to be the same as the number
of physical cores, whereas for the SG2042 it was demonstrated in
Section 3.2 that for some benchmark classes 32 threads provided
better performance compared to 64 threads.
Figure 4 reports single core performance running the bench-
mark suite at FP64 for the x86 CPUs baselined against the SG2042.
This graph is organised the same way as the RISC-V commodity
hardware comparison graph, where the bars are the average num-
ber of times faster or slower across the class, and whiskers range
from the maximum to the minimum. It can be seen that all x86
cores tend to outperform the C920 apart from the Sandybridge
core which on average performs slower for stream and algorithm
benchmark classes. The AMD Rome and Intel Icelake CPUs tend
to outperform the Intel Broadwell, which is understandable given
that the Broadwell is the older of the three. Figure 5 reports results
6
Figure 4: FP64 single core comparison against x86, reporting
number of times faster or slower than the baseline SG2042
from the same experiment using FP32 and it can be seen that the
AMD Rome CPU is fairly lacklustre when executing at single preci-
sion compared to double, whereas the Intel processors on average
perform just as well, and indeed the Sandybridge outperforms the
C920 on average in each class when using FP32.
Figure 5: FP32 single core comparison against x86, reporting
number of times faster or slower than the baseline SG2042
However, the average bars in Figure 5 do not provide a complete
picture. As described in Section 3.2, the C920 only support s vectori-
sation for FP32 and indeed it can be seen from the whiskers in Fig-
ures 5 and 4 that the maximum times faster is less for many bench-
marks classes at FP32 than FP64. Furthermore, there are more slow-
est running kernels that perform slower on the x86 CP Us than the
C920 at FP32. These kernels are where auto-vectorisation is be-
ing applied effectively, and indeed it can be seen that for the lcals
benchmark class there is at-least one kernel on all the x86 CPUs
that performs slower than the C920.
To thispoint we have explored single-core performance between
the SG2042 and x86 CPUs. However, the SG2042 contains 64 cores
which is an impressive number and only matched by the AMD
Rome CPU. It is therefore instructive to undertake a performance
comparison when multi-threading to understand the total perfor-
mance that each CPU can deliver. Figure 6 reports a performance
Figure 6: FP64 multithreaded comparison against x86, re-
porting number of times faster or slower than the baseline
SG2042
comparison of the x86 CPUs against the SG2042 for double preci-
sion, FP64. It can be seen that the basic,lcals,polybench, and stream
classes benefit most from a greater number of cores and for these
the SG2042 on average outperforms the Intel Sandybridge CPU.
The AMD Rome, Intel Broadwell, and Intel Icelake on average out-
perform the SG2042, and in many cases by a significant amount.
Figure 7: FP32 multithreaded comparison against x86, re-
porting number of times faster or slower than the baseline
SG2042
Figure 7 reports the multi-threaded performance comparison
for FP32 and these results contain the largest variance. To aid in
readability we have limited the vertical axis and labelled whisker
vales that exceed this. The SG2042 tends to perform marginally
more competitive against the x86 CPUs for multi-threaded FP32
than FP64, although the polybench class is an anomaly as it per-
forms much better on the three newest x86 CPUs than the SG2042
and on the Intel Sandybridge considerably worse on average.
4 CONCLUSIONS AND FURTHER WORK
In this paper we have undertaken an independent performance ex-
ploration of the Sophon SG2042 CPU. As the world’s first widely
available large core-count RISC-V CPU that is advertised as target-
ing high performance workloads, this processor could potentially
drive significantly increased interest in, and adoption of, RISC-V by
7
the HPC community. However, a critical question is whether the
SG2042 actually delivers the performance that it promises and how
this compares against the existing x86 CPUs that are ubiquitous in
current generation supercomputers.
We demonstrated that the XuanTie C920 core outperforms the
U75, which is present in the SiFive VisionFive V2, on average be-
tween five and ten times for single precision and between three
and six times for double precision workloads. Based upon further
exploration it was found that vectorisation is most effective with
FP32 compared to FP64, and hence explaining the difference in per-
formance between the C920 and U75 for double and single preci-
sion, suggesting that indeed the C920 does not support vectorised
FP64. Furthermore, we found that Vector Length Specific (VLS)
mode tends to perform better than Vector Length Agnostic (VLA)
although the difference between whether to use GCC and Clang
varies on a kernel by kernel basis and is largely influenced by
which tool can most effectively auto-vectorise the code and exe-
cute the vector code path.
The SG2042 contains 64 cores and we explored speed up and
parallel efficiency when threading over these. It was found that it
is important to map threads to cores with the architecture in mind,
and doing this cyclically across both N UMA regions and clusters of
four cores tended to be optimal, especially up to and including 32
thread, because one is distributing across each of the four memory
controllers, one per NUMA region, and the 1MB of shared L2 cache
per cluster.
We compared performance against four high performance x86
CPUs, three of which are common in current generation supercom-
puters and an older processor which was found in the previous
generation. It was discovered that, in the main, the high perfor-
mance x86 CPUs outperformed the SG2042. For a single core com-
parison, on average the AMD Rome performed three times faster,
the Broadwell four times, the Ice Lake four times and the Sandy-
bridge twice for FP32, these numbers were four times faster, four
times, five times, and 20% faster respectively for FP64.
When comparing multi-threaded performance against the x86
CPUs, the 64 cores of the SG2042 outperformed the 4 cores of the
Sandybride on average across all the benchmark classes running
at both FP32 and FP64. The equal core count and more powerful
individual cores of the AMD Rome meant that this processor out-
performed the SG2042 by between eight and five times for FP32
and FP64 respectively. Even though it provided a lower core count,
the 18-core Broadwell outperformed the SG2042 on average six
and four times for single and double precision respectively. Lastly,
the 28-core IceLake, which was the newest CPU that we compared
against, outperformed the SG2042 by six and eight times for FP32
and FP64. The fact that the greatest performance difference be-
tween the x86 CPUs was with single precision surprised us as we
had found that the SG2042’s C920 delivers better performance for
FP32 compared to FP64. However clearly the x86 CPUs, especially
the Intel Xeons, are also making good use of the reduced precision.
For further work we believe that it would be instructive to ex-
plore distributed memory performance on systems built around
the SG2042, especially the performance that can be delivered us-
ing MPI. As the SG2042 continues to become more widely avail-
able then clusters of networked machines containing this proces-
sor will become available and would be an ideal system to un-
dertake such experiments upon. Whilst networking performance
would also be driven by the auxiliaries coupled with the CPU, not
least the network adaptor, understanding what the options are in
this regard would be beneficial in taking a more general view of
whether machines built using the SG2042, and future CPUs in this
family, would be capable of executing large scale HPC workloads.
We conclude that the SG2042 is a very exciting RISC-V technol-
ogy and provides a significant step change over currently available
commodity RISC-V hardware. Whilst performance is yet to match
that of x86, it should be highlighted that the RISC-V vendors have
come an extremely long way in a short time, and by contrast the
x86 CPUs that we tested against have a very long heritage and
benefit from a great many more person years of effort in their de-
velopment. As T-Head produces new high performance cores and
OEMs integrate these into new CPUs, for the next generation of
high performance RISC-V processors it would be very useful to
have RVV v1.0 provided as this will deliver the opportunity to use
mainline GCC and Clang for compiling vectorised code. Further-
more, provision of FP64 vectorisation, wider vector registers, in-
creased L1 cache, and more memory controllers per NUMA region
would also likely deliver significant performance advantages and
help close the gap with x86 high performance processors.
ACKNOWLEDGMENTS
This work has been funded by t he ExCALIBUR H&ES RISC-V testbed.
This work used the ARCHER2 UK National Supercomputing Ser-
vice (https://www.archer2.ac.uk). This work used the Cirrus UK
National Tier-2 HPC Service at EPCC (http: //www.cirrus.ac.uk) funded
by the University of Edinburgh and EPSRC (EP/P020267/1). We
thank PerfXLab for access to the SG2042 used in this work.
REFERENCES
[1] David A Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William
Killian, Adam J Kunen, Olga Pearce, Peter Robinson, Brian S Ryujin, and
Thomas RW Scogland. 2019. RAJA: Portable performance for large-scale scien-
tific applications. In 2019 ieee/acm international workshop on performance, porta-
bility and productivity in hpc (p3hpc). IEEE, 71–81.
[2] eProcessor project 2023. eProcessor: an opens ource full stack ecosystem. Retrieved
Aug 16, 2023 from https://eprocessor.eu/
[3] Esperanto Technologies 2023. Esperanto: Outstanding solutions for Generative AI
and HPC. Retrieved Aug 16, 2023 from https://www.esperanto.ai/
[4] Jing Ge Feng, Ye Ping He, and Qiu Ming Tao. 2021. Evaluation of compilers’
capability of automatic vectorization based on source code analysis. Scientific
Programming 2021 (2021), 1–15.
[5] Vladimir Herdt, Daniel Große, Pascal Pieper, and Rolf Drechsler. 2020. RISC-V
based virtual prototype: An extensible and configurable platform for the system-
level. Journal of Systems Architecture 109 (2020), 101756.
[6] Richard D Hornung and Holger E Hones. 2017. Raja performance suite. Tech-
nical Report. Lawrence Livermore National Lab.(LLNL), Livermore, CA (United
States).
[7] Ricardo Jesus. [n. d.]. Check for A Study on the Performance Implications of
AArch64 Atomics Ricardo Jesus () and Michèle Weiland EPCC, The University of
Edinburgh, Edinburgh, UK. In High Performance Computing: 38th International
Conference, ISC High Performance 2023, Hamburg, Germany, May 21–25, 2023,
Proceedings. Springer Nature, 279.
[8] Ricardo Jesus and Michèle Weiland. 2022. ChapelPerf: A Performance Suite for
Chapel. In The Annual Chapel Implementers and Users Workshop.
[9] Tiago Trevisan Jost, Yves Durand, Christian Fabre, Albert Cohen, and Frédéric
Pérrot. 2021. Seamless compiler integration of variable precision floating-point
8
arithmetic. In 2021 IEEE/ACM International Symposium on Code Generation and
Optimization (CGO). IEEE, 65–76.
[10] Joseph KL Lee, Maurice Jamieson, and Nick Brown. 2023. Backporting risc-v
vector assembly. arXiv preprint arXiv:2304.10324 (2023).
[11] Joseph KL Lee, Maurice Jamieson, Nick Brown, and Ricardo Jesus. 2023. Test-
driving RISC-V Vector hardware for HPC. arXiv preprint arXiv:2304.10319
(2023).
[12] Filippo Mantovani, Pablo Vizcaino, Fabio Banchelli, Marta Garcia-Gasulla, Roger
Ferrer, Giorgos Ieronymakis, Nikos Dimou, Vassilis Papaefstathiou, and Jesus
Labarta. 2023. Software Development Vehicles to enable extended and early co-
design: a RISC-V and HPC case of study. arXiv preprint arXiv:2306.01797 (2023).
[13] Open chip community 2023. Open XuanTie C906. Retrieved Aug 16, 2023 from
https://xrvm.com/cpu-details?id=4056751997003636736
[14] Boria Perez, Alexander Fell, and John D Davis. 2021. Coyote: An open source
simulation tool to enable RISC-V in HPC. In 2021 Design, Automation & Test in
Europe Conference & Exhibition (DATE). IEEE, 130–135.
[15] rvv-next 2023. GNU compiler collection. Retrieved Aug 16, 2023 from
https://github.com/riscv-collab/riscv-gnu- toolchain/tree/rvv- next
[16] T-Head 2023. C920: Specifications. Retrieved Aug 16, 2023 from
https://xuantie.t-head.cn/product/xuantie/4082464366237126656
[17] T-Head. 2023. The T-Head XuanTie C920 Processor Datasheet.
https://xrvm.com/cpu-details?id=4108967096845668352.
9
This figure "acm-jdslogo.png" is available in "png" format from:
http://arxiv.org/ps/2309.00381v1
This figure "sample-franklin.png" is available in "png" format from:
http://arxiv.org/ps/2309.00381v1
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Automatic vectorization is an important technique for compilers to improve the parallelism of programs. With the widespread usage of SIMD (Single Instruction Multiple Data) extensions in modern processors, automatic vectorization has become a hot topic in the research of compiler techniques. Accurately evaluating the effectiveness of automatic vectorization in typical compilers is quite valuable for compiler optimization and design. This paper evaluates the effectiveness of automatic vectorization, analyzes the limitation of automatic vectorization and the main causes, and improves the automatic vectorization technology. This paper firstly classifies the programs by two main factors: program characteristics and transformation methods. Then, it evaluates the effectiveness of automatic vectorization in three well-known compilers (GCC, LLVM, and ICC, including their multiple versions in recent 5 years) through TSVC (Test Suite for Vectorizing Compilers) benchmark. Furthermore, this paper analyzes the limitation of automatic vectorization based on source code analysis, and introduces the differences between academic research and engineering practice in automatic vectorization and the main causes, Finally, it gives some suggestions as to how to improve automatic vectorization capability.
Article
Internet-of-Things (IoT) opens a new world of possibilities for both personal and industrial applications. At the heart of an IoT device, the processor is the core component. Hence, as an open and free instruction set architecture RISC-V is gaining huge popularity for IoT. A large ecosystem is available around RISC-V, including various RTL implementations at one end and high-speed instruction set simulators (ISSs) at the other end. These ISSs facilitate functional verification of RTL implementations as well as early SW development to some extent. However, being designed predominantly for speed, they can hardly be extended to support further system-level use cases such as design space exploration, power/timing/performance validation or analysis of complex HW/SW interactions. In this paper, we propose and implement a RISC-V based Virtual Prototype (VP) with the goal of filling this gap. We provide a 32 and 64 bit RISC-V core supporting the IMAC instruction set with different privilege levels, the RISC-V CLINT and PLIC interrupt controllers and an essential set of peripherals. We support simulation of (mixed 32 and 64 bit) multi-core platforms, provide SW debug and coverage measurement capabilities and support the FreeRTOS and Zephyr operating systems. The VP is designed as extensible and configurable platform with a generic bus system and implemented in standard-compliant SystemC and TLM-2.0. The latter point is very important, since it allows to leverage cutting-edge SystemC-based modeling techniques needed for the mentioned use cases. Our VP allows a significantly faster simulation compared to RTL, while being more accurate than existing ISSs. Finally, our RISC-V VP is fully open source (MIT licence) to help expanding the RISC-V ecosystem and stimulating further research and development.
Raja performance suite
  • D Richard
  • Hornung
  • E Holger
  • Hones
Richard D Hornung and Holger E Hones. 2017. Raja performance suite. Technical Report. Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States).
ChapelPerf: A Performance Suite for Chapel
  • Ricardo Jesus
  • Michèle Weiland
Ricardo Jesus and Michèle Weiland. 2022. ChapelPerf: A Performance Suite for Chapel. In The Annual Chapel Implementers and Users Workshop.
  • K L Joseph
  • Maurice Lee
  • Jamieson
Joseph KL Lee, Maurice Jamieson, and Nick Brown. 2023. Backporting risc-v vector assembly. arXiv preprint arXiv:2304.10324 (2023).
  • K L Joseph
  • Maurice Lee
  • Nick Jamieson
  • Ricardo Brown
  • Jesus
Joseph KL Lee, Maurice Jamieson, Nick Brown, and Ricardo Jesus. 2023. Testdriving RISC-V Vector hardware for HPC. arXiv preprint arXiv:2304.10319 (2023).
  • Filippo Mantovani
  • Pablo Vizcaino
  • Fabio Banchelli
  • Marta Garcia-Gasulla
  • Roger Ferrer
Filippo Mantovani, Pablo Vizcaino, Fabio Banchelli, Marta Garcia-Gasulla, Roger Ferrer, Giorgos Ieronymakis, Nikos Dimou, Vassilis Papaefstathiou, and Jesus Labarta. 2023. Software Development Vehicles to enable extended and early codesign: a RISC-V and HPC case of study. arXiv preprint arXiv:2306.01797 (2023).