Figure 2 - available via license: Creative Commons Attribution 4.0 International
Content may be subject to copyright.
Source publication
The Sophon SG2042 is the world's first commodity 64-core RISC-V CPU for high performance workloads and an important question is whether the SG2042 has the potential to encourage the HPC community to embrace RISC-V. In this paper we undertaking a performance exploration of the SG2042 against existing RISC-V hardware and high performance x86 CPUs in...
Contexts in source publication
Context 1
... considering the performance that the SG2042 will deliver, another important consideration is whether to enable vectorisation or not. Figure 2 reports the difference in performance, when running on a single core, enabling vectorisation for FP32 and FP64 compared to each of these configurations running scalar-only. The bars depict the average across the benchmark class, with the top Table 2: Speed up and parallel efficiency for benchmark classes as we scale the number of threads, using cyclic allocation of threads to cores where threads cycle round NUMA regions and are then allocated contiguously in a region ...
Context 2
... Figure 2 it can be observed that there is a greater benefit in enabling vectorisation for single precision, FP32. This benefit does vary considerably across the different kernels in each class, and hence the average for each class is fairly low, but one can see from the whiskers on the plot that there are some kernels that strongly benefit from vectorisation such as those in the stream class. ...
Context 3
... some individual kernels ran slower with vectorisation enabled, for FP32 these are in the minority and the benefits across the suite outweigh them. There are more kernels that run slower with vectorisation for FP64, however as can be seen from the whiskers in Figure 2 the overhead of even the worst performing kernels tends to be small. Therefore from these results it is our recommendation that vectorisation should be enabled where possible when compiling for the SG2042. ...
Context 4
... comparison, Clang was able to auto-vectorise 59 kernels with only 3 of these following the scalar path at runtime. This is one of the reasons why the average benefit across the benchmark classes when enabling vectorisation is fairly low in Figure 2, because GCC is unable to vectorise many of these individual benchmark kernels. Indeed, the stream class is unique as GCC is able to vectorise all of its constituent kernels, and this demonstrated by far the largest average improvement when enabling vectorisation. ...
Similar publications
The increasing use and cost of high performance computing (HPC) requires new easy-to-use tools to enable HPC users and HPC systems engineers to transparently understand the utilization of resources. The MIT Lincoln Laboratory Supercomputing Center (LLSC) has developed a simple command, LLload, to monitor and characterize HPC workloads. LLload plays...
With an increasing need for enormous amount of computing power, different approaches for building inexpensive supercomputers have emerged. Decreasing costs and continuous technology scaling has led to the emergence of computers with ARM processors which are considered as a possible alternative to traditional x86 processors for creating High Perform...
The ability to accurately predict when a job on a supercomputer will leave the queue and start to run is not only beneficial for providing insights to users, but can also help enable non‐traditional HPC workloads that are not necessarily suited to the batch queue style‐approach that is ubiquitous on production HPC machines. However there are numero...
Compute nodes on modern heterogeneous supercomputing systems comprise CPUs, GPUs, and high-speed network interconnects (NICs). Parallelization is identified as a technique for effectively utilizing these systems to execute scalable simulation and deep learning workloads. The resulting inter-process communication from the distributed execution of th...
Recent commodity x86 CPUs still dominate the majority of supercomputers and most of them implement vector architectures to support single instruction multiple data (SIMD). Although research for architectural exploration requires computer architecture simulators and number of simulators have been developed, only a few tools support recent x86 SIMD i...