ArticlePDF Available

An Empirical Study of Hyper-Threading in High Performance Computing Clusters

Authors:

Abstract and Figures

The effects of Intel Hyper-Threading technology on a system performance vary according to the type of applications the system is running. Hyper-Threading affects High Performance Computing (HPC) clusters similarly. The characteristics of application run on a cluster will determine whether Hyper-Threading will help or hinder performance. In addition, the operating system's support for scheduling tasks, with Hyper-Threading enabled, is an important factor in the overall performance of the system. In this paper, we used an experimental approach to demonstrate the performance gain or degradation of various parallel benchmarks running on a Linux cluster. The results of these benchmarks show that performance varies as a function of the number of nodes and the number of processors per node. Furthermore, we used a performance analysis tool to determine the cause of these performance differences when Hyper-Threading was enabled versus disabled. Our analysis shows the correlation between the cluster performance and the program characteristics, such as computational type, cache and memory usage, and message-passing properties. We conclude the paper by providing guidance on how to best apply Hyper-Threading technology to application classes.
Content may be subject to copyright.
An Empirical Study of Hyper-Threading in High
Performance Computing Clusters
Tau Leng, Rizwan Ali, Jenwei Hsieh, Victor Mashayekhi, Reza Rooholamini
Dell Computer Corp.
U.S.A.
Abstract
The effects of Intel Hyper-Threading technology on a system performance vary
according to the type of applications the system is running. Hyper-Threading affects
High Performance Computing (HPC) clusters similarly. The characteristics of
application run on a cluster will determine whether Hyper-Threading will help or
hinder performance. In addition, the operating system’s support for scheduling tasks,
with Hyper-Threading enabled, is an important factor in the overall performance of
the system.
In this paper, we used an experimental approach to demonstrate the performance
gain or degradation of various parallel benchmarks running on a Linux cluster. The
results of these benchmarks show that performance varies as a function of the
number of nodes and the number of processors per node. Furthermore, we used a
performance analysis tool to determine the cause of these performance differences
when Hyper-Threading was enabled versus disabled. Our analysis shows the
correlation between the cluster performance and the program characteristics, such as
computational type, cache and memory usage, and message-passing properties. We
conclude the paper by providing guidance on how to best apply Hyper-Threading
technology to application classes.
1. Introduction
Intel’s Hyper-Threading technology makes a single physical processor appear as two
logical processors. The physical processor resources are shared and the architectural
state is duplicated for the two logical processors [1]. The premise is that this
duplication allows a single physical processor to execute instructions from different
threads in parallel rather than in serial, and therefore, could lead to better processor
utilization and overall performance.
1
1.1. Level of Parallelism
Two levels of parallelism have been addressed in the modem computer processor
design to improve performance. Instruction-level-parallelism (ILP) refers to
techniques of increasing the number of instructions executed each clock cycle.
Although it is possible that the multiple execution units in a processor can execute
multiple instructions at the same time, the dependencies existed among instructions
makes it a challenge of finding enough instructions to execute simultaneously.
Several mechanisms have been implemented to increase ILP. For example, “out-of-
order execution” is a technique of evaluating a set of instructions and sending them
for execution in parallels, regardless their original order defined by the program, and
yet preserving the dependencies among the instructions.
Thread-Level Parallelism (TLP), on the other hand, enables a processor or
multiprocessor system to concurrently run multiple threads from an application or
from multiple, independent programs. SMT, or Simultaneous Multi-Threading
technology, upon which Hyper-Threading is based, permits a processor to exploit
both ILP and TLP. Multiple threads can run on an SMT processor, and the processor
will dynamically allocate resources between the threads, enabling a processor to
adapt to the varying requirements of the workload. Intel’s Hyper-Threading
implements SMT in such a way that each logical processor maintains a separate
architectural state, which consists of general-purpose, control, machine state, and
advanced programmable interrupt controller (APIC) registers [1]. The chip real
estate required for the architectural states is negligible compared to the total die size.
Thus, threads or separate programs using separate architectural states must share
most of the physical processor resources, such as trace cache, L2-L3 unified caches,
translation look aside buffer, execution units, branch history table, branch target
buffer, control logic, and buses. This simultaneous sharing of resources between two
threads creates a potential for performance degradation.
1.2. Multithreading and Message-passing Applications
In general, processors enabled with Hyper-Threading technology can improve the
performance of applications with high degree of parallelism. Previous studies have
shown that the Hyper-Threading technology improves multi-threaded applications’
performance by the range of 10 to 30 percentages depending on the characteristics of
the applications [2]. These studies also suggest that the potential gain is only
obtained if the application is multi-threaded by any means of parallelization
techniques. A multithreading program is capable of creating multiple processes, or
threads, at a time without having to have multiple copies of the program running in
the computer.
With the addition of Hyper-Threading support in Linux kernels 2.4.9-31 and above,
Linux cluster practitioners have started to assess its performance impact on their
applications. In our area of interest, high performance computing (HPC) clusters,
applications are commonly implemented by using standard message-passing
2
interface, such as MPI or PVM. Applications developed from message-passing
programming model usually employ a mechanism, “mpirun” for example, to spawn
multiple processes and map them to processors in the systems. Parallelism is
achieved through the message-passing interface among the processes to coordinate
the parallel tasks. Unlike the multithreaded programs in which the values of
application variables are shared by all the threads, a message-passing application
runs as a collective of autonomous processes, each with its own local memory.
This type of applications can also benefit from Hyper-Threading technology in the
sense that the number of processes spawned can be doubled and the parallel tasks
can potentially execute faster. Applying Hyper-Threading and doubling the
processes that simultaneously run on the cluster will increase the utilization rate of
the processors’ execution resources. Therefore, the performance can be improved.
On the other hand, overheads might be introduced in the following ways:
Logical processes may compete for access to the caches, and thus could
generate more cache-miss situations
More processes running on the same node may create additional memory
contention
More processes on each node increase the communication traffic (message
passing) between nodes, which can oversubscribe the communication
capacity of the shared memory, the I/O bus or the interconnect networking,
and thus create performance bottlenecks.
Whether the performance benefits of Hyper-Threading – better resource utilization –
can nullify these overhead conditions depends on the application’s characteristics.
In this paper, we have used an experimental approach to demonstrate the impact of
Hyper-Threading on a Linux cluster by using various MPI benchmark programs, and
discussed the adaptability of this new technology into HPC clusters for improving
performance. In the next section, we describe the cluster configurations and the
performance tool for our experiments. Section 3 introduces the performance
benchmarks and the results, alone with the performance analysis. In this section, we
use a performance tool called Vtune™ to analyze the system behavior while running
the benchmark programs on the cluster. The causes of performance gain or
degradation for applying Hyper-Threading on the cluster can be understood through
the performance analysis. Section 4 is the conclusion.
2. Experimental Environment
2.1 The Cluster Configuration
Our testing environment is based on a cluster consisting of 32 Dell PowerEdge 2650
servers interconnected with Myrinet. Each PowerEdge 2650 has two Intel Xeon
processors running at 2.4 GHz with 512KB L2 cache, 2GB of DDR-RAM (double
data rate RAM) memory operating on a 400 MHz Front Side Bus. The chipset of
PowerEdge 2650 is the ServerWorks GC-LE, which accommodates up to six
3
Benchmar
k
High Performance Linpack (HPL) and NPB2.3-Class B
Com
p
ile
r
Intel compilers 6.0 & ATLAS math library
Middleware MPICH for GM 1.2..8
OS Linux 2.4.18-3smp
Protocol GM
Interconnec
t
Myrinet 2000
Platfor
m
DELL PowerEdge PE2650s 32-node Cluste
r
Figure 1. Architectural stack of the test environment. The benchmarks were
compiled with Intel C or Fortran compilers. The OS is RedHat 7.3 distribution
with kernel version 2.4.18-3smp.
registered DDR 200 (PC1600) DIMMs with a 2-way interleaved memory
architecture. Each of the two PCI-X controllers on the 2650 has its own dedicated
1.6 GB/s full duplex connection to the North Bridge to accommodate the peak traffic
generated by the PCI-X busses it controls.
The operating system installed for the cluster is RedHat 7.3 with kernel version
2.4.18-3smp1. The benchmark programs were compiled with Intel C or FORTRAN
compilers, and ATLAS (Automatically Tuned Linear Algebra Software) math
library. Figure 1 shows the architectural stack of our test environment.
2.2 The Performance Analysis Tool
We use Vtune™, an Intel implemented tool, for our performance analysis. A
Windows desktop is then utilized in our test environment for the Vtune Performance
Analyzer to display performance data in graphical formats, as well as collecting
statistical data of the system behavior. Through a “data collector”, a Linux agent, the
analyzer is able to gather targeted Linux system’s information remotely on the fly.
The collecting method we selected is called “sampling”, which is a non-intrusive,
instruction-address collector mechanism [3]. During sampling, the performance
analyzer monitors all the software executing on the system including the OS kernel
and the benchmark program. The monitoring areas that the analyzer emphasizes on
4
1 The Linux information collector of Vtune™ performance analysis tools 6.1 used
for this study supports Linux Kernel version 2.4.18-3.
are according to the user’s specification. For our study, we focus on the following
system information in particular.
Cycle per Instruction (CPI) Retired – The lower the CPI is, the faster the
program has been executed. For P4 Xeon processor, 0.75 or less of CPI is
considered “good”. This event count is used to understand the performance
of each node and to verify the performance results of our benchmarks.
Floating-point Computation Instructions – shows all the floating-pint
instructions that had been retired.
2nd-Level Cache Read Misses % – 2nd Level cache read misses reduce
performance for the processor must then access main memory
Streaming SIMD Extension 2 (SSE2) – The floating-point SIMD
instructions allow computations to be performed on packed double-
precision floating-point values (two double-precision values per XMM
register). Our benchmark programs were compiled with – SSE2 option,
which allows the code taking advantage of the SSE2 feature. This event
count is for understanding if the program threads are fully utilizing this
resource.
x87 Instruction Retired – This event count increments for each x87 floating-
point micro-op, specified through the event mask for detection. All of our
benchmarks, except IS, are floating-point intensive. This event count will
provide a fair understanding of the utilization of floating-point execution
units.
3. Benchmarking Results and Analysis
The MPI programs we used for the experiments are the High-Performance Linpack
(HPL) and the NAS Parallel Benchmark (NPB), benchmarks commonly used in the
High Performance Computing arena.
3.1 Benchmarking with the High Performance Linpack (HPL)
HPL uses a number of linear algebra routines to measure the time it takes to solve
dense linear equations in double precision (64 bits) arithmetic using the Gaussian
elimination method [4]. The measurement obtained from Linpack is in the number of
floating-point operations per second (FLOPS). Linpack mainly exercises the
floating-point calculation capability of the system. However, the communication
latency of the system for running Linpack also plays a significant role on the overall
performance; when using dual processors compute nodes interconnected with high-
speed networking, such as Myrinet, the actual performance of a cluster may reach
almost 60% of its theoretical peak performance, and the percentage could be less
than 30% when using slower interconnect like Fast Ethernet [5]. When running
Linpack, the more the memory used of the system or the larger the problem size
specified for executing the program, the better the performance of the system.
However, as a rule of thumb, the problem size or the memory usage should not
exceed 80% of the total memory in the system for avoiding the swapping situation,
which will decrease the performance significantly.
5
Linpack running on a Dual-XEON System
0
1
2
3
4
5
6
1000x1000 2000x2000 4000x4000 16000x6000 10000x10000 14000x14000
Problem Size
Gflops
1x2 Processes without HT
1x2 Processes with HT
1x4 Processses with HT
Figure 2. The Linpack performance results on single node. The worse case in
here is to spawn two processes when the Hyper-Threading (HT) is on.
To understand the impact of Hyper-Threading on single compute node, we first
conducted a series of HPL runs from small problem size to large. The results shown
in Figure 2 indicate that only when the problem size or the memory used is larger to
some extent, 2000x2000 blocks or more, we can see modest performance
improvement (around 5%) on Hyper-Threading configurations. This is due to the
initiating overhead of Linpack, which is larger when spawning more processes. This
overhead is not disguised when running a very small problem size. Since Linpack
was compiled with the highly optimized ATLAS library, the floating-point
functional units including the SSE2 were almost fully utilized during the execution.
This leaves very little room for improving the Linpack performance by switching on
Hyper-Threading to increase CPUs’ resources utilization.
Also note that in Figure 2, for the runs where the number of processes is less than the
number of logical processors with Hyper-Threading enabled, the performance is
considerably worse. This observation is more apparent on 16-node runs. Figure 3
shows that the result of 16x2 with Hyper-Threading enabled has only small
performance improvement compared to the 16x1 runs with HT disabled.
For the cluster runs, the average CPIs shown in the Vtune analyzer are 0.42, 0.46,
and 0.59 for running 4 processes on each node with Hyper-Threading enabled,
6
Linpack Perfromance Results
on a 16-node Dual-XEON cluster
0
10
20
30
40
50
60
70
80
90
2000 4000 6000 10000 14000 20000 28000 40000 48000 56000
Problem size
GFLOPS
running 2 processes on each node without Hyper-Threading, and running 2 processes
on each node with Hyper-Threading respectively. These statistical sampling data are
in accordance with the actual performance results – when Hyper-Threading is
enabled, running 4 processes on each node increases performance around 5%, and
running 2 processes on each node reduces performance around 25% and more.
16x4 processes with HT on
16x2 processes without HT
16x2 processes with HT on
16x1 processes without HT
Figure 3. The HPL Linpack performance results running on the cluster of 16
nodes. Small performance difference between using 2 logical processors and
using 1 physical processor indicates the limitation of HT support in Linux.
From the observation of “Instruction Retired Rate” in Vtune analyzer data, in the
two-processes runs with Hyper-Threading enabled, three of the logical processors
had been utilized substantially while the fourth one had not been used. In addition,
the “2nd level cache misses %” showed inconsistent rates for the two physical
processors (83% and 75%). Which means the Linux OS had been allocating the CPU
resources for load-balancing the two threads. But the OS scheduler, without knowing
the association of the physical processors and the logical processors, had been
scheduling the two threads on two logical processors which may be designated to the
same physical processor. This is a limitation of Hyper-Threading support in the
Linux current kernel. The issue has been addressed in the Linux community and
expecting to be resolved in the future releases [9].
Although our preliminary study indicates that Linpack-type of applications can
benefit from Hyper-Threading, we have seen mixed results when running the NAS
parallel benchmarks suite.
7
NAS Parallel Benchmark - EP (Class B)
0
100
200
300
400
500
600
700
800
900
1000
1x2 / 1x4 2x2 / 2x4 4x2 / 4x4 8x2 / 8x4 16x2 / 16x4 32x2 / 32x4
# of nodes X # of processes per node
Mop/sec
Hyper-Threading Disabled
Hyper-Threading Enabled
Figure 4. The EP (Class B) benchmark results of Hyper-Threading enabled and
disabled. The results of Hyper-Threading enabled shows 40% improvement
regardless the number of nodes.
3.2 Benchmarking with the NAS Parallel Benchmark (NPB)
The NAS benchmark suite comprises five kernels and three pseudo-applications and
is designed to gauge parallel computing performance. Each of the programs solves a
specific numerical problem [6]. The performance results are measured in Million
Operations per Second (Mop/s). Since each program represents a specific type of
CFD applications, from the benchmark results, one can realize the under-testing
system’s performance characteristics from various aspects. In this paper, we used
three of the eight programs for our experiments, EP, FT, and IS.
In the Embarrassingly Parallel (EP) Benchmark, two-dimensional statistics are
accumulated from a large number of Gaussian pseudo-random numbers, which are
generated according to a particular scheme that is well suited for parallel
computation. This problem is typical of many Monte Carlo applications. Since it
requires almost no communication, in some sense this benchmark provides an
estimate of the upper achievable limits for arithmetic operations’ performance on a
particular system.
Since EP is a computation-bound program and requires almost no communication
during the runs, with Hyper-Threading enabled, it could effectively utilize the CPUs’
8
NAS Parallel Benchmark - IS (Class B)
0
50
100
150
200
250
300
1x2 / 1x4 2x2 / 2x4 4x2 / 4x4 8x2 / 8x4 16x2 / 16x4 32x2 / 32x4
# nodes X # of processes per node
Mop/sec
Hyper-Threading Disabled
Hyper-Threading Enabled
Figure 5. The IS (Class B) benchmark results of Hyper-Threading enabled and
disabled. The percentage of performance difference becomes smaller while the
node count is larger.
resources without being concerned with the communication overhead, which makes
the performance improve significantly. This condition is true regardless of the
number of nodes or CPUs of the cluster. Figure 4 shows the EP performance
improved linearly from one node to 32 nodes, as well as the constant performance
gains on Hyper-Threading enabled runs.
Integer Sort (IS) tests a sorting operation that is important in particle method codes.
This type of application is similar to particle-in-cell applications of physics, wherein
particles are assigned to cells and may drift out. The sorting operation is used to
reassign particles to the appropriate cells. This benchmark tests both integer
computation speed and communication performance. This problem is unique in that
floating-point arithmetic is not involved. Significant data communication, however,
is required.
With Hyper-Threading enabled, doubling the IS processes running on each node
from 2 processes to 4 processes, creates much more communication traffics among
processes and memory contentions inside the nodes. Yet the CPU floating-point
execution units are still underutilized. In both cases, the “x87 Instruction Retired
Rate”, observed from Vtune Performance Analyzer, are around 2.3% indicating that
there is almost no floating-point calculation. Hence, the performance will not be
improved through increasing the resource’s utilization. Figure 5 shows the IS results
comparison from 1 node to 32 nodes.
9
NAS Parallel Benchmark - FT (Class B)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1x2 / 1x4 2x2 / 2x4 4x2 / 4x4 8x2 / 8x4 16x2 / 16x4 32x2 / 32x4
# of nodes X # of processes per node
Mop/sec
Hyper-Threading Disabled
Hyper-Threading Enabled
It also can be observed that the IS performance dissimilarity between Hyper-
ote that this phenomenon might not be applicable for other cluster configurations.
the 3-D FFT PDE (FT) benchmark, a 3-D partial differential equation is solved
benchmark; the large cache will facilitate the performance [7][10].
Figure 6. The FT (Class B) benchmark results of Hyper-Threading enabled and
disabled. The performances were degraded for 50% for all the configurations
when using Hyper-Threading.
Threading disabled and enabled is getting smaller when the node count is larger. As
showed in Figure 3, when the node count is equal to 32, the performance becomes
better for 128 (or 32x4) processes running IS with Hyper-Threading enabled. The
reason is that the proportions of the communication through interconnect network
becomes larger than that through the shared memory, which comparatively releases
the memory contentions and communication traffics of having four logical
processors running four processes in each node. Therefore, we expect to see
performance improvement for the cluster larger than 32 nodes running IS with the
same configuration.
N
For example, using Fast Ethernet as the interconnect for the cluster decreases the
communication capability dramatically; therefore the shared memory communication
capability of the cluster becomes relatively higher. In such a case, increasing the
node count will obstruct the performance of IS, instead of facilitating it [5].
In
using FFTs. This program performs the essence of many spectral methods. It is a
good test of long-distance communication performance. FT requires intensive float-
point operations and messages passing among processes. FT is also a cache-friendly
10
From the Vtune performance data, the level 2 cache misses is increased from 68%
r non-Hyper-Threading runs to 76% for Hyper-Threading runs. Also, the “SSE2”
improve the performance of some MPI applications running
n a cluster, but not all. Depending on the cluster configurations and more
t
operations have less chance to be improved in performance from Hyper-
ing
and the computation can be
performance degradation significantly if Hyper-
Hyper-T yet another element of designing or
perating a balanced cluster system. The challenge of incorporating the technology
arr el al, “Hyper-Threading Technology Architecture and Micro
architecture”, Intel Technology Journal, Vol. 6, Issue 01, February, 2002.
fo
and the “x87 instructions rate” were both up to 99.9%. Moreover, the communication
bandwidth among processes required by FT creates memory contentions and
communication bottlenecks, which lead to 50% constant performance degradation
for the Hyper-Threading enabled runs. The results shown in Figure 5 indicate that
for applications like FT, Hyper-Threading will not provide any gain rather will
degrade the cluster performance for any node count.
2 Conclusions
Hyper-Threading could
o
importantly the nature of the application running on the cluster, the performance gain
can vary or even be negative. By using performance analysis tool, we were able to
understand what areas contribute to the performance gains, and what areas contribute
to the overheads, which lead to performance degradation. Based on our analysis, the
following observations are made for applying Hyper-Threading on a Linux cluster.
Computational intensive applications with fine-tuned floating-poin
Threading, because the CPU resources could already be highly utilized.
Cache-friendly applications might suffer from Hyper-Threading enabled,
because logical processors share the caches and thus the processes runn
on the logical processors might be competing for the caches’ access, which
might result in performance degradation.
Communication-bound or I/O-bound parallel applications may benefit from
Hyper-Threading, if the communication
performed in an interleaving fashion between processes. However, the
additional I/O traffic might create communication bottleneck and reduce the
overall performance.
The current version of Linux OS’s support on Hyper-Threading is limited,
which could cause
Threading is not applied properly.
hreading technology introduces
o
in HPC environment is not just for the users to decide whether or how it should be
applied, but also for the OS, compiler, and performance tool developers to make the
technology more applicable and useful.
References
[1] D. T. M
11
[2] W. Magro, P. Petersen, S. Shah, “Hyper-Threading Technology: Impact o
Compute-Intensive Workloads”, Intel Technology Journal, Vol. 6, Issue 01
n
,
[3]
.com/software/products/vtune/vtune61/index.htm
February 2002.
Intel Vtune Performance Analyzer 6.1,
http://www.intel
e High-Performance Linpack
t UTK
[4] “HPL - A Portable Implementation of th
Benchmark for Distributed-Memory Computers”, Netlib Repository a
and ORNL, http://www.netlib.org/benchmark/hpl.
J. Hsieh, T. Leng, V. Mashayekhi, and R. Rooholamini. “Architectural and
Performance
[5]
erconnects on Cluster of
[6]
Evaluation of Giganet and Myrinet Int
Small-Scale SMP Servers”, in the Proceedings of Super-Computing ‘00,
Dallas, TX, November 2000.
NAS Parallel Benchmark suite, http://www.nas.nasa.gov/Software/NPB/
J. Hsieh, T. Leng, V. Mashayekhi[7] and R. Rooholamini. “Impact of Level 2
[8]
posium, July 2001.
rg/
Cache and Memory Subsystem on the Scalability of Clusters of Small-
Scale SMP Servers”, in the Proceedings of International Conference on
Cluster Computing, Cluster’00, November 2000.
M. Kravetz, H. Franke, S. Nagar, R. Ravindran, “Enhancing Linux
Scheduler Scalability”, in the 2001 Ottawa Linux Sym
[9] The Linux Kernel Achieve, ChangeLog-2.4.19, http://www.kernel.o
F.C. Wong, R.P. Martin, R.H. Arpaci-Dusseau, and D.E. Culler.
.
[10]
“Architectural Requirements and Scalability of the NAS Parallel
Benchmarks”, In the Proceedings of SuperComputing ‘99, Portland,
Oregon, November 1999.
12
... The reason why SMT technology could help to parallelize applications on power-constrained client systems is that activating another physical core and scheduling a task on it consumes more power than running the task in a different logical thread on the same physical core [4]. Obviously, in most cases, using another physical core to run a parallel task is more performant [5], [6]. However, with a constrained power budget, it might not be possible leaving the utilization of SMT technology the only option to boost performance through parallelization. ...
... Previously, many researchers have conducted comparative performance analyses of shared-memory parallel programing frameworks, including analysis on fine-grained tasks [16], [19], [24]- [33]. Furthermore, there have been many works analyzing performance and power efficiency of parallel computing with simultaneous multithreading [5], [6], [34]- [43]. However, the scope of work on fine-grained tasking specifically on SMT cores is very limited. ...
... Performance and power efficiency of x86-64 Intel processors with hyperthreading is explored in [38] using SPEC OMP [45] and SPEC CPU2006 [46] benchmarks with standard reference inputs. Acceleration of applications from different domains using SMT technology is studied in [5], [6], [35]- [37], and [42], but the focus is also on the coarse-grained parallelism. ...
Preprint
Full-text available
Nowadays, latency-critical, high-performance applications are parallelized even on power-constrained client systems to improve performance. However, an important scenario of fine-grained tasking on simultaneous multithreading CPU cores in such systems has not been well researched in previous works. Hence, in this paper, we conduct performance analysis of state-of-the-art shared-memory parallel programming frameworks on simultaneous multithreading cores using real-world fine-grained application kernels. We introduce a specialized and simple software-only parallel programming framework called Relic to enable extremely fine-grained tasking on simultaneous multithreading cores. Using Relic framework, we increase performance speedups over serial implementations of benchmark kernels by 19.1% compared to LLVM OpenMP, by 31.0% compared to GNU OpenMP, by 20.2% compared to Intel OpenMP, by 33.2% compared to X-OpenMP, by 30.1% compared to oneTBB, by 23.0% compared to Taskflow, and by 21.4% compared to OpenCilk.
... The reason why SMT technology could help to parallelize applications on power-constrained client systems is that activating another physical core and scheduling a task on it consumes more power than running the task in a different logical thread on the same physical core [4]. Obviously, in most cases, using another physical core to run a parallel task is more performant [5], [6]. However, with a constrained power budget, it might not be possible leaving the utilization of SMT technology the only option to boost performance through parallelization. ...
... Previously, many researchers have conducted comparative performance analyses of shared-memory parallel programing frameworks, including analysis on fine-grained tasks [16], [19], [24]- [33]. Furthermore, there have been many works analyzing performance and power efficiency of parallel computing with simultaneous multithreading [5], [6], [34]- [43]. However, the scope of work on fine-grained tasking specifically on SMT cores is very limited. ...
... Performance and power efficiency of x86-64 Intel processors with hyperthreading is explored in [38] using SPEC OMP [45] and SPEC CPU2006 [46] benchmarks with standard reference inputs. Acceleration of applications from different domains using SMT technology is studied in [5], [6], [35]- [37], and [42], but the focus is also on the coarse-grained parallelism. ...
Article
Full-text available
Nowadays, latency-critical, high-performance applications are parallelized even on power-constrained client systems to improve performance. However, an important scenario of fine-grained tasking on simultaneous multithreading CPU cores in such systems has not been well researched in previous works. Hence, in this paper, we conduct performance analysis of state-of-the-art shared-memory parallel programming frameworks on simultaneous multithreading cores using real-world fine-grained application kernels. We introduce a specialized and simple software-only parallel programming framework called Relic to enable extremely fine-grained tasking on simultaneous multithreading cores. Using Relic framework, we increase performance speedups over serial implementations of benchmark kernels by 19.1% compared to LLVM OpenMP, by 31.0% compared to GNU OpenMP, by 20.2% compared to Intel OpenMP, by 33.2% compared to X-OpenMP, by 30.1% compared to oneTBB, by 23.0% compared to Taskflow, and by 21.4% compared to OpenCilk.
... The physical core resources are shared and the architectural state is duplicated for the two logical cores [37]. It is known that depending on the characteristics of the application run, Hyper-Threading may help or hinder performance [38]. In particular, compute-intensive applications have less chance to be improved in performance from Hyper-Threading because the CPU resources could already be highly utilized [38]. ...
... It is known that depending on the characteristics of the application run, Hyper-Threading may help or hinder performance [38]. In particular, compute-intensive applications have less chance to be improved in performance from Hyper-Threading because the CPU resources could already be highly utilized [38]. When running the parallel version in the VM, each pair of vCPUs or logical CPUs share the same physical core. ...
Preprint
Full-text available
Serverless computing, in particular the Function-as-a-Service (FaaS) execution model, has recently shown to be effective for running large-scale computations. However, little attention has been paid to highly-parallel applications with unbalanced and irregular workloads. Typically, these workloads have been kept out of the cloud due to the impossibility of anticipating their computing resources ahead of time, frequently leading to severe resource over- and underprovisioning situations. Our main insight in this article is, however, that the elasticity and ease of management of serverless computing technology can be a key enabler for effectively running these problematic workloads for the first time in the cloud. More concretely, we demonstrate that with a simple serverless executor pool abstraction one can achieve a better cost-performance trade-off than a Spark cluster of static size built upon large EC2 virtual machines. To support this conclusion, we evaluate three irregular algorithms: Unbalanced Tree Search (UTS), Mandelbrot Set using the Mariani-Silver algorithm and Betweenness Centrality (BC) on a random graph. For instance, our serverless implementation of UTS is able to outperform Spark by up to 55% with the same cost. We also show that a serverless environment can outperform a large EC2 in the BC algorithm by a 10% using the same amount of virtual CPUs. This provides the first concrete evidence that highly-parallel, irregular workloads can be efficiently executed using purely stateless functions with almost zero burden on users i.e., no need for users to understand non-obvious system-level parameters and optimizations. Furthermore, we show that UTS can benefit from the FaaS pay-as-you-go billing model, which makes it worth for the first time to enable certain application-level optimizations that can lead to significant improvements (e.g. of 41%) with negligible increase in cost.
... In the face of massive data and large computing power demand, the development of high-performance processors represented by AI processors and graphics processing unit (GPU) [3] is rapidly gaining momentum. The demand for higher performance is increasing, and there are already technologies such as multicore [4][5][6] and hyperthreading [7][8][9] to achieve high-performance processors, but higher performance means higher requirements for speed, bandwidth, and power consumption. Especially in recent years, the memory speed has been much lower than the processor's operating speed, and the bandwidth and power consumption problems brought about by frequent data exchange make memory a major factor limiting arithmetic power, while the register file (Regfile) is becoming more and more prominent as a key circuit for processor data interaction and its bottleneck role [10], and there is an urgent need to carry out high-performance and highreliability Regfile research. ...
Article
Full-text available
Register file (Regfile), as the bottleneck circuit for processor data interaction, directly determines the computing performance of the system. To address the read/write conflict and timing error problems of register heap, this paper proposes a 5R4W high-timing reliability Regfile circuit design scheme. First, the scheme analyzed the principles of timing errors such as read/write conflicts, write errors, and read errors in the Regfile circuit; then adopted the timing separation method of independent control of the read/write process by clock double edges to solve multiport read/write conflicts, designed a mirror memory check circuit to avoid write errors caused by the word line delays, and used a phase-locked clock feedback structure to eliminate read errors caused by the data timing fluctuations; in the TSMC 7 nm FinFET process, a 64 × 74-bit 5R4W Regfile circuit was implemented using a fully customized layout. Experimental results show that the Regfile circuit has an area of 0.13 mm2 and consumes 5.541 mW. The circuit operates at a maximum frequency of 3.8 GHz at −40 to −125°C and 0.75 V, and is capable of detecting write errors caused by a clock jitter exceeding 30 ps or a frequency above 5 GHz.
... Over one-third of CPU jobs only utilize up to 50% of the CPU resources available, which could potentially be attributed to Simultaneous Multi-threading (SMT) in the Milan architecture. While SMT can provide benefits for specific types of workloads, such as communication-bound or I/O-bound parallel applications, it may not necessarily improve performance for all applications and may even reduce it in some cases [21]. Consequently, users may choose to disable SMT, leading to half of the logical cores being unused during runtime. ...
Chapter
Full-text available
Resource demands of HPC applications vary significantly. However, it is common for HPC systems to primarily assign resources on a per-node basis to prevent interference from co-located workloads. This gap between the coarse-grained resource allocation and the varying resource demands can lead to HPC resources being not fully utilized. In this study, we analyze the resource usage and application behavior of NERSC’s Perlmutter, a state-of-the-art open-science HPC system with both CPU-only and GPU-accelerated nodes. Our one-month usage analysis reveals that CPUs are commonly not fully utilized, especially for GPU-enabled jobs. Also, around 64% of both CPU and GPU-enabled jobs used 50% or less of the available host memory capacity. Additionally, about 50% of GPU-enabled jobs used up to 25% of the GPU memory, and the memory capacity was not fully utilized in some ways for all jobs. While our study comes early in Perlmutter’s lifetime thus policies and application workload may change, it provides valuable insights on performance characterization, application behavior, and motivates systems with more fine-grain resource allocation.
... This key value considers the physical and logical cores, their frequency and the number of processes currently running on the node. As it is observed, key 1 is calculated in two parts: firstly, the total number of cores is multiplied by the maximum core frequency to obtain a total node processing capacity; although the number of logical cores could be equal to this of the physical cores, they do not obtain the same processing gain as the physical ones; in [32,33] the reported gains using the logical cores only reached between 30% and 50%. For the above mentioned, in key 1 the number of logical cores is divided by 2 to obtain a gain of 50%. ...
Article
Full-text available
The process allocation is a problem in high performance computing, especially when using heterogeneous architectures involving diverse performance characteristics such as number of cores and their frequencies, multithreading technologies, cache memory etc. In order to improve the application performance, it is necessary to consider which processing units are the most suitable to execute the application processes. In this paper, PAARes (Process Allocation based on the Available Resources) strategy is implemented that automatically collects the system information for the process distribution by considering the processing capacity of each node in a cluster and their available resources. To demonstrate the efficiency and efficacy of the proposed strategy, the NAS (NASA Advanced Supercomputing) parallel benchmark is run on homogeneous and heterogeneous clusters under both dedicated and non-dedicated environments.
... Hyper-threading can be beneficial for certain types of workloads, such as some communication-bound or I/O-bound parallel applications. However, not all applications can benefit from hyper-threading, and in some cases, it may actually reduce performance [22]. As a result, the user may choose to disable hyper-threading when compiling their applications, resulting in half of the logical cores being idle at runtime. ...
Preprint
Full-text available
The resource demands of HPC applications vary significantly. However, it is common for HPC systems to assign resources on a per-node basis to prevent interference from co-located workloads. This gap between the coarse-grained resource allocation and the varying resource demands can lead to underutilization of HPC resources. In this study, we comprehensively analyzed the resource usage and characteristics of NERSC Perlmutter, a state-of-the-art HPC system with both CPU-only and GPU-accelerated nodes. Our three-week usage analysis revealed that the majority of jobs had low CPU utilization and that around 86% of both CPU and GPU-enabled jobs used 50% or less of the available host memory. Additionally, 52.1% of GPU-enabled jobs used up to 25% of the GPU memory, and the memory capacity was over-provisioned in some ways for all jobs. The study also found that 60% of GPU-enabled jobs had idle GPUs, which could indicate that resource underutilization may occur as users adapt workflows to a system with new resources. Our research provides valuable insights into performance characterization and offers new perspectives for system operators to understand and track the migration of workloads. Furthermore, it can be extremely useful for designing, optimizing, and procuring HPC systems.
Article
Recent advances in computer architecture significantly enhance the computational capacity of multicore systems. It allows large-scale graphs to be processed inside a single machine. Nevertheless, the irregular processing pattern of graph-structured data constrains the hardware resources from being productively utilized. In this paper, we investigate the constraints in two aspects: workload imbalance and parallel inefficiency. When a graph analytics algorithm is multithreaded, the thread time is highly diversified, indicating an uneven work distribution. Also, the intensive thread contention lowers the computing capacity of CPU cores, thereby hindering the effective utilization of CPU resources. To address these challenges, we present a proactive graph caching strategy that unequally segments graph components into cache-able subsets of varying sizes, namely Syze. Firstly, the computational loads of cache-sized subgraphs are estimated. Then, the demanding subgraphs are further subdivided until certain threshold is met. Moreover, during the propagation of updates, a fraction of vertex ID (i.e., several bits) are encoded to facilitate the communication between subgraphs. As a result, Syze is able to balance the workloads amongst the logical cores by shortening the longest thread execution time. Meanwhile, it alleviates thread contention and thus elevates the parallel efficiency of multicores. Compared with well-optimized Ligra, Gemini and GPOP, Syze achieves accelerations by up to 17.76×17.76\times , 11.67×11.67\times and 2.81×2.81\times respectively. Additionally, the side effects of Syze are evaluated, including raised cache misses and memory accesses. They play a trivial role in deciding the overall performance, as their costs are far outweighed by the gains from the even distribution of workloads and the improved utilization of multicores.
Article
Full-text available
This paper examines the scalability of the Linux 2.4.x scheduler as the load and number of CPUs increases. We show that the current scheduler de-sign involving a single runqueue and lock can suffer from lock contention problems which limits its scal-ability. We present alternate designs using multiple runqueues and priority levels that can reduce lock contention while maintaining the same functional behavior as the current scheduler. These implemen-tations demonstrate better overall scheduling per-formance over a wide spectrum of loads and system configurations.
Conference Paper
Full-text available
GigaNet and Myrinet are two of the leading interconnects for clusters of commodity computer systems. Both provide memory-protected user-level network interface access, and deliver low-latency and high-bandwidth communication to applications. GigaNet is a connection-oriented interconnect based on a hardware implementation of Virtual Interface (VI) Architecture and Asynchronous Transfer Mode (ATM) technologies. Myrinet is a connection-less interconnect which leverages packet switching technologies from experimental Massively Parallel Processors (MPP) networks. This paper investigates their architectural differences and evaluates their performance on two commodity clusters based on two generations of Symmetric Multiple Processors (SMP) servers. The performance measurements reported here suggest that the implementation of Message Passing Interface (MPI) significantly affects the cluster performance. Although MPICH-GM over Myrinet demonstrates lower latency with small messages, the polling-driven implementation of MPICH-GM often leads to tight synchronization between communication processes and higher CPU overhead.
Enhancing Linux Scheduler Scalability The Linux Kernel Achieve, ChangeLog-2.4 Architectural Requirements and Scalability of the NAS Parallel Benchmarks
  • M Kravetz
  • H Franke
  • S Nagar
  • R Ravindrano
  • F C Wong
  • R P Martin
  • R H Arpaci-Dusseau
  • D E Culler
M. Kravetz, H. Franke, S. Nagar, R. Ravindran, " Enhancing Linux Scheduler Scalability ", in the 2001 Ottawa Linux Sym [9] The Linux Kernel Achieve, ChangeLog-2.4.19, http://www.kernel.o F.C. Wong, R.P. Martin, R.H. Arpaci-Dusseau, and D.E. Culler. . [10] " Architectural Requirements and Scalability of the NAS Parallel Benchmarks ", In the Proceedings of SuperComputing '99, Portland, Oregon, November 1999.
HPL -A Portable Implementation of th Benchmark for Distributed-Memory Computers
  • J Hsieh
  • T Leng
  • V Mashayekhi
  • R Rooholamini
"HPL -A Portable Implementation of th Benchmark for Distributed-Memory Computers", Netlib Repository a and ORNL, http://www.netlib.org/benchmark/hpl. J. Hsieh, T. Leng, V. Mashayekhi, and R. Rooholamini. "Architectural and Performance