ArticlePDF Available

Evaluating CPU and Memory Affinity for Numerical Scientific Multithreaded Benchmarks on Multi-cores

Authors:

Abstract and Figures

Modern multi-core platforms feature complex topologies with different cache levels and hierarchical memory subsystems. Consequently, thread and data placement become crucial to achieve good performance. In this context, CPU and memory affinity appear as a promising approach to match the application characteristics to the underlying architecture. In this paper, we evaluate CPU and memory affinity strategies for numerical scientific multithreaded benchmarks on multi-core platforms. We use and analyze hardware performance event counters in order to have a better understanding of such impact. Indeed, the results obtained on different multi-core platforms and Linux kernels show that important performance improvements (up to 70%) can be obtained when applying affinity strategies that fit both the application and the platform characteristics.
Content may be subject to copyright.
EVALUATING CPU AND MEMORY AFFINITY FOR
NUMERICAL SCIENTIFIC MULTITHREADED
BENCHMARKS ON MULTI-CORES
Christiane P. Ribeiro, Márcio Castro, Vania Marangozova-Martin,
Jean-François Méhaut
. Nanosim team, Laboratoire d’Informatique de Grenoble (LIG),CNRS,
INRIA,CEA, University of Grenoble. ZIRST 51, avenue Jean Kuntzmann, 38330 Montbonnot Saint
Martin, France.
Henrique C. Freitas, Carlos A. P. S. Martins
. Department of Computer Science – Pontifical
Catholic University of Minas Gerais (PUC Minas). Av. Dom José Gaspar, 500, Belo Horizonte, MG,
Brazil.
ABSTRACT
Modern multi-core platforms feature complex topologies with different cache levels and hierarchical
memory subsystems. Consequently, thread and data placement become crucial to achieve good
performance. In this context, CPU and memory affinity appear as a promising approach to match the
application characteristics to the underlying architecture. In this paper, we evaluate CPU and memory
affinity strategies for numerical scientific multithreaded benchmarks on multi-core platforms. We use
and analyze hardware performance event counters in order to have a better understanding of such impact.
Indeed, the results obtained on different multi-core platforms and Linux kernels show that important
performance improvements (up to 70%) can be obtained when applying affinity strategies that fit both
the application and the platform characteristics.
KEYWORDS
Performance, Affinity, NAS Benchmarks, Multi-core Platforms.
1. INTRODUCTION
Modern multi-core platforms are designed with a hierarchical memory topology to reduce the
communication latency and to increase bandwidth. Depending on the architecture design
decisions, the shared main memory can be either composed of a single memory bank (UMA -
Uniform Memory Access) or distributed in several memory banks (NUMA - Non-Uniform
IADIS International Journal on Computer Science and Information Systems, v. 7, p. 79-93, 2012; Issue: 1; ISSN/ISBN: 1646-3692.
Memory Access) [Asanovic, 2006]. In addition to the main memory topology, such platforms
also feature multiple levels of cache, which can be disposed in many different ways. Usually,
cores on the same chip share cache whereas cores on different chips do not.
In this paper we argue that the memory architecture has a great impact on applications’
performance and that it should be explicitly taken into account. Indeed, when dealing with
parallel applications whose major performance metrics include the speed of execution and the
speedup [Foster, 1995], non uniform memory accesses may break the load balancing and thus
slow down the application. On the other hand, knowledge and intelligent usage of the cache
management strategies may prevent data from moving around and thus accelerate the
computations.
To respond to the variability of architectural characteristics and ensure good application
performance, it is necessary to efficiently manage thread and data locality. In this context
emerges the concept of affinity, which can be divided into two types: CPU affinity and
memory affinity. CPU affinity forces a thread to run on a specific core or a subset of cores. The
idea is to take advantage of the fact that the data accessed by the thread may remain in the
processor cache. However, in order to do so, the default behavior of the operating system (OS)
scheduler has to be changed [Mei, 2010] [Castro, 2012].
Memory affinity [Terboven, 2008] is ensured when data is efficiently distributed over the
machine memory. Such distribution can either reduce the number of remote accesses (latency
optimization), which may be very significant on NUMA platforms, or the memory contention
(bandwidth optimization), which can be present on both UMA and NUMA platforms.
In this context, some important questions arise: how do different affinity strategies impact
scientific parallel applications? Are CPU-bound applications affected by different CPU
affinity strategies? How can memory affinity strategies impact the performance of a memory-
bound application?
In this work we focus on evaluate the impact of CPU and memory affinity on UMA and
NUMA multi-core platforms. Our methodology is based on the following major aspects:
Choice and control of the used multi-core hardware architectures. We experiment with
two different platforms, respectively one UMA and one NUMA platform. We know all the
parameters characterizing their hardware architecture, namely the type and the number of the
cores, their interconnection and their cache and memory organization. Moreover, we control
all possible operating system parameters.
Use of representative parallel applications. We are interested in evaluating numerical
scientific multithreaded benchmarks, which exhibit significant memory and processing power
usage, data-sharing and different memory access patterns. This is the reason why we have
chosen to work with the NAS Parallel Benchmarks (NPB) [Haoqiang, 1999]. NPB is a
benchmark derived from computational fluid dynamics (CFD) codes. It is composed of a set
of applications and kernels that are examples of memory-bound and CPU-bound programs.
Rigorous performance evaluation and analysis. All experiments have been executed
multiple times and the average values have been used in order to minimize the measurement
error. During experiments, the platforms have been reserved and dedicated to the running
application i.e. there has been no outside interference. In order to evaluate the impact of
affinity strategies on NPB, we have studied performance in a top-down manner. We start with
the standard performance metrics for parallel applications i.e. speedup and scalability. In order
to understand furthermore the obtained results, we have observed thread migrations, as well as
hardware performance counters. For the latter, we have typically observed the cache accesses
and cache misses.
IADIS International Journal on Computer Science and Information Systems, v. 7, p. 79-93, 2012; Issue: 1; ISSN/ISBN: 1646-3692.
Experimentation with different affinity strategies. We have experimented with different
CPU and memory affinity strategies. For CPU affinity, we have tried two strategies in which
threads are forced to run on a specified set of CPU/cores. The first strategy maps threads to a
single node of cores thus allowing cache sharing. The second one prevents cache sharing by
distributing the threads on different nodes. As for memory affinity, we replaced the Linux
default data placement strategy with two alternative ones. The idea is to allocate data on
specified banks of memory aiming at optimizing the latency or the bandwidth. Our
experiments show that affinity strategies have a significant impact on applications’
performance and that their adapted usage can lead up to 70% performance improvement.
This paper is an extension from previous research work [Ribeiro, 2011]. We added new
results and findings related to the impact of CPU and memory affinity on parallel application
on the newest version of the Linux kernel. We highlight that, although the Linux affinity
management is more aware of multi-core machines, there is still need to carefully place
processes, threads and memory data to achieve better performance.
This paper is organized as follows. In Section 2, we present the platforms and benchmarks
used in this work. We report the overall performance of NPBs in Section 3. The investigation
of the CPU and memory affinity is discussed in Section 4. Section 5 presents the performance
of NPBs when affinity strategies are applied. In Section 6, we present the same evaluation of
affinity strategies on a newer version of the Linux kernel. Related works are discussed in
Section 7. Finally, concluding remarks and future works are pointed out in Section 7.
2. EXPERIMENTAL ENVIRONMENT
In order to conduct our experiments, we have selected two representative multi-core
platforms. Intel UMA: a multiprocessor based on four six-core Intel Xeon X7460. Each group
of two cores shares a L2 cache (3MB) and each group of six cores shares a L3 cache (16MB).
Intel NUMA: a Dell PowerEdge R910 equipped with 4 eight-core Intel Xeon X7560
processors. Each core has a private L1 (32KB) and L2 (256KB) caches and all cores on the
same socket share a L3 cache (24MB).
Table 1 summarizes the hardware characteristics of these machines. Memory bandwidth
(obtained from Stream - Triad operation [McCalpin, 1995]) and NUMA factor (obtained from
BenchIT [Molka, 2009]) are also reported in this table. The NUMA factor is obtained through
the division of remote read latency by local read latency. NUMA factors are shown in
intervals, meaning the minimum and maximum penalties to access a remote DRAM in
comparison to a local DRAM. Both machines run Linux operating system 2.6.32 and 3.2.0
with GCC (GNU C Compiler) 4.4.4. In order to control CPU and memory affinity, we have
used the numactl tool version 2.0.4 [Kleen, 2005].
Table 1. Overview of the multi-core platforms.
Platform
#cores #sockets #nodes Clock
(GHz)
LLC
(MB)
DRAM
(GB)
Bandwidth
(GB/s)
NUMA
factor
UMA Intel
NUMA Intel
24
32
4
4
-
4
2.66
2.27
16
24
64
64
6.39
35.54
-
[1.36-3.6]
We have selected the NAS Parallel Benchmarks (NPB) [Haoqiang, 1999] to evaluate the
performance impact of affinity on multi-core machines. NPB is a benchmark derived from
IADIS International Journal on Computer Science and Information Systems, v. 7, p. 79-93, 2012; Issue: 1; ISSN/ISBN: 1646-3692.
computational fluid dynamics (CFD) codes. It is composed of a set of applications and kernels
[Haoqiang, 1999] that are examples of memory-bound and CPU-bound programs. In this work
we use the OMNI compiler group implementation of NPB version 2.3. We consider the
following benchmarks: FFT, MG, LU, CG, BT, EP and SP. For all selected applications, we use
three classes, which define the size of the problem: A (small), B (medium) and C (large). The
IS and UA applications, which are also part of the NPB benchmark, were not used in this
paper. UA was not implemented in this version whereas IS presented some problems during
the execution.
3. OVERALL PERFORMANCE OF NAS PARALLEL
BENCHMARKS
In this section, we present the performance of the original version of NPB on each machine
presented in Section 2. We use the speedup as the performance metric to compare all results.
We vary the number of threads considering the available cores in each platform, using one
thread per core. All applications were executed using the problem classes A (small), B
(medium) and C (large). We executed each experiment several times, obtaining a maximum
standard deviation of 5%.
Figure 1. NAS parallel benchmarks: scalability results on Linux 2.6.32.
Figure 1 shows speedups obtained for the NAS Parallel Benchmarks on the selected
machines. EP and LU have presented great scalability due to the high ratio between
computation and communication. Furthermore, on LU benchmark, data allocation has been
optimized for NUMA machines. A parallel initialization of all data was performed by the team
of threads in order to ensure data locality and to avoid remote accesses. On the contrary, BT,
CG, FFT, SP and MG have presented poor scalability. Indeed, on the Intel NUMA the result is
IADIS International Journal on Computer Science and Information Systems, v. 7, p. 79-93, 2012; Issue: 1; ISSN/ISBN: 1646-3692.
mainly related to the machine characteristics (NUMA effects) and to the way the benchmarks
have been developed, with a NUMA-unaware data allocation and initialization.
We have also observed that the memory access patterns are irregular on some of these
benchmarks (e.g., CG and MG). Indeed, some of them use sparse matrices and indirect
matrices access, which means that a number of threads probably compete for the same
memory bank. In such case, memory contention is generated because the interconnection
network becomes overloaded, reducing the overall performance. On the Intel UMA, the poor
scalability is mainly related to its hierarchical cache memory subsystem, in which some data
accesses can generate expensive inter processor communications such as the long distant
communication of FFT and CG, and the non-continuous communication of SP and LU. We can
see that the speedups of most applications are significantly distant from the ideal ones. We
have found that the main scalability issues were due to the way threads access data, cache
sharing issues and to the way of data are placed over the NUMA nodes. In order to have a
better understanding of the archived performances, we have made a more detailed analysis of
two benchmarks, EP and MG with very distinct characteristics. The former is a CPU-bound
application whereas the latter is a memory-bound application.
4. IMPACT OF AFFINITY: MG AND EP CASE STUDIES
In this section, we evaluate the impact of affinity strategies on the performance of NPB
applications on the selected machines with Linux 2.6.32. Specifically, we have used MG and
EP as our case studies to perform such analysis. We have considered the following metrics:
execution time, speedup and hardware counters.
4.1 CPU Affinity
There are two ways of dealing with CPU affinity in most of operating systems. The first one,
named soft affinity, relies on the traditional OS scheduler in which processes/threads remain
on the same CPU as long as possible. However, in some situations the OS scheduler may
migrate processes/threads to another processor, even when it is not necessary, impacting the
overall performance of the system. The second way of dealing with CPU affinity is called hard
affinity, which delegates the processes/threads locality to the user. In this case, the user may
define on which processor/core each process/thread must run. In this paper, we mean by CPU
affinity the latter definition (hard affinity).
In order to observe how the Linux scheduler behaves, we have performed some
experiments with the two selected benchmarks (EP and MG) on both Intel machines. On those
experiments, we traced the locality of all threads at the beginning of every iteration. After
analyzing the traces obtained from EP, we concluded that the Linux scheduler assigned each
thread to a core and did not migrate them. This occurs because EP is CPU-bound, so all cores
are always performing computations and little time is spent accessing the memory.
We have conducted the same experiments with MG but we notice a difference in the
behavior: the locality of threads constantly changed during the execution on both machines. To
demonstrate such behavior, we have picked the trace information obtained from a single
thread during the execution with 24 threads on the Intel UMA (Figure 2a) and Intel NUMA
(Figure 2b) platforms. As it can be seen, the Linux scheduler has considerably varied the
IADIS International Journal on Computer Science and Information Systems, v. 7, p. 79-93, 2012; Issue: 1; ISSN/ISBN: 1646-3692.
locality of the thread. On the Intel NUMA machine, we observe thread migrations between
different NUMA nodes whereas on the Intel UMA, there were many migrations between
processors and sockets. All other threads have also presented analogous behaviors.
We now analyze the impact of applying different CPU affinity strategies in comparison to
the OS scheduler (soft affinity). To study the influence of cache sharing on the Intel UMA, we
have compared two CPU affinity strategies: intra-socket (threads are bound to sibling cores,
sharing cache) and inter-socket (threads are bound to cores on different sockets and do not
share cache). We have used four threads in these experiments, since with more than 4 threads
there would be no interesting non-sharing case to compare. In order to implement the CPU
affinity strategies, we have used the numactl tool considering the machine topology with the
following parameters: (i) --physcpubind=0,12,4,16 and --physcpubind=0,4,8,12, for the intra-
node strategy on the Intel UMA and NUMA respectively and (ii) --physcpubind=0,1,2,3, for
the inter-node strategy on both machines.
(a) (b)
Figure 2. Thread scheduling of Linux 2.6.32 on MG.
Figure 3 presents some event counters obtained during the execution of both benchmarks
on Intel UMA. The value shown by a bar corresponds with the result of the normalization
between an event counter of a CPU affinity strategy (hard affinity) and an event counter of the
OS scheduler (soft affinity). Thus, bars higher than 1 mean that the CPU affinity has increased
the number of events whereas bars lower than 1 indicate the contrary.
(a) (b)
Figure 3. Intra-socket vs. Inter-socket thread placements on Intel UMA (Linux 2.6.32).
Figure 3a shows the results obtained from MG. We can notice that both strategies have
reduced considerably the number of CPU cycles, since the OS scheduler does not spend too
much cycles to deal with thread migrations. However, by applying the intra-socket strategy,
IADIS International Journal on Computer Science and Information Systems, v. 7, p. 79-93, 2012; Issue: 1; ISSN/ISBN: 1646-3692.
we can observe a considerable growth on the number of cache misses. This is due to the fact
that the total memory allocated by MG does not fit on a single L3 cache, which is shared by
the four threads. Thus, the amount of data moving between the main memory and the L3 cache
is increased. A different behavior is observed with EP in Figure 3b. Since it is CPU-bound
application, there is no important impact on the cache hierarchy for the analyzed event
counters. Thus, EP performances obtained with intra- and inter-node strategies are very similar
with those obtained with Linux.
An analogous analysis of some performance event counters on the Intel NUMA is
presented in Figure 4. Since it is a NUMA machine, we are interested in the following event
counters: last level cache misses (L3_Miss), number of accesses on local and remote DRAM
(L_DRAM and R_DRAM), number of accesses on local cache and remote cache (L_cache
and R_cache). The intra-node strategy assures that all four threads are placed on the same
node whereas the inter-node strategy places one thread per node.
(a) (b)
Figure 4. Intra-node vs. Inter-node thread placements on Intel NUMA (Linux 2.6.32).
Figure 4a presents the event counters obtained from MG. The intra-node strategy does not
generate any access to the remote DRAM, all memory access are done on the local DRAM.
Consequently, the memory contention is increased due to the fact that all data allocated by MG
is placed on the same node. On the contrary, the inter-node strategy has presented very similar
results to the Linux scheduler, reducing the memory contention and improving cache L3 usage
by distributing the data among the nodes.
Event counters obtained from EP are shown in Figure 4b. Both strategies have reported
similar number of CPU cycles when compared to Linux scheduler. Since EP is not memory-
bound, placing threads on different nodes does not impact on memory sub-system
performance. However, we can observe that cache L3 misses and remote cache responses have
been smaller in intra-node strategy. By placing threads in a single node, we then avoid data
distribution in several nodes. Additionally, the fact of placing all threads in the same node
allows data to be prefetched into cache memories.
4.2 Memory Affinity
Operating systems must ensure memory affinity on applications, in order to optimize memory
allocation and placement on NUMA multi-core machines. Memory affinity is guaranteed
when a compromise between threads and data is achieved reducing latency costs or increasing
bandwidth for memory accesses [Ribeiro, 2009].
In the case of Linux, first-touch is the strategy used to guarantee memory affinity. This
policy places data on the node that first accesses it [Joseph, 2006]. Therefore, applications
IADIS International Journal on Computer Science and Information Systems, v. 7, p. 79-93, 2012; Issue: 1; ISSN/ISBN: 1646-3692.
with a regular memory access patterns can benefit from this strategy. However, this strategy
will only present performance gains if applied on applications with a regular data access
pattern. In case of irregular applications, first-touch will probably result in a high number of
remote accesses and memory contention.
On Linux, it is possible to change this default memory affinity strategy by using the
NUMA API [Kleen, 2005]. In this work, we used numactl tool to modify data placement for
both EP and MG benchmarks on the Intel NUMA machine. We have studied two strategies
that we named bind and interleave. The first strategy places memory of an application on a
restricted set of memory banks. Interleave strategy spreads data over memory banks of a
NUMA machine. We used the following parameters on numactl: (i) --membind=0,1,2,3, for
the bind strategy and (ii) --interleave=all, for the interleave strategy. The difference between
these strategies is that the former optimizes latency while the latter optimizes bandwidth. On
both strategies, we have used all cores of the machine and we have pinned each thread to a
core (inter-node CPU affinity) to avoid any impact of thread scheduling.
We measured some hardware counters on Intel NUMA machine, to study the impact of the
different memory affinity strategies on EP and MG benchmarks. The selected counters are
total CPU cycles, cache L3 miss (L3_misses), number of accesses to local and remote caches
(L_cache and R_cache) and, number of accesses to local and remote DRAM access
(L_DRAM and R_DRAM). The value shown by a bar corresponds with the result of the
normalization between an event counter of a memory affinity strategy and an event counter of
the OS strategy. Thus, bars higher than 1 mean that the strategy has increased the number of
events whereas bars lower than 1 indicate the contrary.
Figure 5 shows the performance events counters for MG and EP benchmarks on Intel
NUMA, compared to the Linux execution. We can observe that bind and interleave have
presented similar number of cache L3 misses and accesses to DRAM. However, it can be
noticed that interleave strategy has expressively reduced the total number of CPU cycles on
MG. On the other hand, it has increased the number of cache L3 misses on EP.
(a) (b)
Figure 5. Bind vs. Interleave data placements on Intel NUMA (Linux 2.6.32).
In MG benchmark, computation is performed by memory zones with irregular accesses.
Therefore, spreading data over all available DRAMs allows much more memory pages of a
zone to be accessed by threads in the same interval time. Due to this, QPI interconnection link
is more used to access data over different steps of MG benchmark. In the bind strategy, the
application data is split in continuous memory blocks, forcing threads to access the same
memory bank (DRAM) on MG steps. In the case of EP, we can noticed that bind and
interleave strategies do not impact its overall performance. Considering the cache L3, the
IADIS International Journal on Computer Science and Information Systems, v. 7, p. 79-93, 2012; Issue: 1; ISSN/ISBN: 1646-3692.
interleave strategy have generated more misses because memory pages of the application are
spread over all memory banks. Consequently, the first accesses generate some cache misses to
get data for cores.
5. AFFINITY STRATEGIES ON NAS PARALLEL
BENCHMARKS
In this section, we present the impact of the CPU and Memory affinity strategies on NPB.
Figure 6 reports the speedup comparison between the best affinity strategy and the default
Linux affinity management. The best affinity was chosen considering the application and
platform characteristics.
Figure 6. NAS Parallel Benchmarks affinity results (Linux 2.6.32).
We can observe that controlling the affinity has generally reported better results than
Linux. One can also notice that the benefits of managing affinity are much lower on the Intel
UMA than in the Intel NUMA. Indeed, on the Intel NUMA, the affinity strategy has an impact
on both cache and on the main memory performances. Therefore, greater performance gains
are expected on the NUMA machine.
On the Intel UMA, the best strategy was the inter-socket CPU affinity for all selected
benchmarks. However, such strategy has not presented any improvement gains on FFT, CG,
SP and BT. For these benchmarks, Linux has obtained better speedups since it migrates
threads to reduce long distance and non-continuous accesses on memory. Considering MG and
LU benchmarks, the inter-socket strategy has resulted in better cache sharing, improving their
performances. On EP, we have observed that different CPU affinity strategies generate similar
IADIS International Journal on Computer Science and Information Systems, v. 7, p. 79-93, 2012; Issue: 1; ISSN/ISBN: 1646-3692.
performances. Since this benchmark perform independent work, both CPU and memory
affinities do not have an impact on the performance.
On the contrary, we have found that there was a relation between the number of threads
and the best affinity strategy on the Intel NUMA. For a small number of threads (up to 8), the
best strategy was the intra-node (CPU affinity) combined with bind memory policy (memory
affinity). With more than 8 threads, the best strategy was the inter-node (CPU affinity)
combined with interleave memory policy (memory affinity).All affinity strategies have
decreased the performance of LU on the Intel NUMA. Since this benchmark has a
heterogeneous communication pattern, therefore threads perform many distant accesses during
the whole execution. Consequently, when a CPU affinity strategy is applied, threads do not
migrate and the number of distant accesses is increased. The best performance gains were
obtained with FFT, CG, SP and MG due to their characteristics: indirect access, sparse matrix
and centralized data initialization. On those benchmarks, affinity allowed better data
distribution among the machine nodes and better cache sharing. Considering BT, we have
observed that some threads do not share data, because BT does not scale with high number of
threads. In that case, small performance gains are expected for this benchmark when affinity is
applied.
6. OS EVOLUTION AND VALIDITY OF THE AFFINITY
APPROACH
In the previous sections, all experiments were performed with Linux kernel version 2.6.32.
This section presents results with Linux kernel version 3.2.0. This version has better support
for multi-core machines with UMA and NUMA design has been included in the kernel. For
instance, better scheduling and memory allocation for drivers [Linux kernel, 2012]. In order to
validate our approach, we re-ran all the previously discussed experiments on the same
machines using the kernel version 3.2.0.
6.1 CPU Affinity
Concerning CPU affinity, the first observation that we can make is that there is fewer or none
thread migration on Linux 3.2.0 for both EP and MG benchmarks when executed on the
selected machines. This is due to the changes on the scheduler of the operating system, which
now keeps threads as long as possible in the same cores.
Figure 7 presents the hardware counters for the UMA platform with the Linux 3.2.0 when
CPU affinity is applied on both EP and MG benchmark. On general, the intra-socket strategy
has reduced the performance compared to the Linux default behavior. This has been already
observed in the version 2.6.32 of the operating system. However, in this new version the
impact of this strategy is much more important and performance is reduced for both
benchmarks. Concerning the inter-socket strategy, it has presented similar results compared to
Linux. We concluded that the Linux behavior for these benchmarks is similar to inter-socket
strategy (CPU affinity).
IADIS International Journal on Computer Science and Information Systems, v. 7, p. 79-93, 2012; Issue: 1; ISSN/ISBN: 1646-3692.
(a) (b)
Figure 7. Intra-socket vs. Inter-socket thread placements on Intel UMA (Linux 3.2.0).
Figure 8 depicts the hardware counters for the NUMA platform with the Linux 3.2.0 when
CPU affinity is applied. Although the cache misses are reduced for intra-node strategies on the
newer version of the kernel, the CPU cycles are similar to the ones obtained with the version
2.6.32. Therefore, similar performances are achieved.
(a) (b)
Figure 8. Intra-node vs. Inter-node thread placements on Intel NUMA (Linux 3.2.0).
6.2 Memory Affinity
In Figure 9, we can observe that the impact of using memory affinity on the benchmarks is
still important for performance. Concerning MG benchmark, the performance is improved
when interleave memory affinity is applied. However, this improvement is smaller than the
one observed in the previous version of the kernel. In the case of EP, the results are similar to
the ones obtained with the version 2.6.32 of the operating system, since this benchmark is not
memory-bound.
IADIS International Journal on Computer Science and Information Systems, v. 7, p. 79-93, 2012; Issue: 1; ISSN/ISBN: 1646-3692.
(a) (b)
Figure 9. Bind vs. Interleave data placements on Intel NUMA (Linux 3.2.0).
6.3 Overall Performance with Thread and Memory Affinities
We present in this section the impact of the CPU and Memory affinity strategies on NPB on
the current version of Linux operating system. Figure 10 reports the speedup comparison
between the best affinity strategy and the default Linux affinity management. As for the
version 2.6.32, the best affinity was chosen considering the application and platform
characteristics.
Figure 10. NAS Parallel Benchmarks affinity results (Linux 3.2.0).
Overall, we can observe that applying CPU and memory affinity has improved the
performance of the selected benchmarks. However, the benefits of managing affinity are much
IADIS International Journal on Computer Science and Information Systems, v. 7, p. 79-93, 2012; Issue: 1; ISSN/ISBN: 1646-3692.
lower on the Intel UMA than in the Intel NUMA. This is due to the asymmetric latencies that
NUMA machines has. As presented in Table 1, such asymmetry can generate latencies up to
3.6 times higher for some memory accesses. Therefore, managing affinities for NUMA
machines can provide better performance by up to 70% for CG and MG benchmarks.
Similarly to the results presented with Linux 2.6.32, on the Intel UMA, the best strategy on
average was the inter-socket CPU affinity for all selected benchmarks. On the Intel NUMA
machine, the best strategy depends on the number of threads used to run the benchmark as the
results presented for Linux 2.6.32.
Another interesting result is the differences on performance improvements observed with
Linux 3.2.0. We can observe that on both machines, speedups for the benchmarks are better
now. Considering LU benchmark, applying affinity to it on this operating system version does
not slowdown the performance, but keeps it close to the ones obtained with Linux.
Furthermore, for BT and SP benchmarks, speedups are slightly better with both Linux and
affinities.
These results allows us to conclude that even though Linux operating system has being
changed to better adapt to the current multi-core machines architecture, there is still place for
some performance improvements related to CPU and memory affinity. Therefore, specialized
strategies that can reflect the communication pattern of the parallel applications can provide
better performances for applications, exploiting the hardware power of multi-core machines.
7. RELATED WORK
The complexity of multi-core multiprocessors systems has demanded better understanding on
memory accesses, influences of data and thread placement over the machine and the impact of
hierarchical topologies. Due to this, research groups have studied the performance and the
behavior of several workloads over multi-core machines. In [DeFlumere, 2009], the authors
have evaluated the performance of two multi-core platforms with scientific applications,
focusing their analysis on the performance of memory and communication sub-systems. In
[Zhang, 2010], the authors have studied the influence of cache sharing on different
applications of the PARSEC benchmark suite. However, these works have only considered
UMA multi-core platforms and they have focused only on CPU affinity. In [Alam, 2006], the
authors have investigated the impact of multi-cores and processor affinity on hybrid scientific
workloads based on Message Passing Interface (MPI). However, we use OpenMP, focusing on
cache sharing problem, while they focus on the interconnection network issues present in MPI
systems.. Secondly, we study the impact on different multi-core platforms (UMA and NUMA)
with a significant number of cores per socket, while, they have used three platforms based on
the same AMD Opteron processor with two cores per socket. Finally, we do a deeper study of
the selected benchmarks through the analysis of performance event counters.
The work of [Tam, 2007] uses hardware counters in order to dynamically map the threads
of parallel applications. The scheduling is done by estimating the hardware counters of the
Power5 processor (stall cycles, cache misses, etc.) in order to deduct the sharing pattern. Even
if performance has been increased bu up to 7%, this approach provides less reliable
information for mapping the threads and data.
A work close to ours is the one of [Broquedis, 2009] in which the authors consider
dynamic task and data placement for OpenMP applications. A NUMA-aware runtime for
IADIS International Journal on Computer Science and Information Systems, v. 7, p. 79-93, 2012; Issue: 1; ISSN/ISBN: 1646-3692.
OpenMP, named ForestGOMP, is developed as an extension of the GNU OpenMP library. It
relies on the hwloc framework [Broquedis, 2010] and on the Marcel threading library
[Danjean, 2003. This runtime uses hwloc to extract the target machine topology and then pin
kernel threads on the machine cores. In order to provide more performance for OpenMP
applications, the Marcel library is used to create user level threads within parallel sections and
associates them to the kernel threads. The proposal does not require profiling steps. However,
contrary to our mechanism, ForestGOMP requires some modifications to the source code to
provide information about the program behavior, such as how the data is distributed, which
variables to consider, among others.
8. CONCLUSION
Throughout this paper, we showed that multi-core machines with UMA and NUMA
characteristics can present problems related to bandwidth, latency and performance scalability.
In order to comprehend the impact of these problems on multi-core machines, we have
selected Numerical Scientific multithreaded workloads as our case study. These workloads
exhibit significant memory and processing power usage, data-sharing between threads and
different memory access patterns.
We have conducted a series of experiments with different CPU and memory affinity
strategies in order to observe how the performance behaves according to strategy used on such
machines. The experimental results of this paper highlight the importance of the memory sub-
system on the performance of such workloads on machines with a large number of cores. On
different multi-core architectures (UMA vs. NUMA), the affinity has presented a significant
influence on the performance of benchmarks that are memory bounded. The performance
improvements are also present on newer version of Linux operating system, which has some
modifications to support multi-core machines. Our future work will be related to an
investigation of the same affinity on workloads developed with hybrid programming models
(OpenMP/MPI) while considering the process mapping and the influence of memory and
interconnection.
ACKNOWLEDGEMENT
This paper was supported by FAPEMIG, INRIA and CAPES (grant number 4874-06-4).
REFERENCES
Alam, S. R. et al, 2006. Characterization of Scientific Workloads on Systems with Multi-core Processors.
Proceedings of the IEEE International Symposium on Workload Characterization. San Jose, USA,
pp. 225-236.
Asanovic, K. et al, 2006. The Landscape of Parallel Computing Research: A view from Berkeley.
Technical report. University of California, Berkeley.
IADIS International Journal on Computer Science and Information Systems, v. 7, p. 79-93, 2012; Issue: 1; ISSN/ISBN: 1646-3692.
Broquedis F. et al, 2009. Dynamic Task and Data Placement over NUMA Architectures: an OpenMP
Runtime Perspective. Proceeding of the 5th International Workshop on OpenMP. Dresden, Germany,
pp. 79–92.
Broquedis F. et al, 2010. libtopology: A Generic Framework for Managing Hardware Affinities in HPC
Applications. Proceedings of the Euromicro Conference on Parallel, Distributed, and Network-
Based Processing. Pisa, Italy, pp. 180–186.
Castro, M. et al 2012. Dynamic Thread Mapping Based on Machine Learning for Transactional Memory
Applications. Proceedings of the International European Conference on Parallel and Distributed
Computing. Rhodes Island, Greece, pp. 465-476.
Danjean V. and Namyst R. 2003. Controlling Kernel Scheduling from User Space: An Approach to
Enhancing Applications Reactivity to I/O Events. Proceedings of the High Performance Computing.
Hyderabad, India, pp. 490–499.
DeFlumere, A. M. and Alam, S. R., 2009. Exploring Multi-core Limitations Through Comparison of
Contemporary Systems. Proceedings of the Richard Tapia Celebration of Diversity in Computing
Conference. Portland, Oregon, pp. 75-80.
Foster I., 1995. Designing and Building Parallel Programs. Addison Wesley, United Kingdom.
Haoqiang, J. and Michael Frumkin, J.Y, 1999. The OpenMP Implementation of NAS Parallel
Benchmarks and Its Performance. Technical Report 99-011/1999. NASA Ames Research Center.
Joseph, A. et al, 2006. Exploring Thread and Memory Placement on NUMA Architectures: Solaris and
Linux, UltraSPARC/FirePlane and Opteron/HyperTransport. Proceedings of the High Performance
Computing. Bangalore, India, pp. 338-352.
Kleen, A, 2005. A NUMA API for Linux. Technical Report Novell-4621437.
Linux kernel, 2012. Kernel Coverage. URL: http://lwn.net/Kernel/
McCalpin, John D., 1995. Memory Bandwidth and Machine Balance in Current High Performance
Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA), No.
12, pp. 19-25.
Mei, C. et al, 2010. Optimizing a Parallel Runtime System for Multicore Clusters: A Case Study.
Proceedings of the TeraGrid Conference. New York, USA, pp. 1-8.
Molka, D. et al, 2009. Memory Performance and Cache Coherency Effects on an Intel Nehalem
Multiprocessor System. Proceedings of the International Conference on Parallel Architectures and
Compilation Techniques. Washington, USA, pp. 261-270.
Ribeiro, C. P. et al, 2011. Investigating the Impact of CPU and Memory Affinity on Multi-core
Platforms: A Case Study of Numerical Scientific Multithreaded Applications. Proceedings of the
IADIS International Conference on Applied Computing. Rio de Janeiro, Brazil, pp. 299-306.
Ribeiro, C. P. et al, 2009. Memory Affinity for Hierarchical Shared Memory Multiprocessors.
Proceedings of the International Symposium on Computer Architecture and High Performance
Computing. São Paulo, Brazil, pp.59-66.
Tam D. et al, 2007. Thread Clustering: Sharing-Aware Scheduling on SMP-CMP-SMT Multiprocessors.
Proceedings of the European Conference on Computer Systems. Lisbon, Portugal, pp. 47-58.
Terboven, C. et al. 2008, Data and Thread Affinity in OpenMP Programs. Proceedings of the Workshop
on Memory Access on Future Processors. New York, USA, pp. 377–384.
Zhang, E. Z. et al, 2010. Does Cache Sharing on Modern CMP Matter to the Performance of
Contemporary Multithreaded Programs? Proceedings of the ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming. New York, USA, pp. 203-212.
Zheng W. et al, 2009. Mapping Parallelism to Multi-cores: A Machine Learning Based Approach.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.
New York, USA, pp. 75–84.
IADIS International Journal on Computer Science and Information Systems, v. 7, p. 79-93, 2012; Issue: 1; ISSN/ISBN: 1646-3692.
... There are two means of tackling thread affinity [19] in most of operating systems. The first one, which is called soft affinity, relies on the traditional OS scheduler where [5]. ...
... 19: Runtime parallelism variation from the simple and probabilistic models for genome on NUMA. ...
... throughputThe number of commits in one unit of time.19 transaction Is a finite sequence of machine instructions, executed by a single process. ...
Thesis
Parallel programs need to manage the trade-off between the time spent in synchronisation and computation. The trade-off is significantly affected by the number of active threads. High parallelism may decrease computing time while increase synchronisation cost. Furthermore, thread placement on different cores may impact on program performance, as the data access time can vary from one core to another due to intricacies of its underlying memory architecture. Therefore, the performance of a program can be improved by adjusting its parallelism degree and the mapping of its threads to physical cores. Alas, there is no universal rule to decide them for a program from an offline view, especially for a program with online behaviour variation. Moreover, offline tuning is less precise. This thesis presents work on dynamical management of parallelism and thread placement. It addresses multithread issues via Software Transactional Memory (STM). STM has emerged as a promising technique, which bypasses locks, to tackle synchronisation through transactions. Autonomic computing offers designers a framework of methods and techniques to build autonomic systems with well-mastered behaviours. Its key idea is to implement feedback control loops to design safe, efficient and predictable controllers, which enable monitoring and adjusting controlled systems dynamically while keeping overhead low. This dissertation proposes feedback control loops to automate management of threads at runtime and diminish program execution time.
... CPU affinity provides a mechanism to designate which CPUs a particular process can utilize, providing a potential solution to minimize processing jitter [58,59]. By binding RT tasks-such as those within vBBU-to a specific group of CPUs, CPU affinity ensures that the Kernel Scheduler restricts CPU-time allocation to these assigned CPUs [60]. ...
Article
Full-text available
Cloud-based Radio Access Network (Cloud-RAN) leverages virtualization to enable the coexistence of multiple virtual Base Band Units (vBBUs) with collocated workloads on a single edge computer, aiming for economic and operational efficiency. However, this coexistence can cause performance degradation in vBBUs due to resource contention. In this paper, we conduct an empirical analysis of vBBU performance on a Linux RT-Kernel, highlighting the impact of resource sharing with user-space tasks and Kernel threads. Furthermore, we evaluate CPU management strategies such as CPU affinity and CPU isolation as potential solutions to these performance challenges. Our results highlight that the implementation of CPU affinity can significantly reduce throughput variability by up to 40%, decrease vBBU’s NACK ratios, and reduce vBBU scheduling latency within the Linux RT-Kernel. Collectively, these findings underscore the potential of CPU management strategies to enhance vBBU performance in Cloud-RAN environments, enabling more efficient and stable network operations. The paper concludes with a discussion on the efficient realization of Cloud-RAN, elucidating the benefits of implementing proposed CPU affinity allocations. The demonstrated enhancements, including reduced scheduling latency and improved end-to-end throughput, affirm the practicality and efficacy of the proposed strategies for optimizing Cloud-RAN deployments.
... This is feasible because a customer specifies the number of CPU cores before a VM is created. This idea is different from CPU affinity [43] but it is close to the technique proposed in [44]. The reason for having a fixed core for each VM is to achieve per-core DVFS [45] for each VM. ...
... They identify that task mapping is an important factor on performance degradation, being the memory bandwidth per core the primary source of performance drop when increasing the number of cores per node that participate in the computation. Something similar concludes [16], showing a high sensitivity of the attained scientific kernels performance to the multi-core machines. In [17] it is detected that the tested minikernels exhibit high sensitivity to the cluster architecture. ...
Chapter
Full-text available
This investigation summarizes a set of executions completed on the supercomputers Stampede at TACC (USA), Helios at IFERC (Japan), and Eagle at PSNC (Poland), with the molecular dynamics solver LAMMPS, compiled for CPUs. A communication-intensive benchmark based on long-distance interactions tackled by the Fast Fourier Transform operator has been selected to test its sensitivity to rather different patterns of tasks location, hence to identify the best way to accomplish further simulations for this family of problems. Weak-scaling tests show that the attained execution time of LAMMPS is closely linked to the cluster topology and this is revealed by the varying time-execution observed in scale up to thousands of MPI tasks involved in the tests. It is noticeable that two clusters exhibit time saving (up to 61% within the parallelization range) when the MPI-task mapping follows a concentration pattern over as few nodes as possible. Besides this result is useful from the user’s standpoint, it may also help to improve the clusters throughput by, for instance, adding live-migration decisions in the scheduling policies in those cases of communication-intensive behaviour detected in characterization tests. Also, it opens a similar output for a more efficient usage of the cluster from the energy consumption point of view.
... They identify that task mapping is an important factor on performance degradation, being the memory bandwidth per core the primary source of performance drop when increasing the number of cores per node that participate in the computation. Something similar concludes [9], showing a high sensitivity of the attained NAS kernels performance to the multi-core machines. In [15] it is detected that NPB exhibit high sensitivity to the cluster architecture. ...
Article
Full-text available
In this work the Numerical Aerodynamic Simulation (NAS) benchmarks have been executed in a systematic way on two clusters of rather different architectures and CPUs, to identify dependencies between MPI tasks mapping and the speedup or resource occupation. To this respect, series of experiments with the NAS kernels have been designed to take into account the context complexity when running scientific applications on HPC environments (CPU, I/O or memory-bound, execution time, degree of parallelism, dedicated computational resources, strong- and weak-scaling behaviour, to cite some). This context includes scheduling decisions, which have a great influence on the performance of the applications, making difficult to achieve an optimal exploitation with cost-effective strategies of the HPC resources. An analysis on how task grouping strategies under various cluster setups drive the execution time of jobs and the infrastructure throughput is provided. As a result, criteria for cluster setup arise linked to maximize performance of individual jobs, total cluster throughput or achieving better scheduling. To this respect, a criterion for execution decisions is suggested. This work is expected to be of interest on the design of scheduling policies and useful to HPC administrators.
... They identify that task mapping is an important factor on performance degradation, being the memory bandwidth per core the primary source of performance drop when increasing the number of cores per node that participate in the computation. Something similar concludes [9], showing a high sensitivity of the attained NAS kernels performance to the multi-core machines. In [10] it is detected that NPB exhibit high sensitivity to the cluster architecture. ...
Article
A scalable system has increasing performance with increasing system size. Coordination among units can introduce overheads with an impact on system performance. The coordination costs can lead to sublinear improvement or even diminishing performance with increasing system size. However, there are also systems that implement efficient coordination and exploit collaboration of units to attain superlinear improvement. Modeling the scalability dynamics is key to understanding and engineering efficient systems. Known laws of scalability are minimalistic phenomenological models that explain a rich variety of system behaviors through concise equations. While useful to gain general insights, the phenomenological nature of these models may limit the understanding of the underlying dynamics, as they are detached from first principles that could explain coordination overheads or synergies among units. Through a decentralized system approach, we propose a general model based on generic interactions between units that is able to describe, as specific cases, any general pattern of scalability included by previously reported laws. The proposed general model of scalability has the advantage of being built on first principles, or at least on a microscopic description of interaction between units, and therefore has the potential to contribute to a better understanding of system behavior and scalability.
Article
Едно от най-големите предизвикателства на информатиката е да създава правилно работещи компютърни системи. За да се гарантира коректността на една система, по време на дизайн могат де се прилагат формални методи за моделиране и валидация. Този подход е за съжаление труден и скъп за приложение при мнозинството компютърни системи. Алтернативният подход е да се наблюдава и анализира поведението на системата по време на изпълнение след нейното създаване. В този доклад представям научната си работа по въпроса за наблюдение на копютърните системи. Предлагам един общ поглед на три основни страни на проблема: как трябва да се наблюдават компютърните системи, как се използват наблюденията при недетерминистични системи и как се работи по отворен, гъвкав и възпроизводим начин с наблюдения.
Article
Multicore processors are now a mainstream approach to deliver higher performance to parallel applications. In order to develop efficient parallel applications for those platforms, developers must take care of several aspects, ranging from the architectural to the application level. In this context, Transactional Memory (TM) appears as a programmer friendly alternative to traditional lock-based concurrency for those platforms. It allows programmers to write parallel code as transactions, which are guaranteed to execute atomically and in isolation regardless of eventual data races. At runtime, transactions are executed speculatively and conflicts are solved by re-executing conflicting transactions. Although TM intends to simplify concurrent programming, the best performance can only be obtained if the underlying runtime system matches the application and platform characteristics. The contributions of this thesis concern the analysis and improvement of the performance of TM applications based on Software Transactional Memory (STM) on multicore platforms. Firstly, we show that the TM model makes the performance analysis of TM applications a daunting task. To tackle this problem, we propose a generic and portable tracing mechanism that gathers specific TM events, allowing us to better understand the performances obtained. The traced data can be used, for instance, to discover if the TM application presents points of contention or if the contention is spread out over the whole execution. Our tracing mechanism can be used with different TM applications and STM systems without any changes in their original source codes. Secondly, we address the performance improvement of TM applications on multicores. We point out that thread mapping is very important for TM applications and it can considerably improve the global performances achieved. To deal with the large diversity of TM applications, STM systems and multicore platforms, we propose an approach based on Machine Learning to automatically predict suitable thread mapping strategies for TM applications. During a prior learning phase, we profile several TM applications running on different STM systems to construct a predictor. We then use the predictor to perform static or dynamic thread mapping in a state-of-the-art STM system, making it transparent to the users. Finally, we perform an experimental evaluation and we show that the static approach is fairly accurate and can improve the performance of a set of TM applications by up to 18%. Concerning the dynamic approach, we show that it can detect different phase changes during the execution of TM applications composed of diverse workloads, predicting thread mappings adapted for each phase. On those applications, we achieve performance improvements of up to 31% in comparison to the best static strategy.
Conference Paper
Full-text available
Modern multi-core platforms feature complex topologies with different cache levels and hierarchical memory subsystems. Consequently, managing thread and data placement efficiently becomes crucial to improve the performance of applications. In this context, CPU and memory affinity appear as alternatives to match the application characteristics to the underlying architecture. In this paper, we investigate the impact of CPU and memory affinity strategies on multi-core platforms using numerical scientific multithreaded benchmarks. We perform a deeper study through the analysis of performance event counters in order to have a better understanding of such an impact. Indeed, the results show that important performance improvements (up to 70%) can be obtained when applying affinity strategies that fit both application and platform characteristics.
Conference Paper
Full-text available
Thread mapping is an appealing approach to efficiently exploit the potential of modern chip-multiprocessors. However, efficient thread mapping relies upon matching the behavior of an application with system characteristics. In particular, Software Transactional Memory (STM) introduces another dimension due to its runtime system support. In this work, we propose a dynamic thread mapping approach to automatically infer a suitable thread mapping strategy for transactional memory applications composed of multiple execution phases with potentially different transactional behavior in each phase. At runtime, it profiles the application at specific periods and consults a decision tree generated by a Machine Learning algorithm to decide if the current thread mapping strategy should be switched to a more adequate one. We implemented this approach in a state-of-the-art STM system, making it transparent to the user. Our results show that the proposed dynamic approach presents performance improvements up to 31% compared to the best static solution.
Technical Report
Full-text available
The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence. Conventional wisdom is now to double the number of cores on a chip with each silicon generation. A multidisciplinary group of Berkeley researchers met nearly two years to discuss this change. Our view is that this evolutionary approach to parallel hardware and software may work from 2 or 8 processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism. We believe that much can be learned by examining the success of parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. This led us to frame the parallel landscape with seven questions, and to recommend the following: The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS per development dollar. Instead of traditional benchmarks, use 13 "Dwarfs" to design and evaluate parallel programming models and architectures. (A dwarf is an algorithmic method that captures a pattern of computation and communication.) "Autotuners" should play a larger role than conventional compilers in translating parallel programs. To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications. To be successful, programming models should be independent of the number of processors. To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism. Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters. Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines. To explore the design space rapidly, use system emulators based on Field Programmable Gate Arrays (FPGAs) that are highly scalable and low cost. Since real world applications are naturally parallel and hardware is naturally parallel, what we need is a programming model, system software, and a supporting architecture that are naturally parallel. Researchers have the rare opportunity to re-invent these cornerstones of computing, provided they simplify the efficient programming of highly parallel systems.
Article
Full-text available
The slogan of last year's International Workshop on OpenMP was "A Practical Programming Model for the Multi-Core Era", although OpenMP still is fully hardware architecture agnostic. As a consequence the programmer is left alone with bad performance if threads and data happen to live apart. In this work we examine the programmer's possibil-ities to improve data and thread affinity in OpenMP pro-grams for several toy applications and present how to apply the lessons learned on larger application codes. We filled a gap by implementing explicit data migration on Linux pro-viding a next touch mechanism.
Conference Paper
Full-text available
Modern shared memory multiprocessor systems commonly have non-uniform memory access (NUMA) with asymmetric memory bandwidth and latency characteristics. Operating systems now provide application programmer interfaces allowing the user to perform specific thread and memory placement. To date, however, there have been relatively few detailed assessments of the importance of memory/thread placement for complex applications. This paper outlines a framework for performing memory and thread placement experiments on Solaris and Linux. Thread binding and location specific memory allocation and its verification is discussed and contrasted. Using the framework, the performance characteristics of serial versions of lmbench, Stream and various BLAS libraries (ATLAS, GOTO, ACML on Opteron/Linux and Sunperf on Opteron, UltraSPARC/Solaris) are measured on two different hardware platforms (UltraSPARC/FirePlane and Opteron/HyperTransport). A simple model describing performance as a function of memory distribution is proposed and assessed for both the Opteron and UltraSPARC.
Conference Paper
Full-text available
Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention. A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simulators, the use of out-of-date benchmarks, and the limited coverage of deciding factors. The influence of CMP cache sharing on contemporary multithreaded applications remains preliminarily understood. In this work, we conduct a systematic measurement of the influence on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions, regardless of the types of parallelism, input datasets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch of current development and compilation of multithreaded applications and CMP architectures. By transforming the programs in a cache-sharing-aware manner, we observe up to 36% performance increase when the threads are placed on cores appropriately.
Conference Paper
Full-text available
Multi-core processors are planned for virtually all next-generation HPC systems. In a preliminary evaluation of AMD Opteron Dual-Core processor systems, we investigated the scaling behavior of a set of micro-benchmarks, kernels, and applications. In addition, we evaluated a number of processor affinity techniques for managing memory placement on these multi-core systems. We discovered that an appropriate selection of MPI task and memory placement schemes can result in over 25% performance improvement for key scientific calculations. We collected detailed performance data for several large-scale scientific applications. Analyses of the application performance results confirmed our micro-benchmark and scaling results.
Article
Clusters of multicore nodes have become the most popular option for new HPC systems due to their scalability and performance/cost ratio. The complexity of programming multicore systems underscores the need for powerful and efficient runtime systems that manage resources such as threads and communication sub-systems on behalf of the applications. In this paper, we study several multicore performance issues on clusters using Intel, AMD and IBM processors in the context of the Charm++ runtime system. We then present the optimization techniques that overcome these performance issues. The techniques presented are general enough to apply to other runtime systems as well. We demonstrate the benefits of these optimizations through both synthetic benchmarks and production quality applications including NAMD and ChaNGa on several popular multicore platforms. We demonstrate performance improvement of NAMD and ChaNGa by about 20% and 10%, respectively.
Conference Paper
The efficient mapping of program parallelism to multi-core processors is highly dependent on the underlying architecture. This paper proposes a portable and automatic compiler-based approach to mapping such parallelism using machine learning. It develops two predictors: a data sensitive and a data insensitive predictor to select the best mapping for parallel programs. They predict the number of threads and the scheduling policy for any given program using a model learnt off-line. By using low-cost profiling runs, they predict the mapping for a new unseen program across multiple input data sets. We evaluate our approach by selecting parallelism mapping configurations for OpenMP programs on two representative but different multi-core platforms (the Intel Xeon and the Cell processors). Performance of our technique is stable across programs and architectures. On average, it delivers above 96% performance of the maximum available on both platforms. It achieve, on average, a 37% (up to 17.5 times) performance improvement over the OpenMP runtime default scheme on the Cell platform. Compared to two recent prediction models, our predictors achieve better performance with a significant lower profiling cost.