ArticlePDF Available

Abstract and Figures

An important aspect of High-Performance Computing (HPC) system design is the choice of main memory capacity. This choice becomes increasingly important now that 3D-stacked memories are entering the market. Compared with conventional Dual In-line Memory Modules (DIMMs), 3D memory chiplets provide better performance and energy efficiency but lower memory capacities. Therefore, the adoption of 3D-stacked memories in the HPC domain depends on whether we can find use cases that require much less memory than is available now. This study analyzes the memory capacity requirements of important HPC benchmarks and applications. We find that the High-Performance Conjugate Gradients (HPCG) benchmark could be an important success story for 3D-stacked memories in HPC, but High-Performance Linpack (HPL) is likely to be constrained by 3D memory capacity. The study also emphasizes that the analysis of memory footprints of production HPC applications is complex and that it requires an understanding of application scalability and target category, i.e., whether the users target capability or capacity computing. The results show that most of the HPC applications under study have per-core memory footprints in the range of hundreds of megabytes, but we also detect applications and use cases that require gigabytes per core. Overall, the study identifies the HPC applications and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first step toward adoption of this novel technology in the HPC domain.
Content may be subject to copyright.
000
Main Memory in HPC: Do We Need More, or Could We Live with Less?
DARKO ZIVANOVIC, Barcelona Supercomputing Center (BSC), Universitat Polit`
ecnica de Catalunya
MILAN PAVLOVIC, Barcelona Supercomputing Center (BSC), Universitat Polit`
ecnica de Catalunya
MILAN RADULOVIC, Barcelona Supercomputing Center (BSC), Universitat Polit`
ecnica de Catalunya
HYUNSUNG SHIN, Samsung Electronics Co., Ltd., Memory Division
JONGPIL SON, Samsung Electronics Co., Ltd., Memory Division
SALLY A. MCKEE, Chalmers University of Technology
PAUL M. CARPENTER, Barcelona Supercomputing Center (BSC)
PETAR RADOJKOVI ´
C, Barcelona Supercomputing Center (BSC)
EDUARD AYGUAD ´
E, Barcelona Supercomputing Center (BSC), Universitat Polit`
ecnica de Catalunya
An important aspect of high-performance computing (HPC) system design is the choice of main memory ca-
pacity. This choice becomes increasingly important now that 3D-stacked memories are entering the market.
Compared with conventional DIMMs, 3D memory chiplets provide better performance and energy efficiency
but lower memory capacities. Therefore the adoption of 3D-stacked memories in the HPC domain depends
on whether we can find use cases that require much less memory than is available now.
This study analyzes the memory capacity requirements of important HPC benchmarks and applications.
We find that the High Performance Conjugate Gradients benchmark could be an important success story
for 3D-stacked memories in HPC, but High-performance Linpack is likely to be constrained by 3D memory
capacity. The study also emphasizes that the analysis of memory footprints of production HPC applications
is complex and that it requires an understanding of application scalability and target category, i.e., whether
the users target capability or capacity computing. The results show that most of the HPC applications
under study have per-core memory footprints in the range of hundreds of megabytes, but we also detect
applications and use cases that require gigabytes per core. Overall, the study identifies the HPC applications
and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first
step towards adoption of this novel technology in the HPC domain.
CCS Concepts: •Computer systems organization Distributed architectures; Hardware Anal-
ysis and design of emerging devices and systems; Memory and dense storage;
Additional Key Words and Phrases: Memory capacity requirements, High-performance computing, Produc-
tion HPC applications, HPL, HPCG.
ACM Reference Format:
Darko Zivanovic, Milan Pavlovic, Milan Radulovic, Hyunsung Shin, Jongpil Son, Sally A. McKee, Paul M.
Carpenter, Petar Radojkovi´
c and Eduard Ayguad´
e, 2016. Main Memory in HPC: Do We Need More, or Could
We Live with Less? ACM Trans. Embedd. Comput. Syst. V, N, Article 000 ( 2016), 25 pages.
DOI: http://dx.doi.org/10.1145/3023362
Authors’ addresses: D. Zivanovic, M. Pavlovic, M. Radulovic, P. M. Carpenter, P. Radojkovi´
c, E. Ayguad´
e,
Barcelona Supercomputing Center, Jordi Girona 1-3, K2M 102, 08034, Barcelona, Spain; H. Shin, J. Son,
Samsung Electronics Co., Ltd., Memory Division, 1-1, Samsungjeonja-ro, Hwaseong-si, Gyeonggi-do 445-
701 Korea; S. A. McKee, Chalmers University of Technology, R¨
annv¨
agen 6, 4th floor, E/D/IT (D&IT), 41296
G¨
oteborg, Sweden;
M. Pavlovic is currently at ASML, Eindhoven, Netherlands.
Correspondence email: darko.zivanovic@bsc.es
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights for components of this work owned
by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or repub-
lish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from permissions@acm.org.
© 2016 ACM. 1539-9087/2016/-ART000 $15.00
DOI: http://dx.doi.org/10.1145/3023362
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
000:2 D. Zivanovic et al.
1. INTRODUCTION
Memory systems are important contributors to the deployment and operational costs
of large-scale HPC clusters [Kogge et al. 2008; Stevens et al. 2010; Sodani 2011], mak-
ing memory provisioning one of the most important aspects of HPC system design.1
In spite of this, most available analysis guiding memory provisioning is surprisingly
ad hoc. Usually large HPC systems follow a rule of thumb that couples 2–3GB of
main memory per x86 core or 1 GB per Blue Gene PowerPC core (see Figure 1). It
seems that this rule of thumb is based on experience with previous HPC clusters and
on undocumented knowledge of the principal system integrators, and it is uncertain
whether it matches the memory requirements of production HPC applications.
[1,2) [2,3) [3,4) [4,5) 5
Per-core memory capacity [GB]
0
5
10
15
20
25
30
Number of HPC systems
CPU architecture:
x86
PowerPC
SPARC
Fig. 1. Per-core memory capacity of HPC systems leading the TOP500 list (June 2015). Systems with ex-
actly 2, 3, and 4 GB of memory per core are included in bars [2,3) GB, [3,4) GB, and [4,5) GB, respectively.
Today’s HPC systems are dominated by x86 architectures coupled with 2–3 GB of main memory per core.
The next most prevalent systems are Blue Gene platforms based on PowerPC cores with 1 GB of memory
per core, included in the [1,2) GB bar.
Even though there are various reports and projections that roughly estimate the
memory requirements of existing HPC applications [Atkins et al. 2003; NERSC
2012; 2013; 2014a; 2014b; 2015a; 2015b], there are no or very few studies that
thoroughly analyze and quantify the memory footprints of HPC workloads across
multiple domains. In this paper we try to bridge this gap and to examine whether
current memory design strategies meet the memory requirements of important HPC
benchmarks and applications.
In this study, we theoretically analyze and confirm with experimental measure-
ments the memory capacity requirements of High-Performance Linpack (HPL) and
High Performance Conjugate Gradients (HPCG), the former being the benchmark
used to rank the supercomputers on the TOP500 list. Our measurements show that
in current systems achieving good HPL scores requires at least 2 GB of main memory
per core, which matches the main memory sizing trends of the large HPC clusters
that dominate the TOP500 list. The analysis also shows that, as the total number of
cores is increased, more memory per core will be needed to achieve good performance,
between 7.6 GB and 16.1 GB in a million-core cluster. In contrast, HPCG memory
requirements are fundamentally different. To converge to the optimal performance,
the benchmark requires roughly 0.5 GB of memory per core, and this will not change
as the cluster size increases.
We also study the memory footprints of the Unified European Application Bench-
mark Suite (UEABS), large-scale scientific workloads carefully selected to provide
good coverage of production HPC applications running on Tier-0 and Tier-1 HPC
systems in Europe [PRACE 2013]. We observe a bimodal distribution in memory
1In our system, the MareNostrum supercomputer [Barcelona Supercomputing Center 2013], main memory
accounts to more than 10% of server cost and 10–15% of server energy consumption.
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
Main Memory in HPC: Do We Need More, or Could We Live with Less? 000:3
requirements, finding that memory requirements depend on application scalability
and the targeted HPC category, i.e., whether the workloads represent capability or
capacity computing. In HPC, capability computing refers to using large-scale HPC
installations to solve a single, highly complex problem in the shortest possible time,
while capacity computing refers to optimizing system efficiency to solve as many
mid-size or smaller problems as possible at the same time at the lowest possible cost.
Based on our findings, we recommend guidelines for selecting an appropriate level of
parallelism when designing experiments to quantify memory capacity requirements
of production HPC applications. Most of the UEABS applications have per-core
memory footprints in the range of hundreds of megabytes — an order of magnitude
less than the main memory available in state-of-the-art HPC systems; but we also
detect applications and use cases that still require gigabytes of main memory. We
also demonstrate that even within the same application, different processes can have
memory footprints that vary by an order of magnitude.
To the best of our knowledge, this is the first study that detected and analyzed the
dependency between available memory capacity and HPL and HPCG performance.
Also, for the first time, we explored the complexity of memory footprint analysis for
production HPC applications, and showed how memory footprint depends on applica-
tion scalability and target HPC category. We hope that this study will motivate the
community to question the current trends for memory system sizing in HPC clusters,
and will lead to further analysis of memory capacity requirements of HPC systems.
This analysis becomes increasingly important as 3D-stacked memories are hitting
the market. Replacing conventional DIMMs with new 3D memory chiplets located on
the silicon interposer could be the next breakthrough in memory system design. It
would provide significantly higher memory bandwidth and lower latency, leading to
higher performance and energy-efficiency. On the down side, however, it is unlikely
that (expensive) 3D memory chiplets alone would provide the same memory capacities
as DIMM-based memory systems [Sodani et al. 2016]. Therefore the adoption of 3D-
stacked memories in the HPC domain depends on whether we can find use cases that
require much less memory than is available now.
Academia and industry are also exploring hybrid memory systems that combine 3D-
stacked DRAM with standard DIMMs [Dong et al. 2010; Chou et al. 2014; Sim et al.
2014; Meswani et al. 2015; Sodani et al. 2016]. The general idea behind these hy-
brid systems is to bring the best of two worlds — the bandwidth, latency and energy-
efficiency of 3D-stacked DRAM together with the capacity of DIMMs. In these systems,
however, good performance requires efficient data allocation and migration between
different memory segments. Data management requires profound application profil-
ing, and up to now, no automatic algorithms — whether in the hardware, compiler
or runtime environment — can provide out-of-the-box performance for legacy codes.
Instead, efficient use of the advanced memory organization is still the responsibility
of the programmer, which has significant impact on code development cost [Newburn
2015; Cantalupo et al. 2015; Jeffers et al. 2016].
Therefore, in the context of hybrid memory systems, it is still important to find use
cases with (small) memory footprints that fit into the 3D-stacked memory. With good
out-of-the-box performance, these use cases would be the first success stories for 3D
memory systems. Our study indeed identified the HPC applications and use cases with
memory footprints that could be provided by 3D-stacked memory chiplets, making a
first step towards adoption of this novel technology in the HPC domain.
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
000:4 D. Zivanovic et al.
2. EXPERIMENTAL SETUP
We analyze the memory footprints of HPC applications running on a large-scale clus-
ter. We first describe our hardware platform and applications, and then we explain
how we gather our data.
2.1. Hardware platform
We execute all experiments on the MareNostrum supercomputer [Barcelona Super-
computing Center 2013], one of six Tier-0 HPC systems in the Partnership for Ad-
vanced Computing in Europe (PRACE) [PRACE 2016]. MareNostrum contains 3056
compute nodes connected with InfiniBand. Each node contains two Intel Sandy Bridge-
EP E5-2670 sockets, each comprising eight 2.6 GHz cores. As in most HPC systems,
hyperthreading is disabled. The interconnect is InfiniBand FDR-10 (40 Gb/s), with a
non-blocking two-level fat-tree topology offering full bisection bandwidth, built from
36-port leaf switches and 648-port core switches.
The processors connect to main memory through four channels, each with a single
DDR3-1600 DIMM. Regular MareNostrum compute nodes include 32 GB of DRAM
memory (8 DIMMs ×4 GB), i.e., 2 GB per core. To study application memory footprints
in systems with higher capacity, we execute some experiments on large-memory nodes
containing 128 GB of DRAM (8 DIMMs ×16 GB), i.e., 8 GB per core.
2.2. Applications
We study memory capacity requirements for two widely used HPC benchmarks, High-
Performance Linpack (HPL) and High-Performance Conjugate Gradients (HPCG). We
also study the Unified European Application Benchmark Suite (UEABS) [PRACE
2013], the set of production applications and datasets designed for benchmarking the
European PRACE HPC systems for procurement and comparison purposes [PRACE
2016]. All UEABS applications are parallelized using MPI, and are regularly run on
hundreds or thousands of cores. For all applications, we run MPI-only versions, and
always execute one MPI process per core. So, in the rest of the paper per-process and
per-core memory footprint have equivalent meanings.
2.3. Methodology
In this study, we measure the memory footprints of HPC applications running
with various numbers of processes. We obtain footprint information from the
/proc/[pid]/status system files, which together log the memory usage of all running
processes. The memory footprint of a process corresponds to the amount of physical
memory it uses, i.e., the resident set size, or VmRSS. We use the Extrae and Limpio
instrumentation tools [Barcelona Supercomputing Center 2014, Pavlovic et al. 2015]
to read and log VmRSS values at equidistant time intervals chosen to provide at
least 1000 samples per process. We track, in each experiment, the maximums, means,
and standard deviations of all memory footprint measurements. For production HPC
applications, unless specifically stated that we analyze the master process, we report
footprints of worker processes. For HPL and HPCG, we report memory footprints for
all the processes.
3. HIGH-PERFORMANCE LINPACK
For more than 20 years, the High-Performance Linpack (HPL) benchmark is the most
widely recognized and discussed metric for ranking of HPC systems [Strohmaier et al.
2015]. HPL measures the sustained floating-point rate (GFLOP/s) for solving a dense
system of linear equations using double-precision floating-point arithmetic. The linear
system is randomly generated, with a user-specified size, so that the user can scale the
problem size to achieve the best performance on a given system. The documentation
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
Main Memory in HPC: Do We Need More, or Could We Live with Less? 000:5
32 256 512 768 1024 1280 1536 1792 2048
Per-core memory capacity [MB]
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
HPL relative performance
Number of processes:
16k
4k
1k
256
64
16
Fig. 2. HPL performance also depends on the available memory capacity. When increasing per-core mem-
ory, HPL performance first increases and then reaches the saturation point — the sustained floating-point
rate (GFLOP/s) of the system. As we increase the number of processes, the saturation point moves towards
larger per-core memory capacities.
recommends setting a problem size that uses approximately 80% of the available
memory [Petitet et al. 2012]. We determine how reducing the memory capacity would
affect performance by appropriately decreasing the problem size. Specifically, we set
the problem size in order to use 80% of the target main memory capacity and verify at
runtime that the memory footprint fits inside the target memory capacity.
3.1. Measured memory requirements
Figure 2 shows relative HPL performance in FLOP/s (Y-axis) for various amounts of
available memory per core (X-axis) and different numbers of processes (different lines
on the chart). We plot performance relative to the results with 32 MB of main memory,
the smallest amount of memory used in any of these experiments. The experiments
are performed for a range of 16–16,384 processes, and in each experiment the number
of processes equals the number of cores.
As we increase the available main memory, HPL performance first increases, then
reaches the saturation point and remains approximately constant. We observe this
trend for any number of HPL processes. Also, we detect that stable HPL performance
(after the saturation point) is directly proportional to the number of processes used
in the experiment. For instance, increasing the number of processes from 16 to 64,
256, and 1024 increases the steady-state performance by roughly 4×, 16×, and 64×
respectively.2We also detect that, as we increase the number of processes, the satura-
tion point moves towards larger per-core memory capacities; for instance, 16, 256, and
4096 processes reach their saturation points at 256 MB, 768 MB, and 1280 MB, re-
spectively. This means that for larger HPC clusters, more memory per core is required
to achieve maximum HPL performance. When HPL comprises 16,384 processes (exe-
cuted on one third of the MareNostrum supercomputer), the saturation point reaches
1.5 GB of memory per core. These results indicate that on large HPC systems with
tens or hundreds of thousands of high-end x86 cores, reaching the HPL performance
saturation point, or at least the point of diminishing returns, requires approximately
2 GB of memory per core. This matches the main memory sizing trends of the large
HPC systems that dominate the TOP500 list.
2This trend is not visible in Figure 2 because for each number of processes the relative performance is
normalized to the 32 MB results.
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
000:6 D. Zivanovic et al.
3.2. Analysis
In order to confirm and better understand our measurements, we analyze how HPL
performance depends on single-core floating-point rate, available memory capacity, in-
terconnect bandwidth and latency, and how this dependency changes as we vary the
number of processes used in the HPL runs. In this section, we summarize the analysis
focusing on the conclusions that have an impact on the memory system sizing; detailed
step-by-step explanation and mathematical formulas are presented in the Appendix.
Our analysis confirms the HPL performance trends presented in Figure 2. For small
input datasets, HPL has relatively poor overall performances because of high interpro-
cess communication overheads. As the input dataset increases, the computation factor
becomes dominant, the communication overhead mitigates and the HPL performance
converges to the saturation point. For large per-process memory capacities (large
input datasets), the HPL execution time is dominated by the dense matrix-vector
multiplication computational routines that require significant processing power. In
current HPC systems with gigabytes of main memory, HPL is clearly a CPU bound
application typically able to achieve close to the theoretical peak floating-point rate.
Our theoretical analysis indeed shows that as the memory capacity is increased, the
HPL performance analytically converges to a steady value proportional to the number
of processes used in the HPL run, but the performance optimum is theoretically
reached for infinite main memory.
We also analytically quantify the HPL performance loss due to a finite main memory
capacity and determine the capacity required to reach close-to-optimal HPL perfor-
mance, e.g., within 10%, 5% and 1% of the optimal. We fit the HPC system parameter
constants in the formulas to our experimental data (see Fig. 2) and estimate main
memory capacity that will lead to good HPL scores in larger systems. Since we fit the
constants to our hardware platform, this analysis shows the trend of the main memory
needed to achieve good HPL performance for different system sizes, and not the firm
values. The outcome of this analysis is presented in Figure 3.
Figure 3 plots the amount of memory per core (Y-axis) needed to achieve 90%, 95%
and 99% of the ideal (infinite memory) HPL performance. We present the results for a
wide range of system sizes, from 16 up to over 1,000,000 cores (X-axis). Both axes plot
the values in a logarithmic scale. In scaling the interconnect, we assume that latency
remains constant and bisection bandwidth scales with the number of cores.3This is
a conservative assumption because interconnect latency may increase with the num-
ber of cores, which would further increase the per-core memory requirement for large
systems. From Figure 3, we see that increasing the system size causes more memory
per core to be needed to achieve good HPL performance (recall that both axes of Fig-
ure 3 have a logarithmic scale). For systems with 100,000 cores, 2.6 GB and 5.5 GB
of per-core memory would be needed to reach 90% and 95% of the ideal performance,
respectively. This approximately matches the memory sizing of the HPC systems dom-
inating the current TOP500 list. For an HPC system with 1,000,000 cores, however,
7.6 GB and 16.1 GB of per-core memory would be needed to reach 90% and 95% of
ideal performance, respectively. To increase efficiency to 99% of the ideal performance,
a system with 100,000 cores would require 31.4 GB and a system with 1,000,000 cores
would require 88.6 GB per core, a huge amount.
To put our estimation into perspective on Figure 3 we plot three systems: the 16-core
system used in the first study that analyzed HPL memory capacity requirements [Don-
garra et al. 2003], our largest experiment with 16,384 cores on MareNostrum super-
computer (year 2016), and a future potential 1,000,000-core system. These points val-
3In the model developed in the Appendix, αand βare constants independent of the number of cores.
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
Main Memory in HPC: Do We Need More, or Could We Live with Less? 000:7
100101102103104105106107
Number of cores in HPC system
101
102
103
104
105
Per-core memory capacity [MB]
99% of the ideal HPL performance
95% of the ideal HPL performance
90% of the ideal HPL performance
Million-core system
16.1 GB/core
Dongarra et al. [2003]
16 cores, 250 MB/core
MareNostrum [2016]
16k cores, 2 GB/core
Fig. 3. Per-core memory needed to get 90%, 95% and 99% of the ideal HPL performance, for different sizes
of HPC systems. Three systems, with 250 MB/core (year 2003), 2 GB/core (year 2016) and 16.1 GB/core
(future potential system) are compared with our estimation curves. Moving from 95% to 99% of ideal HPL
performance requires a huge step in the amount of per-core memory (axes are plotted on logarithmic scales).
idate the model over more than a decade of system scaling and illustrate that larger
systems will need more memory per core to achieve good HPL performance.
4. HIGH-PERFORMANCE CONJUGATE GRADIENTS
The High-Performance Conjugate Gradients (HPCG) benchmark [Dongarra et al.
2016] has been introduced as a complement to HPL and the TOP500 rankings, since
the community questions whether HPL is a good proxy for production applications.
HPCG is based on an iterative sparse-matrix conjugate gradient kernel with double-
precision floating-point values, and is representative of HPC applications governed by
differential equations. Such applications tend to have much greater demands on the
memory system, in terms of bandwidth and latency, and they access data using irreg-
ular patterns [Dongarra and Heroux 2013]. Similarly to HPL, the user can scale the
problem size to achieve the best performance on a given system. Therefore, to deter-
mine the capacity requirements, we analyze HPCG performance for a range of 16–8192
processes as a function of the problem size, i.e., main memory used in the experiment.
Then, we analyze whether the trends detected in the real-system measurements match
the expected tendencies based on algorithm complexity and data pattern accesses.
4.1. Measured memory requirements
In Figure 4, we show the relation between relative HPCG performance (Y-axis) and
memory footprint (X-axis). We executed HPCG for various memory footprints, by
changing the problem sizes from 24-24-24 up to 120-120-120 with an additive step of
8, always keeping the three dimensions identical, and reported average per-process
memory footprints. For each number of processes, the performance was plotted
relative to the results with the smallest problem size used in the experiments. We
analyzed this trend for different numbers of processes that are plotted with different
lines of the chart. Recall that, for each experiment, the number of processes equals
the number of cores.
For a small number of processes and small input datasets, the HPCG workload may
(partially) fit into on-CPU caches which leads to performance that significantly exceeds
stable ones (on larger input datasets). In our experiments, we detected this trend for 16
and 32 processes. This trend is detected and analyzed by previous studies [Marjanovi´
c
et al. 2014], and it is considered to be a non-representative use of HPCG [Dongarra
et al. 2014; 2016]. Therefore, we neither plot nor analyze these results.
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
000:8 D. Zivanovic et al.
256 512 768 1024 1280
Per-process memory footprint [MB]
1.0
1.5
2.0
HPCG relative performance
Number of processes:
8192
4096
2048
1024
512
256
128
64
Fig. 4. HPCG performance also depends on the available memory capacity. Performance increases until it
reaches the saturation point, where it is constrained by the sustained memory bandwidth. The saturation
point remains constant across a wide range of HPCG processes, at around 512 MB of main memory per
process.
As we increase the HPCG problem size, i.e., per-process memory footprint, HPCG
performance rapidly increases and then reaches a stable value directly proportional
to the number of processes used in the experiment.4Unlike HPL, we detect that the
saturation point remains constant, roughly 512 MB of main memory per-process, for
a large range of HPCG processes. In the following section, we analyze this in detail.
Finally, we also detect that for memory footprints of around 1.5 GB, HPCG perfor-
mance decreases. The performance drop is caused by swapping, and we did not detect
it when the experiments were repeated on large-memory nodes comprising 8 GB of
main memory per core. The HPCG performance drop because of memory swapping
was not reported in past, and we would suggest that HPCG developers take this into
account when providing suggestions about dataset sizing.
4.2. Analysis
Although the HPCG benchmark was released only a couple of years ago, several
studies analyze its behavior and performance bottlenecks, and even estimate its
performance on future exascale HPC systems [Marjanovi´
c et al. 2014; Park et al.
2014]. When running HPCG, the user sets the per-process problem size N. For a given
problem size N, the number of floating-point operations and memory accesses are
both proportional to N.
#F LOP sper process ≈ O(N)(1)
The execution time of HPCG depends on the problem size Nand number of processes n:
T≈ O(N) + O(N2
3) + O(log n)(2)
The first factor O(N)refers to the computational complexity, while factors O(N2
3)and
O(log n)refer to point-to-point and collective communication, respectively.
For small memory capacities, below 256 MB per process in Figure 4, the HPCG per-
formance is affected by the interprocess communication, factors O(N2
3)and O(log n)
in Equation 2. However, as the input dataset increases, the factor O(N)becomes
dominant and the communication overhead mitigates; the HPCG performance rapidly
converges to the saturation point determined by the memory bandwidth.
For large per-process memory capacities (large values of N), the HPCG execution
time is dominated by the computational routines, which mainly perform sparse matrix-
vector multiplication [Dongarra and Heroux 2013] and require modest CPU power but
4As for HPL, this trend is not visible in Figure 4 because for each number of processes, the performance is
normalized to the result with the smallest input dataset.
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
Main Memory in HPC: Do We Need More, or Could We Live with Less? 000:9
significant memory bandwidth. For each floating-point operation (FLOP), HPCG re-
quires a transfer of at least 4 bytes from main memory, i.e., HPCG byte-per-FLOP
ratio is higher than 4. In state-of-the-art HPC systems, the byte-per-FLOP ratio is be-
low 1,5meaning that in current systems memory bandwidth is the main performance
bottleneck. As we increase the system size, the total available memory bandwidth and
therefore HPCG performance increase proportionally. Because of this, HPCG perfor-
mance is proportional to the number of processes, as detected in Section 4.1.
Finally, we analyze whether the HPCG performance saturation point moves as we
increase the number of processes. As the number of processes increases, the factor
O(log n)might cause the HPCG saturation point to move towards the larger per-
process memory capacity, i.e., towards the right in Figure 4. Marjanovi´
c et al. [2014]
analyze in detail the impact of collective communication on HPCG performance, and
conclude that for any plausible input dataset sizes and numbers of processes this im-
pact is negligible. The authors also analyze the communication overheads in future
systems, and estimate that even in a million-core HPC system, the communication
overhead stays below 1.2%.
5. PRODUCTION HPC APPLICATIONS
In this section, we analyze how the number of application processes affects the memory
footprints of production HPC applications, taking into account application scalability,
the targeted HPC category, and the size of the input dataset.
We study 10 of the 12 applications from the Unified European Application Bench-
mark Suite (UEABS) [PRACE 2013], which has been designed to represent production
applications running on large-scale Tier-1 and Tier-0 HPC systems in Europe.6Par-
allelized using the Message Passing Interface (MPI), these applications are regularly
executed on hundreds to thousands of cores. Most of the applications come with two
input datasets. Smaller datasets (Test Case A) are deemed suitable for Tier-1 systems
up to about 1000 cores, and larger datasets (Test Case B) target Tier-0 systems up to
about 10,000 cores. For BQCD, GADGET and NEMO, a single dataset (Test Case A)
is provided that is suitable for both system sizes. Table I summarizes the applications
and input datasets. For each application, we show the area of science that it targets,
briefly describe the input dataset, and indicate the number of processes used in the
experiments. As for HPL and HPCG, in all experiments we execute one application
process per CPU core. The number of processes starts from 16 (a single MareNostrum
node) and it increases by powers-of-two. Some of the applications have memory ca-
pacity requirements that exceed the available memory on a single node, which limits
the lowest number of processes we use in the experiments, e.g., BQCD in Test Case A
cannot be executed with less than 64 processes (four nodes). The largest number of
processes we use is 8192, except for Quantum Espresso (QE), which reports errors
when executing on 4096 or 8192 cores. Note that SPECFEM3D always runs with the
specified numbers of cores: 864 in Test Case A and 11,616 in Test Case B.
5.1. Memory footprint vs. Number of processes
The memory footprints of an HPC application executed with a given input dataset can
vary significantly for different numbers of application processes. In general, the more
processes used for the computation, the smaller the portion of the input data handled
5Our node comprises two 8-core Sandy Bridge sockets and each core can execute up to 8 double-precision
FLOPs per cycle. Each socket has four 64-bit wide, 1.6 GHz memory channels. Therefore byte/FLOP =
8×(8 bytes)×(1.6 GHz)
16×(8 FLOP)×(3 GHz) = 0.27.
6We could not finalize the Code Saturne and GPAW installations. These errors have been reported to the
application developers.
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
000:10 D. Zivanovic et al.
Table I. Scientific HPC applications used in the study
Application Area of science Test Case A: Smaller input dataset Test Case B: Larger input dataset
Problem size Process
range
Problem size Process
range
ALYA Computational
mechanics
27 million element
mesh
16–1k 552.9 million element
mesh
256–8k
BQCDaParticle
physics
322×642lattice 64–8k N/A N/A
CP2K Computational
chemistry
Energy calculation of
1024 waters
128–1k 216 LiH system with
Hartree-Fock exchange b
GADGET Astronomy
and cosmology
135 million particles 512–8k N/A N/A
GENE Plasma
physics
Ion-scale turbulence
in Asdex-Upgrade
64–1k Ion-scale turbulence in
Jet
2k–8k
GROMACS Computational
chemistry
150,000 atoms 16–1k 3.3 million atoms 16–8k
NAMD Computational
chemistry
2×2×2 replication of
the STM Virus
16–1k 4×4×4 replication of
the STM Virus
64–8k
NEMO Ocean model-
ing
12° global configura-
tion; 4322×3059 grid
512–8k N/A N/A
QE Computational
chemistry
112 atoms; 21 itera-
tions
16–1k 1532 atoms; two itera-
tions
1k–2k
SPECFEM3D Computational
geophysics
6×12×768 mesh of
the earth
864 6×24×1760 mesh of
the earth
11,616
aQuantum Chromo-Dynamics (QCD) is a set of five kernels. We study Kernel A, also called Berlin Quantum
Chromo-Dynamics (BQCD), which is commonly used in QCD simulations.
bCP2K cannot run Test Case B on our platform. The errors have been reported to the application developers.
by each process. On the other hand, distributing the computation over a larger num-
ber of processes also means more replication of data on the boundaries of adjacent data
segments, or in per-process private data segments, external libraries, and communica-
tion buffers [Koop et al. 2007]. Although it may seem obvious that memory footprints
of HPC applications are tightly-coupled with the number of application processes, pre-
vious studies of memory footprints ignore this relationship (see Section 7).
Figure 5(a) illustrates the relationship between the per-process (per-core) memory
footprint and the number of application processes for NAMD running Test Case A. This
is a strong scaling case, i.e, we keep the same input dataset (Test Case A) and change
the number of processes. This corresponds to a real-life use of production applications
in which users have to choose the number of processes that will be used to solve an al-
ready defined problem with a fixed input size. For each process we track the maximum
and mean memory footprints, and we plot the average values and the standard devi-
ations among all processes. When NAMD runs as 16 processes, the mean per-process
memory footprint is 1656 MB. As we increase the number of processes, the footprint
drops significantly. When the application runs using 1024 processes, the per-process
memory footprint is only 258 MB, a difference of 6.4×. Maximum memory footprints
exhibit the same trends (see Figure 5(a)), and thus in the rest of the paper we discuss
only mean footprint values.
Figure 5(b) summarizes memory footprint results for all UEABS workloads. Except
GROMACS, which has a small overall memory footprint,7the general trend is the
7The GROMACS developers explain that the application requires only around 100 MB in total for Test
Case A (divided among all processes). The dominant part of the per-process GROMACS memory footprints
comes from the MPI library and other external libraries, and it remains constant as the number of applica-
tion processes increases.
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
Main Memory in HPC: Do We Need More, or Could We Live with Less? 000:11
16 32 64 128 256 512 1024
Number of application processes
0
400
800
1200
1600
2000
6.4
×
Per-process memory footprint [MB]:
Maximum
Mean
(a) NAMD, Test Case A. Mean and maximum mem-
ory footprints exhibit the same trend.
ALYA
[16–1024]
GROMACS
[16–1024]
NAMD
[16–1024]
BQCD
[64–8192]
CP2K
[128–1024]
GADGET
[512–8192]
GENE
[64–1024]
NEMO
[512–8192]
QE
[16–256]
0
5
10
15
20 17.0×
1.2×
6.4×7.9×
3.3×
7.7×7.6×6.6×5.8×
Reduction factor of
mean per-process footprint
(b) UEABS applications, Test Case A. The range of
processes is indicated below each benchmark.
Fig. 5. Per-process memory footprints shrink as the number of processes increases.
same for the remaining applications. We detect memory footprint changes from 3.3×
for CP2K up to 17×for ALYA.
5.1.1. Discussion. Our analysis emphasizes that the memory footprints of HPC appli-
cations are tightly-coupled with the number of application processes. State-of-the-art
parallel benchmark suites, however, do not strictly define the number of processes to
use in experiments. UEABS recommends experiments with up to 10,000 processes, but
the minimum number of processes is not specified. Similarly, other parallel benchmark
suites either provide loose recommendations about the number of processes (SPEC
OMP2012 [SPEC 2015b], SPEC MPI2007 [SPEC 2015a], SPLASH-2 [Woo et al. 1995])
or do not discuss this issue at all (NAS [Wong et al. 1999], PARSEC [Bienia et al.
2008], HPC Challenge [Luszczek and Dongarra 2005], Berkeley dwarfs [Asanovic et al.
2006]). Therefore, when analyzing memory capacity requirements, it is essential that
the users themselves determine a number of processes that is representative of real
production use. This, in turn, requires knowledge of the HPC category that the user is
targeting together an understanding of the scalability of the applications under study,
as we discuss in the following sections.
5.2. Selecting the number of processes
5.2.1. HPC categories. High-performance computing is broadly divided into two cate-
gories [ETP4HPC 2013]. Capability computing refers to using a large-scale HPC in-
stallation to solve a single problem in the shortest possible time, for example simulat-
ing a human brain on a Tier-0 HPC system. Capacity computing refers to optimizing
system efficiency to solve as many mid-size or smaller problems as possible at the same
time at the lowest possible cost, for example when small or medium enterprises use
rented (on-demand) HPC resources to simulate numerous design choices for their prod-
ucts. Analyzing pricing policies for renting HPC resources is beyond the scope of this
study; in the rest of this paper, we therefore approximate the cost of a given experiment
as proportional to the CPU-hours, i.e., the number of cores used in the experiment
(#cores) multiplied by the execution time: cost C P U hour s = #cores ×exe time.
Although capability computing targets application runs with the lowest execution
time, excessive application scaling may deliver diminishing returns in performance im-
provement while linearly increasing CPU-hours. This is an unacceptable scenario that
leads to inefficient resource utilization. Similarly, although capacity computing targets
low-cost HPC computation, excessive slowdown of application runs may have unac-
ceptable impact, e.g., on the productivity of engineers waiting for simulation results.
5.2.2. Application scalability. It is important to understand that CPU-hours and execu-
tion time are dependent metrics, and that in the production runs, users must ana-
lyze the trade-offs between them. In Figure 6, we analyze this relationship for NAMD
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
000:12 D. Zivanovic et al.
16 32 64 128 256 512 1024
0
10
20
30
40
50
1.0 2.03 3.97
7.59
14.6
27.79
39.14
1.0 0.99 1.01 1.05 1.1 1.15 1.64
Speed-up
CPU-hours
16 32 64 128 256 512 1024
Number of application processes
0
0.2
0.4
0.6
0.8
1
1.0 1.01 0.99 0.95 0.91 0.87
0.61
Parallel
Efficiency
(a) NAMD (Test Case A), good scalability
16 32 64 128 256 512 1024
0
1
2
3
4
5
1.0
1.77
2.71
3.98
1.0 1.13 1.48
2.01
Speed-up
CPU-hours
16 32 64 128 256 512 1024
Number of application processes
0
0.2
0.4
0.6
0.8
11.0 0.89
0.68
0.5
Parallel
Efficiency
(b) CP2K (Test Case A), limited scalability
Fig. 6. Trade-offs between normalized execution time and experiment cost (CPU-hours) for applications
with good and limited scalability.
and CP2K, respectively. The upper graphs of Figures 6(a) and 6(b) show normalized
speed-up and CPU-hours for each experiment, and the lower graphs show applica-
tion parallel efficiency. All statistics are computed relative to the experiments with the
fewest processes, 16 for NAMD and 128 for CP2K.8Parallel efficiency (a number be-
tween 0 and 1) quantifies how effectively the resources are utilized, and it is the main
metric for analyzing application scalability. A parallel efficiency of 1 means that the
application speed-up is directly proportional to the number of processes. Low parallel
efficiency means that significantly increasing processing resources only delivers low or
moderate speed-ups.
NAMD is an example of an application with good scalability (Figure 6(a)). Increasing
the number of processes causes significant speed-ups with negligible increments in
CPU-hours. When we change the number of processes from 16 to 256, 512, and 1024,
we measure speed-ups of 14.60×, 27.79×, and 39.14×at cost increments of 10%, 15%,
and 64%, respectively. When used in capability computing, NAMD should be executed
with a large number of processes (we use 1024 processes in the experiments presented
in Figure 6(a)). Although CPU-hours is the primary metric in capacity computing, it is
reasonable to expect that a user would accept small increases in CPU-hours if they lead
to high (e.g. 42-fold) improvements in execution time. Therefore, in capacity computing
as well, experiments with a large number of processes are most representative of real-
life production use of NAMD. We observe similar trends for ALYA, BQCD, GENE, and
Quantum Espresso. All of them show good scalability and significant speed-ups with
negligible CPU-hours increments. In both capability and capacity computing, these
applications should be executed with many processes.
CP2K is an example of an application with limited scalability (Figure 6(b)). When we
change the number of processes from 128 to 256, 512, and 1024, we observe speed-ups
of 1.77×, 2.71×, and 3.98×, while the CPU-hours increase 1.13×, 1.48×, and 2.01×,
respectively. Results for CP2K show clear trade-offs between cost and speed-up. When
used in capability computing, CP2K should be executed with a large number of pro-
cesses, 512 and 1024 in the experiments presented in Figure 6(b). When targeting
capacity computing, users should try to reduce CPU-hours. Thus, CP2K should be par-
8Recall that CP2K cannot be executed with 16, 32 and 64 processes because its memory requirements exceed
the available memory.
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
Main Memory in HPC: Do We Need More, or Could We Live with Less? 000:13
Large
number of
processes
Large
number of
processes
Large
number of
processes
Small
number of
processes
HPC
category
Application
scalability
Good Limited
Capability
computing
Capacity
computing
Fig. 7. The representative number of application processes is determined by application scalability and the
targeted HPC category.
titioned into smaller numbers of processes. We observe similar behavior for GADGET,
GROMACS, and NEMO.
5.2.3. Summary. In this section, we showed how memory footprints of HPC applica-
tions depend on the number of application processes, and we provided guidelines for
selecting the number of processes to be representative of production application use.
Figure 7 summarizes this analysis. The figure shows how the representative number
of application processes depends on application scalability and how this may change
for different HPC categories. Applications with good scalability should be executed
with large numbers of processes, regardless of the targeted HPC category. This leads
to significant speed-ups with only a small increase in experimentation cost. For ap-
plications with limited scalability, increasing the number of processes reveals a clear
trade-off between execution time and CPU-hours. For experiments that target capabil-
ity computing, these applications should be executed with a large number of processes
providing low execution time at the expense of the CPU-hours. On the other hand,
when targeting capacity computing, applications should be partitioned into a small
number of processes, sacrificing execution time to improve experimentation cost and
overall system throughput [Zivanovic et al. 2016].
5.3. Memory requirements of production HPC applications
In this section, we analyze per-process memory capacity requirements of the produc-
tion applications. In all experiments, the applications were run with Test Case A in-
puts and up to 8192 processes. Test Case A can be run for all UEABS applications,
and it supports a wider range of processes compared to Test Case B (which could only
run six out of ten applications, see Section 2.2). We analyze Test Case B in more detail
in Section 5.4. Recall that our applications roughly scale up to 1000–10,000 processes
when running these input datasets.
The results are summarized in Figure 8. The left side of Figure 8 shows results for
the applications with good scalability — ALYA, BQCD, GENE, NAMD, and Quantum
Espresso (QE). These applications should be executed with a large number of processes
regardless of the targeted HPC category. The average per-process memory footprints
for these applications ranges from 57 MB for ALYA to 258 MB for NAMD. The foot-
prints for BQCD, GENE, and QE are 116 MB, 137 MB, and 197 MB, respectively.9
9Although Quantum Espresso processing Test Case A should scale up to 1024 processes [PRACE 2013],
we observed very good scalability up to 256 processes but poor scalability (slowdowns) for 512 and 1024
processes. We therefore report results for 256 processes for this application.
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
000:14 D. Zivanovic et al.
ALYA
[1024]
BQCD
[8192]
GENE
[1024]
NAMD
[1024]
QE
[256]
CP2K
[1024]
GADGET
[8192]
NEMO
[8192]
CP2K
[16]
GADGET
[256]
NEMO
[128]
0
1
2
3
4
5
6
7
8
Per-process
memory footprint [GB]
Applications with good scalability Applications with limited scalability
Capability computing Capacity computingCapability and capacity computing
Fig. 8. Memory footprints of production HPC applications depend on application scalability and the tar-
geted HPC category. Only the applications with limited scalability that target capacity computing require
gigabytes of main memory per process.
The right side of Figure 8 shows the per-process memory footprints of the appli-
cations with limited scalability. In the capability computing experiments, we execute
these applications on a large number of cores: 1024 for CP2K, and 8192 for GADGET
and NEMO. The per-process memory footprints are again fairly small — 336 MB,
154 MB, and 203 MB, for CP2K, GADGET, and NEMO, respectively. Since the proces-
sor under study has eight cores, and we allocate one process per core, the per-socket
memory footprint in these experiments ranges between 0.4 GB (ALYA, 8×57 MB) and
2.6 GB (CP2K, 8×336 MB). The first-generation 3D memory devices already provide
such memory capacities, see Section 6.1.
In capacity computing, we execute the applications with limited scalability on a
small number of cores, and this number is dictated by the memory capacity of the
compute nodes — scaling the parallelism down further would cause the per-process
memory footprints to exceed the available memory. To understand how the memory
footprints increase in systems with higher memory capacities, we run experiments on
large-memory nodes containing 128 GB of main memory, i.e., 8 GB per core. When we
partition CP2K, GADGET, and NEMO to 16, 256, and 128 processes, their per-process
footprints are 5.8 GB, 2.2 GB, and 4.8 GB, respectively (rightmost part of Figure 8). On
standard nodes with 2 GB of memory per core, we could partition these applications to
128 processes for CP2K, and 512 processes for GADGET and NEMO. In this scenario,
we measured the per-process memory footprints of 1.1 GB, 1.2 GB, and 1.3 GB, for
CP2K, GADGET, and NEMO, respectively.
Figure 8 omits results for GROMACS and SPECFEM3D. The per-process footprint of
GROMACS is very low, between 60 MB and 70 MB, and it decreases only slightly from
16 to 1024 processes (see Section 5.1). SPECFEM3D requires exactly 864 processes,
and thus we cannot analyze how its memory footprint changes with the number of pro-
cesses. When executed with 864 processes, the average memory footprint is 2.53 GB.
These results show that different production HPC applications — or even a single
application used in different HPC categories — can have significantly different mem-
ory capacity requirements. Applications that scale well and those that target capability
computing have low per-process memory footprints. These applications require from
57 MB to 258 MB of memory, which means that they heavily under-utilize the memory
capacity of our HPC platform: their average memory usage is below 10% of the full
capacity. Only the applications with limited scalability that target capacity computing
require gigabytes of main memory per process.
5.3.1. Master process memory requirements. Many HPC applications are written using a
master–worker process model. In these applications, the problem is decomposed into
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
Main Memory in HPC: Do We Need More, or Could We Live with Less? 000:15
16 32 64 128 256 512 1024
Number of application processes
0
400
800
1200
Per-process memory footprint [MB]:
Worker processes
Master process
Fig. 9. As the number of processes increase, memory footprints of worker processes decrease as expected,
but memory footprint of the master process increases: ALYA, Test Case A.
data segments that can be processed independently by different worker processes. The
master process (usually the first process) assigns work to each worker process, and
collects the intermediate and final results of the computation.
Figure 9 plots the memory footprints of the ALYA master and worker processes as
the number of processes increases from 16 to 1024. Memory footprints of worker pro-
cesses drop as we increase the number of processes, following the trend described in
Section 5.1. The master process, however, exhibits the opposite trend, and its memory
footprint increases with the number of processes. This is a general trend in master–
worker applications. For the UEABS applications, we detect that master process may
have significantly higher memory footprints than workers, up to 36.6×for NEMO
(master 7.2 GB, worker 0.2 GB) and 62.5×for BQCD (master 7.1 GB, worker 0.1 GB).
Both application developers and computer architects should pay attention to this phe-
nomenon, still not well highlighted and quantified by the community.
5.4. Towards weak scaling analysis
The presented analysis keeps constant the input dataset size and varies the number
of application processes. This refers to the strong scaling case of production HPC ap-
plications. In addition to this, it is also important to perform a weak scaling analysis,
i.e., to analyze memory capacity requirements when both the number of processes and
input dataset size are increased — similar to the study performed for HPL and HPCG
benchmarks. Since the problem inputs are specified by the benchmark suite, such
analysis requires either that the benchmark suite support a user-defined problem
size (as for HPL and HPCG) or that it provides a set of inputs specifically intended
for weak scaling analysis. We are not aware of such a real application benchmark
suite. UEABS for instance has just two problem sizes, Test Case A and Test Case B,
and in many cases the problems being solved are fundamentally different, making
them unsuitable for weak scaling analysis. For example, in case of ALYA, Test Case A
is a model of the respiratory system whereas Test Case B is a mesh of generic ele-
ments [Bull 2013]. As an intermediate step, we analyze the two input datasets for the
NAMD benchmark distributed with the UEABS suite, which is one of few benchmarks
where the two datasets are comparable [Bull 2013], and observe the changes in the
application memory footprint and scalability when increasing the input dataset size.
In order to analyze how the per-process memory footprint changes with dataset size,
in Figure 10(a) we plot the NAMD memory footprint results for both the smaller
and larger input datasets. The Test Case A curve starts at 16 processes, a sin-
gle MareNostrum node. Test Case B exceeds the memory capacity of one or two
MareNostrum nodes (16 and 32 processes), so the curve starts from 64 processes. We
show the Test Case A results on up to 1024 processes, and for Test Case B on up to 8192
processes, as recommended by UEABS documentation [PRACE 2013]. For both input
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
000:16 D. Zivanovic et al.
16 32 64 128 256 512 1024 2048 4096 8192
Number of application processes
0
400
800
1200
1600
2000
Per-process memory
footprint [MB]:
NAMD Test Case A
NAMD Test Case B
(a) When input dataset increases memory footprint
curve shifts towards larger number of processes.
16 32 64 128 256 512 1024 2048
Number of application processes
0
0.2
0.4
0.6
0.8
1.0
1.2
1.0 1.01 0.99 0.95 0.91 0.87
0.61
1.0
0.82 0.79
0.66
0.45 0.39
Parallel efficiency:
NAMD Test Case A
NAMD Test Case B
(b) Increasing the input dataset lowers the scalabil-
ity of NAMD application.
Fig. 10. Increasing the input dataset changes memory footprint and scalability of NAMD application. We
detect the same trend for all UEABS applications.
datasets, the NAMD footprint is around 1600 MB on a small number of processes, and
it drops rapidly as we increase the process count. Both memory footprint curves follow
the same tendency, with the Test Case B results being shifted towards larger numbers
of processes. We detect the same trend for all UEABS applications.
Next, we analyze the impact of dataset size on application scalability. Figure 10(b)
plots parallel efficiency of NAMD with both input datasets. Parallel efficiency of NAMD
with Test Case A reduces to 0.61 when the process count increases from 16 to 1024
(64×). With Test Case B, when increasing from 64 to 2048 processes (32×), the par-
allel efficiency drops to 0.39.10 For all the applications under study, we find that it
is harder to achieve good scalability for larger numbers of processes, even if the in-
put dataset size increases. This is not surprising, because increasing the number of
processes causes more communication and synchronization overheads, and increases
the penalty of sequential code segments. The simple increase in dataset size does not
address all of these problems.
To summarize, our results show that when input datasets increase, the memory
footprint as a function of the number of processes keeps the same trend, but the curve
is shifted toward larger number of processes (Figure 10(a)). Application scalability,
however, reduces, in some cases significantly. Therefore, increasing the input dataset
requires repeating the analysis of the trade-offs between execution time and CPU-
hours (Section 5.2) in order to determine the representative number of processes for a
production run of the application.
6. IMPLICATIONS
The current trend in HPC system design is to increase the number of memory chan-
nels per CPU and the number of I/Os in each DDR generation. This approach is lim-
ited by the package size, plus it is expensive, and it increases memory system power
consumption. A potentially promising solution for these problems is the use of 3D-
stacked DRAMs. In this section, we summarize the pros and the cons of 3D-stacked
DRAM, and briefly overview the currently-available products based on this technology:
Hybrid Memory Cube (HMC), High-Bandwidth Memory (HBM), and multi-channel
DRAM (MCDRAM) incorporated into the Knights Landing processors. We also discuss
the opportunities and challenges of these solutions in the context of high-performance
computing, and outline our expectations on how these devices may change the design
of next-generation memory systems.
10NAMD running Test Case B should scale up to 10,000 processes, but we detect very low or no speed-up
after 2048 processes. Therefore we plot parallel efficiency results up to 2048 processes for Test Case B.
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
Main Memory in HPC: Do We Need More, or Could We Live with Less? 000:17
6.1. Background: 3D-stacked DRAM
3D stacking increases package density, with memory chiplets placed on a silicon in-
terposer instead of a printed circuit board. Stacked DRAM dies are connected using
through-silicon vias (TSVs), which shorten the interconnection paths and reduce con-
nectivity impedance and channel latency. Hence, data can be moved at a higher rate
with a lower energy per bit. On the down side, 3D-stacked DRAMs will likely reduce
the main memory capacity, at least in early generations, due to significantly higher
cost per bit than conventional DIMMs.
Hybrid Memory Cube (HMC) [HMC Consortium 2014] is connected to the CPU
with a high-speed serial interface that provides up to 480 GB/s per device. Announced
production runs of HMC components are limited to 2 GB and 4 GB devices, while
the standards specify capacities of up to 8 GB. Memory capacity can be increased by
integrating multiple HMC devices into the package, but doing so is non-trivial. Each
HMC device can be directly connected to up to four other devices (CPUs or HMCs)
via four independent serial links. This enables chaining of devices to increase memory
capacity, but multi-hop routing may increase access latency and variability. The impact
of such variability on HPC workloads requires further analysis and justification.
High-Bandwidth Memory (HBM) [JEDEC 2013] is connected to the host CPU or
GPU with a wide 1024-bit parallel interface that delivers up to 256 GB/s. Similarly
to HMC, the standard specifies up to 8 GB devices and integrating multiple devices
in the package is challenging. Only a single HBM device can be connected to each
interface (channel), so using multiple HBM devices requires a large silicon interposer
with multiple 1024-bit wide interfaces, increasing cost.
Intel Knights Landing (KNL) processors [Sodani et al. 2016] are the first CPUs
that bring in 3D-stacked DRAM in addition to the traditional DDR DIMMs. KNLs
comprise up to 72 cores supported by two levels of main memory. At the first level, 3D
multi-channel DRAM (MCDRAM) is connected to the CPU through an on-package
interposer and it offers a capacity of up to 16 GB (0.2 GB per core) with 400 GB/s of
peak theoretical bandwidth. In addition to MCDRAM, KNL can be connected to up to
384 GB (5.4 GB per core) of standard DDR4 memory. MCDRAM and DDR can be or-
ganized in three modes: cache,flat and hybrid. In the cache mode, MCDRAM behaves
as an additional (L3) level of the cache hierarchy. In the flat mode, MCDRAM and
DDR are two distinct memory nodes with different capacity, latency and bandwidth
that can be addressed by different APIs. In case that the input dataset does not fit into
the MCDRAM, it is responsibility of the programmer to perform data partitioning,
allocation and migration that would use efficiently the advanced memory organiza-
tion. Finally, the hybrid mode combines the cache and the flat mode. In this mode, the
MCDRAM is partitioned into two segments — one is used as the L3 cache for the DDR,
while the other is addressable and used as MCDRAM memory node in the flat mode.
6.2. 3D-stacked DRAM in HPC memory systems: Opportunities and challenges
Understanding application memory capacity requirements is essential for the design
of HPC memory systems based on 3D-stacked DRAM. The main question to be
answered is whether 3D memory chiplets can on their own provide the capacity
required by HPC applications.
6.2.1. HPCG vs. HPL. An important driving force for 3D-stacked DRAM could be the
HPCG benchmark. With performance directly proportional to main memory band-
width, and memory footprints below 1 GB per process even when targeting million-
core systems, HPCG could be the first success story for 3D-stacked DRAM in HPC.
Regarding KNL, hundreds of MBs per core of MCDRAM may be sufficient for an out-
standing HPCG performance, especially on small clusters (see Figure 4). Therefore,
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
000:18 D. Zivanovic et al.
HPCG could be a good example of an important benchmark that works out-of-the-box
and performs well on KNL.
On the other hand, one of the main show-stoppers for 3D-stacked memory could be
HPL. In contrast to HPCG, high memory bandwidth provides no benefits for HPL,
while the limited capacity of 3D-stacked memory can lead to significant performance
loss, especially in large-scale systems. As shown in Section 3.2, a million-core system
would require 16.1 GB per core to achieve 95% of the potential performance (Figure 3).
Although the HPC community is questioning whether HPL is representative of modern
production HPC applications [Murphy et al. 2006; 2010], and is actively looking for al-
ternative benchmarks (HPCG being one of them), high HPL scores are still important
objectives in the design of large HPC clusters. Looking forward, it will be interesting
to see whether KNL-based TOP500 systems will use the 3D-stacked MCDRAM in the
HPL runs, i.e. whether the developers of optimized HPL and corresponding linear alge-
bra libraries will find a way to benefit from hybrid MCDRAM + DDR memory systems.
6.2.2. Production HPC applications. In Section 5.3 and Figure 8 we saw that the mem-
ory capacity requirements of production HPC applications had a bimodal distribution.
Most of our HPC applications and use cases require only hundreds of megabytes of
main memory. This capacity can be provided by 3D memory chiplets located on the
silicon interposer (e.g. KNL MCDRAM), with no need for conventional DIMMs on the
printed circuit board (PCB). Such a memory will provide significantly higher mem-
ory bandwidth and lower latency, which, in turn, will lead to higher system perfor-
mance and energy-efficiency. Since 3D-stacked memory chiplets could directly replace
DIMMs, the main memory would still comprise a single level with uniform latency.
For HPC applications that require gigabytes of main memory, sufficient memory
capacity can be provided using hybrid 3D memories plus on-PCB DIMMs, similar to
KNL systems. The main memory would therefore consist of two levels of hierarchy with
different latencies, bandwidths, and capacities. For computer architects, this opens
design options to optimize the capacities, organizations, and interconnections of the
3D memory chiplets and the DIMMs.
Although hybrid memory systems support functional portability, i.e. execution of
legacy codes, there is a clear tradeoff between the achieved performance and the
effort invested in code profiling and development. For example, in the KNL cache
mode, large-footprint applications can be executed with no changes in the source
code, but this approach could lead to significant performance loss. Although having
large caches may intuitively suggest higher performance, in KNL it may not be the
case. Since MCDRAM and DDR4 have separate data paths (as they use separate
memory controllers), MCDRAM misses require two consecutive accesses — first to
the MCDRAM and second to the DDR — leading to a high cache miss penalty and
potentially low overall performance.
Good performance of hybrid memory systems is conditioned by the need for advanced
data allocation, migration, and prefetching policies [Chou et al. 2014; Meswani et al.
2015]. Optimal data management in these systems, such as the KNL flat memory
mode, requires profound application profiling and a significant increase in the code
development cost. In order to reduce this effort and increase adoption of the KNL
architecture, Intel released various profiling tools [Intel 2016a] and data management
libraries and APIs [Intel 2016b] that simplify efficient programing of the systems with
hybrid main memory. It will be interesting to see whether the KNL — as the first
system that combines the 3D-stacked and the DDR main memory — will be adopted by
the users, and whether the increased cost in software development and maintenance
will be justified by the performance gains.
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
Main Memory in HPC: Do We Need More, or Could We Live with Less? 000:19
6.2.3. Message from application developers. New HPC systems should be designed tak-
ing into consideration the requirements from future users and application developers.
With regard to memory capacity, certain applications, in domains such as theoretical
physics and inorganic chemistry, have a constant need for more memory. Based on
Figure 8, however, we see that there are many applications that have low memory re-
quirements. Users of such applications that could in fact live with less DRAM usually
remain quiet, since for them the additional DRAM under discussion will not degrade
performance. Moreover, these users generally do not have to pay the costs associated
with extra memory per core, in terms of capital cost and power consumption. This
leads to general-purpose HPC clusters with 2 GB per high-end x86 core, with some
large-memory nodes having 4 GB to 8 GB per core, i.e., more than 100 GB per node.
With the introduction of 3D-stacked memory, this dynamic changes. Users may wish
to “trade” DIMM capacity, which they do not need, for 3D-stacked DRAM, which pro-
vides higher bandwidth and lower latency. For applications with relatively low memory
capacity requirements, 3D-stacked DRAM is likely to lead to significantly better over-
all performance [Radulovic et al. 2015]. It is therefore essential that application de-
velopers understand the performance–capacity–cost tradeoffs between DIMM-based,
3D-stacked and hybrid DRAM solutions, in order to clearly express their preferences
to the HPC hosting centres. Whereas user demand has already led to large-memory
nodes, messages from the users and developers of small-memory application may lead
to specialization in the other direction; i.e., small-memory nodes with high-bandwidth
low-latency 3D-stacked memory.
7. RELATED WORK
Dongarra et al. [2003] present the Linpack benchmarks suite, the TOP500 list, and the
HPL code. The authors execute HPL on a small 4×4 cluster of Pentium III 500 MHz
CPUs and analyze the benchmark performance for various interconnects and input
dataset sizes of up to around 250 MB per core. The results show that increasing the
HPL input dataset size can lead to significant performance improvements. Our work
extends this study in various directions. We detect the point of diminishing returns
when increasing the input dataset size, and analyze how this changes with the num-
ber of processes used in the HPL run, i.e., with the size of the HPC system. We also
estimate the amounts of physical memory required for the close-to-optimal HPL per-
formance on future large-scale HPC clusters.
The HPL benchmark has been extensively used in the past. In general, HPL stud-
ies analyze how to tune arithmetic libraries, OS kernel and network parameters to
improve HPL performance on a given system. The studies use the maximum input
dataset that fits into the physical memory while preventing swapping, as suggested by
the HPL developers, and do not analyze the impact of changes in the physical memory
capacity on the HPL performance.
Marjanovi´
c et al. [2014] analyze the HPCG benchmark and predict the HPCG per-
formance on a given architecture based on the memory bandwidth and the highest
network latency between compute units. They conclude that for modern systems with
a decent network, highly accurate prediction can be done based only on the memory
bandwidth. On the node level, they show that small problem sizes that fit in the CPU
caches can have HPCG performance that exceeds the stable values and are therefore
non representative. However, they do not analyze how HPCG performance depends on
the problem size for larger numbers of processes, as we did in this study.
Although memory provisioning for large-scale HPC clusters is an important task, to
the best of our knowledge only three prior studies analyze memory footprints of HPC
applications [Biswas et al. 2011; Perks et al. 2011; Pavlovic et al. 2011]. However, these
studies do not analyze the relationship to the number of processes, which is very impor-
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
000:20 D. Zivanovic et al.
tant as we show in this study. Biswas et al. [2011] and Perks et al. [2011] investigate
different techniques to reduce memory footprints in order to improve the performance
of HPC workloads. Biswas et al. [2011] leverage the data similarity often exhibited in
MPI applications. They identify identical memory blocks across MPI tasks on a single
node and use a novel memory allocation library to merge them. The authors evaluate
their proposal on a range of MPI applications (SPEC MPI2007, NAS, ASC Sequoia
benchmarks, and two production applications), and show memory footprint reduction
of 32% on average. Perks et al. [2011] investigate the impact of compiler choice on the
memory usage of distributed MPI codes. The authors compare memory usage of four
versions of simple MPI benchmarks compiled with GNU, Intel, PGI, and Sun compil-
ers. Their results show that compiler choice can make a difference of up to 32% in
memory usage. Pavlovic et al. [2011] characterize memory behavior for four scientific
applications to estimate the memory system requirements of future HPC systems with
hundreds or thousands of cores per node. The authors estimate memory footprints of
HPC applications comprising thousands of processes by using linear regression based
on results of a few experiments with a small number of processes. Even though the
authors target systems running applications with thousands of processes, the study
does not analyze application scalability, nor does it evaluate whether the input sets
used in the study are large enough to take advantage of such parallelism.
As the first 3D-stacked DRAM devices are hitting the market, various studies
analyze how to incorporate these devices into the memory hierarchy. It is generally
accepted that 3D-stacked DRAM is unlikely to fulfill the memory capacity require-
ments of server and HPC applications, so the community is exploring hybrid systems
in which 3D-stacked DRAM is complemented by standard DIMMs [Dong et al. 2010;
Chou et al. 2014; Sim et al. 2014; Meswani et al. 2015]. The essence of these studies
is the development of techniques for advanced data migration between 3D-stacked
DRAM and DIMMs. An important requirement of this work is to avoid excessive code
development costs and improve performance of legacy codes. So, all the studies keep
the unified view of the main memory at the application level; the data management
policies are performed in hardware by complex data path enhancements [Chou et al.
2014; Sim et al. 2014] or by an interaction between hardware and the operating
system [Dong et al. 2010; Meswani et al. 2015]. Overall, all studies agree that
managing hybrid memory systems with 3D and DIMMs is a difficult task and that
simple approaches, such as using 3D-memory as an additional level of cache, may lead
to significant performance loss.
8. FUTURE WORK
A simple but important question “How much memory do we need in HPC?” has not
been discussed thoroughly by the academic and computer architecture community. We
hope that our study will trigger further research and discussion on this topic. The first
topic for future work could be an analysis of weak scaling of production HPC applica-
tions and its impact on HPC memory footprints. Weak scaling analysis is especially
important to anticipate future HPC problems with significantly larger input datasets.
This analysis, however, would require HPC production application benchmark suites
that allow the problem size to be tuned in a similar way to HPL and HPCG, or at least
provide a collection of comparable input sets with varying problem size.
Also, the community, or at least large research centers that purchase HPC clusters
should question the current trends for memory system sizing. Maybe the first steps
could be to study the histograms of memory system usage of existing HPC clusters
and understand whether users are taking advantage of most of the installed memory.
It seems that decisions for new machines have traditionally been based on experiences
with previous HPC clusters and on undocumented knowledge of the principal system
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
Main Memory in HPC: Do We Need More, or Could We Live with Less? 000:21
integrators. This is a conservative approach that will unlikely lead to a breakthrough
in the memory system design.
Finally, the principal HPC system designers and integrators should publicly docu-
ment and justify their memory sizing decisions. For example, based on our results, one
could question whether the memory capacity of the largest HPC clusters are provi-
sioned with too much emphasis on the TOP500 list. Since smaller HPC systems are
influenced by the system architectures of the larger ones, this could mean that the
HPL scores are responsible for most of the HPC memory system sizing decisions. It
is about time that HPC system designers try to convince us that this is not true, and
explain the real rationale behind the 2–3 GB per-core rule.
9. CONCLUSIONS
This study analyzed memory capacity requirements of important HPC benchmarks
and applications. This analysis becomes increasingly important as 3D-stacked memo-
ries are hitting the market. These novel memories provide significantly higher mem-
ory bandwidth and lower latency, leading to higher performance and better energy-
efficiency. However, the adoption of 3D memories in the HPC domain requires use
cases needing much less memory capacity than currently provisioned. With good out-
of-the-box performance, these use cases would be the first success stories for these
memory systems, and could be an important driving force for their further adoption.
We detected that HPCG could be an important success story for 3D-stacked mem-
ories in HPC. With low memory footprints and performance directly proportional to
the available memory bandwidth this benchmark is a perfect fit for memory systems
based on 3D chiplets. HPL, however, could be one of the main show-stoppers because
reaching a good performance requires memory capacities that are unlikely to be
provided by 3D chiplets.
The study also emphasizes that the analysis of memory footprints of production HPC
applications requires an understanding of their scalability and target category, i.e.,
whether the workloads represent capability or capacity computing. The results show
that most of the HPC applications under study have per-core memory footprints in the
range of hundreds of megabytes — an order of magnitude less than the main memory
available in the state-of-the-art HPC systems; but we also detect applications and use
cases that still require gigabytes of main memory.
Overall, the study indeed identified the HPC applications and use cases with
memory footprints that could be provided by 3D-stacked memory chiplets, making
the first step towards adoption of this novel technology in the HPC domain. Also, it
showed that the simple question “How much memory do we need in HPC?” may not
have a simple answer. We hope that this will motivate the community to question the
trends for memory system sizing in current HPC clusters, and will lead to further
analysis targeting future ones.
APPENDIX: HPL ANALYTICAL ANALYSIS
The Appendix complements the HPL analysis in Section 3.2; it presents step-by-step
mathematical formulas that analyze HPL performance as a function of per-core mem-
ory capacity and the number of processes. The analysis indeed shows that the HPL
performance analytically converges to a steady value proportional to the floating-point
rate (GFLOP/s) of the system, but the performance optimum is theoretically reached
for infinite main memory. We also calculate the per-core memory capacity needed to
achieve steady values of HPL performance, and how this amount of memory changes
when increasing the size of HPC systems (number of cores).
Number of FLOPs. HPL solves a dense linear system of Nunknowns using LU
factorization [Petitet et al. 2012]. For a given problem size Nthe benchmark performs
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
000:22 D. Zivanovic et al.
the following number of double-precision floating-point operations (#F LOP s) [Don-
garra et al. 2003]: #F LOP s =2
3N3+ 2N2+O(N)(3)
Since N1, and for number of processes n:
#F LOP sper process '2N3
3n(4)
Execution time. HPL execution time on a specific system depends on various sys-
tem parameters:
γ3: The time that a single processing unit (e.g. CPU core) requires to perform one
floating-point operation when performing matrix-matrix operations.
α: The time to prepare a message for transmission between processes.
β: The time L×βindicates the time taken by the message of length Lto traverse the
network to the destination.
The execution time also depends on the way the data (matrices) are partitioned and
distributed among the processes. The coefficient matrix is first logically partitioned
into blocks, each of dimension NB×NB, and these blocks are cyclically distributed onto
the process grid. In all our experiments, factor NBis kept constant. Finally, the data
is distributed onto a two-dimensional grid of processes, P×Q, where the total number
of processes is n=P×Q. When possible it is suggested to keep the same values for P
and Q, i.e., P=Q=n[Petitet et al. 2012].
An approximation of the HPL execution time Tthat illustrates the cost of the domi-
nant factors is [Petitet et al. 2012]:
T=2γ3N3
3P Q +βN2(3P+Q)
2P Q +αN((NB+ 1) log P+P)
NB
(5)
HPL performance. HPL performance is the number of FLOPs divided by the exe-
cution time, and is expressed in FLOP/s. In order to simplify our mathematical formu-
las, we observe execution time per FLOP, which is the reciprocal for HPL performance.
If we assume P=Q=nand analyze HPL execution time per FLOP by dividing
(5) by (4), we have: Tper FLOP =γ3+3βn
N+3αn 1
2(NB+ 1) log n+n
2NBN2(6)
Equation 6 describes the HPL execution time per FLOP as a function of the problem
size N. Per-core memory capacity mdepends on the problem size Nand the number
of processes n:m=10N2
n.11 Therefore, the problem size that generates per-process
memory capacity mwhen the HPL is executed on nprocesses can be computed as:
N=pmn
10 . If we put this into Equation 6, we determine the dependency between the
time per FLOP (Tper FLOP) and the per-core memory capacity (m):
Tper FLOP =γ3+3β10
m+15α1
2(NB+ 1) log n+n
NBm(7)
We observe time per FLOP as a function of 1
m. For smaller values of per-process
memory, the second and third terms in Equation 7 increase the execution time per
FLOP, which means that communication overheads lower the HPL performance. For
infinite memory, Tper FLOP =γ3. This means that when increasing per-core memory,
HPL execution time per FLOP, and therefore the HPL performance, indeed analyti-
cally converge to a steady value. This steady value is proportional to the number of
processes because the number of FLOPs is also proportional to the number of cores
(processes) used in the HPL run.
Reaching the steady execution time per FLOP. Next, we calculate per-core
memory needed to achieve steady values of execution time per FLOP. Also, we ana-
lyze how this amount of memory changes when increasing the size of HPC systems
11Problem size should be set to 80% of available memory [Petitet et al. 2012]. The N×Ncoefficient matrix
requires 8N2bytes, so the per-core memory capacity is 100
80 ×8N2
n=10N2
n.
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
Main Memory in HPC: Do We Need More, or Could We Live with Less? 000:23
(number of cores). Dividing (7) by γ3to normalize relative to the lowest execution time
with infinite memory gives the relative execution time per FLOP:
Trel per FLOP = 1 + k1
1
m+k2n+k3log n1
m(8)
Constants k1,k2,k3and γ3depend on the hardware platform, and for our platform
we fit the constants according to the results from Figure 2. First we fit the constants
k2and k3, and then k1and γ3using linear regression. We get maximum error against
experiments of 13%. Then, we analyze how much memory per core is needed to get
close to ideal result with infinite memory, i.e., to reach a relative execution time per
FLOP of ε. This we get by solving a quadratic equation:
k2n+k3log n1
m2
+k11
m+ (1 ε)=0 (9)
The results for εvalues that contribute to 90%, 95% and 99% of ideal (infinite mem-
ory) HPL performance are explained in detail in Section 3.2. By taking the derivative
of the solution to Equation 9, we find that, for any fixed target overhead ε > 0, increas-
ing the number of cores n, always increases the memory per core, m. This shows that,
as the total number of cores is increased, more memory per core is needed to achieve
good execution time per FLOP, and therefore good HPL performance. This trend is con-
firmed by the experimental results in Section 3.1 and historical systems in Figure 3.
Writing k1,k2and k3in terms of αand βthen taking the derivatives with respect to
αand βshows that increasing interconnect latency or reducing interconnect bisection
bandwidth also increase the memory per core for any fixed overhead  > 0.
ACKNOWLEDGMENT
This work was supported by the Collaboration Agreement between Samsung Electronics Co., Ltd. and BSC,
Spanish Government through Severo Ochoa programme (SEV-2015-0493), by the Spanish Ministry of Sci-
ence and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts
2014-SGR-1051 and 2014-SGR-1272). This work has also received funding from the European Union’s Hori-
zon 2020 research and innovation programme under ExaNoDe project (grant agreement No671578). Darko
Zivanovic holds the Severo Ochoa grant (SVP-2014-068501) of the Ministry of Economy and Competitive-
ness of Spain. The authors thank Harald Servat from BSC and Vladimir Marjanovi´
c from High Performance
Computing Center Stuttgart for their technical support.
REFERENCES
Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt
Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Kather-
ine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical
Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley.
Daniel E. Atkins, Kelvin K. Droegemeier, Stuart I. Feldman, Stuart I. Feldman, Michael L. Klein, David G.
Messerschmitt, Paul Messina, Jeremiah P. Ostriker, and Margaret H. Wright. 2003. Revolutionizing
Science and Engineering Through Cyberinfrastructure. Report of the National Science Foundation Blue-
Ribbon Advisory Panel on Cyberinfrastructure. National Science Foundation.
Barcelona Supercomputing Center. 2013. MareNostrum III System Architecture. Technical Report.
Barcelona Supercomputing Center 2014. Extrae User guide manual for version 2.5.1. Barcelona Supercom-
puting Center.
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite:
Characterization and Architectural Implications. In Proc. of the International Conference on Parallel
Architectures and Compilation Techniques (PACT). 72–81.
Susmit Biswas, Bronis R. de Supinski, Martin Schulz, Diana Franklin, Timothy Sherwood, and Frederic T.
Chong. 2011. Exploiting Data Similarity to Reduce Memory Footprints. In Proc. of the IEEE Interna-
tional Parallel & Distributed Processing Symposium (IPDPS). 152–163.
Mark Bull. 2013. PRACE-2IP: D7.4 Unified European Applications Benchmark Suite Final. (2013).
Chris Cantalupo, Karthik Raman, and Ruchira Sasanka. 2015. MCDRAM on 2nd Generation Intel Xeon
Phi Processor (code-named Knights Landing): Analysis Methods and Tools. International Conference
for High Performance Computing, Networking, Storage and Analysis (SC). (Nov. 2015). Tutorial.
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
000:24 D. Zivanovic et al.
Chiachen Chou, Aamer Jaleel, and Moinuddin K. Qureshi. 2014. CAMEO: A Two-Level Memory Organi-
zation with Capacity of Main Memory and Flexibility of Hardware-Managed Cache. In Proc. of the
International Symposium on Microarchitecture (MICRO). 1–12.
Xiangyu Dong, Yuan Xie, Naveen Muralimanohar, and Norman P. Jouppi. 2010. Simple but Effective Het-
erogeneous Main Memory with On-Chip Memory Controller Support. In Proc. of the International Con-
ference for High Performance Computing, Networking, Storage and Analysis (SC). 1–11.
Jack Dongarra, Michael Heroux, and Piotr Luszczek. 2016. The HPCG Benchmark. http://www.hpcg-
benchmark.org/. (2016).
Jack J. Dongarra and Michael A. Heroux. 2013. Toward a New Metric for Ranking High Performance Com-
puting Systems. Sandia Report SAND2013-4744. Sandia National Laboratories.
Jack J. Dongarra, Piotr Luszczek, and Michael A. Heroux. 2014. HPCG: One Year Later. In International
Supercomputing Conference (ISC).
Jack J. Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK Benchmark: Past, Present and
Future. Concurrency and Computation: Practice and Experience 15, 9 (2003), 803–820.
ETP4HPC. 2013. ETP4HPC Strategic Research Agenda Achieving HPC leadership in Europe. (June 2013).
Hybrid Memory Cube Consortium. 2014. Hybrid Memory Cube Specification 2.0.
www.hybridmemorycube.org/specification-v2-download-form/. (Nov. 2014).
Intel. 2016a. Intel VTune Amplifier 2016. https://software.intel.com/en-us/intel-vtune-amplifier-xe/. (2016).
Intel. 2016b. The memkind library. http://memkind.github.io/memkind/. (2016).
JEDEC Solid State Technology Association. 2013. High Bandwidth Memory (HBM) DRAM.
www.jedec.org/standards-documents/docs/jesd235. (Oct. 2013).
James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Pro-
gramming: Knights Landing Edition (2nd ed.). Morgan Kaufmann.
Peter Kogge, Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty
Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, Sherman Karp, Stephen Keckler,
Dean Klein, Robert Lucas, Mark Richards, Al Scarpelli, Steven Scott, Allan Snavely, Thomas Sterling,
R. Stanley Williams, and Katherine Yelick. 2008. ExaScale Computing Study: Technology Challenges
in Achieving Exascale Systems. (Sept. 2008).
Matthew J. Koop, Terry Jones, and Dhabaleswar K. Panda. 2007. Reducing Connection Memory Require-
ments of MPI for InfiniBand Clusters: A Message Coalescing Approach. In Proc. of the IEEE Interna-
tional Symposium on Cluster Computing and the Grid (CCGRID). 495–504.
Piotr Luszczek and Jack J. Dongarra. 2005. Introduction to the HPC Challenge Benchmark Suite. ICL Tech-
nical Report ICL-UT-05-01. University of Tennessee.
Vladimir Marjanovi´
c, Jos´
e Garcia, and Colin W. Glass. 2014. Performance Modeling of the HPCG Bench-
mark. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simula-
tion. Springer International Publishing, 172–192.
Mitesh R. Meswani, Sergey Blagodurov, David Roberts, John Slice, Mike Ignatowski, and Gabriel H. Loh.
2015. Heterogeneous Memory Architectures: A HW/SW Approach for Mixing Die-stacked and Off-
package Memories. In IEEE International Symposium on High Performance Computer Architecture
(HPCA). 126–136.
Richard Murphy, Jonathan Berry, William McLendon, Bruce Hendrickson, Douglas Gregor, and Andrew
Lumsdaine. 2006. DFS: A Simple to Write Yet Difficult to Execute Benchmark. In IEEE International
Symposium on Workload Characterization (IISWC). 175–177.
Richard Murphy, Kyle Wheeler, Brian Barrett, and James Ang. 2010. Introducing the Graph 500. Cray
User’s Group (CUG). (May 2010).
NERSC. 2012. Large Scale Computing and Storage Requirements for High Energy Physics: Target 2017.
Report of the NERSC Requirements Review. Lawrence Berkeley National Laboratory.
NERSC. 2013. Large Scale Computing and Storage Requirements for Biological and Environmental Science:
Target 2017. Report of the NERSC Requirements Review LBNL-6256E. Lawrence Berkeley National
Laboratory.
NERSC. 2014a. High Performance Computing and Storage Requirements for Basic Energy Sciences: Target
2017. Report of the HPC Requirements Review LBNL-6978E. Lawrence Berkeley National Laboratory.
NERSC. 2014b. Large Scale Computing and Storage Requirements for Fusion Energy Sciences: Target 2017.
Report of the NERSC Requirements Review LBNL-6631E. Lawrence Berkeley National Laboratory.
NERSC. 2015a. High Performance Computing and Storage Requirements for Nuclear Physics: Target 2017.
Report of the NERSC Requirements Review LBNL-6926E. Lawrence Berkeley National Laboratory.
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
Main Memory in HPC: Do We Need More, or Could We Live with Less? 000:25
NERSC. 2015b. Large Scale Computing and Storage Requirements for Advanced Scientific Computing Re-
search: Target 2017. Report of the NERSC Requirements Review LBNL-6978E. Lawrence Berkeley
National Laboratory.
Chris J. Newburn. 2015. Code for the future: Knights Landing and beyond. International Supercomputing
Conference (ISC). (July 2015). IXPUG Workshop.
Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhiraj D. Kalamkar,
Xing Liu, Md. Mosotofa Ali Patwary, Yutong Lu, and Pradeep Dubey. 2014. Efficient Shared-memory
Implementation of High-performance Conjugate Gradient Benchmark and Its Application to Unstruc-
tured Matrices. In Proc. of the International Conference for High Performance Computing, Networking,
Storage and Analysis (SC). 945–955.
Milan Pavlovic, Yoav Etsion, and Alex Ramirez. 2011. On the Memory System Requirements of Future
Scientific Applications: Four Case-studies. In Proc. of the IEEE International Symposium on Workload
Characterization (IISWC). 159–170.
Milan Pavlovic, Milan Radulovic, Alex Ramirez, and Petar Radojkovic. 2015. Limpio - LIghtweight
MPI instrumentatiOn. In Proc. of the Int. Conference on Program Comprehension (ICPC).
https://www.bsc.es/computer-sciences/computer-architecture/memory-systems/limpio, 303–306.
O. Perks, S.D. Hammond, S. J. Pennycook, and S. A. Jarvis. 2011. Should We Worry About Memory Loss?
SIGMETRICS Performance Evaluation Review 38, 4 (March 2011), 69–74.
Antoine Petitet, Clint Whaley, Jack Dongarra, Andy Cleary, and Piotr Luszczek. 2012. HPL - A Portable
Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers.
http://www.netlib.org/benchmark/hpl/. (Oct. 2012).
PRACE. 2013. Unified European Applications Benchmark Suite. www.prace-ri.eu/ueabs/. (2013).
PRACE. 2016. Prace Research Infrastructure. http://www.prace-ri.eu. (2016).
Milan Radulovic, Darko Zivanovic, Daniel Ruiz, Bronis R. de Supinski, Sally A. McKee, Petar Radojkovi´
c,
and Eduard Ayguad´
e. 2015. Another Trip to the Wall: How Much Will Stacked DRAM Benefit HPC?. In
Proc. of the International Symposium on Memory Systems (MEMSYS). 31–36.
Jaewoong Sim, Alaa R. Alameldeen, Zeshan Chishti, Chris Wilkerson, and Hyesoon Kim. 2014. Transparent
Hardware Management of Stacked DRAM As Part of Memory. In Proc. of the International Symposium
on Microarchitecture (MICRO). 13–24.
Avinash Sodani. 2011. Race to Exascale: Opportunities and Challenges. International Symposium on Mi-
croarchitecture (MICRO). (Dec. 2011). Keynote.
Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani,
Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights Landing: Second-Generation Intel
Xeon Phi Product. IEEE Micro 36, 2 (March 2016), 34–46.
SPEC. 2015a. SPEC MPI2007. http://www.spec.org/mpi2007/. (2015).
SPEC. 2015b. SPEC OMP2012. https://www.spec.org/omp2012/. (2015).
Rick Stevens, Andy White, Pete Beckman, Ray Bair-ANL, Jim Hack, Jeff Nichols, Al GeistORNL, Horst
Simon, Kathy Yelick, John Shalf-LBNL, Steve Ashby, Moe Khaleel-PNNL, Michel McCoy, Mark Seager,
Brent Gorda-LLNL, John Morrison, Cheryl Wampler-LANL, James Peery, Sudip Dosanjh, Jim Ang-
SNL, Jim Davenport, Tom Schlagel, BNL, Fred Johnson, and Paul Messina. 2010. A Decadal DOE
Plan for Providing Exascale Applications and Technologies for DOE Mission Needs. Presentation at
Advanced Simulation and Computing Principal Investigators Meeting. (March 2010).
Erich Strohmaier, Jack Dongarra, Horst Simon, Martin Meuer, and Hans Meuer. 2015. TOP500 List.
http://www.top500.org/. (June 2015).
Frederick C. Wong, Richard P. Martin, Remzi H. Arpaci-Dusseau, and David E. Culler. 1999. Architectural
Requirements and Scalability of the NAS Parallel Benchmarks. In Proc. of the ACM/IEEE Conference
on Supercomputing (SC).
Steven Cameron Woo, Moriyoshi Ohara, and Evan Torrie. 1995. The SPLASH-2 Programs: Characterization
and Methodological Considerations. In Proc. of the International Symposium on Computer Architecture
(ISCA). 24–36.
Darko Zivanovic, Milan Radulovic, Germ´
an Llort, David Zaragoza, Janko Strassburg, Paul M. Carpen-
ter, Petar Radojkovi´
c, and Eduard Ayguad´
e. 2016. Large-Memory Nodes for Energy Efficient High-
Performance Computing. In International Symposium on Memory Systems (MEMSYS).
ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article 000, Publication date: 2016.
... These two levels of precision reduce the effects of the computational errors but only for the so-called fused operations, which are sequences of high precision operations done on the internal registers of the FPU. The main limiting factors in modern hardware architectures are memory size and bandwidth [5], [6]. To control them, the precision of data stored in the main memory becomes that of the memory format. ...
... We rely on main memory 8 for hosting the data used by the kernel (source data, in IEEE format, and working set). Passing through the system cache 4 , they are processed in the coprocessor scratchpad 6 . ...
... Provided that the compiler does an excellent job at static scheduling, these accumulations may be interleaved in the RF as long as space is available. The compiler must maximize the amount of computation in the RF 6 to minimize the precision losses and maximize the performance. However, the spilled variables can be stored in the processor L1 cache (if appropriately sized) with the UNUM format (with up to 256 bits of mantissa). ...
Thesis
Most of the Floating-Point (FP) hardware units support the formats and the operations specified in the IEEE 754 standard. These formats have fixed bit-length. They are defined on 16, 32, 64, and 128 bits. However, some applications, such as linear system solvers and computational geometry, benefit from different formats which can express FP numbers on different sizes and different tradeoffs among the exponent and the mantissa fields. The class of Variable Precision (VP) formats meets these requirements. This research proposes a VP FP computing system based on three computation layers. The external layer supports legacy IEEE formats for input and output variables. The internal layer uses variable-length internal registers for inner loop multiply-add. Finally, an intermediate layer supports loads and stores of intermediate results to cache memory without losing precision, with a dynamically adjustable VP format. The VP unit exploits the UNUM type I FP format and proposes solutions to address some of its pitfalls, such as the variable latency of the internal operation and the variable memory footprint of the intermediate variables. Unlike IEEE 754, in UNUM type I the size of a number is stored within its representation. The unit implements a fully pipelined architecture, and it supports up to 512 bits of precision, internally and in memory, for both interval and scalar computing. The user can configure the storage format and the internal computing precision at 8-bit and 64-bit granularity This system is integrated as a RISC-V coprocessor. The system has been prototyped on an FPGA (Field-Programmable Gate Array) platform and also synthesized for a 28nm FDSOI process technology. The respective working frequencies of FPGA and ASIC implementations are 50MHz and 600MHz. Synthesis results show that the estimated chip area is 1.5mm2, and the estimated power consumption is 95mW. The experiments emulated in an FPGA environment show that the latency and the computation accuracy of this system scale linearly with the memory format length set by the user. In cases where legacy IEEE-754 formats do not converge, this architecture can achieve up to 130 decimal digits of precision, increasing the chances of obtaining output data with an accuracy similar to that of the input data. This high accuracy opens the possibility to use direct methods, which are more sensitive to computational error, instead of iterative methods, which always converge. However, their latency is ten times higher than the direct ones. Compared to low precision FP formats, in iterative methods, the usage of high precision VP formats helps to drastically reduce the number of iterations required by the iterative algorithm to converge, reducing the application latency of up to 50%. Compared with the MPFR software library, the proposed unit achieves speedups between 3.5x and 18x, with comparable accuracy.
... However, this simplicity can lead to system designers and engineers over-or underestimating system performance as technology advances. Recently, it has become common to use multicore processors and/or accelerators, such as GPGPU with multiple layers of communication networks in supercomputers and HPC [17][18][19][20]. These communication layers include PCI Express (PCIe) and NVLink within nodes and system interconnection networks. ...
... The task is distributed on P × Q processors and performed in parallel. The operations and performance are thoroughly analyzed and described in [1][2][3]7,17]. ...
... Over the past decade, significant breakthroughs in memory technology have been achieved, resulting in high-bandwidth memory (HBM). HBM consists of 3D (or 2.5D) stacked DRAM technology that provides terabit-per-second bandwidths and large bit widths per package [17,18,20]. HBM can be integrated with conventional processors and/or accelerators, though it requires some interconnection silicon components, such as interposers. ...
Article
High-performance Linpack (HPL) is among the most popular benchmarks for evaluating the capabilities of computing systems and has been used as a standard to compare the performance of computing systems since the early 1980s. In the initial system-design stage, it is critical to estimate the capabilities of a system quickly and accurately. However, the original HPL mathematical model based on a single core and single communication layer yields varying accuracy for modern processors and accelerators comprising large numbers of cores. To reduce the performance-estimation gap between the HPL model and an actual system, we propose a mathematical model for multi-communication layered HPL. The effectiveness of the proposed model is evaluated by applying it to a GPU cluster and well-known systems. The results reveal performance differences of 1.1% on a single GPU. The GPU cluster and well-known large system show 5.5% and 4.1% differences on average, respectively. Compared to the original HPL model, the proposed multi-communication layered HPL model provides performance estimates within a few seconds and a smaller error range from the processor/accelerator level to the large system level.
... In Table 1 of [55], the total percentage of the small memory jobs is much higher than that of the large memory jobs. Third, the results of a previous study [56] show that 10 of the 13 production HPC applications under investigation have process sizes in the range of hundreds of megabytes. In Figure 8 of [56], only three of the applications have a process size larger than 1 gigabyte. ...
... Third, the results of a previous study [56] show that 10 of the 13 production HPC applications under investigation have process sizes in the range of hundreds of megabytes. In Figure 8 of [56], only three of the applications have a process size larger than 1 gigabyte. Therefore, we can discuss the effects of preemption delay on the SDSC and SX-ACE workloads using this range. ...
Article
Full-text available
Dedicated infrastructures are commonly used for urgent computations. However, using dedicated resources is not always affordable due to budget constraints. As a result, utilizing shared infrastructures becomes an alternative solution for urgent computations. Since the infrastructures are meant to serve many users, the urgent jobs may arrive when regular jobs are using the necessary resources. In such a case, it is necessary to preempt the regular jobs so that urgent jobs can be executed immediately. Most conventional methods for job scheduling have focused on reducing the response times and waiting times of all jobs. However, these methods can delay urgent jobs and hinder them from being completed within a stipulated deadline. Furthermore, in heterogeneous systems with coprocessors, preemption becomes more difficult because coprocessors rely on several system software functionalities provided by the host processor. In this paper, we propose a parallel job scheduling method to effectively use shared heterogeneous systems for urgent computations. Our method employs an in-memory process swapping mechanism to preempt jobs running on the coprocessor devices. The results of our simulations show that our method can achieve a significant reduction in the response time and slowdown of regular jobs without substantial delays of urgent jobs.
... Recently, data-intensive applications in various areas (e.g., big data, machine learning, and IoT) have processed a huge number of datasets, and this tendency toward larger data size is becoming more prevalent [3,6,21,24]. Scientific applications also show the requirements for large amounts of memory for intermediate data and use loops to repeat similar operations on all or part of the intermediate data [1,26,28]. Thus, the overall system performance of such applications is directly affected by the amount of main memory. ...
Article
Full-text available
The amount of data in modern computing workloads is growing rapidly. Meanwhile, the capacity of main memory is growing slowly; thus, memory management of operating systems plays an increasingly important role in application performance. Recent scientific applications process large amounts of data as well. They tend to manage intermediate data in anonymous pages and repeat core operations on the data using loops. However, LRU variants have difficulty handling loop access patterns in scientific applications, which are commonly used as a page replacement policy in the operating system. In this article, we propose a new page replacement scheme, called adaptive page replacement (APR) for looping access patterns in scientific applications. APR can detect various looping access patterns and handle them appropriately online by exploiting the information already available in the virtual memory subsystem of OS. We evaluate APR by trace-driven simulation with traces extracted from 12 workloads in the SPLASH-2x benchmark. Throughout our experimental results, we demonstrate that APR outperforms prior schemes including CLOCK.
... It could possibly be from the larger amount of memory each image used, but further investigation would be needed. This would fall in line with what has been noted before, that increasing memory per core will increase performance before reaching saturation [38]. ...
Article
Full-text available
This work added semi-Lagrangian convected air particles to the Intermediate Complexity Atmospheric Research (ICAR) model. The ICAR model is a simplified atmospheric model using quasi-dynamical downscaling to gain performance over more traditional atmospheric models. The ICAR model uses Fortran coarrays to split the domain amongst images and handle the halo region communication of the image’s boundary regions. The newly implemented convected air particles use trilinear interpolation to compute initial properties from the Eulerian domain and calculate humidity and buoyancy forces as the model runs. This paper investigated the performance cost and scaling attributes of executing unsaturated and saturated air particles versus the original particle-less model. An in-depth analysis was done on the communication patterns and performance of the semi-Lagrangian air particles, as well as the performance cost of a variety of initial conditions such as wind speed and saturation mixing ratios. This study found that given a linear increase in the number of particles communicated, there is an initial decrease in performance, but that it then levels out, indicating that over the runtime of the model, there is an initial cost of particle communication, but that the computational benefits quickly offset it. The study provided insight into the number of processors required to amortize the additional computational cost of the air particles.
Article
The expected halt of traditional technology scaling is motivating increased heterogeneity in high performance computing (HPC) systems with the emergence of numerous specialized accelerators. As heterogeneity increases, so does the risk of underutilizing expensive hardware resources if we preserve today’s rigid node configuration and reservation strategies. This has sparked interest in resource disaggregation to enable finer-grain allocation of hardware resources to applications. However, there is currently no data-driven study of what range of disaggregation is appropriate in HPC. To that end, we perform a detailed analysis of key metrics sampled in NERSC’s Cori, a production HPC system that executes a diverse open-science HPC workload. In addition, we profile a variety of deep learning applications to represent an emerging workload. We show that for a rack (cabinet) configuration and applications similar to Cori, a central processing unit (CPU) with intra-rack disaggregation has a 99.5% probability to find all resources it requires inside its rack. In addition, ideal intra-rack resource disaggregation in Cori could reduce memory and NIC resources by 5.36% to 69.01% and still satisfy the worst-case average rack utilization.
Conference Paper
Full-text available
Energy consumption is by far the most important contributor to HPC cluster operational costs, and it accounts for a significant share of the total cost of ownership. Advanced energy-saving techniques in HPC components have received significant research and development effort, but a simple measure that can dramatically reduce energy consumption is often overlooked. We show that, in capacity computing, where many small to medium-sized jobs have to be solved at the lowest cost, a practical energy-saving approach is to scale-in the application on large-memory nodes. We evaluate scaling-in; i.e. decreasing the number of application processes and compute nodes (servers) to solve a fixed-sized problem, using a set of HPC applications running in a production system. Using standard-memory nodes, we obtain average energy savings of 36%, already a huge figure. We show that the main source of these energy savings is a decrease in the node-hours (node_hours = #nodes x exe_time), which is a consequence of the more efficient use of hardware resources. Scaling-in is limited by the per-node memory capacity. We therefore consider using large-memory nodes to enable a greater degree of scaling-in. We show that the additional energy savings, of up to 52%, mean that in many cases the investment in upgrading the hardware would be recovered in a typical system lifetime of less than five years.
Article
Full-text available
A new sparse high performance conjugate gradient benchmark (HPCG) has been recently released to address challenges in the design of sparse linear solvers for the next generation extreme-scale computing systems. Key computation, data access, and communication pattern in HPCG represent building blocks commonly found in today's HPC applications. While it is a well known challenge to efficiently parallelize Gauss-Seidel smoother, the most time-consuming kernel in HPCG, our algorithmic and architecture-aware optimizations deliver 95% and 68% of the achievable bandwidth on Xeon and Xeon Phi, respectively. Based on available parallelism, our Xeon Phi shared-memory implementation of Gauss-Seidel smoother selectively applies block multi-color reordering. Combined with MPI parallelization, our implementation balances parallelism, data access locality, CG convergence rate, and communication overhead. Our implementation achieved 580 TFLOPS (82% parallelization efficiency) on Tianhe-2 system, ranking first on the most recent HPCG list in July 2014. In addition, we demonstrate that our optimizations not only benefit HPCG original dataset, which is based on structured 3D grid, but also a wide range of unstructured matrices.
Conference Paper
The TOP 500 list is the most widely regarded ranking of modern supercomputers, based on Gflop/s measured for High Performance LINPACK (HPL). Ranking the most powerful supercomputers is important: Hardware producers hone their products towards maximum benchmark performance, while nations fund huge installations, aiming at a place on the pedestal. However, the relevance of HPL for real-world applications is declining rapidly, as the available compute cycles are heavily overrated. While relevant comparisons foster healthy competition, skewed comparisons foster developments aimed at distorted goals. Thus, in recent years, discussions on introducing a new benchmark, better aligned with real-world applications and therefore the needs of real users, have increased, culminating in a highly regarded candidate: High Performance Conjugate Gradients (HPCG). In this paper we present an in-depth analysis of this new benchmark. Furthermore, we present a model, capable of predicting the performance of HPCG on a given architecture, based solely on two inputs: the effective bandwidth between the main memory and the CPU and the highest occuring network latency between two compute units. Finally, we argue that within the scope of modern supercomputers with a decent network, only the first input is required for a highly accurate prediction, effectively reducing the information content of HPCG results to that of a stream benchmark executed on one single node. We conclude with a series of suggestions to move HPCG closer to its intended goal: a new benchmark for modern supercomputers, capable of capturing a well-balanced mixture of relevant hardware properties.
Chapter
Introduces Knights Landing, a many-core processor that delivers massive thread and data parallelism with high memory bandwidth. Knights Landing is the Second Generation of Intel® Xeon Phi™ products using a many-core architecture which both benefits from, and relies on, parallel programming. Key new innovations such as MCDRAM, cluster modes, and memory modes are explained at a high level.
Conference Paper
First defined two decades ago, the memory wall remains a fundamental limitation to system performance. Recent innovations in 3D-stacking technology enable DRAM devices with much higher bandwidths than traditional DIMMs. The first such products will soon hit the market, and some of the publicity claims that they will break through the memory wall. Here we summarize our analysis and expectations of how such 3D-stacked DRAMs will affect the memory wall for a set of representative HPC applications. We conclude that although 3D-stacked DRAM is a major technological innovation, it cannot eliminate the memory wall.
Article
This article describes the architecture of Knights Landing, the second-generation Intel Xeon Phi product family, which targets high-performance computing and other highly parallel workloads. It provides a significant increase in scalar and vector performance and a big boost in memory bandwidth compared to the prior generation, called Knights Corner. Knights Landing is a self-booting, standard CPU that is completely binary compatible with prior Intel Xeon processors and is capable of running all legacy workloads unmodified. Its innovations include a core optimized for power efficiency, a 512-bit vector instruction set, a memory architecture comprising two types of memory for high bandwidth and large capacity, a high-bandwidth on-die interconnect, and an integrated on-package network fabric. These features enable the Knights Landing processor to provide significant performance improvement for computationally intensive and bandwidth-bound workloads while still providing good performance on unoptimized legacy workloads, without requiring any special way of programming other than the standard CPU programming model.
Article
Recent technology advancements allow for the integration of large memory structures on-die or as a die-stacked DRAM. Such structures provide higher bandwidth and faster access time than off-chip memory. Prior work has investigated using the large integrated memory as a cache, or using it as part of a heterogeneous memory system under management of the OS. Using this memory as a cache would waste a large fraction of total memory space, especially for the systems where stacked memory could be as large as off-chip memory. An OS managed heterogeneous memory system, on the other hand, requires costly usage-monitoring hardware to migrate frequently-used pages, and is often unable to capture pages that are highly utilized for short periods of time. This paper proposes a practical, low-cost architectural solution to efficiently enable using large fast memory as Part-of-Memory (PoM) seamlessly, without the involvement of the OS. Our PoM architecture effectively manages two different types of memory (slow and fast) combined to create a single physical address space. To achieve this, PoM implements the ability to dynamically remap regions of memory based on their access patterns and expected performance benefits. Our proposed PoM architecture improves performance by 18.4% over static mapping and by 10.5% over an ideal OS-based dynamic remapping policy.
Article
Die-stacked DRAM is a technology that will soon be integrated in high-performance systems. Recent studies have focused on hardware caching techniques to make use of the stacked memory, but these approaches require complex changes to the processor and also cannot leverage the stacked memory to increase the system's overall memory capacity. In this work, we explore the challenges of exposing the stacked DRAM as part of the system's physical address space. This non-uniform access memory (NUMA) styled approach greatly simplifies the hardware and increases the physical memory capacity of the system, but pushes the burden of managing the heterogeneous memory architecture (HMA) to the software layers. We first explore simple (and somewhat impractical) schemes to manage the HMA, and then refine the mechanisms to address a variety of hardware and software implementation challenges. In the end, we present an HMA approach with low hardware and software impact that can dynamically tune itself to different application scenarios, achieving performance even better than the (impractical-to-implement) baseline approaches.