Conference PaperPDF Available

Runtime Support for Distributed Dynamic Locality


Abstract and Figures

Single node hardware design is shifting to a heterogeneous nature and many of today's largest HPC systems are clusters that combine heterogeneous compute device architectures. The need for new programming abstractions in the advancements to the Exascale era has been widely recognized and variants of the Partitioned Global Address Space (PGAS) programming model are discussed as a promising approach in this respect. In this work, we present an graph-based abstraction of hardware locality to provide runtime support for dynamic , distributed hardware locality, specifically considering heterogeneous systems and asymmetric, deep memory hierarchies. Our reference implementation dyloc leverages hwloc to provide high-level operations on logical hardware topol-ogy based on user-specified predicates such as filter-and group transformations and locality-aware partitioning. To facilitate integration in existing applications, we discuss adapters to maintain compatibility with the established hwloc API.
Content may be subject to copyright.
Runtime Support for Distributed Dynamic Locality
Tobias Fuchs and Karl Fürlinger
Ludwig-Maximilians-Universität (LMU) München, Computer Science Department
Oettingenstr. 67, 80538 Munich, Germany
Abstract. Single node hardware design is shifting to a heterogeneous nature and
many of today’s largest HPC systems are clusters that combine heterogeneous
compute device architectures. The need for new programming abstractions in the
advancements to the Exascale era has been widely recognized and variants of
the Partitioned Global Address Space (PGAS) programming model are discussed
as a promising approach in this respect. In this work, we present a graph-based
approach to provide runtime support for dynamic, distributed hardware local-
ity, specifically considering heterogeneous systems and asymmetric, deep mem-
ory hierarchies. Our reference implementation dyloc leverages hwloc to provide
high-level operations on logical hardware topology based on user-specified pred-
icates such as filter- and group transformations and locality-aware partitioning.
To facilitate integration in existing applications, we discuss adapters to maintain
compatibility with the established hwloc API.
1 Introduction
The cost of accessing data in Exascale systems is expected to be the dominant fac-
tor in terms of execution time and energy consumption [11]. To minimize data move-
ment, programming systems must therefore shift from a compute-centric to a more
data-centric focus.
The Partitioned Global Address Space (PGAS) model is particularly suitable for
programming abstractions for data locality [3] but differentiates only between local and
remote data access in its conventional form. This two-level abstraction lacks the expres-
siveness to model locality of increasingly deep and heterogeneous machine hierarchies.
To facilitate plasticity, the capability of software to adapt to the underlying hardware
architecture and available resources, programmers must be provided with fine-grained
control of data placement in the hardware topology. The 2014 PADAL report [11] sum-
marizes a wish list on programming environment features to facilitate this task. This
work is motivated by two wish list items in particular:
Flexible, memory-agnostic mappings of abstract processes to given physical archi-
Concise interfaces for hardware models that adjust the level of detail to the re-
quested accuracy
This work introduces an abstraction of dynamic distributed locality with specific
support for deep asymmetric memory hierarchies of heterogeneous systems which typ-
ically do not exhibit an unambiguous tree structure. In this context, dynamic locality
refers to the capability to create logical representations of physical hardware com-
ponents from run-time specified, imperative and declarative constraints. Application-
specific predicates can be applied as distance- and affinity metrics to define measures
of locality. Our approach employs a graph-based internal representation of hierarchical
locality domains. Its interface allows to request light-weight views which represent the
complex locality graph as a well-defined, consolidated hierarchy.
The remainder of this paper is structured as follows: After a brief review of related
work, we illustrate the need for dynamic hardware locality support using requirements
identified in the DASH library. Section 4 introduces the concept of a graph-based lo-
cality topology and general considerations for implementation. Addressing dynamic
characteristics, Sec. 4 outlines fundamental operations on locality hierarchies and se-
lected semantic details. To substantiate our conceptual findings, we introduce our ref-
erence implementation ‘dyloc‘ and explain how it achieves interoperability with hwloc
in Sec. 5. Finally, the benefit of the presented techniques is evaluated in a use case on
SuperMIC, a representative heterogeneous Ivy Bridge / Xeon Phi system.
2 Related Work
Hierarchical locality is incorporated in numerous approaches to facilitate programma-
bility of the memory hierarchy. Most dynamic schemes are restricted two levels in the
machine hierarchy.
In X10, memory and execution space is composed of places, and tasks execute at
specific places. Remote data can only be accessed by spawning a task at the target place.
Chapel has a similar concept of locales.
The task model implemented in Sequoia [1] does not consider hardware capacity
for task decomposition and communication is limited to parameter exchange between
adjacent parent and child tasks.
Hierarchical Place Trees (HPT) [12] extend the models of Sequoia and X10 and in-
crease flexibility of task communication and instantiation. Some fundamental concepts
of HPT like hierarchical array views have been adopted in DASH. The HPT program-
ming model is substantially task-parallel, however, and based on task queues assigned
to places. HPTs model only static intra-node locality collected at startup.
All abstractions of hierarchical locality in related work model the machine hierar-
chy as a tree structure, including the de-facto standard hwloc. However, shortcomings of
trees for modeling modern heterogeneous architectures are known [8] while hierarchi-
cal graphs have been shown to be more practicable to represent locality and hardware
capacity in task models [9].
Notably, the authors of hwloc explain that graph data structures are used in the net-
work topology component netloc as a tree-based model was too strict and inconvenient
[7]. We believe that this reasoning also applies to node-level hardware. Regarding cur-
rent trends in HPC hardware configurations, we observed that interdependent character-
istics of horizontal and vertical locality in heterogeneous systems cannot be sufficiently
and unambiguously represented in a single, conventional tree. This is already evident
for recent architectures with cores connected in grid- and ring bus topologies.
More important, heterogeneous hosts require communication schemes and virtual
process topologies that are specific to hardware configuration and the algorithm sce-
nario. This involves concepts of vertical and horizontal locality that are not based
on latency and throughput as distance measure. For example in a typical accelerator-
offloading algorithm with a final reduction phase, processes first consider physical dis-
tance and horizontal locality. For communication in the reduction phase, distance is
measured based on PCI interface affinity to optimize for vertical locality.
Still, formal considerations cannot disprove the practical benefit of tree data struc-
tures as a commonly understood mental model for algorithms and application devel-
opment. We therefore came to the conclusion that two models of hardware locality are
required: an internal physical model representing the machine architecture in a detailed,
immutable graph and logical views resulting from projections of the physical model to
a simplified tree structure.
3 Background and Motivation
The concepts discussed in the following sections evolved from specific requirements of
DASH, a C++ template library for distributed containers and algorithms in Partitioned
Global Address Space. While the concepts and methods presented in this work do not
depend on a specific programming model, terminology and basic assumptions regard-
ing domain decomposition and process topology have been inherited from DASH. In
this section, these are briefly discussed as motivating use cases for dynamic hardware
Fig. 1. Team hierarchy created from two balanced splits: numbers in boxes indicate unit ranks
relative to the current team, with corresponding global ranks above
Virtual Process Topology: DASH Teams In the DASH execution model, individual
computation entities are called units, a generic name chosen because terms such as
process or thread have a specific connotation that might be misleading for some run-
time system concepts. In the MPI-based implementation of the DASH runtime, a unit
corresponds to an MPI rank.
Units are organized in hierarchical teams to represent the logical structure of algo-
rithms and machines in a program [10]. On initialization of the DASH runtime, all units
are assigned to the predefined team instance ALL. New teams can be only created by
specifying a subset of a parent team in a split operation. Splitting a team creates an
additional level in the team hierarchy [6].
In the basic variant of the team split operation, units are evenly partitioned in a spec-
ified number of child teams of balanced size. A balanced split does not respect hardware
locality but has low complexity and no communication overhead. It is therefore prefer-
able for teams in flat memory hierarchy segments. On systems with asymmetric or deep
memory hierarchies, it is highly desirable to split a team such that locality of units
within every child team is optimized. A locality-aware split at node level could group
units by affinity to the same NUMA domain, for example.
Organizing units by locality requires means to query their affinity in the hardware
topology. Resolving NUMA domains from given process IDs can be reliably realized
using hwloc. When collaboration schemes are to be optimized for a specific communi-
cation bus, especially with grid- and ring topologies, concepts of affinity and distance
soon depend on higher-order predicates and differ from the textbook intuition of mem-
ory hierarchies.
This does not refer to experimental, exotic architecture designs but already applies
to systems actively used at the time of this writing. Figure 2 shows the physical struc-
ture of a SuperMIC system at host level and its common logical interpretation. Note
that core affinity to PCI interconnect can be obtained, for example by traversing hwloc
topology data, but is typically not exploited in applications due to the lack of a locality
information system that allows to express high-level, declarative views.
Fig. 2. Hardware locality of a single SuperMIC compute node with host-level physical architec-
ture to the left and corresponding logical locality domains including two MIC coprocessors to the
Adaptive Unit-Level Parallelism Node-level work loads of nearly all distributed al-
gorithms can be optimized using unit-level parallelization like multithreading or SIMD
operations. The available parallelization techniques and their suitable configuration de-
pend on the unit’s placement in the process- and hardware topology. As this can only be
determined during execution, this again requires runtime support for dynamic hardware
locality that allows to query available capacities of locality domains – such as cache
sizes, bus capacity, and the number of available cores – depending on the current team
Domain Decomposition: DASH Patterns The Pattern concept in DASH [5] allows
user-specified data distributions similar to Chapel’s domain maps [2]. As only specific
combinations of algorithms and data distribution schemes maintain data locality, hard-
ware topology and algorithm design are tightly coupled. Benefits of topology-aware
selection of algorithms and patterns for multidimensional arrays have been shown in
previous work [4].
4 Locality Domain Hierarchies
An hwloc distance matrix allows to express a single valid representation of hardware
locality of non-hierarchical topologies. However, it is restricted to latency and through-
put as distance measures. A distance matrix can express the effects of grouping and
view operations but does not support high-level queries and has to be recalculated for
every modification of the topology view. In this section, we present the Locality Do-
main Hierarchy (LDH) model which extends the hwloc topology model by additional
properties and operations to represent locality topology as dynamic graph.
In more formal terms, we model hardware locality as directed, acyclic, multi-indexed
multigraph. In this, nodes represent Locality Domains that refer to any physical or log-
ical component of a distributed system with memory or computation capacities, cor-
responding to places in X10 or Chapel’s locales. Edges in the graph are directed and
denote one of the following relationships:
Containment indicating that the target domain is logically or physically contained in
the source domain
Alias source and target domains are only logically separated and refer to the same
physical domain; this is relevant when searching for a shortest path, for example
Leader the source domain is restricted to communication with the target domain
Figure 3 outlines components of the locality domain concept in a simplified ex-
ample. A locality hierarchy is specific to a team and only contains domains that are
populated by the team’s units. At initialization, the runtime initializes the default team
ALL as root of the team hierarchy with all units and associates the team with the global
locality graph containing all domains of the machine topology.
Leaf nodes in the locality hierarchy are units, the lowest addressable domain cate-
gory. A single unit has affinity to a specific physical core but may utilize multiple cores
or shared memory segments exclusively. Domain capacities such as cores and shared
L:1 .0
domain tag
L:2 .0.0
... units: { 0,1,2,3 }
cores: 24
mem: 32 GB
domain capacities
L:2 .0.1
units: { 2,3 }
cores: 12
mem: 16 GB
units: { 3 }
cores: 6
mem: 8 GB
L:0 .
domain capabilities
team 1 team 2
Fig. 3. Domain nodes in a locality hierarchy with domain attributes in dynamically accumulated
capacities and invariant capabilities
memory are equally shared by the domain’s units if not specified otherwise. In the ex-
ample illustrated in Fig. 3, two units assigned to a NUMA domain of 12 cores each
utilize 6 cores.
When a team is split, its locality graph is partitioned among child teams such that
a single partition is coherent and only contains domains with at least one leaf occupied
by a unit in the child team. This greatly simplifies implementation of locality-aware
algorithms as any visible locality domain is guaranteed to be accessible by some unit in
the current team configuration.
4.1 Domain Attributes and Properties
The topological characteristics of a domain’s corresponding physical component are
expressed as three correlated yet independent attributes:
scope category of physical or logical component represented by the domain object such
as "socket" or "L3D cache"
level number of logical indirections between the locality domain and the hierarchy
root; not necessarily related to distance
domain_tag the domain’s hierarchical path from the root domain, consisting of relative
subdomain offsets separated by a dot character
Domain tags serve as unique identifiers and allow to locate domains without search-
ing the hierarchy. For any set of domains, the longest common prefix of their domain
tags identifies their lowest common ancestor, for example. Apart from these attributes,
a domain is associated with two property maps:
Capabilities invariant hardware locality properties that do not depend on the locality
graph’s structure, like the number of threads per core, cache sizes, or SIMD width.
Capacities derivative properties that might become invalid when the graph structure is
modified, like L3 cache size available per unit
Dynamic locality support requires means to specify transformations on the physical
topology graph as views. Views realize a projection but must not actually modify the
original graph data. Invariant properties are therefore stored separately and assigned
to domains by reference only. A view only contains a shallow copy of the graph data
structure and only the capacities of domains included in the view.
4.2 Operations on Locality Domains
A specific domain node can be queried by their unique domain tag or unit. Conceptu-
ally, locality hierarchy model is a directed, multi-relational graph so any operation ex-
pressed in path algebra for multi-relational graphs is conceptually feasible and highly
expressive, but overly complex. For the use cases we identified in applications so far,
it is sufficient to provide the operations with semantics listed in Fig. 4, apart from un-
surprising operations for node traversal and lookup by identifier. These can be applied
to any domain in a locality hierarchy, including its root domain to include the entire
domain_group(d, domain_tags[]) -> d'
domain_select(d, domain_tags[]) -> d'
domain_exclude(d, domain_tags[]) -> d'
domain_copy(d) -> d'
domain_at(d, u) -> t
domain_find_if(d, pred) -> t[]
separate subdomains into group domain
remove all but the speci ed subdomains
remove speci ed subdomains
create a copy of domain d
get tag of domain assigned to unit u
get tags of domains satisfying predicate
Fig. 4. Fundamental operations in the Locality Domain concept on a locality domain hierarchy d.
Modifying operations return the result of their operation as locality domain view d0.
Operations for selection and exclusion are applied to subdomains recursively. The
runtime interface can define complex high-level functions based on combinations of
these fundamental operations. To restrict a virtual topology to a single NUMA affinity,
for example:
numa_tags := domain_find_if(topo, (d | d.scope = NUMA))
numa_topo := domain_select(topo, numa_tags[1])
The domain_group operation combines an arbitrary set of domains in a logical group.
This is useful in various situations, especially when specific units are assigned to spe-
cial roles, often depending on a phase in an algorithm. For example, Intel suggests the
leader role communication pattern1for applications running MPI processes on Xeon
Phi accelerator modules where communication between MPI ranks on host and accel-
erator is restricted in the reduction phase to a single, dedicated process on either side.
xeon-phi- coprocessor-system- software-developers- guide.pdf
As groups are virtual, their level is identical to the original LCA of the grouped
domains and their communication cost is 0. Like any other modification of a locality
graph’s structure, adding domain groups does not affect measures distance or communi-
cation cost as a logical rearrangement has, of course, no effect on physical connectivity.
Figure 5 illustrates the steps of the domain grouping algorithm.
Fig. 5. Simplified illustration of the domain grouping algorithm. Domains 100 and 110 in NUMA
scope are separated into a group. To preserve the original topology structure, the group includes
their parent domains up to the lowest common ancestor with domain 121 as alias of domain 11.
4.3 Specifying Distance and Affinity Metrics
Any bidirectional connection between a domain and its adjacent subdomains in the
locality hierarchy model represents a physical bus exhibiting characteristic communi-
cation overhead such as a cache crossbar or a network interconnect. Therefore, a cost
function cost(d)can be specified for any domain dto specify communication cost of
the medium connecting its immediate subdomains. This allows to define a measure of
locality for a pair of domains (da,db)as the cumulative cost of the shortest path con-
nection, restricted to domains below their lowest common ancestor (LCA). A domain
has minimal distance 0 to itself.
Heterogeneous hosts require communication schemes and virtual process topolo-
gies that are specific to hardware configuration and the algorithm scenario. In a typical
accelerator offload algorithm with a final reduction phase, processes first consider phys-
ical distance and horizontal locality. For communication in the reduction phase, distance
is measured based on PCI interface affinity to optimize for vertical locality.
5 The dyloc Library
Initial concepts of the dyloc library have been implemented for locality discovery in the
DASH runtime. In this, hardware locality information from hwloc, PAPI, libnuma, and
LIKWID has been combined into a unified data structure that allowed to query locality
information by process ID or affinity.
Fig. 6. Using dyloc as intermediate process in locality discovery.
This query interface proved to be useful for static load balancing on heterogeneous
systems like SuperMIC and was recently made available as the standalone library dy-
loc2. Figure 7 outlines the structure of its dependencies and interfaces, with APIs pro-
vided for C and C++.
Fig. 7. Dependencies and interfaces of the dyloc/dylocxx library
The boost graph library3offers an ideal abstraction for high-level operations on
locality domain graphs. These are exposed in the C++ developer API and may be mod-
ified by user-specified extensions. The boost graph concepts specify separate storage of
node properties and the graph structure. This satisfies the requirements of the domain
topology data structure as introduced in Sec. 4 where domain capabilities are indepen-
dent from the topology structure. As a consequence, consolidated views on a locality
graph do not require deep copies of domain nodes. Only their accumulative capacities
have to be recalculated.
We consider compatibility to existing concepts in the hwloc API a critical require-
ment and therefore ensured, to the best of our knowledge and understanding, that con-
figurations of dyloc’s graph-based locality model can be projected to a well-defined
hierarchy and exported to hwloc data structures.
A possible scenario is illustrated in Fig. 6. Topology data provided by hwloc for
separate nodes are combined into a unified dyloc locality graph that supports high-level
operations. Queries and transformations on the graph return a light-weight view that
can be converted to a hwloc topology and then used in applications instead of topology
objects obtained from hwloc directly.
6 Proof of Concept: Work balancing min_element on SuperMIC
The SuperMIC system 4consists of 32 compute nodes with identical hardware config-
uration of two NUMA domains, each containing an Ivy Bridge (8 cores) host proces-
sor and a Xeon Phi "Knights Corner" coprocessors (Intel MIC 5110P) as illustrated in
Fig. 2. This system configuration is an example of both increased depth of the machine
hierarchy and heterogeneous node-level architecture.
1Te a m Lo c al i ty tl o c ( dash: : Team:: All () ) ;
2Lo c B a la n c ed P a tt e r n p a tt e r n ( NE LE M , t l oc ) ;
3dash: : Ar ray < T > ar r ay ( p a t te r n ) ;
4Gl o bI t m i n_ e l em e nt ( G l ob I t fi r st , Glo b It l a st ) {
5auto ul o c = Un it L oc a l it y ( myid () ) ;
6auto nt h r ea d s = u lo c . n u m _ th r e a ds ( ) ;
7# pr a gm a o mp p a ra l le l for n u m_ t hr e ad s ( n th r ea ds )
8{/* . .. f in d l oc a l re s ul t .. . */ }
9dash: : ba rr i er ( ) ;
10 // b r oa d ca s t l oc a l re s ul t :
11 auto le a de r = ul oc . a t _s c op e ( sc o pe : : MO D UL E )
12 . un i t_ i ds ( ) [0 ] ;
13 if ( l ea d er == myid)
14 ...
15 }
Listing 1.1. Pseudo code of the modified min_element algorithm
To substantiate how asymmetric, heterogeneous system configurations introduce a
new dimension to otherwise trivial algorithms, we briefly discuss the implementation
of the min_element algorithm in DASH. Its original variant is implemented as follows:
domain decomposition divides the element range into contiguous blocks of identical
size. All units then run a thread-parallel scan on their local block for a local minimum
and enter a collective barrier once it has been found. Once all units completed their
local work load, local results are reduced to the global minimum. For portable work
load balancing on heterogeneous systems, the employed domain decomposition must
dynamically adapt to the unit’s available locality domain capacities and -capabilities:
Capacities: total memory capacity on MIC modules is 8 GB for 60 cores, significantly
less than 64 GB for 32 cores on host level
Capabilities: MIC cores have a base clock frequency of 1.1 GHz and 4 SMT threads,
with 2.8 Ghz and 2 SMT threads on host level
Time (ms)
0 20 40 60 80 100 120
without load-balancing (symmetric)
Trace of process states in min_element on SuperMIC node
Time (ms)
0 20 40 60 80 100 120
2x6 MIC processes
4 threads/core
16 host processes
1 thread/core
local work
barrier wait
Fig. 8. Trace of process activities in the min_element algorithm exposing the effect of load bal-
ancing based on dynamic hardware locality
Listing 1.1 contains the abbreviated modified implementation of the min_element
scenario utilizing the runtime support proposed in this work. The full implementation
is available in the DASH source distribution 5.
7 Conclusion and Future Work
Even with the improvements to the min_element algorithm explained in Sec. 6, the
implementation is not fully portable, yet: the load factor to adjust for the differing
elements/ms has been determined in auto tuning. In future work, we will extend the
locality hierarchy model by means to register progress in local work loads to allow self-
adaptation of algorithms depending on load imbalance measured for specified sections.
Acknowledgements This work was partially supported by the German Research Foun-
dation (DFG) by the German Priority Programme 1648 Software for Exascale Com-
puting (SPPEXA) and by the German Federal Ministry of Education and Research
(BMBF) through the MEPHISTO project, grant agreement 01IH16006B.
1. Michael Bauer, John Clark, Eric Schkufza, and Alex Aiken. Programming the Memory
Hierarchy Revisited: Supporting Irregular Parallelism in Sequoia. ACM SIGPLAN Notices,
46(8):13–24, 2011.
2. Bradford L Chamberlain, Steven J Deitz, David Iten, and Sung-Eun Choi. User-defined
distributions and layouts in Chapel: Philosophy and framework. In Proceedings of the 2nd
USENIX conference on Hot topics in parallelism, pages 12–12. USENIX Association, 2010.
3. Georges Da Costa, Thomas Fahringer, Juan Antonio Rico Gallego, Ivan Grasso, Atanas
Hristov, Helen Karatza, Alexey Lastovetsky, Fabrizio Marozzo, Dana Petcu, Georgios
Stavrinides, et al. Exascale Machines Require New Programming Paradigms and Runtimes.
Supercomputing frontiers and innovations, 2(2):6–27, 2015.
4. Tobias Fuchs and Karl Fürlinger. A Multi-Dimensional Distributed Array Abstraction for
PGAS. In Proceedings of the 18th IEEE International Conference on High Performance
Computing and Communications (HPCC 2016), pages 1061–1068, Sydney, Australia, De-
cember 2016.
5. Tobias Fuchs and Karl Fürlinger. Expressing and Exploiting Multidimensional Locality in
DASH. In Proceedings of the SPPEXA Symposium 2016, Lecture Notes in Computational
Science and Engineering, Garching, Germany, January 2016. to appear.
6. Karl Fürlinger, Tobias Fuchs, and Roger Kowalewski. DASH: A C++ PGAS library for
distributed data structures and parallel algorithms. In Proceedings of the 18th IEEE Inter-
national Conference on High Performance Computing and Communications (HPCC 2016),
pages 983–990, Sydney, Australia, December 2016.
7. Brice Goglin. Managing the Topology of Heterogeneous Cluster Nodes with Hardware Lo-
cality (hwloc). In International Conference on High Performance Computing & Simulation
(HPCS 2014), Bologna, Italy, July 2014. IEEE.
8. Brice Goglin. Exposing the Locality of Heterogeneous Memory Architectures to HPC Ap-
plications. In 1st ACM International Symposium on Memory Systems (MEMSYS16). ACM,
9. Mohammadtaghi Hajiaghayi, Theodore Johnson, Mohammad Reza Khani, and Barna Saha.
Hierarchical Graph Partitioning. In Proceedings of the 26th ACM symposium on Parallelism
in algorithms and architectures, pages 51–60. ACM, 2014.
10. Amir Ashraf Kamil and Katherine A. Yelick. Hierarchical Additions to the SPMD Pro-
gramming Model. Technical Report UCB/EECS-2012-20, EECS Department, University of
California, Berkeley, Feb 2012.
11. Adrian Tate, Amir Kamil, Anshu Dubey, Armin Größlinger, Brad Chamberlain, Brice
Goglin, Carter Edwards, Chris J Newburn, David Padua, Didem Unat, et al. Programming
abstractions for data locality. Research report, PADAL Workshop 2014, April 28–29, Swiss
National Supercomputing Center (CSCS), Lugano, Switzerland, November 2014.
12. Yonghong Yan, Jisheng Zhao, Yi Guo, and Vivek Sarkar. Hierarchical place trees: A portable
abstraction for task parallelism and data movement. In International Workshop on Languages
and Compilers for Parallel Computing, pages 172–187. Springer, 2009.
... Another recommendation is the combination of the concepts with more information about the hardware. Fuchs and Fürlinger [FF18] presented a graph-based approach to provide logical hardware topologies and high-level operations for dynamic, distributed hardware locality during run-time. This representation of hardware can be used to uniformly and algorithmspecifically distribute the data and processing workload on the given hardware. ...
Full-text available
The Partitioned Global Address Space (PGAS) programming model represents distributed memory as global shared memory space. The DASH library implements C++ standard library concepts based on PGAS and provides additional containers and algorithms that are common in High-Performance Computing (HPC) applications. This thesis examines especially sparse matrices and whether state-of-the-art conceptual approaches to these data storage can be abstracted to improve the adaptation of domain-specific computational intent to the underlying hardware. Therefore, an intensive systematic analysis of state-of-the-art distribution and storage of sparse data is done. Based on these findings given property classification system of general dense data distributions is extended by sparse properties. Afterward, a universal vocabulary of a domain decomposition concept abstraction is developed. This four-layer concept introduces the following steps: (i) Global Canonical Domain, (ii) Formatting, (iii) Decomposition, and (iv) Hardware. In combination with the proposed abstraction, an index set algebra utilizing the resulting sparse data distribution system is presented. This concludes in a newly created Position Concept that is capable of referencingan element in the data domain as a multivariate position. The key feature is the abilityto reference this element with arbitrary representations of the element’s physical memorylocation.
Conference Paper
Full-text available
DASH is a realization of the PGAS (partitioned global address space) model in the form of a C++ template library without the need for a custom PGAS (pre-)compiler. We present the DASH NArray concept, a multidimensional array abstraction designed as an underlying container for stenciland dense numerical applications. After introducing fundamental programming concepts used in DASH, we explain how these have been extended by multidimensional capabilities in the NArray abstraction. Focusing on matrix-matrix multiplication in a case study, we then discuss an implementation of the SUMMA algorithm for dense matrix multiplication to demonstrate how the DASH NArray facilitates portable efficiency and simplifies the design of efficient algorithms due to its explicit support for locality-based operations. Finally, we evaluate the performance of the SUMMA algorithm based on the NArray abstraction against established implementations of DGEMM and PDGEMM. In combination with mechanisms for automatic optimization of logical process topology and domain decomposition, our implementation yields highly competitive results without manual tuning, significantly outperforming Intel MKL and PLASMA in node-level use cases as well as ScaLAPACK in highly distributed scenarios.
Full-text available
The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal.
Full-text available
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific
Conference Paper
Full-text available
Modern computer systems feature multiple homogeneous or heterogeneous computing units with deep memory hierarchies, and expect a high degree of thread-level parallelism from the software. Exploitation of data locality is critical to achieving scalable parallelism, but adds a significant dimension of complexity to performance optimization of parallel programs. This is especially true for programming models where locality is implicit and opaque to programmers. In this paper, we introduce the hierarchical place tree (HPT) model as a portable abstraction for task parallelism and data movement. The HPT model supports co-allocation of data and computation at multiple levels of a memory hierarchy. It can be viewed as a generalization of concepts from the Sequoia and X10 programming models, resulting in capabilities that are not supported by either. Compared to Sequoia, HPT supports three kinds of data movement in a memory hierarchy rather than just explicit data transfer between adjacent levels, as well as dynamic task scheduling rather than static task assignment. Compared to X10, HPT provides a hierarchical notion of places for both computation and data mapping. We describe our work-in-progress on implementing the HPT model in the Habanero-Java (HJ) compiler and runtime system. Preliminary results on general-purpose multicore processors and GPU accelerators indicate that the HPT model can be a promising portable abstraction for future multicore processors.
Conference Paper
High-performance computing requires a deep knowledge of the hardware platform to fully exploit its computing power. The performance of data transfer between cores and memory is becoming critical. Therefore locality is a major area of optimization on the road to exascale. Indeed, tasks and data have to be carefully distributed on the computing and memory resources. We discuss the current way to expose processor and memory locality information in the Linux kernel and in user-space libraries such as the hwloc software project. The current de facto standard structural modeling of the platform as the tree is not perfect, but it offers a good compromise between precision and convenience for HPC runtimes. We present an in-depth study of the software view of the upcoming Intel Knights Landing processor. Its memory locality cannot be properly exposed to user-space applications without a significant rework of the current software stack. We propose an extension of the current hierarchical platform model in hwloc. It correctly exposes new heterogeneous architectures with high-bandwidth or non-volatile memories to applications, while still being convenient for affinity-aware HPC runtimes.
Conference Paper
We present DASH, a C++ template library that offers distributed data structures and parallel algorithms and implements a compiler-free PGAS (partitioned global address space) approach. DASH offers many productivity and performance features such as global-view data structures, efficient support for the owner-computes model, flexible multidimensional data distribution schemes and inter-operability with STL (standard template library) algorithms. DASH also features a flexible representation of the parallel target machine and allows the exploitation of several hierarchically organized levels of locality through a concept of Teams. We evaluate DASH on a number of benchmark applications and we port a scientific proxy application using the MPI two-sided model to DASH. We find that DASH offers excellent productivity and performance and demonstrate scalability up to 9800 cores.
We describe two novel constructs for programming parallel machines with multi-level memory hierarchies: call-up, which allows a child task to invoke computation on its parent, and spawn, which spawns a dynamically determined number of parallel children until some termination condition in the parent is met. Together we show that these constructs allow applications with irregular parallelism to be programmed in a straightforward manner, and furthermore these constructs complement and can be combined with constructs for expressing regular parallelism. We have implemented spawn and call-up in Sequoia and we present an experimental evaluation on a number of irregular applications.
User-defined distributions and layouts in Chapel: Philosophy and framework
  • Bradford L Chamberlain
  • J Steven
  • David Deitz
  • Sung-Eun Iten
  • Choi
Bradford L Chamberlain, Steven J Deitz, David Iten, and Sung-Eun Choi. User-defined distributions and layouts in Chapel: Philosophy and framework. In Proceedings of the 2nd USENIX conference on Hot topics in parallelism, pages 12-12. USENIX Association, 2010.
Expressing and Exploiting Multidimensional Locality in DASH
  • Tobias Fuchs
  • Karl Fürlinger
Tobias Fuchs and Karl Fürlinger. Expressing and Exploiting Multidimensional Locality in DASH. In Proceedings of the SPPEXA Symposium 2016, Lecture Notes in Computational Science and Engineering, Garching, Germany, January 2016. to appear.
Managing the Topology of Heterogeneous Cluster Nodes with Hardware Locality (hwloc)
  • Brice Goglin
Brice Goglin. Managing the Topology of Heterogeneous Cluster Nodes with Hardware Locality (hwloc). In International Conference on High Performance Computing & Simulation (HPCS 2014), Bologna, Italy, July 2014. IEEE.