ChapterPDF Available

Abstract and Figures

DASH is a realization of the PGAS (partitioned global address space) programming model in the form of a C++ template library. It provides a multidimensional array abstraction which is typically used as an underlying container for stencil- and dense matrix operations. Efficiency of operations on a distributed multidimensional array highly depends on the distribution of its elements to processes and the communication strategy used to propagate values between them. Locality can only be improved by employing an optimal distribution that is specific to the implementation of the algorithm, run-time parameters such as node topology, and numerous additional aspects. Application developers do not know these implications which also might change in future releases of DASH. In the following, we identify fundamental properties of distribution patterns that are prevalent in existing HPC applications. We describe a classification scheme of multi-dimensional distributions based on these properties and demonstrate how distribution patterns can be optimized for locality and communication avoidance automatically and, to a great extent, at compile time. 10.1007/978-3-319-40528-5_15
Content may be subject to copyright.
Expressing and Exploiting Multi-Dimensional
Locality in DASH
Tobias Fuchs and Karl F¨
urlinger
Abstract DASH is a realization of the PGAS (partitioned global address space)
programming model in the form of a C++ template library. It provides a multi-
dimensional array abstraction which is typically used as an underlying container
for stencil- and dense matrix operations. Efficiency of operations on a distributed
multi-dimensional array highly depends on the distribution of its elements to pro-
cesses and the communication strategy used to propagate values between them. Lo-
cality can only be improved by employing an optimal distribution that is specific
to the implementation of the algorithm, run-time parameters such as node topology,
and numerous additional aspects. Application developers do not know these impli-
cations which also might change in future releases of DASH. In the following, we
identify fundamental properties of distribution patterns that are prevalent in existing
HPC applications. We describe a classification scheme of multi-dimensional distri-
butions based on these properties and demonstrate how distribution patterns can be
optimized for locality and communication avoidance automatically and, to a great
extent, at compile time.
Tobias Fuchs
MNM-Team
Ludwig-Maximilians-Universit¨
at (LMU) M¨
unchen, Computer Science Department
Oettingenstr. 67, 80538 Munich, Germany
e-mail: tobias.fuchs@nm.ifi.lmu.de
Karl F¨
urlinger
MNM-Team
Ludwig-Maximilians-Universit¨
at (LMU) M¨
unchen, Computer Science Department
Oettingenstr. 67, 80538 Munich, Germany
e-mail: karl.fuerlinger@nm.ifi.lmu.de
1
2 Tobias Fuchs and Karl F¨
urlinger
1 Introduction
For Exascale systems the cost of accessing data is expected to be the dominant
factor in terms of execution time as well as energy consumption [3]. To minimize
data movement, applications have to consider initial placement and optimize both
vertical data movement in the memory hierarchy and horizontal data movement
between processing units. Programming systems for Exascale must therefore shift
from a compute-centric to a more data-centric focus and give application developers
fine-grained control over data locality.
On an algorithmic level, many scientific applications are naturally expressed in
multi-dimensional domains that arise from discretization of time and space. How-
ever, few programming systems support developers in expressing and exploiting
data locality in multiple dimensions beyond the most simple one-dimensional dis-
tributions. In this paper we present a framework that enables HPC application devel-
opers to express constraints on data distribution that are suitable to exploit locality
in multi-dimensional arrays.
The DASH library [10] provides numerous variants of data distribution schemes.
Their implementations are encapsulated in well-defined concept definitions and are
therefore semantically interchangeable. However, no single distribution scheme is
suited for every usage scenario. In operations on shared multi-dimensional contain-
ers, locality can only be maintained by choosing an optimal distribution. This choice
depends on:
the algorithm executed on the shared container, in particular its communication
pattern and memory access scheme,
run-time parameters such as the extents of the shared container, the number of
processes and their network topology,
numerous additional aspects such as CPU architecture and memory topology.
The responsibility to specify a data distribution that achieves high locality and com-
munication avoidance lies with the application developers. These, however, are not
aware of implementation-specific implications: a specific distribution might be bal-
anced, but blocks might not fit into a cache line, inadvertently impairing hardware
locality.
As a solution, we present a mechanism to find a concrete distribution variant
among all available candidate implementations that satisfies a set of properties. In
effect, programmers do not need to specify a distribution type and its configuration
explicitly. They can rely on the decision of the DASH library and focus only on
aspects of data distribution that are relevant in the scenario at hand.
For this, we first identify and categorize fundamental properties of distribution
schemes that are prevalent in algorithms in related work and existing HPC appli-
cations. With DASH as a reference implementation, we demonstrate how data dis-
tributions can then be optimized determined automatically and, to a great extent, at
compile time.
From a software engineering perspective, we explain how our methodology fol-
lows best practices known from established C++ libraries and thus ensures that user
Expressing and Exploiting Multi-Dimensional Locality in DASH 3
applications are not only robust against, but even benefit from future changes in
DASH.
The remainder of this paper is structured as follows: The following section in-
troduces fundamental concepts of PGAS and locality in the context of DASH. A
classification of data distribution properties is presented in Sec. 3. In Sec. 4, we
show how this system of properties allows to exploit locality in DASH in different
scenarios. Using the use case of SUMMA as an example, the presented methods are
evaluated for performance as well as flexibility against the established implementa-
tions from Intel MKL and ScaLAPACK. Publications and tools related to this work
are discussed in Sec. 6. Finally, Sec. 7 gives a conclusion and an outlook on fu-
ture work where the DASH library’s pattern traits framework is extended to sparse,
irregular, and hierarchical distributions.
2 Background
This section gives a brief introduction to the Partitioned Global Address Space
approach considering locality and data distribution. We then present concepts in
the DASH library used to express process topology, data distribution and iteration
spaces. The following sections build upon these concepts and present new mecha-
nisms to exploit locality automatically using generic programming techniques.
2.1 PGAS and Multi-Dimensional Locality
Conceptually, the Partitioned Global Address Space (PGAS) paradigm unifies mem-
ory of individual, networked nodes into a virtual global memory space. In effect,
PGAS languages create a shared namespace for local and remote variables. This,
however, does not affect physical ownership. A single variable is only located in a
specific node’s memory and local access is more efficient than remote access from
other nodes. This is expected to matter more and more even within single (NUMA)
nodes in the near future [3]. As locality directly affects performance and scalability,
programmers need full control over data placement. Then, however, they are facing
overwhelmingly complex implications of data distribution on locality.
This complexity increases exponentially with the number of data dimensions.
Calculating a rectangular intersection might be manageable for up to three dimen-
sions, but locality is hard to maintain in higher dimensions, especially for irregular
distributions.
4 Tobias Fuchs and Karl F¨
urlinger
2.2 DASH Concepts
Expressing data locality in a Partitioned Global Address Space language builds upon
fundamental concepts of process topology and data distribution. In the following, we
describe these concepts as they are used in the context of DASH.
2.2.1 Topology: Teams and Units
In DASH terminology, a unit refers to any logical component in a distributed mem-
ory topology that supports processing and storage. Conventional PGAS approaches
offer only the differentiation between local and global data and distinguish between
private, shared-local, and shared-remote memory. DASH extends this model by a
more fine-grained differentiation that corresponds to hierarchical machine models
as units are organized in hierarchical teams. For example, a team at the top level
could group processing nodes into individual teams, each again consisting of units
referencing single CPU cores.
2.2.2 Data Distribution: Patterns
Data distributions in general implement a two-level mapping:
1. from index to process (node- or process mapping)
2. from process to local memory offset (local order or layout)
Index sets separate the logical index space as seen by the user from physical
layout in memory space. This distinction and the mapping between index domains
is usually transparent to the programmer.
Fig. 1 Example of partitioning, mapping, and layout in the distribution of a dense, two-
dimensional array
Process mapping can also be understood as distribution, arrangement in local
memory is also referred to as layout e.g. in Chapel [4].
In DASH, data decomposition is based on index mappings provided by different
implementations of the DASH Pattern concept. Listing 1 shows the instantiation of a
rectangular pattern, specifying the Cartesian index domain and partitioning scheme.
Expressing and Exploiting Multi-Dimensional Locality in DASH 5
Patterns partition a global index set into blocks that are then mapped to units. Con-
sequently, indices are mapped to processes indirectly in two stages: from index to
block (partitioning) and from block to unit (mapping). Figure 1 illustrates a pat-
tern’s index mapping as sequential steps in the distribution of a two-dimensional
array. While the name and the illustrated example might suggest otherwise, blocks
are not necessarily rectangular.
In summary, the DASH pattern concept defines semantics in the following cate-
gories:
Distribution Well-defined distribution of indices to units, depending on
properties in the subordinate categories:
Partitioning Grouping indices into blocks
Mapping Distributing blocks to units in a team
Layout Arrangement of blocks and block elements in local memory
Indexing Operations related to index sets for iterating data elements in
global- and local scope
Layout semantics specify the arrangement of values in local memory and, in effect,
their order. Indexing semantics also include index set operations like slicing and
intersection but do not affect physical data distribution.
We define distribution semantics of a pattern type depending on the following set
of operations:
local(iG)7→ (u,iL)Index iGto unit uand local offset iL
global(u,iL)7→ iGUnit uand local offset iLto global index iG
unit(iG)7→ uIndex iGto unit u
local block(iG)7→ (u,iLB)Index iGto unit uand local block index iLB
global block(iG)7→ iGB Index iGto global block index iGB
with n-dimensional indices iG,iLas coordinates in the global / local Cartesian ele-
ment space and iGB,iLB as coordinates in the global / local Cartesian block space.
Instead of a Cartesian point, an index may also be declared as a point’s offset in
linearized memory order.
6 Tobias Fuchs and Karl F¨
urlinger
1// Brief notation:
2TilePattern<2> pattern(global_extent_x, global_extent_y,
3TILED(tilesize_x), TILED(tilesize_y));
4// Equivalent full notation:
5TilePattern<2, dash::default_index_t, ROW_MAJOR>
6pattern(DistributionSpec<2>(
7TILED(tilesize_x), TILED(tilesize_y),
8SizeSpec<2, dash::default_size_t>(
9global_extent_x, global_extent_y),
10 TeamSpec<1>(
11 Team::All()));
Listing 1 Explicit instantiation of DASH patterns
DASH containers use patterns to provide uniform notations based on view proxy
types to express index domain mappings. User-defined data distribution schemes
can be easily incorporated in DASH applications as containers and algorithms ac-
cept any type that implements the Pattern concept.
Listing 2 illustrates the intuitive usage of user-defined pattern types and the
local and block view accessors that are part of the DASH Container concept.
View proxy objects use a referenced container’s pattern to map between its index do-
mains but do not contain any elements themselves. They can be arbitrarily chained
to refine an index space in consecutive steps, as in the last line of Listing 2: the
expression array.local.block(1) addresses the second block in the array’s
local memory space.
In effect, patterns specify local iteration order similar to the partitioning of itera-
tion spaces e.g. in RAJA [11]. Proxy types implement all methods of their delegate
container type and thus also provide begin and end iterators that specify the iter-
ation space within the view’s mapped domain. DASH iterators provide an intuitive
notation of ranges in virtual global memory that are well-defined with respect to
distance and iteration order, even in multi-dimensional and irregular index domains.
1CustomPattern pattern;
2dash::Array<double> a(size, pattern);
3double g_first = a[0] // First value in global memory,
4// corresponds to a.begin()
5double l_first = a.local[0]; // First value in local memory,
6// corresponds to a.local.begin()
7dash::copy(a.block(0).begin(), // Copy first block in
8a.block(0).end(), // global memory to second
9a.local.block(1).begin()); // block in local memory
Listing 2 Views on DASH containers
Expressing and Exploiting Multi-Dimensional Locality in DASH 7
3 Classification of Pattern Properties
While terms like blocked,cyclic and block-cyclic are commonly understood, the ter-
minology of distribution types is inconsistent in related work, or varies in semantics.
Typically, distributions are restricted to specific constraints that are not applicable
in the general case for convenience.
Instead of a strict taxonomy enumerating the full spectrum of all imaginable
distribution semantics, a systematic description of pattern properties is more practi-
cable to abstract semantics from concrete implementations. The classification pre-
sented in this section allows to specify distribution patterns by categorized, un-
ordered sets of properties. It is, of course, incomplete, but can be easily extended.
We identify properties that can be fulfilled by data distributions and then group these
properties into orthogonal categories which correspond to the separate functional
aspects of the pattern concept described in Subsection 2.2.2: partitioning, unit map-
ping, and memory layout. This categorization also complies with the terminology
and conceptual findings in related work [16].
DASH pattern semantics are specified by a configuration of properties in these
dimensions:
Global ×Partitioning ×Mapping
| {z }
Distribution
×Layout
Details on a selection of single properties in all categories are discussed in the re-
mainder of this section.
3.1 Partitioning Properties
Partitioning refers to the methodology used to subdivide a global index set into dis-
joint blocks in an arbitrary number of logical dimensions. If not specified otherwise
by other constraints, indices are mapped into rectangular blocks. A partitioning is
regular if it only creates blocks with identical extents and balanced if all block have
identical size.
rectangular Block extents are constant in every single dimension, e.g. ev-
ery row has identical size.
minimal Minimal number of blocks in every dimension, i.e. at most one
block for every unit.
regular All blocks have identical extents.
balanced All blocks have identical size (number of elements).
multi-dimensional Data is partitioned in at least two dimensions.
cache-aligned Block sizes are a multiple of cache line size.
Note that with the classification, these properties are mostly independent: rectan-
8 Tobias Fuchs and Karl F¨
urlinger
gular partitionings may produce blocks with varying extents, balanced partitionings
are not necessarily rectangular, and so on. For example, partitioning a matrix into
triangular blocks could satisfy the regular and balanced partitioning properties. The
fine-grained nature of property definitions allows many possible combinations that
form an expressive and concise vocabulary to express pattern semantics.
3.2 Mapping Properties
Well-defined mapping properties exist that have been formulated to define multipar-
titionings, a family of distribution schemes supporting parallelization of line sweep
computations over multi-dimensional arrays.
The first and most restrictive multipartitioning has been defined based on the
diagonal property [15]. In a multipartitioning, each process owns exactly one tile in
each hyperplane of a partitioning so that all processors are active in every step of a
line-sweep computation along any array dimension as illustrated in Figure 2.
Fig. 2 Combinations of mapping properties. Numbers in blocks indicate the unit rank owning the
block
General multipartitionings are a more flexible variant that allows to assign more
than one block to a process in a partitioned hyperplane. The generalized definition
subdivides the original diagonal property into the balanced and neighbor mapping
properties [6] described below. This definition is more relaxed but still preserves the
benefits for line-sweep parallelization.
balanced The number of assigned blocks is identical for every unit.
neighbor A block’s adjacent blocks in any one direction along a dimen-
sion are all owned by some other processor.
shifted Units are mapped to blocks in diagonal chains in at least one
hyperplane.
diagonal Units are mapped to blocks in diagonal chains in all hyper-
planes.
cyclic Blocks are assigned to processes like dealt from a deck of
cards in every hyperplane, starting from first unit.
multiple At least two blocks are mapped to every unit.
Expressing and Exploiting Multi-Dimensional Locality in DASH 9
The constraints defined for multipartitionings are overly strict for some algorithms
and can be further relaxed to a subset of its properties. For example, a pipelined
optimization of the SUMMA algorithm requires a diagonal shift mapping [14, 18]
that satisfies the diagonal property but is not required to be balanced. Therefore, the
diagonal property in our classification does not imply a balanced mapping, deviating
from its original definition.
3.3 Layout Properties
Layout properties describe how values are arranged in a unit’s physical memory
and, consequently, their order of traversal. Perhaps the most crucial property is stor-
age order which is either row- or column major. If not specified, DASH assumes
row-major order as known from C. The list of properties can also be extended to
give hints to allocation routines on the physical memory topology of units such as
NUMA or CUDA.
row-major Row major storage order, used by default.
col-major Column major storage order.
blocked Elements are contiguous in local memory within a single
block.
canonical All local indices are mapped to a single logical index domain.
linear Local element order corresponds to a logical linearization
within single blocks (tiled) or within entire local memory
(canonical).
While patterns assign indices to units in logical blocks, they do not necessarily pre-
serve the block structure in local index sets. After process mapping, a pattern’s lay-
out scheme may arrange indices mapped to a unit in an arbitrary way in physical
memory. In canonical layouts, the local storage order corresponds to the logical
global iteration order. Blocked layouts preserve the block structure locally such that
values within a block are contiguous in memory, but in arbitrary order. The addi-
tional linear property also preserves the logical linearized order of elements within
single blocks. For example, Morton order memory layout as shown in Figure 3 sat-
isfies the blocked property, as elements within a block are contiguous in memory,
but does not grant the linear property.
10 Tobias Fuchs and Karl F¨
urlinger
Fig. 3 Morton order memory layout of block elements
3.4 Global Properties
The Global category is usually only needed to give hints on characteristics of the
distributed value domain such as the sparse property to indicate the distribution of
sparse data.
dense Distributed data domain is dense.
sparse Distributed data domain is sparse.
balanced The same number of values is mapped to every unit after par-
titioning and mapping.
It also contains properties that emerge from a combination of the independent parti-
tioning and layout properties and cannot be derived from either category separately.
The global balanced distribution property, for example, guarantees the same number
of local elements at every unit. This is trivially fulfilled for balanced partitioning and
balanced mapping where the same number of blocks bof identical size sis mapped
to every unit. However, it could also be achieved in a combination of unbalanced
partitioning and unbalanced mapping, e.g. when assigning bblocks of size sand
b/2 blocks of size 2s.
4 Exploiting Locality with Pattern Traits
The classification system presented in the last section allows to describe distribution
pattern semantics using properties instead of a taxonomy of types that are associ-
ated with concrete implementations. In the following, we introduce pattern traits,
a collection of mechanisms in DASH that utilize distribution properties to exploit
data locality automatically.
As a technical prerequisite for these mechanisms, every pattern type is annotated
with tag type definitions that declare which properties are satisfied by its implemen-
tation. This enables meta-programming based on the patterns’ distribution proper-
ties as type definitions are evaluated at compile time. Using tags to annotate type
Expressing and Exploiting Multi-Dimensional Locality in DASH 11
invariants is a common method in generic C++ programming and prevalent in the
STL and the Boost library 1.
1template <dim_t NDim, ...>
2class ThePattern {
3public:
4typedef mapping_properties<
5mapping_tag::diagonal,
6mapping_tag::cyclic >
7mapping_tags;
8...
9};
Listing 3 Property tags in a pattern type definition.
4.1 Deducing Distribution Patterns from Constraints
In a common use case, programmers intend to allocate data in distributed global
memory with the use for a specific algorithm in mind. They would then have to
decide for a specific distribution type, carefully evaluating all available options for
optimal data locality in the algorithm’s memory access pattern.
To alleviate this process, DASH allows to automatically create a concrete pattern
instance that satisfies a set of constraints. The function make_pattern returns a
pattern instance from a given set of properties and run-time parameters. The actual
type of the returned pattern instance is resolved at compile time and never explicitly
appears in client code by relying on automatic type deduction.
1http://www.boost.org/community/generic programming.html
12 Tobias Fuchs and Karl F¨
urlinger
1static const dash::dim_t NumDataDim = 2;
2static const dash::dim_t NumTeamDim = 2;
3// Topology of processes, here: 16x8 process grid
4TeamSpec<NumTeamDim> teamspec(16, 8);
5// Cartesian extents of container:
6SizeSpec<NumDataDim> sizespec(extent_x, extent_y);
7// Create instance of pattern type deduced from
8// constraints at compile time:
9auto pattern =
10 dash::make_pattern<
11 partitioning_properties<
12 partitioning_tag::balanced >,
13 mapping_properties<
14 mapping_tag::balanced, mapping_tag::diagonal >,
15 layout_properties<
16 layout_tag::blocked >
17 >(sizespec, teamspec);
Listing 4 Deduction of an Optimal Distribution
To achieve compile-time deduction of its return type, make_pattern employs the
Generic Abstract Factory design pattern [2]. Different from an Abstract Factory that
returns a polymorphic object specializing a known base type, a Generic Abstract
Factory returns an arbitrary type, giving more flexibility and no longer requiring
inheritance at the same time.
Fig. 4 Type deduction and pattern instantiation in dash::make pattern
Pattern constraints are passed as template parameters grouped by property cate-
gories as shown in Listing 4. Data extents and unit topology are passed as run-time
arguments. Their respective dimensionality (NumDataDim,NumTeamDim), how-
ever, can be deduced from the argument types at compile time. Figure 4 illustrates
the logical model of this process involving two stages: a type generator that resolves
a pattern type from given constraints and argument types at compile time and an
Expressing and Exploiting Multi-Dimensional Locality in DASH 13
object generator that instantiates the resolved type depending on constraints and
run-time parameters.
Every property that is not specified as a constraint is a degree of freedom in type
selection. Evaluations of the GUPS benchmark show that arithmetic for dereferenc-
ing global indices is a significant performance bottleneck apart from locality effects.
Therefore, when more than one pattern type satisfies the constraints, the implemen-
tation with the least complex index calculation is preferred.
The automatic deduction also is designed to prevent inefficient configurations.
For example, pattern types that pre-generate block coordinates to simplify index
calculation are inefficient and memory-intensive for a large number of blocks. They
are therefore disregarded if the blocking factor in any dimension is small.
4.2 Deducing Distribution Patterns for a Specific Use Case
With the ability to create distribution patterns from constraints, developers still have
to know which constraints to choose for a specific algorithm. Therefore, we offer
shorthands for constraints of every algorithm provided in DASH that can be passed
to make_pattern instead of single property constraints.
1dash::TeamSpec<2> teamspec(16, 8);
2dash::SizeSpec<2> sizespec(1024, 1024);
3// Create pattern instance optimized for SUMMA:
4auto pattern = dash::make_pattern<
5dash::summa_pattern_traits
6>(sizepec, teamspec);
7// Create matrix instances using the pattern:
8dash::Matrix<2, int> matrix_a(sizespec, pattern);
9dash::Matrix<2, int> matrix_b(sizespec, pattern);
10 ...
11 auto matrix_c = dash::summa(matrix_a, matrix_b)
Listing 5 Deduction of a matching distribution pattern for a given use-case
4.3 Checking Distribution Constraints
An implementer of an algorithm on shared containers might want to ensure that
their distribution fits the algorithm’s communication strategy and memory access
scheme.
The traits type pattern_constraints allows querying constraint attributes
of a concrete pattern type at compile time. If the pattern type satisfies all requested
properties, the attribute satisfied is expanded to true Listing 6 shows its usage
in a static assertion that would yield a compilation error if the object pattern
implements an invalid distribution scheme.
14 Tobias Fuchs and Karl F¨
urlinger
1// Compile time constraints check:
2static_assert(
3dash::pattern_contraints<
4decltype(pattern),
5partitioning_properties< ... >,
6mapping_properties< ... >,
7layout_properties< ... >
8>::satisfied::value
9);
10 // Run time constraints check:
11 if (dash::check_pattern_contraints<
12 partitioning_properties< ... >,
13 mapping_properties< ... >,
14 indexing_properties< ... >
15 >(pattern)) {
16 // Object ’pattern’ satisfies constraints
17 }
Listing 6 Checking distribution constraints at compile time and run time
Some constraints depend on parameters that are unknown at compile time, such as
data extents or unit topology in the current team.
The function check_pattern_constraints allows checking a given pat-
tern object against a set of constraints at run time. Apart from error handling, it can
also be used to implement alternative paths for different distribution schemes.
4.4 Deducing Suitable Algorithm Variants
When combining different applications in a work flow or working with legacy code,
container data might be preallocated. As any redistribution is usually expensive, the
data distribution scheme is invariant and a matching algorithm variant is to be found.
We previously explained how to resolve a distribution scheme that is the best
match for a known specific algorithm implementation. Pattern traits and generic pro-
gramming techniques available in C++11 also allow to solve the inverse problem:
finding an algorithm variant that is suited for a given distribution. For this, DASH
provides adapter functions that switch between an algorithm’s implementation vari-
ants depending on the distribution type of its arguments. In Listing 7, three matrices
are declared using an instance of dash::TilePattern that corresponds to the
known distribution of their preallocated data. In compilation, dash::multiply
expands to an implementation of matrix-matrix multiplication that best matches the
distribution properties of its arguments, like dash::summa in this case.
Expressing and Exploiting Multi-Dimensional Locality in DASH 15
1typedef dash::TilePattern<2, ROW_MAJOR> TiledPattern;
2typedef dash::Matrix<2, int, TiledPattern> TiledMatrix;
3TiledPattern pattern(global_extent_x, global_extent_y,
4TILE(tilesize_x), TILE(tilesize_y));
5TiledMatrix At(pattern);
6TiledMatrix Bt(pattern);
7TiledMatrix Ct(pattern);
8...
9// Use adapter to resolve algorithm suited for TiledPattern:
10 dash::multiply(At, Bt, Ct); // --> dash::summa(At, Bt, Ct);
Listing 7 Deduction of an algorithm variant for a given distribution
5 Performance Evaluation
We choose dense matrix-matrix multiplication (DGEMM) as a use case for evalua-
tion as it represents a concise example that allows to demonstrate how slight changes
in domain decomposition drastically affect performance even in highly optimized
implementations.
In principle, the matrix-matrix multiplication implemented in DASH realizes a
conventional blocked matrix multiplication similar to a variant of the SUMMA al-
gorithm presented in [14]. For the calculation C=A×B, matrices A,Band Care
distributed using a blocked partitioning. Following the owner computes principle,
every unit then computes the multiplication result
Ci j =Aik ×Bk j =
K1
k=0
AikBk j
for all sub-matrices in Cthat are local to the unit.
Figure 5 illustrates the first two multiplication steps for a square matrix for sim-
plicity, but the SUMMA algorithm also allows rectangular matrices and unbalanced
partitioning.
Fig. 5 Domain decomposition and first two block matrix multiplications in the SUMMA imple-
mentation. Numbers in blocks indicate the unit mapped to the block.
16 Tobias Fuchs and Karl F¨
urlinger
We compare strong scaling capabilities on a single processing node against
DGEMM provided by multi-threaded Intel MKL and PLASMA [1]. Performance
of distributed matrix multiplication is evaluated against ScaLAPACK [7] for an in-
creasing number of processing nodes.
Ideal tile sizes for PLASMA and ScaLAPACK had to be obtained in a large series
of tests for every variation of number of cores and matrix size. As PLASMA does
not optimize for NUMA systems, we also tried different configurations of numactl
as suggested in the official documentation of PLASMA.
For the DASH implementation, data distribution is resolved automatically using
the make pattern mechanism as described in Subsec. 4.2.
5.1 Eperimental Setup
To substantiate the transferability of the presented results, we execute benchmarks
on the supercomputing systems SuperMUC and Cori which differ in hardware spec-
ifications and application environments.
SuperMUC phase 22incorporates an Infiniband fat tree topology interconnect
with 28 cores per processing node. We evaluated performance for both Intel MPI
and IBM MPI.
Cori phase 13is a Cray system with 32 cores per node in an Aries dragonfly
topology interconnect. As an installation of PLASMA is not available, we evaluate
performance of DASH and Intel MKL.
5.2 Results
We only consider the best results from MKL, PLASMA and ScaLAPACK to provide
a fair comparison to the best of our abilities.
In summary, the DASH implementation consistently outperformed the tested
variants of DGEMM and PDGEMM on distributed- and shared memory scenarios
in all configurations.
More important than performance in single scenarios, overall analysis of results
in single-node scenarios confirms that DASH in general achieved predictable scal-
ability using automatic data distributions. This is most evident when comparing re-
sults on Cori presented in Fig. 7: the DASH implementation maintained consistent
scalability while performance of Intel MKL dropped when the number of processes
was not a power of two, a good example of a system-dependent implication that is
commonly unknown to programmers.
2https://www.lrz.de/services/compute/supermuc/systemdescription/
3http://www.nersc.gov/users/computational-systems/cori/cori-phase-i/
Expressing and Exploiting Multi-Dimensional Locality in DASH 17
3360 x 3360 Matrix
10080 x 10080 Matrix
33600 x 33600 Matrix
0.0
0.2
0.4
0.6
0.8
1.0
4 8 12 16 20 24 28 4 8 12 16 20 24 28 4 8 12 16 20 24 28
Number of Cores
TFLOP/s
Strong scaling of DGEMM, single node on SuperMUC, Intel MPI
3360 x 3360 Matrix
10080 x 10080 Matrix
33600 x 33600 Matrix
0.0
0.2
0.4
0.6
0.8
1.0
282420161284 282420161284 282420161284
Number of Cores
TFLOP/s
Implementation DASH MKL PLASMA
Strong scaling of DGEMM, single node on SuperMUC, IBM MPI
Fig. 6 Strong scaling of matrix multiplication on single node on SuperMUC phase 2, Intel MPI
and IBM MPI, with 4 to 28 cores for increasing matrix size
3360 x 3360 Matrix
10080 x 10080 Matrix
33600 x 33600 Matrix
0.0
0.2
0.4
0.6
0.8
1.0
32282420161284 32282420161284 32282420161284
Number of Cores
TFLOP/s
Implementation DASH MKL
Strong scaling of DGEMM, single node on Cori, Cray MPICH
Fig. 7 Strong scaling of matrix multiplication on single node on Cori phase 1, Cray MPICH, with
4 to 32 cores for increasing matrix size
6 Related Work
Various aspects of data decomposition have been examined in related work that
influenced the design of pattern traits in DASH.
The Kokkos framework [9] is specifically designed for portable multi-dimensional
locality. It implements compile-time deduction of data layout depending on mem-
ory architecture and also specifies distribution traits roughly resembling some of the
property categories introduced in this work. However, Kokkos targets intra-node lo-
cality focusing on CUDA- and OpenMP backends and does not define concepts for
18 Tobias Fuchs and Karl F¨
urlinger
1. Intel MPI
2. IBM MPI
0
10
20
30
40
50
4 8 16 32 64 128 4 8 16 32 64 128
Nodes (x28 cores)
TFLOP/s
Implementation DASH ScaLAPACK
Strong scaling analysis of DGEMM, multi−node
Fig. 8 Strong scaling of dash::summa and PDGEMM (ScaLAPACK) on SuperMUC phase 2 for
IBM MPI and Intel MPI for matrix size 57344 ×57344
process mapping. It is therefore not applicable to the PGAS language model where
explicit distinction between local and remote ownership is required.
UPC++ implements a PGAS language model and, similar to the array concept in
DASH, offers local views for distributed arrays for rectangular index domains [12].
However, UPC++ does not provide a general view concept and no abstraction of
distribution properties as described in this work.
Chapel’s Domain Maps is an elegant framework that allows to specify and incor-
porate user-defined mappings [4] and also supports irregular domains. The funda-
mental concepts of domain decomposition in DASH are comparable to DMaps in
Chapel with dense and strided regions like previously defined in ZPL [5]. Chapel
does not provide automatic deduction of distribution schemes, however, and no clas-
sification of distribution properties is defined.
Finally, the benefits of hierarchical data decomposition are investigated in recent
research such as TiDA, which employs hierarchical tiling as a general abstraction
for data locality [17]. The Hitmap library achieves automatic deduction of data de-
composition for hierarchical, regular tiles [8] at compile time.
7 Conclusion and Future Work
We constructed a general categorization of distribution schemes based on well-
defined properties. In a broad spectrum of different real-world scenarios, we then
discussed how mechanisms in DASH utilize this property system to exploit data
locality automatically.
In this, we demonstrated the expressiveness of generic programming techniques
in modern C++ and their benefits for constrained optimization.
Automatic deduction greatly simplifies the incorporation of new pattern types
such that new distribution schemes can be employed in experiments with with min-
imal effort. In addition, a system of well-defined properties forms a concise and
Expressing and Exploiting Multi-Dimensional Locality in DASH 19
precise vocabulary to express semantics of data distribution, significantly improv-
ing testability of data placement.
We will continue our work on flexible data layout mappings and explore concepts
to further support hierarchical locality. We are presently in the process of separating
the functional aspects of DASH patterns (partitioning, mapping and layout) into
separate policy types to simplify pattern type generators. In addition, the pattern
traits framework will be extended by soft constraints to express preferable but non-
mandatory properties.
The next steps will be to implement various irregular and sparse distributions
that can be easily combined with view specifiers in DASH to support the existing
unified sparse matrix storage format provided by SELL-C-σ[13]. We also intend to
incorporate hierarchical tiling schemes as proposed in TiDA [17]. The next release
of DASH including these features will be available in the second quarter of 2016.
Acknowledgements We gratefully acknowledge funding by the German Research Foundation
(DFG) through the German Priority Programme 1648 Software for Exascale Computing (SPPEXA).
References
1. Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou,
Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. Numerical linear algebra on emerging
architectures: The plasma and magma projects. In Journal of Physics: Conference Series,
volume 180, page 012037. IOP Publishing, 2009.
2. Andrei Alexandrescu. Modern C++ design: generic programming and design patterns ap-
plied. Addison-Wesley, 2001.
3. J. A. Ang, R. F. Barrett, R. E. Benner, D. Burke, C. Chan, J. Cook, D. Donofrio, S. D. Ham-
mond, K. S. Hemmert, S. M. Kelly, H. Le, V. J. Leung, D. R. Resnick, A. F. Rodrigues, J. Shalf,
D. Stark, D. Unat, and N. J. Wright. Abstract machine models and proxy architectures for ex-
ascale computing. In Proceedings of the 1st International Workshop on Hardware-Software
Co-Design for High Performance Computing, Co-HPC ’14, pages 25–32, Piscataway, NJ,
USA, 2014. IEEE Press.
4. Bradford L Chamberlain, Sung-Eun Choi, Steven J Deitz, David Iten, and Vassily Litvinov.
Authoring user-defined domain maps in Chapel. In CUG 2011, 2011.
5. Bradford L Chamberlain, Sung-Eun Choi, E Christopher Lewis, Calvin Lin, Lawrence Snyder,
and W Derrick Weathersby. ZPL: A machine independent programming language for parallel
computers. Software Engineering, IEEE Transactions on, 26(3):197–211, 2000.
6. Daniel G Chavarr´
ıa-Miranda, Alain Darte, Robert Fowler, and John M Mellor-Crummey. Gen-
eralized multipartitioning for multi-dimensional arrays. In Proceedings of the 16th Interna-
tional Parallel and Distributed Processing Symposium, page 164. IEEE Computer Society,
2002.
7. Jaeyoung Choi, James Demmel, Inderjiit Dhillon, Jack Dongarra, Susan Ostrouchov, Antoine
Petitet, Ken Stanley, David Walker, and R Clinton Whaley. ScaLAPACK: A portable linear
algebra library for distributed memory computers - Design issues and performance. In Applied
Parallel Computing Computations in Physics, Chemistry and Engineering Science, pages 95–
106. Springer, 1995.
8. Carlos de Blas Cart ´
on, Arturo Gonzalez-Escribano, and Diego R Llanos. Effortless and ef-
ficient distributed data-partitioning in linear algebra. In High Performance Computing and
20 Tobias Fuchs and Karl F¨
urlinger
Communications (HPCC), 2010 12th IEEE International Conference on, pages 89–97. IEEE,
2010.
9. H Carter Edwards, Daniel Sunderland, Vicki Porter, Chris Amsler, and Sam Mish. Manycore
performance-Portability: Kokkos Multidimensional Array Library. Scientific Programming,
20(2):89–114, 2012.
10. Karl F ¨
urlinger, Colin Glass, Andreas Kn¨
upfer, Jie Tao, Denis H¨
unich, Kamran Idrees,
Matthias Maiterth, Yousri Mhedeb, and Huan Zhou. DASH: Data structures and algorithms
with support for hierarchical locality. In Euro-Par 2014 Workshops (Porto, Portugal), 2014.
11. RD Hornung and JA Keasler. The RAJA Portability Layer: Overview and Status. Technical
report, Lawrence Livermore National Laboratory (LLNL), Livermore, CA, 2014.
12. Amir Kamil, Yili Zheng, and Katherine Yelick. A local-view array library for partitioned
global address space C++ programs. In Proceedings of ACM SIGPLAN International Work-
shop on Libraries, Languages, and Compilers for Array Programming, page 26. ACM, 2014.
13. Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R Bishop. A unified
sparse matrix data format for efficient general sparse matrix-vector multiplication on modern
processors with wide SIMD units. SIAM Journal on Scientific Computing, 36(5):C401–C423,
2014.
14. Manojkumar Krishnan and Jarek Nieplocha. SRUMMA: a matrix multiplication algorithm
suitable for clusters and scalable shared memory systems. In Parallel and Distributed Pro-
cessing Symposium, 2004. Proceedings. 18th International, page 70. IEEE, 2004.
15. Naomi H Naik, Vijay K Naik, and Michel Nicoules. Parallelization of a class of implicit finite
difference schemes in computational fluid dynamics. International Journal of High Speed
Computing, 5(01):1–50, 1993.
16. Adrian Tate, Amir Kamil, Anshu Dubey, Armin Gr¨
oßlinger, Brad Chamberlain, Brice Goglin,
Carter Edwards, Chris J Newburn, David Padua, Didem Unat, et al. Programming abstractions
for data locality. Research report, PADAL Workshop 2014, April 28–29, Swiss National
Supercomputing Center (CSCS), Lugano, Switzerland, November 2014.
17. Didem Unat, Cy Chan, Weiqun Zhang, John Bell, and John Shalf. Tiling as a durable ab-
straction for parallelism and data locality. In Workshop on Domain-Specific Languages and
High-Level Frameworks for High Performance Computing, 2013.
18. Robert A Van De Geijn and Jerrell Watts. SUMMA: Scalable universal matrix multiplication
algorithm. Concurrency-Practice and Experience, 9(4):255–274, 1997.
... The following properties are based on the work of Fuchs and Fürlinger [FF16b]. ...
... Combining all these findings, we propose a four-layer abstraction for multidimensional views on PGAS data domains. This concept is based on the concepts of Fuchs and Fürlinger [FF16b] and Unat et al. [Una+17] presented in section 2.7. ...
... It creates rectangular (optimized) chunks, which can be easily further processed with dense matrices techniques. This is possible, as the dense pattern properties provided by Fuchs and Fürlinger [FF16b] can be effortlessly applied to these rectangles. fig. 3. 13. ...
Thesis
Full-text available
The Partitioned Global Address Space (PGAS) programming model represents distributed memory as global shared memory space. The DASH library implements C++ standard library concepts based on PGAS and provides additional containers and algorithms that are common in High-Performance Computing (HPC) applications. This thesis examines especially sparse matrices and whether state-of-the-art conceptual approaches to these data storage can be abstracted to improve the adaptation of domain-specific computational intent to the underlying hardware. Therefore, an intensive systematic analysis of state-of-the-art distribution and storage of sparse data is done. Based on these findings given property classification system of general dense data distributions is extended by sparse properties. Afterward, a universal vocabulary of a domain decomposition concept abstraction is developed. This four-layer concept introduces the following steps: (i) Global Canonical Domain, (ii) Formatting, (iii) Decomposition, and (iv) Hardware. In combination with the proposed abstraction, an index set algebra utilizing the resulting sparse data distribution system is presented. This concludes in a newly created Position Concept that is capable of referencingan element in the data domain as a multivariate position. The key feature is the abilityto reference this element with arbitrary representations of the element’s physical memorylocation.
... This thesis follows [FF16] in the use of terminology when referring to DASH features: ...
... The first stencil ST1 is aligned to a BN while ST2 is aligned to a NB layout and ST3 is aligned to a BB layout. Additional ST1 should benefit at most from the cache on the nodes, because of the default Row Major order in a DASH array [FF16] and should show the best performance in total throughput. It is assumed that the performance of each stencil can be improved by choosing a layout which reduces the network communication which equals to increasing the data locality. ...
... This could be an effect of the overall better R H/V ratio with square numbers of units [4, 16, 21, 25, 64, 100, 121], shown in the first plot of Figure 5 The fastest execution time achieved for 128 units is 43.1s by ST1, 45.46s by ST2, and 104.34s by ST3. The cache could be a reason for the overall good performance of ST1 because DASH arrays are stored by default in Row Major order [FF16]. ST1 can benefit from data already loaded into the cache lines, while ST2 produces more cache misses, because every next value will be in another cache line and this line could be missing in the cache. ...
Thesis
Full-text available
Distributing a multi-channel remote sensing data processing with potentially large stencils is a difficult challenge. The goal of this master thesis was to evaluate and investigate the performance impacts of such a processing on a distributed system and if it is possible to improve the total execution time by exploiting data locality or memory alignments. The thesis also gives a brief overview of the actual state of the art in remote sensing distributed data processing and points out why distributed computing will become more important for it in the future. For the experimental part of this thesis an application to process huge arrays on a distributed system was implemented with DASH, a C++ Template Library for Distributed Data Structures with Support for Hierarchical Locality for High Performance Computing and Data-Driven Science. On the basis of the first results an optimization model was developed which has the goal to reduce network traffic while initializing a distributed data structure and executing computations on it with potentially large stencils. Furthermore, a software to estimate the memory layouts with the least network communication cost for a given multi-channel remote sensing data processing workflow was implemented. The results of this optimization were executed and evaluated afterwards. The results show that it is possible to improve the initialization speed of a large image by considering the brick locality by 25%. The optimization model also generate valid decisions for the initialization of the PGAS memory layouts. However, for a real implementation the optimization model has to be modified to reflect implementation-dependent sources of overhead. This thesis presented some approaches towards solving challenges of the distributed computing world that can be used for real-world remote sensing imaging applications and contributed towards solving the challenges of the modern Big Data world for future scientific data exploitation.
... The approach is based on a proposal by Cray to implement Coarray semantics in the C++ language [1]. In our implementation, the main functionality of the interface is implemented using existing DASH containers and closely follows the DASH global memory space concepts [2] [3]. In contrast to Fortran, C-like languages utilize square brackets for array accesses. ...
... At the heart of the DASH template library is a set of distributed data structures, including one-and multidimensional arrays as well as lists and unordered maps. The distribution of data among the participating processes (called units in DASH) can be controlled by user-defined data distribution patterns [3]. ...
Conference Paper
Fortran Coarrays are a well known data structure in High Performance Computing (HPC) applications. There have been various attempts to port the concept to other programming languages that have a wider user base outside of scientific computing. While a popular implementation of the partitioned global address space (PGAS) model is Unified Parallel C (UPC), there is currently no portable implementation of Coarrays for C++. In this paper a portable version is presented, which is closely based on the Coarray C++ implementation of the Cray Compiling Environment. In this work we focus on a common subset of all proposed features by Cray. Our implementation utilizes the distributed data structures provided by the DASH library, demonstrating their universal applicability. Finally, a performance evaluation shows that our proposed Coarray abstraction adds negligible overhead and even outperforms native Coarray Fortran.
... MPI ranks). As a basis we use the pattern concept as described in previous work [8]. The mapping of data onto global memory location is a three-step process: first, each element in a container is assigned to a location in the global index space. ...
Article
Full-text available
The Partitioned Global Address Space (PGAS) programming model brings intuitive shared memory semantics to distributed memory systems. Even with an abstract and unifying virtual global address space it is, however, challenging to use the full potential of different systems. Without explicit support by the implementation node-local operations have to be optimized manually for each architecture. A goal of this work is to offer a user-friendly programming model that provides portable performance across systems. In this paper we present an approach to integrate node-level programming abstractions with the PGAS programming model. We describe the hierarchical data distribution with local patterns and our implementation, MEPHISTO, in C++ using two existing projects. The evaluation of MEPHISTO shows that our approach achieves portable performance while requiring only minimal changes to port it from a CPU-based system to a GPU-based one using a CUDA or HIP back-end.
... Thus a fully 1. The figure is inspired by Fuchs and Fuerlinger [21]. Index Space 0" 1" 2" 3" 4" 5" 6" 7" 8" 9" 10" 11" 12" 13" 14" 15" 16" 17" 18" 19" 20" 21" 22" 23" Data Decomposition 0" 1" 2" 3" 4" 5" 6" 7" 8" 9" 10" 11" 12" 13" 14" 15" 16" 17" 18" 19" 20" 21" 22" 23" Data Layout 0" 1" 6" 7" 16" 17" …" 2" 3" 8" 9" 14" 15" …" 4" 5" 10" 11" 12" 13" …" Iteration Space 0" 1" 2" 3" 4" 5" 6" 7" 8" 9" 10" 11" 12" 13" 14" Data Placement 0" 1" 2" 3" 4" 5" 6" 7" 8" 9" 10" 11" 12" 13" 14" 15" 16" 17" 18" 19" 20" 21" 22" 23" Traversal Order 0" 1" 2" 3" 4" 5" 6" 7" 8" 9" 10" 11" 12" 13" 14" 15" 16" 17" 18" 19" 20" 21" 22" 23" Fig. 1. illustration of concepts that are important for data locality for a dense two dimensional array. ...
Article
Full-text available
The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems.
... Algorithms can be specialized for pattern properties and use them to specify constraints for compile-time optimization. The underlying mechanisms are beyond the scope of this paper and are described in detail in [14]. ...
Conference Paper
Full-text available
DASH is a realization of the PGAS (partitioned global address space) model in the form of a C++ template library without the need for a custom PGAS (pre-)compiler. We present the DASH NArray concept, a multidimensional array abstraction designed as an underlying container for stenciland dense numerical applications. After introducing fundamental programming concepts used in DASH, we explain how these have been extended by multidimensional capabilities in the NArray abstraction. Focusing on matrix-matrix multiplication in a case study, we then discuss an implementation of the SUMMA algorithm for dense matrix multiplication to demonstrate how the DASH NArray facilitates portable efficiency and simplifies the design of efficient algorithms due to its explicit support for locality-based operations. Finally, we evaluate the performance of the SUMMA algorithm based on the NArray abstraction against established implementations of DGEMM and PDGEMM. In combination with mechanisms for automatic optimization of logical process topology and domain decomposition, our implementation yields highly competitive results without manual tuning, significantly outperforming Intel MKL and PLASMA in node-level use cases as well as ScaLAPACK in highly distributed scenarios.
Chapter
Single node hardware design is shifting to a heterogeneous nature and many of today’s largest HPC systems are clusters that combine heterogeneous compute device architectures. The need for new programming abstractions in the advancements to the Exascale era has been widely recognized and variants of the Partitioned Global Address Space (PGAS) programming model are discussed as a promising approach in this respect. In this work, we present a graph-based approach to provide runtime support for dynamic, distributed hardware locality, specifically considering heterogeneous systems and asymmetric, deep memory hierarchies. Our reference implementation dyloc leverages hwloc to provide high-level operations on logical hardware topology based on user-specified predicates such as filter- and group transformations and locality-aware partitioning. To facilitate integration in existing applications, we discuss adapters to maintain compatibility with the established hwloc API.
Conference Paper
The Partitioned Global Address Space (PGAS) programming model has become a viable alternative to traditional message passing using MPI. The DASH project provides a PGAS abstraction entirely based on C++11. The underlying DASH RunTime, DART, provides communication and management functionality transparently to the user. In order to facilitate incremental transitions of existing MPI-parallel codes, the development of DART has focused on creating a PGAS runtime based on the MPI-3 RMA standard. From an MPI-RMA user perspective, this paper outlines our recent experiences in the development of DART and presents insights into issues that we faced and how we attempted to solve them, including issues surrounding memory allocation and memory consistency as well as communication latencies. We implemented a set of benchmarks for global memory allocation latency in the framework of the OSU micro-benchmark suite and present results for allocation and communication latency measurements of different global memory allocation strategies under three different MPI implementations.
Conference Paper
The high energy costs for data movement compared to computation gives paramount importance to data locality management in programs. Managing data locality manually is not a trivial task and also complicates programming. Tiling is a well-known approach that provides both data locality and parallelism in an application. However, there is no standard programming construct to express tiling at the application level. We have developed a multicore programming model, TiDA, based on tiling and implemented the model as C++ and Fortran libraries. The proposed programming model has three high level abstractions, tiles, regions and tile iterator. These abstractions in the library hide the details of data decomposition, cache locality optimizations, and memory affinity management in the application. In this paper we unveil the internals of the library and demonstrate the performance and programability advantages of the model on five applications on multiple NUMA nodes. The library achieves up to 2.10x speedup over OpenMP in a single compute node for simple kernels, and up to 22x improvement over a single thread for a more complex combustion proxy application (SMC) on 24 cores. The MPI+TiDA implementation of geometric multigrid demonstrates a 30.9 % performance improvement over MPI+OpenMP when scaling to 3072 cores (excluding MPI communication overheads, 8.5 % otherwise).
Article
Full-text available
To achieve exascale computing, fundamental hardware architectures must change. This will significantly impact scientific applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. To adapt to exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency in the future. An abstract machine model is designed to expose to the application developers and system software only the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. A proxy architecture is a parameterized version of an abstract machine model, with parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models enable discussion among the developers of analytic models and simulators and computer hardware architects and they allow for application performance analysis, system software development, and hardware optimization opportunities. In this paper, we present a set of abstract machine models and show how they might be used to help software developers prepare for exascale. We then apply parameters to one of these models to demonstrate how a proxy architecture can enable a more concrete exploration of how well application codes map onto future architectures.
Conference Paper
Full-text available
DASH is a realization of the PGAS (partitioned global address space) model in the form of a C++ template library. Operator overloading is used to provide global-view PGAS semantics without the need for a custom PGAS (pre-)compiler. The DASH library is implemented on top of our runtime system DART, which provides an abstraction layer on top of existing one-sided communication substrates. DART contains methods to allocate memory in the global address space as well as collective and one-sided communication primitives. To support the development of applications that exploit a hierarchical organization, either on the algorithmic or on the hardware level, DASH features the notion of teams that are arranged in a hierarchy. Based on a team hierarchy, the DASH data structures support locality iterators as a generalization of the conventional local/global distinction found in many PGAS approaches.
Article
Full-text available
Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern manycore accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces APIs, and performance requirements. The Kokkos Array programming model provides library-based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: 1 manycore compute devices each with its own memory space, 2 data parallel kernels and 3 multidimensional arrays. Kernel execution performance is, especially for NVIDIA® devices, extremely dependent on data access patterns. Optimal data access pattern can be different for different manycore devices --potentially leading to different implementations of computational kernels specialized for different devices. The Kokkos Array programming model supports performance-portable kernels by 1 separating data access patterns from computational kernels through a multidimensional array API and 2 introduce device-specific data access mappings when a kernel is compiled. An implementation of Kokkos Array is available through Trilinos [Trilinos website, http://trilinos.sandia.gov/, August 2011].
Article
Full-text available
Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardware-specific, which could become an obstacle when using heterogeneous systems. Also, it is as yet unclear how the wide single instruction multiple data (SIMD) units in current multi- and many-core processors should be used most efficiently if there is no structure in the sparsity pattern of the matrix. We suggest SELL-C-sigma, a variant of Sliced ELLPACK, as a SIMD-friendly data format which combines long-standing ideas from General Purpose Graphics Processing Units (GPGPUs) and vector computer programming. We discuss the advantages of SELL-C-sigma compared to established formats like Compressed Row Storage (CRS) and ELLPACK, and show its suitability on a variety of hardware platforms (Intel Sandy Bridge, Intel Xeon Phi and Nvidia Tesla K20) for a wide range of test matrices from different application areas. Using appropriate performance models we develop deep insight into the data transfer properties of the SELL-C-sigma spMVM kernel. SELL-C-sigma comes with two tuning parameters whose performance impact across the range of test matrices is studied and for which reasonable choices are proposed. This leads to a hardware-independent ("catch-all") sparse matrix format, which achieves very high efficiency for all test matrices across all hardware platforms.
Article
Full-text available
The emergence and continuing use of multi-core architectures and graphics processing units require changes in the existing software and sometimes even a redesign of the established algorithms in order to take advantage of now prevailing parallelism. Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) and Matrix Algebra on GPU and Multics Architectures (MAGMA) are two projects that aims to achieve high performance and portability across a wide range of multi-core architectures and hybrid systems respectively. We present in this document a comparative study of PLASMA's performance against established linear algebra packages and some preliminary results of MAGMA on hybrid multi-core and GPU systems.
Conference Paper
Full-text available
Multipartitioning is a strategy for parallelizing computations that require solving 1D recurrences along each dimension of a multi-dimensional array. Previous techniques for multipartitioning yielded efficient parallelizations over 3D domains only when the number of processors was a perfect square. This paper considers the general problem of computing multipartitionings for d-dimensional data volumes on an arbitrary number of processors. We describe an algorithm that computes an optimal multipartitioning on to all of the processors for this general case. Finally, we describe how we extended Rice University's dHPF (data-parallel High Performance Fortran) compiler to generate code that exploits generalized multipartitioning and show that the compiler's generated code for the NAS (Numerical Aerospace Simulation) SP (Scalar Pentadiagonal) computational fluid dynamics benchmark achieves scalable high performance
Conference Paper
Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.
Article
The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal.
Article
This paper outlines the content and performance of ScaLAPACK, a collection of mathematical software for linear algebra computations on distributed memory computers. The importance of developing standards for computational and message passing interfaces is discussed. We present the different components and building blocks of ScaLAPACK. This paper outlines the difficulties inherent in producing correct codes for networks of heterogeneous processors. We define a theoretical model of parallel computers dedicated to linear algebra applications: the Distributed Linear Algebra Machine (DLAM). This model provides a convenient framework for developing parallel algorithms and investigating their scalability, performance and programmability. Extensive performance results on various platforms are presented and analyzed with the help of the DLAM. Finally, this paper briefly describes future directions for the ScaLAPACK library and concludes by suggesting alternative approaches to mathematical libraries, explaining how ScaLAPACK could be integrated into efficient and user-friendly distributed systems.