Content uploaded by Benoît Dupont de Dinechin
Author content
All content in this area was uploaded by Benoît Dupont de Dinechin on Mar 11, 2020
Content may be subject to copyright.
Content uploaded by Benoît Dupont de Dinechin
Author content
All content in this area was uploaded by Benoît Dupont de Dinechin on Mar 11, 2020
Content may be subject to copyright.
Asynchronous One-Sided Communications and
Synchronizations for a Clustered Manycore Processor
Julien Hascoët, Benoît Dupont de Dinechin, Pierre Guironnet de Massas and Minh Quan Ho
Kalray, Montbonnot-Saint-Martin, France
{jhascoet,benoit.dinechin,pgmassas,mqho}@kalray.eu
ABSTRACT
Clustered manycore architectures tted with a Network-on-Chip
(NoC) and scratchpad memories enable highly energy-ecient and
time-predictable implementations. However, porting applications
to such processors represents a programming challenge. Inspired by
supercomputer one-sided communication libraries and by OpenCL
async_work_group_copy
primitives, we propose a simple program-
ming layer for communication and synchronization on clustered
manycore architectures. We discuss the design and implementation
of this layer on the 2nd-generation Kalray MPPA processor, where
it is available from both OpenCL and POSIX C/C++ multithreaded
programming models. Our measurements show that it allows to
reach up to 94% of the theoretical hardware throughput with a
best-case latency round-trip of 2.2µs when operating at 500 MHz.
ACM Reference Format:
Julien Hascoët, Benoît Dupont de Dinechin, Pierre Guironnet de Massas
and Minh Quan Ho. 2017. Asynchronous One-Sided Communications and
Synchronizations for a Clustered Manycore Processor. In Proceedings of
ESTIMedia’17, Seoul, Republic of Korea, October 15–20, 2017, 10 pages.
https://doi.org/10.1145/3139315.3139318
1 INTRODUCTION
Processors integrating up to hundreds of cores need to introduce
locality in the memory hierarchy in order to reach their perfor-
mance and eciency targets. For GPGPUs, this is the local memory
of the Compute Units (OpenCL) or the Streaming Multiprocessors
(CUDA). For CPU-based manycore processors like the Adapteva
Epiphany 64 [
23
] and the Epiphany-V [
20
], this is the scratch-pad
memory attached to each core. For the Kalray Multi-Purpose Pro-
cessor Array (MPPA)
®
-256 [
21
] processor, this is the local memory
shared by the 16 application cores of each compute cluster.
We describe the design and implementation of an asynchronous
one-sided (AOS) communication and synchronization programming
layer for the 2nd-generation Kalray MPPA
®
processor, which is
able to eciently exploit the network on-chip and the cluster local
or processor external memories. This layer presents the platform
as a collection of execution domains, which are composed of cores
and their directly addressable memory. Domains publish parts of
their memory as segments, which can then be accessed by remote
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea
© 2017 Association for Computing Machinery.
ACM ISBN 978-1-4503-5117-1/17/10.. .$15.00
https://doi.org/10.1145/3139315.3139318
cores using RDMA Put/Get operations, and synchronized by remote
atomic operations (e.g. fetch-and-add). In addition to these window-
like segments, queue-like segments are also available with N-to-1
atomic enqueue and a local dequeue operations.
The AOS communication and synchronization operations layer
is deployed in production to support application programming of
the MPPA
®
processor for a range of low-level and in high-level
programming environments such as OpenCL, OpenMP, OpenVX,
static/dynamic dataow execution models. The AOS layer is also
used by Kalray’s optimized application libraries and is targeted by
code generators such as for CNN inference.
The paper is organized as follow. Section 2 motivates our ap-
proach in relation with related work, which is mostly targeted
at high-performance (HPC) systems. In Section 3, we present the
high-level principles of the AOS programming layer. Section 4 intro-
duces the MPPA®processor features relevant to this contribution.
Section 5 presents the design, algorithms and implementation of
the AOS programming layer onto the Kalray MPPA
®
processor.
Section 6 provides detailed results, performance analysis and the
limitations of the implementation.
2 RELATED WORK AND MOTIVATIONS
2.1 Remote Direct Memory Access
Remote Direct Memory Access (RDMA) is an integral part of mod-
ern communication technologies that can be characterized by: OS-
bypass, zero-copy, one-sided communications and asynchronous
operations. OS-bypass allows direct interaction between the appli-
cation and a virtualized instance of the network hardware, without
involving the operating system. Zero-copy allows a system to place
transferred data directly to its nal memory location, based on in-
formation included in the RDMA operations. One-sided operations
allow a communication to complete without the involvement of the
application thread on the remote side. Asynchronous operations are
used to decouple the initiation of a communication from its progress
and subsequent completion, in order to allow communication to be
overlapped with computation.
These communication technologies mostly apply at the back-
plane and system levels in data centers and supercomputers:
•
Inside a compute node, between cores and other bus masters,
HT (HyperTransport), QPI (QuickPath Interconnect) and
PCIe support load/store as well as DMA (Direct Memory
Access) operations.
•
Between compute nodes across a backplane or a chassis, sRIO
(serial RapidIO) and DCB (Data Center Bridging)‘ ’ Ethernet
variants such as RoCE (RDMA over Converged Ethernet)
mostly support RDMA operations.
ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea J. Hascoët, B. Dupont de Dinechin, P. Guironnet de Massas, M. Q. Ho
•
At the system level, between compute racks and the storage
boxes, Inniband or Ethernet also support RDMA operations.
•
Between systems, IP networks support the BSD sockets and
client/server operations.
Motivated by the success of RDMA in high-performance sys-
tems, the design objectives for the asynchronous one-sided (AOS)
operations programming layer is to adapt the RDMA principles to
the architecture of CPU-based manycore processors.
2.2 High-Performance Interconnects
The Inniband technology designed by Mellanox is widely deployed
in high-performance systems and datacenters. It natively supports
Remote Direct Memory Access (RDMA) put-get, Send/Receive read-
write and atomic operations, with a focus on high-throughput with
low-latency. Based on the earlier VIA (Virtual Interface Architec-
ture), the InniBand specication only lists Verbs, that is, functions
that must exist but whose syntax is left to vendors.
After vendors created separate Verbs APIs, these coalesced into
the Open-Fabrics Association (OFA) Verbs. OFA Verbs has support
for: two-sided and one-sided operations, always asynchronous;
reliable and unreliable modes, connection-oriented and connection-
less; remote direct memory access, send and receive; and atomic
operations on remote memory regions. To allow the direct access to
endpoint memory, this virtual memory must be pinned in physical
memory and registered into the network interface I/O MMU. OFA
Verbs oer cross platform support across InniBand on IB network,
iWARP on IP network and RoCE on Ethernet fabric.
iWARP uses IETF dened RDDP (Remote Direct Data Placement)
to deliver RDMA services over standard, unmodied IP network
and standard TCP/IP Ethernet services. Enhancements to the Eth-
ernet data link layer enabled the application of advanced RDMA
services over the IEEE Data Center Bridging (DCB), that is, loss-
less Ethernet. In early 2010, this technology, now known as RDMA
over Converged Ethernet (RoCE) was standardized by the IBTA
(InniBand Trade Association). RoCE utilizes advances in Ether-
net (DCB) to eliminate the need to modify and optimize iWARP
to run eciently over DCB. RoCE focuses on server-to-server and
server-to-storage networks, delivering the lowest latency and jitter
characteristics and enabling simpler software and hardware imple-
mentations. RoCE supports the OFA Verbs interface seamlessly.
The GPUDirect specication was developed together by Mel-
lanox and NVIDIA. It is composed of a new interface (API) within
the Tesla GPU driver, a new interface within the Mellanox Inni-
Band drivers, and a Linux kernel modication to support direct
communication between drivers. GPUDirect allows RDMA capable
devices to direct access to GPU device memory, so data can be
directly transferred between two GPUs without buering in host
memory. GPUDirect Verbs provide extended memory registration
functions to support GPU buer and GPU memory de-allocation
call-back for ecient MPI implementations.
The Intel
®
Omni-Path technology competes with Inniband,
with the advantage that the interfaces can be integrated in the
Intel
®
processor themselves. It can be used through the OpenFabrics
library which has an implementation of the Inniband Verbs API
as standardized by the Open Fabric Alliance (OFA).
2.3 HPC Communication Systems
Today’s high performance computing (HPC) programming models
are based on Simple Program Multiple Data (SPMD) execution,
where a single program executable is spawned on
N
processing
nodes. There is one process per node and each process is assigned
a unique rank
∈ [
0
,N]
. The main HPC programming model is the
message passing interface (MPI), which combines SPMD execution,
explicit send / receive of data buers, and collective operations.
Whereas most HPC applications still rely on message-passing
semantics using classic MPI, the underlying communication sys-
tems have evolved decades ago to rely on one-sided communication
semantics, starting with Cray SHMEM library [
7
][
9
]. The rise of
PGAS languages like Co-Array Fortran [
19
] [
16
], UPC and of Global
Arrays motivated the development of one-sided communication lay-
ers, notably GasNet from Berkeley and ARMCI from PNNL. PGAS
languages and GA combine SPMD execution, one-sided commu-
nications, and collective operations. The MPI standard introduced
one-sided communications in MPI-2, which have been reworked
and can be combined with split-phase synchronization in MPI-3.
The Cray SHMEM (SHared MEMory) library [
4
] was initially
introduced by Cray Research for low-level programming on the
Cray T3D and T3E massively parallel processors [
7
]. This library de-
nes symmetric variables as those with same size, type, and address
relative to processor local address space, and these naturally appear
as a by-product of SPMD execution. Dynamic memory allocation
of symmetric variables is supported with a
shmalloc()
operation.
Static data, and heap data obtained through this symmetric alloca-
tor, are implicitly registered. Thanks to the symmetric variables, it
is possible to dene the interface of one-sided operations such as
put and get by referring to local objects only. Besides put and get
variants, the SHMEM library supports remote atomic operations,
and collective operations. The SHMEM library motivated the design
of the F
−−
language [
8
], one of the rst Partitioned Global Address
Space (PGAS) languages, which evolved into Co-Array Fortran.
The Aggregate Remote Memory Copy Interface (ARMCI) [
17
]
was designed as an improvement over Cray SHMEM and IBM LAPI
(IBM SP), and is used as the base of the Global Arrays toolkit. The
API is structured in three classes of operations:
•data transfer operations: put; get; accumulate.
•
synchronization operations: atomic read-modify-write; lock-
/mutex;
•
utility operations: memory allocation / deallocation; local
and global fence; error handling.
The Berkeley Global Address Space Networking (GASNet) li-
brary [
1
] is designed as a compiler run-time library for the PGAS
languages UPC and Titanium. The GASNet library is structured
with a core API and an extended API. The core API includes mem-
ory registration primitives, and is otherwise based on the active
message paradigm. Active message request handlers must be at-
tached to each instance of the SPMD program by calling a collective
operation
gasnet_attach()
. The extended API is meant primarily
as a low-level compilation target, and can be implemented either
with only the core API, or by leveraging higher-level primitives of
the network interface cards. The extended API includes put,get,
and remote
memset()
operations. Data transfers are non-blocking,
and the synchronization barrier is split phase.
Asynchronous One-Sided Communications for a Manycore Processor ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea
However useful, classic HPC communication layers cannot be
eectively applied to manycore processors with local memories.
First problem is that the memory capacity locally available to a
core is about several GB on HPC systems, while it is tens or hun-
dreds KB on manycore processors. The second problem with HPC
communication libraries is they assume a symmetric memory hier-
archy, where the total memory is the union of the compute nodes
memories. Manycore processors not only have (on-chip) local mem-
ories, but also one or more external DDR memory systems. Finally,
a network-on-chip interface is much less capable than a macro
network interface but has signicantly lower latencies.
2.4 OpenCL Asynchronous Copies
OpenCL structures a platform into a Host connected to Compute
Devices. Each Compute Device has a Global Memory, which is
shared by Compute Units. Each Compute Unit has a Local Memory,
a cache for the Global Memory, and Processing Elements that share
the Local and the Global memories. Each Processing Element has
registers and a Private Memory. Computations are dispatched from
the Host to the Compute Units as Work Groups. A Work Group
is composed of Work Items, which are instances of a computation
kernel written in the OpenCl-C dialect. This dialect includes vector
data-types and requires to tag memory objects with their address
space : __global,__local,__private, and __constant.
For CPU-based clustered manycore processors, the main short-
coming of OpenCL is the inability to support ecient communica-
tion between the Local Memories and synchronization between the
Compute Units. However, this capability is essential to the ecient
implementation of image processing, CNN inference and other algo-
rithms where tiling is applicable. Moreover, OpenCL was originally
designed for the GPGPU manycore architecture, where context
switching at the cycle level is exploited to cover memory access
stalls with useful computations. On a DSP or CPU-based manycore
architecture, there is no such cycle-level switching. As a result,
high-performances require that programmers manually build pro-
cessing pipelines with the OpenCL asynchronous prefetch or copy
operations between the Global Memory and the Local Memory.
Specically, OpenCL denes the
async_work_group_copy
and
async_work_group_strided_copy
operations. These asynchro-
nous copy operations enable data to be copied asynchronously
between the device Global Memory and the Compute Unit Local
memory. The main limitation of these operations is that data must
to be read/written contiguously from/into the local memory (dense
mode), an assumption that turns out to be overly restrictive.
For instance, in image processing tiling decomposition, one may
need to copy a 2D sub-image of 16
×
16 pixels to a larger local buer,
allocated at 18
×
18 pixels to deal with halo pixels. In this case, one
must explicitly manage a local stride of two pixels between each
data block, since the local buer is sparse and data should not be
written contiguously to it. This restriction of OpenCL asynchro-
nous copy operations is even more apparent when local buers are
declared as true multi-dimensional arrays as supported by the C99
and OpenCL-C standards. The use of true multi-dimensional arrays
particularly eases 2D/3D stencil programming.
Nevertheless, the asynchronous copy operations of OpenCL have
proved highly useful in order to exploit DMA engines available on
FPGAs, DSPs or clustered manycore processors like MPPA
®
. As the
OpenCL standard allows vendor extensions for ecient usage of
the target hardware, asynchronous 2D copy has been implemented
on the OpenCL runtime of STHORM P2012 [
13
]. Other companies
such as Xilinx
®
, Altera
®
, Adapteva
®
, AMD
®
and Intel
®
also provide
similar OpenCL extensions on their processors.
3 PRINCIPLES OF THE AOS LAYER
AOS operations operate on memory segments, which are represented
by opaque communication objects supporting one or several of
the communication protocols presented below: Load/Store,Put/Get,
Remote Queues, and Remote Atomics. Memory segments objects are
created in an address space and published by cores who have direct
access to that memory space. By using a global ID supplied upon
segment creation, cores operating in another address space can
clone the segment by supplying its global ID and obtain the object
that locally represents this (remote) memory segment.
3.1 Load/Store: Memory-to-Register
Load/Store is the simplest method to transfer data. Any memory
access that has the correct Memory Management Unit (MMU) map-
ping regarding virtual and physical addresses will allow the pro-
cessor to access a memory segment through the cache hierarchy.
Manycore architectures either support transparent access to the
global memory through hardware caches or via MMU-based dis-
tributed shared memory system like TreadMarks [11].
However on large-scale parallel systems, Load/Store through
multiple levels of cache hierarchy can be a serious performance bot-
tleneck when there is data sharing between multiple Non-Uniform-
Memory-Access (NUMA) nodes (do not scale easily). In particular
the implementation of reductions and inter-core/node communica-
tions are complex to perform eciently.
3.2 Two-sided: Remote Queues
Classic two-sided send/receive operations are known to have a
signicant overhead due to synchronizations between the sender
and receiver nodes, and the need to use temporary buers as op-
posed to zero-copy communication. In addition, real-life imple-
mentations present signicant correctness challenges [
10
]. The
two-sided send/receive operations are nevertheless the main com-
munication primitives proposed by MPI [14].
As a primitive two-sided protocol, we select the remote queue
operations described in [
2
], as it avoids the problems of classic
message passing. First, it can be implemented as simple message
queues that are proven to be ecient for ne-grained control and
coordination of distributed computations. Moreover, remote queues
also apply to N-to-1 communication whenever atomicity of enqueue
and dequeue operations can be ensured. This N-to-1 capability is
essential on massively parallel systems that require run-time orches-
tration of activities (e.g. master / worker parallel pattern). Finally,
remote queues enable ecient coding as they enable synchroniza-
tion without introducing any locking mechanism.
3.3 One-sided Put/Get: RDMA
RDMA is a one-sided communication protocol inspired from [
18
],
[
22
] and [
12
]. Any communication initiator registered to a memory
ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea J. Hascoët, B. Dupont de Dinechin, P. Guironnet de Massas, M. Q. Ho
segment is a master on this memory. A RDMA transfer can be
initiated over a target memory segment using either the put or
get primitives. Like their HPC counterparts, RDMA protocols have
several advantages: RDMA provides zero copy memory-to-memory
operation (no buering); and they not require any synchronization
before initiating the transfers as the initiator is a master on the
targeted memory segment. When several initiators target the same
memory segment temporally, no serialization is performed but only
the hardware bus size is the point of serialization.
The RDMA operations favor high-throughput over low-latency,
and typically assume a relaxed memory consistency model. A Re-
laxed memory consistency model allows operations to execute
asynchronously with out-of-order global completion. For compari-
son, Load/Store is also a one-sided operation which make the RDMA
protocol usually easier to manipulate than classical Send/Receive
where a large overhead exists because of the strict matching of the
operations Send and Receive (need to synchronize all the time).
3.4 Remote Atomics: Active Messages
Remote atomic operations have been proven to be ecient for
inter-node synchronization [
15
], [
3
]. A remote atomic operation
consists in an initiator of sending a message with an operation
using dierent operands to a remote or external compute resource.
Then, this compute resource executes the operation possibly under
certain conditions and forwards the completion to the initiator.
Remote atomic operations are based on active messages. An
active message is a low overhead message that executes as a call
without return value and using the message payload as arguments.
In our implementation, the active message will be executed either
after passing certain user-dened conditions or immediately for
low-latency. Remote atomic operations are ubiquitous mechanism
for ecient parallel programming. For instance they provide classi-
cal atomic remote instructions such as fetch-and-add,compare-and-
swap and load-and-clear. When applicable, we also dene posted
variants of these operations (without a returned value). Remote
atomics implemented with active messages allow ecient and ele-
gant synchronization mechanisms.
4 MPPA®PROCESSOR ARCHITECTURE
4.1 Architecture Overview
The Kalray MPPA
®
(Multi-Purpose Processing Array) manycore
architecture is designed to achieve high energy eciency and de-
terministic response times for compute-intensive embedded appli-
cations [21].
Figure 1: MPPA®Processor
The MPPA
®
processor (Fig. 1), code-named Bostan, integrates
256 VLIW application cores and 32 VLIW management cores (288
cores in total) which can operate from 400 MHz to 600 MHz on
a single chip and delivers more than 691.2 Giga FLOPS single-
precision for a typical power consumption of 12W. The 288 cores
of the MPPA
®
processor are grouped in 16 Compute Clusters (CC)
and implements two Input/Output Subsystems (IO) to communicate
with the external world through high-speed interfaces via the PCIe
Gen3 and Ethernet 10 Gbits/s.
4.2 Computing Resources
4.2.1 Input/Output Subsystems. integrate two quad-cores with
the VLIW architecture 4.2.3 and are connected to a 4GB of external
DDR3 memory and on-chip Shared Memory (SMEM) memory of 4
MB. Regarding core memory accesses, cached and uncached access
can be done for both Load and Store operations (64-bit/cycle) in the
SMEM and DDR3. For the shared memory, cached and uncached
atomics are available such as Load-and-Clear, Fetch-and-Add and
Compare-and-Swap (cas). Atomic cached operations provide execu-
tion eciency when dealing with critical algorithm paths that need
mutual exclusion or atomic updates of variables. Each IO embeds
8 high-speed IO interfaces usually called Direct Memory Access
(DMA) to communication through PCI Express First-In-First-Outs
(FIFOs), Ethernet, DDR3 and SMEM. The software is in charge of
maintaining the memory coherence between DMA reads/writes
and the cores.
4.2.2 Compute Clusters. embed 17 cores which are 16 Process-
ing Element (PE) and a Resource Manager (RM). Compute clusters
integrate a multi-banked private local SMEM of 2MB. Cores’ mem-
ory accesses are supported only in this SMEM and only uncached
atomics are available (same available atomics as the IO). Each com-
pute cluster has one DMA interface for communicating with exter-
nal nodes. Here, the software is also in charge of maintaining the
memory coherence between DMA reads/writes and the cores.
4.2.3 Core. Each MPPA
®
core implements a 32-bit VLIW archi-
tecture which issues up to 5 instructions per cycle, corresponding to
the following execution units: branch & control unit (BCU), ALU0,
ALU1, load-store unit (LSU), multiply-accumulate unit (MAU) com-
bined with a oating-point unit (FPU). Each ALU is capable of 32-bit
scalar or 16-bit SIMD operations, and the two can be coupled for
64-bit operations. The MAU performs 32-bit multiplications with a
64-bit accumulator and supports 16-bit SIMD operations. Finally,
the FPU supports one double-precision fused multiply-add (FMA)
operation per cycle, or two single-precision operations per cycle.
4.3 Communication Resources
4.3.1 Network-on-Chip. The 18 multi-core CPUs of the MPPA
®
are interconnected by a full-duplex NoC of 32-bit. The NoC im-
plements wormhole switching with source routing and supports
guaranteed services through the conguration of ow injection pa-
rameters at the NoC interface: the maximum rate
σ
; the maximum
burstiness
ρ
; the minimum and the maximum packet sizes (the size
unit is the it). A it is 32-bit (4 Byte/s), meaning a bandwidth of
2 GB/s per link direction when operating at 500 MHz. The NoC
is a direct network with a 2D torus topology. This network does
not support Load/Store but only data NoC stream and low-latency
control NoC messages. Thus the software is in charge of converting
virtual memory addresses to data stream (data NoC) and convert
Asynchronous One-Sided Communications for a Manycore Processor ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea
this stream back to virtual address in the remote memory to initiate
any communications between any multi-core CPUs.
4.3.2 Control NoC Interface. The control NoC is made to com-
municate at very low-latency with messages of 64-bit only. It does
not have access to the memories (on-chip or o-chip memories), the
messages are mapped in the DMA interface registers. Each DMA
interface implements 128 64-bit control NoC Rx mailboxes and 4
Tx resources. These mailboxes can be used for barriers and simple
64-bit messages with a notication on a list of processors (up to
17 cores in CC). In this paper, the barrier mode is mainly used for
generic inter-core low-latency synchronization and notication.
For instance forcing a remote core or a poll of remote cores out of
idle state in a single clock-cycle for the initiating core. A store in
the peripheral space is a posted operation. The 4 Tx resources must
be shared between the cores of the multi-core CPU. A NoC route
and a remote control NoC Rx mailbox identication number (called
a tag in range
[
0
,
127
]
) must be congured to send 64-bit message
through each control NoC Tx resource.
4.3.3 Data NoC Interface. The data NoC feature is made for
high throughput. Therefore, it is a very asynchronous hardware
block that requires to be handled
asynchronously
by the software.
Indeed all outstanding incoming and outgoing transactions must
be managed by the software asynchronously for performance. Each
data NoC DMA interface is composed of three elements:
•
Eight micro-cores are available for each DMA interface and
can run concurrently. A micro-core is a micro-programmable
DMA core that needs to be programmed and congured. It
has a simple set of instructions such as reads,local and remote
notications for local and remote completions and addition
support for the arithmetic of internal read pointers. It can
execute up to 4 nested loops to describe custom memory
access patterns with high throughput. This throughput is
limited by the technology of the memory on which the micro-
core is reading, the NoC link size (4 bytes/cycle) and the
memory access patterns.
•
The data NoC implements 256 Rx tags (range
[
0
,
255
]
) per
DMA interfaces to write incoming data NoC packets in the
scratchpad memory of compute clusters or in the DDR mem-
ory of IOs. This Rx tag has a write window described by a
base address, a size and the write pointer that need to be
congured and managed at runtime. The completion of the
incoming data transfer is given by an end-of-transfer (
Eot
)
command which increments a 2
16
-bit notication counter
corresponding the used Rx tag in the DMA interface of the
MPPA®network.
•
Each DMA interface implements 8 packet-shapers. A packet-
shaper (DMA Tx) is a hardware unit that is building data NoC
packets using data coming from a PE or a micro-core. Then,
the packet-shaper sends these NoC packets in the MPPA
®
NoC using the congured NoC route. Indeed all NoC routes
and injection parameters Quality-of-Service (QoS) need to
be congured by software.
4.4 Memory Architecture
4.4.1 Hierarchy. Besides register le, MPPA
®
exposes a three-
level memory hierarchy. First, the L1 is the data cache allowing
transparent cached accesses to L2. Second, the L2 is called the
local shared memory which is an on-chip high-bandwidth and low-
latency scratchpad memory. The L3 is the main global memory
which is a DDR3 technology. The L2 memory of MPPA can be
congured, either to cache the L3 memory (software emulation of
L2 cache using the MMU inspired from [
11
], like in conventional
cache-based systems), or as user-buers managed explicitly by
DMA transfers. The L3 can also be accessed by IO DMA interfaces
or through the IO core L1 data cache by Load/Store. Finally, on
compute clusters, L1 caches are not coherent between cores and
DMA interfaces’ writes; thus, memory coherency is managed by
software (full memory barrier, partial memory barrier or uncached
memory accesses are used).
4.4.2 Memory Map. The hardware exposes an heterogeneous
memory map of 20 address spaces (2 per IO and 1 per CC). MPPA
®
processors implement a distributed memory architecture, with one
local memory per cluster. I/O cores access to their local SMEM
and private DDR via Load/Store and by DMA interfaces. Compute
clusters can also access to their local SMEM (but not DDR) via
Load/Store and by their DMA interface. The DMA interface must
be used to build up NoC packets and send them to the NoC in order
to communicate between the 20 address spaces available.
5 DESIGN AND IMPLEMENTATION OF THE
ASYNCHRONOUS ONE-SIDED OPERATIONS
Asynchronous one-sided communications have been proven to
be ecient for HPC workloads by overlapping communication
to computation with fundamental ordering properties. However,
enabling such feature on a heterogeneous distributed local memory
architecture like MPPA is a challenge.
First, the runtime system needs to deal with hardware resources
sharing (memories, DMA Rx Tags, DMA Tx packet-shapers and
DMA micro-cores) and the dynamic management of the outstanding
used hardware resources (complex hardware-software interactions).
It must be done for both local and remote resources in a massively
parallel environment. Second, several abstraction features needs
to be provided to the user application such as QoS conguration,
synchronization and bindings at the creation of communication
segments for any protocols without the need for the user of be-
ing aware of the NoC topology for all on/o-chip memories. This
abstraction is also provided for initiators that are registered to a
segment. Third, as learn from [
6
], the abstracted one-sided pro-
tocols should not be limited by the number of physical hardware
resources. In fact, these hardware resources should be translated
(virtualized) to dierent kinds of “software” components such as:
memory segments management, RDMA emulation, remote queues
and automatic ow control without losing the performance of the
hardware. This is to free the user from managing physical hardware
resources and software job FIFOs congestion control, which is often
a complex issue and a source of error. Fourth, the RDMA engines
must expose fundamental ordering properties regarding outstand-
ing transactions and remote atomic operations, as well as maintain-
ing the memory coherence and consistency (memory ordering) at
synchronization points. Finally, it needs to implement an all-to-all
ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea J. Hascoët, B. Dupont de Dinechin, P. Guironnet de Massas, M. Q. Ho
initiator-to-server ow-control mechanism for the remote atomic
operation to avoid data and request corruption when congestion
occurs. Such constraints make our new one-sided communication
software emulated engine very complicated to develop, debug and
validate along with reaching descent performance (reaching the
theoretical maximum hardware throughput).
5.1 Asynchronous Active Message and RDMA
Put & Get Engine
The active message and RDMA engine is usually the critical path
of data intensive applications. All data transfers are managed by
this engine which is running on PEs; therefore, it needs to be e-
cient, thread-safe and user-friendly (e.g. management of maximum
outstanding jobs with ow-control). RDMA and active message
transactions operate on a window which is a one-sided memory
object well described in the one-sided MPI standard. Initiating a
one-sided operation consists in parsing the targeted segment pa-
rameters (Line 2 in Algorithm 1) such as the supported segment
protocols (RDMA or remote atomics in our case), the NoC route,
the destination DMA Rx Tag and checking if the read or write
transaction is not out-of-bound of the targeted remote window.
Then the transaction is prepared by the initiator core and it takes a
Slot
on the targeted
segment
atomically. An N-to-N ow-control
mechanism has been implemented for both over remote memory
transactions “inter-node” and local memory transaction “intra-node”
(
done_slot
Line 9 in Algorithm 1) which provides a backpressure
mechanism when the hardware and low-level software is under
congestion. In steady-state the core then sends the request either in
the NoC or in the shared local memory. The completion ticket of the
initiated transaction is computed and set to an
Event
which can be
waited later on using Algorithm 2. Indeed, when an event is given
by the user to the engine, it initiates the transaction and returns
immediately (the request is outstanding). If no event is given to the
engine, it will wait until transaction’s completion.
The one-sided user API functions take as arguments: a local
virtual address that is read or written, a remote target segment,
transaction parameters (e.g. operation type, size, stride, geometries)
and an event for the completion of the initiated transaction. The
RDMA functions are exposed to the user with put,get transac-
tions and remote atomic operations with postadd,poke,fetchclear,
fetchadd and peek. The user is also provided with a fence operation
for the remote completion of outstanding RDMA transactions of a
targeted memory segment.
5.2 Completion Event-based Engine
The completion event-based engine described by 2 is made for very
low-latency event completion. An
event
is a condition that has
only two states (true or false) for the user. This
event
has a memory
Address
content to monitor and compare it with a
Value
using
simple condition (e.g. equal,greater,less,etc). Depending on the
event
nature of the associated transaction, its state can become
true by getting hardware pending events (Line 10 in Algorithm 2)
and accumulated them into the content of
Address
. This sequence
is required to prevent DMA Rx end-of-transfer (
Eot
) counter from
saturation. However this sequence needs to be atomic, thus we
use atomic uncached instructions and we notify all other remote
Algorithm 1
Concurrent RDMA or Active Message Initiator En-
gine Finite-State-Machine (FSM) (parallel code)
1: Load the target parameters of the targeted segment
2: if is a valid remote segment transaction then
3: Cancel transaction and return failure /* Exit */
4: end if
5: Prepare transaction (RDMA or Active Message)
6: Write memory barrier (for outstanding user stores)
7: Slot = atomic fetch add uncached slot[target] counter
8: /* Transaction ow-control (outstanding) */
9: while (Slot + 1) >= (done_slot[target ] + FIFO_SIZE) do
10: Idle core // or OS yield
11: end while
12: if is Remote NoC Transaction then
13: /* Packet shaper is locked (atomic with taken Slot) */
14: PE Congure Tx packet shaper (route)
15: PE Push NoC transaction and remote notify (ordered)
16: else
17: /* Write job in shared memory (parallel) */
18: Write job at (Slot + 1) % FIFO_SIZE (ordered)
19: Write memory barrier
20: Send notication (inter-PE event)
21: end if
22: Set Event for completion ticket (Slot + 1)
23: if Blocking transaction then
24: wait Event to occur (See Algorithm 2)
25: end if
processors of the local node when the content of
Address
is up-
dated in the memory hierarchy. This broadcast notify operation
(Line 15 in Algorithm 2) is done using a low-latency control NoC Rx
mailbox in barrier mode which leads only to a simple posted store
in the peripheral space for the processor. For both low-latency and
high-throughput of event management, this engine does not rely
on any interruption mechanisms to avoid trashing the instruction/-
data cache when switching to interrupt handlers, suering from
interrupt noise and interrupt handler control multiplexing and the
overhead of context switching.
However, for more generic operating systems (Linux or RTOS)
this engine could use a preemptive and cooperative multi-thread
(Line 18 in Algorithm 2). Nevertheless, in high performance en-
vironment, MPPA’s operating systems are based on a simple run-
to-completion multi-threading model in the compute matrix. This
engine also provides a non-blocking user implementation for test-
ing whether or not the event is true.
5.3 RDMA and Active Message Job Arbiters
The RDMA and active message job arbiters are high-performance
FSMs that run on each node of the MPPA’s network. This FSM
serves requests from
Algorithm 1
. They rely on round-robin ar-
bitrations and are triggered on events (no interrupts) sent by the
DMA NoC interface or inter-PE events. These arbiters process re-
quests coming from the NoC or other intra-node PEs. This incoming
request sequence is described by Algorithm 1 for both RDMA and ac-
tive message jobs. The RDMA job arbiter manages asynchronously
the execution of outstanding micro-programmable DMA cores. It
selects an available DMA core, congures the NoC route, writes
Asynchronous One-Sided Communications for a Manycore Processor ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea
Algorithm 2
Lock-Free Concurrent Wait
Event
FSM (parallel code)
1: Value = load Event condition value
2: Address = load Event address to check
3: while true do
4: Test_Value = uncached load at Address to evaluate
5: if Evaluate condition Value with Test_Value then
6: Read memory barrier /* Core & DMA coherence */
7: Return success /* Exit */
8: end if
9: if Pending hardware DMA events then
10: Eot
= atomic load and clear on end-of-transfer notica-
tions counter on Rx Tag
11: if Eot then
12: Atomic fetch add uncached Eot at Address
13: Write memory barrier
14: /* Force PEs to re-evaluate conditions */
15: Broadcast notify to all PEs
16: end if
17: end if
18: Idle core // or OS yield possible
19: end while
the DMA core arguments, starts the DMA core and updates the
completion job ticket. The active message arbiter is simpler as it
does not need to manage outstanding jobs (asynchronous). Indeed
an active message is an operation containing a set of instructions
with operands. When the operation is processed, the active mes-
sage job arbiter sends the result back to the initiator (if any) and
updates the completion job ticket. The more complex part of this
software arbiter is that all active messages from an initiator are
ordered
with all outstanding RDMA writes of this initiator. To be
concise, from the initiator’s point of view, all outstanding incoming
RDMA transactions will complete before the posted remote atomic
operation is processed (active message operation completion).
5.4 Memory Ordering of RDMA and Active
Message Operations
For performance, one-sided operations have a relaxed-memory
consistency model. Regarding outstanding RDMA operations from
an initiator point of view, all puts are strictly ordered between each
other for their local completion but not for the remote completion
(completion is given by Algorithm 2). Indeed low-latency and high-
throughput are critical for performance; thus, it implies having
low-level mechanisms that have the following properties:
Get operations are ordered when reading from a same memory
segment, while reading from dierent memory segments is not
ordered. Outstanding put and get operations are not ordered on the
same initiator for ecient parallel execution. Hence, for any reason,
when Read-After-Write (RAW) dependency occurs (put followed
by a get on a same memory segment), an RDMA fence completion
must be performed before initiating the get operation. The fence
operation is provided by the one-sided engine and is part of the
active message operations. In the memory consistency model, the
one-sided fence operation completion provides the visibility of all
outstanding memory modications to other processors and nodes.
Regarding remote atomic operations, they are ordered when
targeting a same memory segment, and not ordered when targeting
dierent memory segments. A powerful concept with outstanding
RDMA transaction and outstanding remote atomic operation is
that they are ordered between each other when targeting a same
segment. It is done thanks to a point-to-point software “virtual
channel” between each pair of segments. When an initiator
X
posts
several puts (RDMA transactions) and then posts a remote atomic
operation to a memory segment, the posted remote atomic oper-
ation will be seen in this memory segment only after the remote
completion of the previously initiated puts of the initiator
X
. Such
ordering are very important for performance as the initiator can
post high-throughput RDMA data transfers along with a posted
synchronization mechanism. Indeed, everything can be done asyn-
chronously from the initiator point of view. Therefore, this initiator
can go back to computation immediately without losing any time.
We call this concept: Outstanding ordering between posted remote
atomic operations and RDMA transactions.
5.5 Data Restructuring Support
Linear and strided copies are essential geometries in RDMA com-
munication. A strided transfer can have an oset between each
contiguous data block either on src or dst buers, or even both.
An ecient RDMA API (and the underlaying hardware) should be
able to perform strided transfers with zero-copy, by automatically
incrementing read and write DMA oset with less cost. We call this
capability the “on-the-y data restructuring”. Applications such as
computer vision, deep learning, signal processing, linear algebra
and numerical simulation require ecient zero-copy data restruc-
turing. They generally rely on tiling with or without overlap, halo
region forwarding, transposition patterns and 2D/3D block trans-
fers. 2D and 3D copy are special cases of strided copy, where there
exist stride-osets on both src and dst buers. These osets can be
dierent from each other, as the local buer is often smaller and
accommodates a sub-partition of the remote buer. We inspired our
panel of transfer geometries from [
5
] and added few more patterns,
typically the spaced-remote-sparse-local
1
transfer, and its induced
implementation for 2D/3D data blocks.
6 RESULTS, ANALYSIS & LIMITS
6.1 Benchmarking Environment
We use a multi-multicore Central Processing Unit (CPU) execution
model with a low-level POSIX-like environment for benchmarking.
All measures were made onto an MPPA
®
operating at 500 MHz
with one or two DDR3 running at 1066 MHz. Each DDR3 bus size
is 64-bit which leads to a theoretical maximum memory bandwidth
of 8
.
5 GB/s (17
.
0 GB/s using 2-DDRs). The NoC is 32-bit width
also operating at 500 MHz; therefore, it provides up to 2
.
0 GB/s.
Compute clusters’ SMEM has 1 NoC link providing 2
.
0 GB/s per
link direction. However, we use a typical data NoC payload packet
size of 32 its with an header of 2-its for a total typical packet
size of 34 its. Thus, it leads to a maximum ecient data trans-
fers throughput of 2
∗ (
32
/
34
)
, which gives 1
.
88 GB/s (full-duplex).
The throughput is dened as the memory bandwidth on which
1arbitraryremote_stride and local_stride osets
ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea J. Hascoët, B. Dupont de Dinechin, P. Guironnet de Massas, M. Q. Ho
the node(s) or processor(s) are reading or writing. The latency is
dened as the time between the initiation and the completion of a
transaction; thus, it will depend of the size of the transaction.
6.2 RDMA Performance
6.2.1 Memory Throughput. Figure 2 shows both DDR(s) and
inter-cluster SMEM reads (gets) and writes (puts). The through-
put is measured onto the memory on which the CCs are reading
and writing. The size of the RDMA transactions (abscissa) and the
number of CCs (dierent curves) are varying showing dierent
saturation points regarding the hardware and the software implied.
All throughput benchmarks were done with asynchronous RDMA
transactions to saturate the software in charge of conguring the
DMA NoC interfaces. Thus software ow-control is heavily used to
prevent from the corruption of outstanding local or remote FIFOs.
First for DDR memory accesses we achieve more than 50% of
maximum theoretical throughput for data transfers greater than
4 KB and 94% for 32 KB in all topologies. Second it can be noticed
that RDMA puts are better than gets. This is due to remote server
contention which is the point of serialization for the conguration
of the DMA interfaces. Indeed for the software point of view out-
standing puts only rely on local ow-contol whereas outstanding
gets rely on remote ow-control. Remote ow-control is more com-
plex as it requires more software interactions with the DMA NoC
interface (conguration and packet transmission). We show that
our software implementation of RDMA support reaches more than
70% of the peak hardware throughput for contiguous data NoC
stream size bigger than 8 KB. To conclude the RDMA throughput
provides the user application with both ecient usage of the hard-
ware when having data stream size greater equal than 8 KB along
with managing complex ow-control mechanisms. Enabling decent
performance without the need for the user to manage ow-control
by himself gives ease. The implementation of software ow-control
in an application is often a source of error regarding data races.
6.2.2 Memory Latency. As the software is in charge of cong-
uring the DMA NoC interface to communicate, it implies latency.
Highly-coupled parallel software often leads to poor performance
onto massively parallel architectures. Therefore the transaction
latency on such architectures is critical when dealing with com-
plex data dependency patterns that imply inter-node communi-
cations (e.g. low-latency 6-steps Fast Fourier Transform (FFT) or
low-latency Convolutional Neural Network (CNN) inference). We
usually implement well-known N-buering techniques to tackle
such problems by masking the memory access latencies. However
depending on the spatial and temporal memory locality, it is not
always possible; and thus, the latency of the transactions becomes
important. We model the total round-trip latency of the RDMA
software/hardware engines when there are neither congestion nor
user/kernel interruptions.
TT (B)
is the Time to Transmit B bytes
and is given by:
TT (B)=I PT +H LT +SPT +B/
3
.
76
+CT
Let
IPT
be the Initiator Processing Time described in Algorithm 1,
SPT
the
Server Processing Time explained in 5.3,
HLT
the Hardware Latency
Time for the NoC link/router and micro-engine memory accesses,
CT
the Completion Time described in Algorithm 2 and
B/
3
.
76 with
B
the number of byte to transfer. 3
.
76 bytes per cycles is the e-
cient data transfer bandwidth considering a NoC header of 2 its
with a payload of 32 its. Typical cost in cycles are respectively:
TT (B)=500 +100 +300 +B/3.76 +200 =1100 +B/3.76.
Figure 3 shows round-trip latency using dierent compute matrix
geometries that are reading or writing into DDR(s) or one SMEM.
The minimum latency is 2
.
2
µs
. When transfer sizes are greater
than 10 KB we observe the point of rupture. This software latency
becomes negligible compared to the latency the of DMA’s micro-
engine transfer. When there is no contention, for instance curve
1-DDR-4-CCs, we have exactly a curve derivative of 3.76 bytes per
cycle after this rupture point. Moreover, after this rupture point
the latency is impacted by the bandwidth of the targeted memory
(DDR(s) or the SMEM).
6.3 Network-on-Chip Scalability
A strength of NoC based manycore processors is this ability to
scale on non-interferent inter-node data transfers. Table 1 shows
the internal compute matrix NoC bandwidth using dierent matrix
geometries and stream sizes. The peak input-output throughput of
the 16 CCs is given by 2
∗
16
∗
3
.
76 bytes per cycles. Operating at
500 MHz it provides 60
.
2 GB/s peak bandwidth. Our new RDMA
engine can reach more than 88% of peak performance for inter-node
transfers with a size of 16 KB.
Transfer Sizes 1KB 4 KB 16KB 64KB 256 KB
Nb Cluster(s)
1 0.7 2.7 3.5 3.8 3.8
4 2.7 11.0 14.2 15.2 15.4
8 5.5 21.8 28.3 30.5 30.8
16 10.7 42.8 56.3 60.2 60.2
Table 1: Compute Matrix’s NoC Bandwidth in GB/s
The communication patterns is the following: each compute
cluster initiates an RDMA put to a neighbor using NoC routes that
do not overlap between each other at runtime. When using 1 CC, we
use the loopback feature of the NoC interface. No NoC link sharing
or point of serialization occur but the share of the SMEMs for the
DMA NoC interface reads and writes. The SMEM is a multi-bank
interleaved memory of 16 banks, each bank can sustain 8 bytes per
cycle; therefore, providing a bandwidth of 64 GB/s, thus is not the
bottleneck in our measurements.
6.4 Remote Atomics Performance
We benchmark the latency of the active message engines as they
are used for synchronization and reduction operations. Figure 4
shows the latency on dierent geometries for both asynchronous
(posted) and blocking calls (wait until completion). In abscissa we
show the number of initiator CCs that are targeting either in spread
or centralized (see curve legend) mode the 16 CCs of the MPPA
®
compute matrix. Spread mode means that all initiators change their
target node each time they are sending a request (they all target
dierent nodes). It can be understood as a scatter mode where there
would be no overloading on receivers (well load-balanced) and the
best performance is expected. Centralized mode means that all
initiators target simultaneously a same node, thus overload this
node by increasing request processing. It aims measuring the worst
case of all possible active message scheduling schemes at execution.
Asynchronous One-Sided Communications for a Manycore Processor ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea
0
2
4
6
8
10
12
14
16
1 10 100 1000 10000 100000 1e+06
Bandwidth (GB/s)
Transfer Size (Bytes)
RDMA Engine Throughput - Read (Get)
1-DDR-1-CC
1-DDR-2-CCs
1-DDR-4-CCs
1-DDR-8-CCs
1-DDR-16-CCs
2-DDRs-1-CC
2-DDRs-2-CCs
2-DDRs-4-CCs
2-DDRs-8-CCs
2-DDRs-16-CCs
1-SMEM-1-CC
1-SMEM-2-CCs
1-SMEM-4-CCs
1-SMEM-8-CCs
1-SMEM-16-CCs
0
2
4
6
8
10
12
14
16
1 10 100 1000 10000 100000 1e+06
Bandwidth (GB/s)
Transfer Size (Bytes)
RDMA Engine Throughput - Write (Put)
1-DDR-1-CC
1-DDR-2-CCs
1-DDR-4-CCs
1-DDR-8-CCs
1-DDR-16-CCs
2-DDRs-1-CC
2-DDRs-2-CCs
2-DDRs-4-CCs
2-DDRs-8-CCs
2-DDRs-16-CCs
1-SMEM-1-CC
1-SMEM-2-CCs
1-SMEM-4-CCs
1-SMEM-8-CCs
1-SMEM-16-CCs
Figure 2: RDMA Engine Throughput GB/s(Asynchronous)
1
10
100
1000
10000
1 10 100 1000 10000 100000 1e+06
Time (us)
Transfer Size (Bytes)
RDMA Engine Latency - Read (Get)
2.3 us Round-Trip (1168 cycles)
1-DDR-1-CC
1-DDR-2-CCs
1-DDR-4-CCs
1-DDR-8-CCs
1-DDR-16-CCs
2-DDRs-1-CC
2-DDRs-2-CCs
2-DDRs-4-CCs
2-DDRs-8-CCs
2-DDRs-16-CCs
1-SMEM-1-CC
1-SMEM-2-CCs
1-SMEM-4-CCs
1-SMEM-8-CCs
1-SMEM-16-CCs
1
10
100
1000
10000
1 10 100 1000 10000 100000 1e+06
Time (us)
Transfer Size (Bytes)
RDMA Engine Latency - Write (Put)
2.2 us Round-Trip (1095 cycles)
1-DDR-1-CC
1-DDR-2-CCs
1-DDR-4-CCs
1-DDR-8-CCs
1-DDR-16-CCs
2-DDRs-1-CC
2-DDRs-2-CCs
2-DDRs-4-CCs
2-DDRs-8-CCs
2-DDRs-16-CCs
1-SMEM-1-CC
1-SMEM-2-CCs
1-SMEM-4-CCs
1-SMEM-8-CCs
1-SMEM-16-CCs
Figure 3: RDMA Engine Latency µs (Blocking)
0.1
1
10
100
2 4 6 8 10 12 14 16
Time (us)
Number of Cluster(s) Initiators
Active Message Engine Scaling Latency
2.174 MIOPS Initiator (230 cycles)
0.951 MIOPS Server (526 cycles)
2.2 us Round-Trip (1109 cycles)
Async-Spread-1-PE
Async-Centralized-1-PE
Blocking-Spread-1-PE
Blocking-Centralized-1-PE
Async-Spread-16-PEs
Async-Centralized-16-PEs
Blocking-Spread-16-PEs
Blocking-Centralized-16-PEs
Figure 4: Active Message Latency
The best case initiator latency on a posted operation is 230 cycles
(418
ns
), for instance postadd. The round-trip latency for the comple-
tion of a fetchadd operation is 1109 cycles, 2
.
2
µs
. Lots of conicts
occur when having the 256 PEs that are sending requests to the
same node (see curve Async-Centralized-16-PEs with 16 clusters).
In such conguration, the N-to-N ow-control is generating a lot of
trac to avoid the corruption of software job FIFOs. However the
implementation is able to sustain such contention but the latency
explodes (up to 17.5µs in asynchronous mode).
6.5 Remote Queue Throughput
Remote queues provide an elementary support of two-sided com-
munications for small low-latency atomic messages (1-to-1 and
N-to-1). Regarding the benchmark conditions, each CC has a queue
where the IO subsystem is sending messages (1 outstanding mes-
sage per CC from the IO subsystem point-of-view). Then the CCs
gets this message and responses using an atomic queue message to
the N-to-1 queue on the IO (a FIFO receiving atomic messages). All
CCs are running concurrently; therefore, we use the N-to-1 feature
for data NoC packet atomicity. Table 2 shows IOPS of one RM of
the IO that is receiving a request from a CC and is sending back a
new job command to the responding CC. The benchmarks were
ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea J. Hascoët, B. Dupont de Dinechin, P. Guironnet de Massas, M. Q. Ho
made using dierent data NoC packet sizes (Bytes) without any
batching.
Packet Size in Bytes 16 B 32 B 64 B 128 B 248 B
Nb Cluster(s)
1 675 670 600 425 350
2 to 16 740 725 725 575 550
Table 2: Performance of the Remote Queues in Kilo IOPS
Such a communication pattern is important as it is used in of-
oading. Therefore, we provide the limitation of such implemen-
tation when ne-grained parallelism is required by an ooaded
application where the control is done by a host processor. In our
case, the IO subsystem is considered as a host processor that is of-
oading computations onto the compute matrix (though job queue
commands). For small messages, the results show that a simple
double-buering (with 2 CCs) allows to saturate the number of
IOPS for one IO master core.
7 CONCLUSION & PERSPECTIVES
We present the design and illustrate the advantages of a one-sided
asynchronous (AOS) communications and synchronizations pro-
gramming layer for the Kalray MPPA
®
processor. The motivation
is to apply to this and related CPU-based manycore processors
the established principles of one-sided communications libraries of
supercomputers, in particular: the Cray SHMEM library, the PNNL
ARMCI library, and the MPI-2 one-sided API subset. The main dif-
ference between these communication libraries and the proposed
AOS layer is that a supercomputer has a symmetric architecture,
where the compute nodes are identical and the working memory
is composed of the union of the compute node local memories.
Similar to Inniband low-level API, the AOS programming layer
supports the Read/Write, Put/Get and Remote Atomics protocols,
but is designed around the capabilities of an RDMA-capable NoC.
One-sided asynchronous operations are proven to be highly e-
cient thanks to relaxed ordering and easier to use as the initiator
is a master on the targeted memories. Indeed, one-sided commu-
nications do not require strict matching whereas the send/receive
operations do in the two-sided protocol. Our software implemen-
tation is capable to sustain more than 70% of the hardware peak
throughput with using RDMA put-get engines for data transfer
sizes greater equal than 8 KB. However, managing resource sharing,
ow-control, arbitration and notications in software has limitation
in terms of latency as shown. Based on this implementation and its
results, the forthcoming 3rd-generation MPPA
®
processor will in-
clude hardware acceleration for these engines’ key functions. This
will enable to reach very low-latency and close to peak throughput
on small transactions, along with respecting important ordering
properties for the memory consistency.
REFERENCES
[1] Dan Bonachea. 2008. GASNet Specication. Technical Report.
[2]
Eric A Brewer, Frederic T Chong, Lok T Liu, Shamik D Sharma, and John D
Kubiatowicz. 1995. Remote queues: Exposing message queues for optimization
and atomicity. In Proceedings of the seventh annual ACM symposium on Parallel
algorithms and architectures. ACM, 42–53.
[3]
Darius Buntinas, Dhabaleswar K Panda, and William Gropp. 2001. NIC-based
atomic remote memory operations in Myrinet/GM. In IN MYRINET/GM,âĂİ IN
WORKSHOP ON NOVE USES OF SYSTEM AREA NETWORKS (SAN1. Citeseer.
[4]
Barbara Chapman, Tony Curtis, Swaroop Pophale, Stephen Poole, Je Kuehn,
Chuck Koelbel, and Lauren Smith. 2010. Introducing OpenSHMEM: SHMEM
for the PGAS community. In Proceedings of the Fourth Conference on Partitioned
Global Address Space Programming Model. ACM, 2.
[5]
Silviu Ciricescu, Ray Essick, Brian Lucas, Phil May, Kent Moat, Jim Norris, Michael
Schuette, and Ali Saidi. 2003. The recongurable streaming vector processor
(RSVPTM). In Proceedings of the 36th annual IEEE/ACM International Symposium
on Microarchitecture. IEEE Computer Society, 141.
[6]
Benoît Dupont de Dinechin, Pierre Guironnet de Massas, Guillaume Lager, Clé-
ment Léger, Benjamin Orgogozo, Jérôme Reybert, and Thierry Strudel. 2013. A
distributed run-time environment for the kalray mppa
®
-256 integrated manycore
processor. Procedia Computer Science 18 (2013), 1654–1663.
[7]
Karl Feind. 1995. Shared memory access (SHMEM) routines. Cray Research
(1995).
[8]
David Gelernter, Alexandru Nicolau, and David A Padua. 1990. Languages and
compilers for parallel computing. Pitman.
[9]
Alexandros V Gerbessiotis and Seung-Yeop Lee. 2004. Remote memory access:
A case for portable, ecient and library independent parallel programming.
Scientic Programming 12, 3 (2004), 169–183.
[10]
Sergei Gorlatch. 2004. Send-receive Considered Harmful: Myths and Realities of
Message Passing. ACM Trans. Program. Lang. Syst. 26, 1 (Jan. 2004), 47–56.
[11]
Peter J Keleher, Alan L Cox, Sandhya Dwarkadas, and Willy Zwaenepoel. 1994.
TreadMarks: Distributed Shared Memory on Standard Workstations and Operat-
ing Systems.. In USENIX Winter, Vol. 1994. 23–36.
[12]
J Nieplocha V Tipparaju M Krishnan and G Santhanaraman DK Panda. 2003.
Optimizing mechanisms for latency tolerance in remote memory access commu-
nication on clusters. In IEEE International Conference on Cluster Computing. IEEE,
138.
[13]
Thierry Lepley, Pierre Paulin, and Eric Flamand. 2013. A novel compilation
approach for image processing graphs on a many-core platform with explicitly
managed memory. In Proceedings of the 2013 International Conference on Compilers,
Architectures and Synthesis for Embedded Systems. IEEE Press, 6.
[14]
Jiuxing Liu, Weihang Jiang, Pete Wycko, Dhabaleswar K Panda, David Ashton,
Darius Buntinas, William Gropp, and Brian Toonen. 2004. Design and Implemen-
tation of MPICH2 over InniBand with RDMA Support. In Parallel and Distributed
Processing Symposium, 2004. Proceedings. 18th International. IEEE, 16.
[15]
Kevin JM Martin, Mostafa Rizk, Martha Johanna Sepulveda, and Jean-Philippe
Diguet. 2016. Notifying memories: a case-study on data-ow applications with
NoC interfaces implementation. In Proceedings of the 53rd Annual Design Au-
tomation Conference. ACM, 35.
[16]
John Mellor-Crummey, Laksono Adhianto, William N. Scherer, III, and Guohua
Jin. 2009. A New Vision for Coarray Fortran. In Proc. of the Third Conference
on Partitioned Global Address Space Programing Models (PGAS ’09). Article 5,
5:1–5:9 pages.
[17]
Jarek Nieplocha and Bryan Carpenter. 1999. ARMCI: A portable remote memory
copy library for distributed array libraries and compiler run-time systems. Parallel
and Distributed Processing (1999), 533–546.
[18]
Jarek Nieplocha, Vinod Tipparaju, Manojkumar Krishnan, and Dhabaleswar K
Panda. 2006. High performance remote memory access communication: The
ARMCI approach. International Journal of High Performance Computing Applica-
tions 20, 2 (2006), 233–253.
[19]
Robert W. Numrich and John Reid. 1998. Co-array Fortran for Parallel Program-
ming. SIGPLAN Fortran Forum 17, 2 (Aug. 1998), 1–31. https://doi.org/10.1145/
289918.289920
[20]
Andreas Olofsson. 2016. Epiphany-V: A 1024 processor 64-bit RISC System-On-
Chip. CoRR abs/1610.01832 (2016).
[21]
Selma Saidi, Rolf Ernst, Sascha Uhrig, Henrik Theiling, and Benoît Dupont de
Dinechin. 2015. The shift to multicores in real-time and safety-critical systems.
In 2015 International Conference on Hardware/Software Codesign and System Syn-
thesis, CODES+ISSS 2015, Amsterdam, Netherlands, October 4-9, 2015. 220–229.
[22]
Karthikeyan Vaidyanathan, Lei Chai, Wei Huang, and Dhabaleswar K Panda.
2007. Ecient asynchronous memory copy operations on multi-core systems
and I/OAT. In Cluster Computing, 2007 IEEE International Conference on. IEEE,
159–168.
[23]
Anish Varghese, Bob Edwards, Gaurav Mitra, and Alistair P Rendell. 2014. Pro-
gramming the adapteva epiphany 64-core network-on-chip coprocessor. In Par-
allel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE Interna-
tional. IEEE, 984–992.