Conference PaperPDF Available

Asynchronous one-sided communications and synchronizations for a clustered manycore processor

Authors:

Abstract and Figures

Clustered manycore architectures fitted with a Network-on-Chip (NoC) and scratchpad memories enable highly energy-efficient and time-predictable implementations. However, porting applications to such processors represents a programming challenge. Inspired by supercomputer one-sided communication libraries and by OpenCL async_work_group_copy primitives, we propose a simple programming layer for communication and synchronization on clustered manycore architectures. We discuss the design and implementation of this layer on the 2nd-generation Kalray MPPA processor, where it is available from both OpenCL and POSIX C/C++ multithreaded programming models. Our measurements show that it allows to reach up to 94% of the theoretical hardware throughput with a best-case latency round-trip of 2.2μs when operating at 500 MHz.
Content may be subject to copyright.
Asynchronous One-Sided Communications and
Synchronizations for a Clustered Manycore Processor
Julien Hascoët, Benoît Dupont de Dinechin, Pierre Guironnet de Massas and Minh Quan Ho
Kalray, Montbonnot-Saint-Martin, France
{jhascoet,benoit.dinechin,pgmassas,mqho}@kalray.eu
ABSTRACT
Clustered manycore architectures tted with a Network-on-Chip
(NoC) and scratchpad memories enable highly energy-ecient and
time-predictable implementations. However, porting applications
to such processors represents a programming challenge. Inspired by
supercomputer one-sided communication libraries and by OpenCL
async_work_group_copy
primitives, we propose a simple program-
ming layer for communication and synchronization on clustered
manycore architectures. We discuss the design and implementation
of this layer on the 2nd-generation Kalray MPPA processor, where
it is available from both OpenCL and POSIX C/C++ multithreaded
programming models. Our measurements show that it allows to
reach up to 94% of the theoretical hardware throughput with a
best-case latency round-trip of 2.2µs when operating at 500 MHz.
ACM Reference Format:
Julien Hascoët, Benoît Dupont de Dinechin, Pierre Guironnet de Massas
and Minh Quan Ho. 2017. Asynchronous One-Sided Communications and
Synchronizations for a Clustered Manycore Processor. In Proceedings of
ESTIMedia’17, Seoul, Republic of Korea, October 15–20, 2017, 10 pages.
https://doi.org/10.1145/3139315.3139318
1 INTRODUCTION
Processors integrating up to hundreds of cores need to introduce
locality in the memory hierarchy in order to reach their perfor-
mance and eciency targets. For GPGPUs, this is the local memory
of the Compute Units (OpenCL) or the Streaming Multiprocessors
(CUDA). For CPU-based manycore processors like the Adapteva
Epiphany 64 [
23
] and the Epiphany-V [
20
], this is the scratch-pad
memory attached to each core. For the Kalray Multi-Purpose Pro-
cessor Array (MPPA)
®
-256 [
21
] processor, this is the local memory
shared by the 16 application cores of each compute cluster.
We describe the design and implementation of an asynchronous
one-sided (AOS) communication and synchronization programming
layer for the 2nd-generation Kalray MPPA
®
processor, which is
able to eciently exploit the network on-chip and the cluster local
or processor external memories. This layer presents the platform
as a collection of execution domains, which are composed of cores
and their directly addressable memory. Domains publish parts of
their memory as segments, which can then be accessed by remote
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea
© 2017 Association for Computing Machinery.
ACM ISBN 978-1-4503-5117-1/17/10.. .$15.00
https://doi.org/10.1145/3139315.3139318
cores using RDMA Put/Get operations, and synchronized by remote
atomic operations (e.g. fetch-and-add). In addition to these window-
like segments, queue-like segments are also available with N-to-1
atomic enqueue and a local dequeue operations.
The AOS communication and synchronization operations layer
is deployed in production to support application programming of
the MPPA
®
processor for a range of low-level and in high-level
programming environments such as OpenCL, OpenMP, OpenVX,
static/dynamic dataow execution models. The AOS layer is also
used by Kalray’s optimized application libraries and is targeted by
code generators such as for CNN inference.
The paper is organized as follow. Section 2 motivates our ap-
proach in relation with related work, which is mostly targeted
at high-performance (HPC) systems. In Section 3, we present the
high-level principles of the AOS programming layer. Section 4 intro-
duces the MPPA®processor features relevant to this contribution.
Section 5 presents the design, algorithms and implementation of
the AOS programming layer onto the Kalray MPPA
®
processor.
Section 6 provides detailed results, performance analysis and the
limitations of the implementation.
2 RELATED WORK AND MOTIVATIONS
2.1 Remote Direct Memory Access
Remote Direct Memory Access (RDMA) is an integral part of mod-
ern communication technologies that can be characterized by: OS-
bypass, zero-copy, one-sided communications and asynchronous
operations. OS-bypass allows direct interaction between the appli-
cation and a virtualized instance of the network hardware, without
involving the operating system. Zero-copy allows a system to place
transferred data directly to its nal memory location, based on in-
formation included in the RDMA operations. One-sided operations
allow a communication to complete without the involvement of the
application thread on the remote side. Asynchronous operations are
used to decouple the initiation of a communication from its progress
and subsequent completion, in order to allow communication to be
overlapped with computation.
These communication technologies mostly apply at the back-
plane and system levels in data centers and supercomputers:
Inside a compute node, between cores and other bus masters,
HT (HyperTransport), QPI (QuickPath Interconnect) and
PCIe support load/store as well as DMA (Direct Memory
Access) operations.
Between compute nodes across a backplane or a chassis, sRIO
(serial RapidIO) and DCB (Data Center Bridging)‘ ’ Ethernet
variants such as RoCE (RDMA over Converged Ethernet)
mostly support RDMA operations.
ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea J. Hascoët, B. Dupont de Dinechin, P. Guironnet de Massas, M. Q. Ho
At the system level, between compute racks and the storage
boxes, Inniband or Ethernet also support RDMA operations.
Between systems, IP networks support the BSD sockets and
client/server operations.
Motivated by the success of RDMA in high-performance sys-
tems, the design objectives for the asynchronous one-sided (AOS)
operations programming layer is to adapt the RDMA principles to
the architecture of CPU-based manycore processors.
2.2 High-Performance Interconnects
The Inniband technology designed by Mellanox is widely deployed
in high-performance systems and datacenters. It natively supports
Remote Direct Memory Access (RDMA) put-get, Send/Receive read-
write and atomic operations, with a focus on high-throughput with
low-latency. Based on the earlier VIA (Virtual Interface Architec-
ture), the InniBand specication only lists Verbs, that is, functions
that must exist but whose syntax is left to vendors.
After vendors created separate Verbs APIs, these coalesced into
the Open-Fabrics Association (OFA) Verbs. OFA Verbs has support
for: two-sided and one-sided operations, always asynchronous;
reliable and unreliable modes, connection-oriented and connection-
less; remote direct memory access, send and receive; and atomic
operations on remote memory regions. To allow the direct access to
endpoint memory, this virtual memory must be pinned in physical
memory and registered into the network interface I/O MMU. OFA
Verbs oer cross platform support across InniBand on IB network,
iWARP on IP network and RoCE on Ethernet fabric.
iWARP uses IETF dened RDDP (Remote Direct Data Placement)
to deliver RDMA services over standard, unmodied IP network
and standard TCP/IP Ethernet services. Enhancements to the Eth-
ernet data link layer enabled the application of advanced RDMA
services over the IEEE Data Center Bridging (DCB), that is, loss-
less Ethernet. In early 2010, this technology, now known as RDMA
over Converged Ethernet (RoCE) was standardized by the IBTA
(InniBand Trade Association). RoCE utilizes advances in Ether-
net (DCB) to eliminate the need to modify and optimize iWARP
to run eciently over DCB. RoCE focuses on server-to-server and
server-to-storage networks, delivering the lowest latency and jitter
characteristics and enabling simpler software and hardware imple-
mentations. RoCE supports the OFA Verbs interface seamlessly.
The GPUDirect specication was developed together by Mel-
lanox and NVIDIA. It is composed of a new interface (API) within
the Tesla GPU driver, a new interface within the Mellanox Inni-
Band drivers, and a Linux kernel modication to support direct
communication between drivers. GPUDirect allows RDMA capable
devices to direct access to GPU device memory, so data can be
directly transferred between two GPUs without buering in host
memory. GPUDirect Verbs provide extended memory registration
functions to support GPU buer and GPU memory de-allocation
call-back for ecient MPI implementations.
The Intel
®
Omni-Path technology competes with Inniband,
with the advantage that the interfaces can be integrated in the
Intel
®
processor themselves. It can be used through the OpenFabrics
library which has an implementation of the Inniband Verbs API
as standardized by the Open Fabric Alliance (OFA).
2.3 HPC Communication Systems
Today’s high performance computing (HPC) programming models
are based on Simple Program Multiple Data (SPMD) execution,
where a single program executable is spawned on
N
processing
nodes. There is one process per node and each process is assigned
a unique rank
∈ [
0
,N]
. The main HPC programming model is the
message passing interface (MPI), which combines SPMD execution,
explicit send / receive of data buers, and collective operations.
Whereas most HPC applications still rely on message-passing
semantics using classic MPI, the underlying communication sys-
tems have evolved decades ago to rely on one-sided communication
semantics, starting with Cray SHMEM library [
7
][
9
]. The rise of
PGAS languages like Co-Array Fortran [
19
] [
16
], UPC and of Global
Arrays motivated the development of one-sided communication lay-
ers, notably GasNet from Berkeley and ARMCI from PNNL. PGAS
languages and GA combine SPMD execution, one-sided commu-
nications, and collective operations. The MPI standard introduced
one-sided communications in MPI-2, which have been reworked
and can be combined with split-phase synchronization in MPI-3.
The Cray SHMEM (SHared MEMory) library [
4
] was initially
introduced by Cray Research for low-level programming on the
Cray T3D and T3E massively parallel processors [
7
]. This library de-
nes symmetric variables as those with same size, type, and address
relative to processor local address space, and these naturally appear
as a by-product of SPMD execution. Dynamic memory allocation
of symmetric variables is supported with a
shmalloc()
operation.
Static data, and heap data obtained through this symmetric alloca-
tor, are implicitly registered. Thanks to the symmetric variables, it
is possible to dene the interface of one-sided operations such as
put and get by referring to local objects only. Besides put and get
variants, the SHMEM library supports remote atomic operations,
and collective operations. The SHMEM library motivated the design
of the F
−−
language [
8
], one of the rst Partitioned Global Address
Space (PGAS) languages, which evolved into Co-Array Fortran.
The Aggregate Remote Memory Copy Interface (ARMCI) [
17
]
was designed as an improvement over Cray SHMEM and IBM LAPI
(IBM SP), and is used as the base of the Global Arrays toolkit. The
API is structured in three classes of operations:
data transfer operations: put; get; accumulate.
synchronization operations: atomic read-modify-write; lock-
/mutex;
utility operations: memory allocation / deallocation; local
and global fence; error handling.
The Berkeley Global Address Space Networking (GASNet) li-
brary [
1
] is designed as a compiler run-time library for the PGAS
languages UPC and Titanium. The GASNet library is structured
with a core API and an extended API. The core API includes mem-
ory registration primitives, and is otherwise based on the active
message paradigm. Active message request handlers must be at-
tached to each instance of the SPMD program by calling a collective
operation
gasnet_attach()
. The extended API is meant primarily
as a low-level compilation target, and can be implemented either
with only the core API, or by leveraging higher-level primitives of
the network interface cards. The extended API includes put,get,
and remote
memset()
operations. Data transfers are non-blocking,
and the synchronization barrier is split phase.
Asynchronous One-Sided Communications for a Manycore Processor ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea
However useful, classic HPC communication layers cannot be
eectively applied to manycore processors with local memories.
First problem is that the memory capacity locally available to a
core is about several GB on HPC systems, while it is tens or hun-
dreds KB on manycore processors. The second problem with HPC
communication libraries is they assume a symmetric memory hier-
archy, where the total memory is the union of the compute nodes
memories. Manycore processors not only have (on-chip) local mem-
ories, but also one or more external DDR memory systems. Finally,
a network-on-chip interface is much less capable than a macro
network interface but has signicantly lower latencies.
2.4 OpenCL Asynchronous Copies
OpenCL structures a platform into a Host connected to Compute
Devices. Each Compute Device has a Global Memory, which is
shared by Compute Units. Each Compute Unit has a Local Memory,
a cache for the Global Memory, and Processing Elements that share
the Local and the Global memories. Each Processing Element has
registers and a Private Memory. Computations are dispatched from
the Host to the Compute Units as Work Groups. A Work Group
is composed of Work Items, which are instances of a computation
kernel written in the OpenCl-C dialect. This dialect includes vector
data-types and requires to tag memory objects with their address
space : __global,__local,__private, and __constant.
For CPU-based clustered manycore processors, the main short-
coming of OpenCL is the inability to support ecient communica-
tion between the Local Memories and synchronization between the
Compute Units. However, this capability is essential to the ecient
implementation of image processing, CNN inference and other algo-
rithms where tiling is applicable. Moreover, OpenCL was originally
designed for the GPGPU manycore architecture, where context
switching at the cycle level is exploited to cover memory access
stalls with useful computations. On a DSP or CPU-based manycore
architecture, there is no such cycle-level switching. As a result,
high-performances require that programmers manually build pro-
cessing pipelines with the OpenCL asynchronous prefetch or copy
operations between the Global Memory and the Local Memory.
Specically, OpenCL denes the
async_work_group_copy
and
async_work_group_strided_copy
operations. These asynchro-
nous copy operations enable data to be copied asynchronously
between the device Global Memory and the Compute Unit Local
memory. The main limitation of these operations is that data must
to be read/written contiguously from/into the local memory (dense
mode), an assumption that turns out to be overly restrictive.
For instance, in image processing tiling decomposition, one may
need to copy a 2D sub-image of 16
×
16 pixels to a larger local buer,
allocated at 18
×
18 pixels to deal with halo pixels. In this case, one
must explicitly manage a local stride of two pixels between each
data block, since the local buer is sparse and data should not be
written contiguously to it. This restriction of OpenCL asynchro-
nous copy operations is even more apparent when local buers are
declared as true multi-dimensional arrays as supported by the C99
and OpenCL-C standards. The use of true multi-dimensional arrays
particularly eases 2D/3D stencil programming.
Nevertheless, the asynchronous copy operations of OpenCL have
proved highly useful in order to exploit DMA engines available on
FPGAs, DSPs or clustered manycore processors like MPPA
®
. As the
OpenCL standard allows vendor extensions for ecient usage of
the target hardware, asynchronous 2D copy has been implemented
on the OpenCL runtime of STHORM P2012 [
13
]. Other companies
such as Xilinx
®
, Altera
®
, Adapteva
®
, AMD
®
and Intel
®
also provide
similar OpenCL extensions on their processors.
3 PRINCIPLES OF THE AOS LAYER
AOS operations operate on memory segments, which are represented
by opaque communication objects supporting one or several of
the communication protocols presented below: Load/Store,Put/Get,
Remote Queues, and Remote Atomics. Memory segments objects are
created in an address space and published by cores who have direct
access to that memory space. By using a global ID supplied upon
segment creation, cores operating in another address space can
clone the segment by supplying its global ID and obtain the object
that locally represents this (remote) memory segment.
3.1 Load/Store: Memory-to-Register
Load/Store is the simplest method to transfer data. Any memory
access that has the correct Memory Management Unit (MMU) map-
ping regarding virtual and physical addresses will allow the pro-
cessor to access a memory segment through the cache hierarchy.
Manycore architectures either support transparent access to the
global memory through hardware caches or via MMU-based dis-
tributed shared memory system like TreadMarks [11].
However on large-scale parallel systems, Load/Store through
multiple levels of cache hierarchy can be a serious performance bot-
tleneck when there is data sharing between multiple Non-Uniform-
Memory-Access (NUMA) nodes (do not scale easily). In particular
the implementation of reductions and inter-core/node communica-
tions are complex to perform eciently.
3.2 Two-sided: Remote Queues
Classic two-sided send/receive operations are known to have a
signicant overhead due to synchronizations between the sender
and receiver nodes, and the need to use temporary buers as op-
posed to zero-copy communication. In addition, real-life imple-
mentations present signicant correctness challenges [
10
]. The
two-sided send/receive operations are nevertheless the main com-
munication primitives proposed by MPI [14].
As a primitive two-sided protocol, we select the remote queue
operations described in [
2
], as it avoids the problems of classic
message passing. First, it can be implemented as simple message
queues that are proven to be ecient for ne-grained control and
coordination of distributed computations. Moreover, remote queues
also apply to N-to-1 communication whenever atomicity of enqueue
and dequeue operations can be ensured. This N-to-1 capability is
essential on massively parallel systems that require run-time orches-
tration of activities (e.g. master / worker parallel pattern). Finally,
remote queues enable ecient coding as they enable synchroniza-
tion without introducing any locking mechanism.
3.3 One-sided Put/Get: RDMA
RDMA is a one-sided communication protocol inspired from [
18
],
[
22
] and [
12
]. Any communication initiator registered to a memory
ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea J. Hascoët, B. Dupont de Dinechin, P. Guironnet de Massas, M. Q. Ho
segment is a master on this memory. A RDMA transfer can be
initiated over a target memory segment using either the put or
get primitives. Like their HPC counterparts, RDMA protocols have
several advantages: RDMA provides zero copy memory-to-memory
operation (no buering); and they not require any synchronization
before initiating the transfers as the initiator is a master on the
targeted memory segment. When several initiators target the same
memory segment temporally, no serialization is performed but only
the hardware bus size is the point of serialization.
The RDMA operations favor high-throughput over low-latency,
and typically assume a relaxed memory consistency model. A Re-
laxed memory consistency model allows operations to execute
asynchronously with out-of-order global completion. For compari-
son, Load/Store is also a one-sided operation which make the RDMA
protocol usually easier to manipulate than classical Send/Receive
where a large overhead exists because of the strict matching of the
operations Send and Receive (need to synchronize all the time).
3.4 Remote Atomics: Active Messages
Remote atomic operations have been proven to be ecient for
inter-node synchronization [
15
], [
3
]. A remote atomic operation
consists in an initiator of sending a message with an operation
using dierent operands to a remote or external compute resource.
Then, this compute resource executes the operation possibly under
certain conditions and forwards the completion to the initiator.
Remote atomic operations are based on active messages. An
active message is a low overhead message that executes as a call
without return value and using the message payload as arguments.
In our implementation, the active message will be executed either
after passing certain user-dened conditions or immediately for
low-latency. Remote atomic operations are ubiquitous mechanism
for ecient parallel programming. For instance they provide classi-
cal atomic remote instructions such as fetch-and-add,compare-and-
swap and load-and-clear. When applicable, we also dene posted
variants of these operations (without a returned value). Remote
atomics implemented with active messages allow ecient and ele-
gant synchronization mechanisms.
4 MPPA®PROCESSOR ARCHITECTURE
4.1 Architecture Overview
The Kalray MPPA
®
(Multi-Purpose Processing Array) manycore
architecture is designed to achieve high energy eciency and de-
terministic response times for compute-intensive embedded appli-
cations [21].
Figure 1: MPPA®Processor
The MPPA
®
processor (Fig. 1), code-named Bostan, integrates
256 VLIW application cores and 32 VLIW management cores (288
cores in total) which can operate from 400 MHz to 600 MHz on
a single chip and delivers more than 691.2 Giga FLOPS single-
precision for a typical power consumption of 12W. The 288 cores
of the MPPA
®
processor are grouped in 16 Compute Clusters (CC)
and implements two Input/Output Subsystems (IO) to communicate
with the external world through high-speed interfaces via the PCIe
Gen3 and Ethernet 10 Gbits/s.
4.2 Computing Resources
4.2.1 Input/Output Subsystems. integrate two quad-cores with
the VLIW architecture 4.2.3 and are connected to a 4GB of external
DDR3 memory and on-chip Shared Memory (SMEM) memory of 4
MB. Regarding core memory accesses, cached and uncached access
can be done for both Load and Store operations (64-bit/cycle) in the
SMEM and DDR3. For the shared memory, cached and uncached
atomics are available such as Load-and-Clear, Fetch-and-Add and
Compare-and-Swap (cas). Atomic cached operations provide execu-
tion eciency when dealing with critical algorithm paths that need
mutual exclusion or atomic updates of variables. Each IO embeds
8 high-speed IO interfaces usually called Direct Memory Access
(DMA) to communication through PCI Express First-In-First-Outs
(FIFOs), Ethernet, DDR3 and SMEM. The software is in charge of
maintaining the memory coherence between DMA reads/writes
and the cores.
4.2.2 Compute Clusters. embed 17 cores which are 16 Process-
ing Element (PE) and a Resource Manager (RM). Compute clusters
integrate a multi-banked private local SMEM of 2MB. Cores’ mem-
ory accesses are supported only in this SMEM and only uncached
atomics are available (same available atomics as the IO). Each com-
pute cluster has one DMA interface for communicating with exter-
nal nodes. Here, the software is also in charge of maintaining the
memory coherence between DMA reads/writes and the cores.
4.2.3 Core. Each MPPA
®
core implements a 32-bit VLIW archi-
tecture which issues up to 5 instructions per cycle, corresponding to
the following execution units: branch & control unit (BCU), ALU0,
ALU1, load-store unit (LSU), multiply-accumulate unit (MAU) com-
bined with a oating-point unit (FPU). Each ALU is capable of 32-bit
scalar or 16-bit SIMD operations, and the two can be coupled for
64-bit operations. The MAU performs 32-bit multiplications with a
64-bit accumulator and supports 16-bit SIMD operations. Finally,
the FPU supports one double-precision fused multiply-add (FMA)
operation per cycle, or two single-precision operations per cycle.
4.3 Communication Resources
4.3.1 Network-on-Chip. The 18 multi-core CPUs of the MPPA
®
are interconnected by a full-duplex NoC of 32-bit. The NoC im-
plements wormhole switching with source routing and supports
guaranteed services through the conguration of ow injection pa-
rameters at the NoC interface: the maximum rate
σ
; the maximum
burstiness
ρ
; the minimum and the maximum packet sizes (the size
unit is the it). A it is 32-bit (4 Byte/s), meaning a bandwidth of
2 GB/s per link direction when operating at 500 MHz. The NoC
is a direct network with a 2D torus topology. This network does
not support Load/Store but only data NoC stream and low-latency
control NoC messages. Thus the software is in charge of converting
virtual memory addresses to data stream (data NoC) and convert
Asynchronous One-Sided Communications for a Manycore Processor ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea
this stream back to virtual address in the remote memory to initiate
any communications between any multi-core CPUs.
4.3.2 Control NoC Interface. The control NoC is made to com-
municate at very low-latency with messages of 64-bit only. It does
not have access to the memories (on-chip or o-chip memories), the
messages are mapped in the DMA interface registers. Each DMA
interface implements 128 64-bit control NoC Rx mailboxes and 4
Tx resources. These mailboxes can be used for barriers and simple
64-bit messages with a notication on a list of processors (up to
17 cores in CC). In this paper, the barrier mode is mainly used for
generic inter-core low-latency synchronization and notication.
For instance forcing a remote core or a poll of remote cores out of
idle state in a single clock-cycle for the initiating core. A store in
the peripheral space is a posted operation. The 4 Tx resources must
be shared between the cores of the multi-core CPU. A NoC route
and a remote control NoC Rx mailbox identication number (called
a tag in range
[
0
,
127
]
) must be congured to send 64-bit message
through each control NoC Tx resource.
4.3.3 Data NoC Interface. The data NoC feature is made for
high throughput. Therefore, it is a very asynchronous hardware
block that requires to be handled
asynchronously
by the software.
Indeed all outstanding incoming and outgoing transactions must
be managed by the software asynchronously for performance. Each
data NoC DMA interface is composed of three elements:
Eight micro-cores are available for each DMA interface and
can run concurrently. A micro-core is a micro-programmable
DMA core that needs to be programmed and congured. It
has a simple set of instructions such as reads,local and remote
notications for local and remote completions and addition
support for the arithmetic of internal read pointers. It can
execute up to 4 nested loops to describe custom memory
access patterns with high throughput. This throughput is
limited by the technology of the memory on which the micro-
core is reading, the NoC link size (4 bytes/cycle) and the
memory access patterns.
The data NoC implements 256 Rx tags (range
[
0
,
255
]
) per
DMA interfaces to write incoming data NoC packets in the
scratchpad memory of compute clusters or in the DDR mem-
ory of IOs. This Rx tag has a write window described by a
base address, a size and the write pointer that need to be
congured and managed at runtime. The completion of the
incoming data transfer is given by an end-of-transfer (
Eot
)
command which increments a 2
16
-bit notication counter
corresponding the used Rx tag in the DMA interface of the
MPPA®network.
Each DMA interface implements 8 packet-shapers. A packet-
shaper (DMA Tx) is a hardware unit that is building data NoC
packets using data coming from a PE or a micro-core. Then,
the packet-shaper sends these NoC packets in the MPPA
®
NoC using the congured NoC route. Indeed all NoC routes
and injection parameters Quality-of-Service (QoS) need to
be congured by software.
4.4 Memory Architecture
4.4.1 Hierarchy. Besides register le, MPPA
®
exposes a three-
level memory hierarchy. First, the L1 is the data cache allowing
transparent cached accesses to L2. Second, the L2 is called the
local shared memory which is an on-chip high-bandwidth and low-
latency scratchpad memory. The L3 is the main global memory
which is a DDR3 technology. The L2 memory of MPPA can be
congured, either to cache the L3 memory (software emulation of
L2 cache using the MMU inspired from [
11
], like in conventional
cache-based systems), or as user-buers managed explicitly by
DMA transfers. The L3 can also be accessed by IO DMA interfaces
or through the IO core L1 data cache by Load/Store. Finally, on
compute clusters, L1 caches are not coherent between cores and
DMA interfaces’ writes; thus, memory coherency is managed by
software (full memory barrier, partial memory barrier or uncached
memory accesses are used).
4.4.2 Memory Map. The hardware exposes an heterogeneous
memory map of 20 address spaces (2 per IO and 1 per CC). MPPA
®
processors implement a distributed memory architecture, with one
local memory per cluster. I/O cores access to their local SMEM
and private DDR via Load/Store and by DMA interfaces. Compute
clusters can also access to their local SMEM (but not DDR) via
Load/Store and by their DMA interface. The DMA interface must
be used to build up NoC packets and send them to the NoC in order
to communicate between the 20 address spaces available.
5 DESIGN AND IMPLEMENTATION OF THE
ASYNCHRONOUS ONE-SIDED OPERATIONS
Asynchronous one-sided communications have been proven to
be ecient for HPC workloads by overlapping communication
to computation with fundamental ordering properties. However,
enabling such feature on a heterogeneous distributed local memory
architecture like MPPA is a challenge.
First, the runtime system needs to deal with hardware resources
sharing (memories, DMA Rx Tags, DMA Tx packet-shapers and
DMA micro-cores) and the dynamic management of the outstanding
used hardware resources (complex hardware-software interactions).
It must be done for both local and remote resources in a massively
parallel environment. Second, several abstraction features needs
to be provided to the user application such as QoS conguration,
synchronization and bindings at the creation of communication
segments for any protocols without the need for the user of be-
ing aware of the NoC topology for all on/o-chip memories. This
abstraction is also provided for initiators that are registered to a
segment. Third, as learn from [
6
], the abstracted one-sided pro-
tocols should not be limited by the number of physical hardware
resources. In fact, these hardware resources should be translated
(virtualized) to dierent kinds of “software” components such as:
memory segments management, RDMA emulation, remote queues
and automatic ow control without losing the performance of the
hardware. This is to free the user from managing physical hardware
resources and software job FIFOs congestion control, which is often
a complex issue and a source of error. Fourth, the RDMA engines
must expose fundamental ordering properties regarding outstand-
ing transactions and remote atomic operations, as well as maintain-
ing the memory coherence and consistency (memory ordering) at
synchronization points. Finally, it needs to implement an all-to-all
ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea J. Hascoët, B. Dupont de Dinechin, P. Guironnet de Massas, M. Q. Ho
initiator-to-server ow-control mechanism for the remote atomic
operation to avoid data and request corruption when congestion
occurs. Such constraints make our new one-sided communication
software emulated engine very complicated to develop, debug and
validate along with reaching descent performance (reaching the
theoretical maximum hardware throughput).
5.1 Asynchronous Active Message and RDMA
Put & Get Engine
The active message and RDMA engine is usually the critical path
of data intensive applications. All data transfers are managed by
this engine which is running on PEs; therefore, it needs to be e-
cient, thread-safe and user-friendly (e.g. management of maximum
outstanding jobs with ow-control). RDMA and active message
transactions operate on a window which is a one-sided memory
object well described in the one-sided MPI standard. Initiating a
one-sided operation consists in parsing the targeted segment pa-
rameters (Line 2 in Algorithm 1) such as the supported segment
protocols (RDMA or remote atomics in our case), the NoC route,
the destination DMA Rx Tag and checking if the read or write
transaction is not out-of-bound of the targeted remote window.
Then the transaction is prepared by the initiator core and it takes a
Slot
on the targeted
segment
atomically. An N-to-N ow-control
mechanism has been implemented for both over remote memory
transactions “inter-node” and local memory transaction “intra-node”
(
done_slot
Line 9 in Algorithm 1) which provides a backpressure
mechanism when the hardware and low-level software is under
congestion. In steady-state the core then sends the request either in
the NoC or in the shared local memory. The completion ticket of the
initiated transaction is computed and set to an
Event
which can be
waited later on using Algorithm 2. Indeed, when an event is given
by the user to the engine, it initiates the transaction and returns
immediately (the request is outstanding). If no event is given to the
engine, it will wait until transaction’s completion.
The one-sided user API functions take as arguments: a local
virtual address that is read or written, a remote target segment,
transaction parameters (e.g. operation type, size, stride, geometries)
and an event for the completion of the initiated transaction. The
RDMA functions are exposed to the user with put,get transac-
tions and remote atomic operations with postadd,poke,fetchclear,
fetchadd and peek. The user is also provided with a fence operation
for the remote completion of outstanding RDMA transactions of a
targeted memory segment.
5.2 Completion Event-based Engine
The completion event-based engine described by 2 is made for very
low-latency event completion. An
event
is a condition that has
only two states (true or false) for the user. This
event
has a memory
Address
content to monitor and compare it with a
Value
using
simple condition (e.g. equal,greater,less,etc). Depending on the
event
nature of the associated transaction, its state can become
true by getting hardware pending events (Line 10 in Algorithm 2)
and accumulated them into the content of
Address
. This sequence
is required to prevent DMA Rx end-of-transfer (
Eot
) counter from
saturation. However this sequence needs to be atomic, thus we
use atomic uncached instructions and we notify all other remote
Algorithm 1
Concurrent RDMA or Active Message Initiator En-
gine Finite-State-Machine (FSM) (parallel code)
1: Load the target parameters of the targeted segment
2: if is a valid remote segment transaction then
3: Cancel transaction and return failure /* Exit */
4: end if
5: Prepare transaction (RDMA or Active Message)
6: Write memory barrier (for outstanding user stores)
7: Slot = atomic fetch add uncached slot[target] counter
8: /* Transaction ow-control (outstanding) */
9: while (Slot + 1) >= (done_slot[target ] + FIFO_SIZE) do
10: Idle core // or OS yield
11: end while
12: if is Remote NoC Transaction then
13: /* Packet shaper is locked (atomic with taken Slot) */
14: PE Congure Tx packet shaper (route)
15: PE Push NoC transaction and remote notify (ordered)
16: else
17: /* Write job in shared memory (parallel) */
18: Write job at (Slot + 1) % FIFO_SIZE (ordered)
19: Write memory barrier
20: Send notication (inter-PE event)
21: end if
22: Set Event for completion ticket (Slot + 1)
23: if Blocking transaction then
24: wait Event to occur (See Algorithm 2)
25: end if
processors of the local node when the content of
Address
is up-
dated in the memory hierarchy. This broadcast notify operation
(Line 15 in Algorithm 2) is done using a low-latency control NoC Rx
mailbox in barrier mode which leads only to a simple posted store
in the peripheral space for the processor. For both low-latency and
high-throughput of event management, this engine does not rely
on any interruption mechanisms to avoid trashing the instruction/-
data cache when switching to interrupt handlers, suering from
interrupt noise and interrupt handler control multiplexing and the
overhead of context switching.
However, for more generic operating systems (Linux or RTOS)
this engine could use a preemptive and cooperative multi-thread
(Line 18 in Algorithm 2). Nevertheless, in high performance en-
vironment, MPPA’s operating systems are based on a simple run-
to-completion multi-threading model in the compute matrix. This
engine also provides a non-blocking user implementation for test-
ing whether or not the event is true.
5.3 RDMA and Active Message Job Arbiters
The RDMA and active message job arbiters are high-performance
FSMs that run on each node of the MPPA’s network. This FSM
serves requests from
Algorithm 1
. They rely on round-robin ar-
bitrations and are triggered on events (no interrupts) sent by the
DMA NoC interface or inter-PE events. These arbiters process re-
quests coming from the NoC or other intra-node PEs. This incoming
request sequence is described by Algorithm 1 for both RDMA and ac-
tive message jobs. The RDMA job arbiter manages asynchronously
the execution of outstanding micro-programmable DMA cores. It
selects an available DMA core, congures the NoC route, writes
Asynchronous One-Sided Communications for a Manycore Processor ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea
Algorithm 2
Lock-Free Concurrent Wait
Event
FSM (parallel code)
1: Value = load Event condition value
2: Address = load Event address to check
3: while true do
4: Test_Value = uncached load at Address to evaluate
5: if Evaluate condition Value with Test_Value then
6: Read memory barrier /* Core & DMA coherence */
7: Return success /* Exit */
8: end if
9: if Pending hardware DMA events then
10: Eot
= atomic load and clear on end-of-transfer notica-
tions counter on Rx Tag
11: if Eot then
12: Atomic fetch add uncached Eot at Address
13: Write memory barrier
14: /* Force PEs to re-evaluate conditions */
15: Broadcast notify to all PEs
16: end if
17: end if
18: Idle core // or OS yield possible
19: end while
the DMA core arguments, starts the DMA core and updates the
completion job ticket. The active message arbiter is simpler as it
does not need to manage outstanding jobs (asynchronous). Indeed
an active message is an operation containing a set of instructions
with operands. When the operation is processed, the active mes-
sage job arbiter sends the result back to the initiator (if any) and
updates the completion job ticket. The more complex part of this
software arbiter is that all active messages from an initiator are
ordered
with all outstanding RDMA writes of this initiator. To be
concise, from the initiator’s point of view, all outstanding incoming
RDMA transactions will complete before the posted remote atomic
operation is processed (active message operation completion).
5.4 Memory Ordering of RDMA and Active
Message Operations
For performance, one-sided operations have a relaxed-memory
consistency model. Regarding outstanding RDMA operations from
an initiator point of view, all puts are strictly ordered between each
other for their local completion but not for the remote completion
(completion is given by Algorithm 2). Indeed low-latency and high-
throughput are critical for performance; thus, it implies having
low-level mechanisms that have the following properties:
Get operations are ordered when reading from a same memory
segment, while reading from dierent memory segments is not
ordered. Outstanding put and get operations are not ordered on the
same initiator for ecient parallel execution. Hence, for any reason,
when Read-After-Write (RAW) dependency occurs (put followed
by a get on a same memory segment), an RDMA fence completion
must be performed before initiating the get operation. The fence
operation is provided by the one-sided engine and is part of the
active message operations. In the memory consistency model, the
one-sided fence operation completion provides the visibility of all
outstanding memory modications to other processors and nodes.
Regarding remote atomic operations, they are ordered when
targeting a same memory segment, and not ordered when targeting
dierent memory segments. A powerful concept with outstanding
RDMA transaction and outstanding remote atomic operation is
that they are ordered between each other when targeting a same
segment. It is done thanks to a point-to-point software “virtual
channel” between each pair of segments. When an initiator
X
posts
several puts (RDMA transactions) and then posts a remote atomic
operation to a memory segment, the posted remote atomic oper-
ation will be seen in this memory segment only after the remote
completion of the previously initiated puts of the initiator
X
. Such
ordering are very important for performance as the initiator can
post high-throughput RDMA data transfers along with a posted
synchronization mechanism. Indeed, everything can be done asyn-
chronously from the initiator point of view. Therefore, this initiator
can go back to computation immediately without losing any time.
We call this concept: Outstanding ordering between posted remote
atomic operations and RDMA transactions.
5.5 Data Restructuring Support
Linear and strided copies are essential geometries in RDMA com-
munication. A strided transfer can have an oset between each
contiguous data block either on src or dst buers, or even both.
An ecient RDMA API (and the underlaying hardware) should be
able to perform strided transfers with zero-copy, by automatically
incrementing read and write DMA oset with less cost. We call this
capability the “on-the-y data restructuring”. Applications such as
computer vision, deep learning, signal processing, linear algebra
and numerical simulation require ecient zero-copy data restruc-
turing. They generally rely on tiling with or without overlap, halo
region forwarding, transposition patterns and 2D/3D block trans-
fers. 2D and 3D copy are special cases of strided copy, where there
exist stride-osets on both src and dst buers. These osets can be
dierent from each other, as the local buer is often smaller and
accommodates a sub-partition of the remote buer. We inspired our
panel of transfer geometries from [
5
] and added few more patterns,
typically the spaced-remote-sparse-local
1
transfer, and its induced
implementation for 2D/3D data blocks.
6 RESULTS, ANALYSIS & LIMITS
6.1 Benchmarking Environment
We use a multi-multicore Central Processing Unit (CPU) execution
model with a low-level POSIX-like environment for benchmarking.
All measures were made onto an MPPA
®
operating at 500 MHz
with one or two DDR3 running at 1066 MHz. Each DDR3 bus size
is 64-bit which leads to a theoretical maximum memory bandwidth
of 8
.
5 GB/s (17
.
0 GB/s using 2-DDRs). The NoC is 32-bit width
also operating at 500 MHz; therefore, it provides up to 2
.
0 GB/s.
Compute clusters’ SMEM has 1 NoC link providing 2
.
0 GB/s per
link direction. However, we use a typical data NoC payload packet
size of 32 its with an header of 2-its for a total typical packet
size of 34 its. Thus, it leads to a maximum ecient data trans-
fers throughput of 2
∗ (
32
/
34
)
, which gives 1
.
88 GB/s (full-duplex).
The throughput is dened as the memory bandwidth on which
1arbitraryremote_stride and local_stride osets
ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea J. Hascoët, B. Dupont de Dinechin, P. Guironnet de Massas, M. Q. Ho
the node(s) or processor(s) are reading or writing. The latency is
dened as the time between the initiation and the completion of a
transaction; thus, it will depend of the size of the transaction.
6.2 RDMA Performance
6.2.1 Memory Throughput. Figure 2 shows both DDR(s) and
inter-cluster SMEM reads (gets) and writes (puts). The through-
put is measured onto the memory on which the CCs are reading
and writing. The size of the RDMA transactions (abscissa) and the
number of CCs (dierent curves) are varying showing dierent
saturation points regarding the hardware and the software implied.
All throughput benchmarks were done with asynchronous RDMA
transactions to saturate the software in charge of conguring the
DMA NoC interfaces. Thus software ow-control is heavily used to
prevent from the corruption of outstanding local or remote FIFOs.
First for DDR memory accesses we achieve more than 50% of
maximum theoretical throughput for data transfers greater than
4 KB and 94% for 32 KB in all topologies. Second it can be noticed
that RDMA puts are better than gets. This is due to remote server
contention which is the point of serialization for the conguration
of the DMA interfaces. Indeed for the software point of view out-
standing puts only rely on local ow-contol whereas outstanding
gets rely on remote ow-control. Remote ow-control is more com-
plex as it requires more software interactions with the DMA NoC
interface (conguration and packet transmission). We show that
our software implementation of RDMA support reaches more than
70% of the peak hardware throughput for contiguous data NoC
stream size bigger than 8 KB. To conclude the RDMA throughput
provides the user application with both ecient usage of the hard-
ware when having data stream size greater equal than 8 KB along
with managing complex ow-control mechanisms. Enabling decent
performance without the need for the user to manage ow-control
by himself gives ease. The implementation of software ow-control
in an application is often a source of error regarding data races.
6.2.2 Memory Latency. As the software is in charge of cong-
uring the DMA NoC interface to communicate, it implies latency.
Highly-coupled parallel software often leads to poor performance
onto massively parallel architectures. Therefore the transaction
latency on such architectures is critical when dealing with com-
plex data dependency patterns that imply inter-node communi-
cations (e.g. low-latency 6-steps Fast Fourier Transform (FFT) or
low-latency Convolutional Neural Network (CNN) inference). We
usually implement well-known N-buering techniques to tackle
such problems by masking the memory access latencies. However
depending on the spatial and temporal memory locality, it is not
always possible; and thus, the latency of the transactions becomes
important. We model the total round-trip latency of the RDMA
software/hardware engines when there are neither congestion nor
user/kernel interruptions.
TT (B)
is the Time to Transmit B bytes
and is given by:
TT (B)=I PT +H LT +SPT +B/
3
.
76
+CT
Let
IPT
be the Initiator Processing Time described in Algorithm 1,
SPT
the
Server Processing Time explained in 5.3,
HLT
the Hardware Latency
Time for the NoC link/router and micro-engine memory accesses,
CT
the Completion Time described in Algorithm 2 and
B/
3
.
76 with
B
the number of byte to transfer. 3
.
76 bytes per cycles is the e-
cient data transfer bandwidth considering a NoC header of 2 its
with a payload of 32 its. Typical cost in cycles are respectively:
TT (B)=500 +100 +300 +B/3.76 +200 =1100 +B/3.76.
Figure 3 shows round-trip latency using dierent compute matrix
geometries that are reading or writing into DDR(s) or one SMEM.
The minimum latency is 2
.
2
µs
. When transfer sizes are greater
than 10 KB we observe the point of rupture. This software latency
becomes negligible compared to the latency the of DMA’s micro-
engine transfer. When there is no contention, for instance curve
1-DDR-4-CCs, we have exactly a curve derivative of 3.76 bytes per
cycle after this rupture point. Moreover, after this rupture point
the latency is impacted by the bandwidth of the targeted memory
(DDR(s) or the SMEM).
6.3 Network-on-Chip Scalability
A strength of NoC based manycore processors is this ability to
scale on non-interferent inter-node data transfers. Table 1 shows
the internal compute matrix NoC bandwidth using dierent matrix
geometries and stream sizes. The peak input-output throughput of
the 16 CCs is given by 2
16
3
.
76 bytes per cycles. Operating at
500 MHz it provides 60
.
2 GB/s peak bandwidth. Our new RDMA
engine can reach more than 88% of peak performance for inter-node
transfers with a size of 16 KB.
Transfer Sizes 1KB 4 KB 16KB 64KB 256 KB
Nb Cluster(s)
1 0.7 2.7 3.5 3.8 3.8
4 2.7 11.0 14.2 15.2 15.4
8 5.5 21.8 28.3 30.5 30.8
16 10.7 42.8 56.3 60.2 60.2
Table 1: Compute Matrix’s NoC Bandwidth in GB/s
The communication patterns is the following: each compute
cluster initiates an RDMA put to a neighbor using NoC routes that
do not overlap between each other at runtime. When using 1 CC, we
use the loopback feature of the NoC interface. No NoC link sharing
or point of serialization occur but the share of the SMEMs for the
DMA NoC interface reads and writes. The SMEM is a multi-bank
interleaved memory of 16 banks, each bank can sustain 8 bytes per
cycle; therefore, providing a bandwidth of 64 GB/s, thus is not the
bottleneck in our measurements.
6.4 Remote Atomics Performance
We benchmark the latency of the active message engines as they
are used for synchronization and reduction operations. Figure 4
shows the latency on dierent geometries for both asynchronous
(posted) and blocking calls (wait until completion). In abscissa we
show the number of initiator CCs that are targeting either in spread
or centralized (see curve legend) mode the 16 CCs of the MPPA
®
compute matrix. Spread mode means that all initiators change their
target node each time they are sending a request (they all target
dierent nodes). It can be understood as a scatter mode where there
would be no overloading on receivers (well load-balanced) and the
best performance is expected. Centralized mode means that all
initiators target simultaneously a same node, thus overload this
node by increasing request processing. It aims measuring the worst
case of all possible active message scheduling schemes at execution.
Asynchronous One-Sided Communications for a Manycore Processor ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea
0
2
4
6
8
10
12
14
16
1 10 100 1000 10000 100000 1e+06
Bandwidth (GB/s)
Transfer Size (Bytes)
RDMA Engine Throughput - Read (Get)
1-DDR-1-CC
1-DDR-2-CCs
1-DDR-4-CCs
1-DDR-8-CCs
1-DDR-16-CCs
2-DDRs-1-CC
2-DDRs-2-CCs
2-DDRs-4-CCs
2-DDRs-8-CCs
2-DDRs-16-CCs
1-SMEM-1-CC
1-SMEM-2-CCs
1-SMEM-4-CCs
1-SMEM-8-CCs
1-SMEM-16-CCs
0
2
4
6
8
10
12
14
16
1 10 100 1000 10000 100000 1e+06
Bandwidth (GB/s)
Transfer Size (Bytes)
RDMA Engine Throughput - Write (Put)
1-DDR-1-CC
1-DDR-2-CCs
1-DDR-4-CCs
1-DDR-8-CCs
1-DDR-16-CCs
2-DDRs-1-CC
2-DDRs-2-CCs
2-DDRs-4-CCs
2-DDRs-8-CCs
2-DDRs-16-CCs
1-SMEM-1-CC
1-SMEM-2-CCs
1-SMEM-4-CCs
1-SMEM-8-CCs
1-SMEM-16-CCs
Figure 2: RDMA Engine Throughput GB/s(Asynchronous)
1
10
100
1000
10000
1 10 100 1000 10000 100000 1e+06
Time (us)
Transfer Size (Bytes)
RDMA Engine Latency - Read (Get)
2.3 us Round-Trip (1168 cycles)
1-DDR-1-CC
1-DDR-2-CCs
1-DDR-4-CCs
1-DDR-8-CCs
1-DDR-16-CCs
2-DDRs-1-CC
2-DDRs-2-CCs
2-DDRs-4-CCs
2-DDRs-8-CCs
2-DDRs-16-CCs
1-SMEM-1-CC
1-SMEM-2-CCs
1-SMEM-4-CCs
1-SMEM-8-CCs
1-SMEM-16-CCs
1
10
100
1000
10000
1 10 100 1000 10000 100000 1e+06
Time (us)
Transfer Size (Bytes)
RDMA Engine Latency - Write (Put)
2.2 us Round-Trip (1095 cycles)
1-DDR-1-CC
1-DDR-2-CCs
1-DDR-4-CCs
1-DDR-8-CCs
1-DDR-16-CCs
2-DDRs-1-CC
2-DDRs-2-CCs
2-DDRs-4-CCs
2-DDRs-8-CCs
2-DDRs-16-CCs
1-SMEM-1-CC
1-SMEM-2-CCs
1-SMEM-4-CCs
1-SMEM-8-CCs
1-SMEM-16-CCs
Figure 3: RDMA Engine Latency µs (Blocking)
0.1
1
10
100
2 4 6 8 10 12 14 16
Time (us)
Number of Cluster(s) Initiators
Active Message Engine Scaling Latency
2.174 MIOPS Initiator (230 cycles)
0.951 MIOPS Server (526 cycles)
2.2 us Round-Trip (1109 cycles)
Async-Spread-1-PE
Async-Centralized-1-PE
Blocking-Spread-1-PE
Blocking-Centralized-1-PE
Async-Spread-16-PEs
Async-Centralized-16-PEs
Blocking-Spread-16-PEs
Blocking-Centralized-16-PEs
Figure 4: Active Message Latency
The best case initiator latency on a posted operation is 230 cycles
(418
ns
), for instance postadd. The round-trip latency for the comple-
tion of a fetchadd operation is 1109 cycles, 2
.
2
µs
. Lots of conicts
occur when having the 256 PEs that are sending requests to the
same node (see curve Async-Centralized-16-PEs with 16 clusters).
In such conguration, the N-to-N ow-control is generating a lot of
trac to avoid the corruption of software job FIFOs. However the
implementation is able to sustain such contention but the latency
explodes (up to 17.5µs in asynchronous mode).
6.5 Remote Queue Throughput
Remote queues provide an elementary support of two-sided com-
munications for small low-latency atomic messages (1-to-1 and
N-to-1). Regarding the benchmark conditions, each CC has a queue
where the IO subsystem is sending messages (1 outstanding mes-
sage per CC from the IO subsystem point-of-view). Then the CCs
gets this message and responses using an atomic queue message to
the N-to-1 queue on the IO (a FIFO receiving atomic messages). All
CCs are running concurrently; therefore, we use the N-to-1 feature
for data NoC packet atomicity. Table 2 shows IOPS of one RM of
the IO that is receiving a request from a CC and is sending back a
new job command to the responding CC. The benchmarks were
ESTIMedia’17, October 15–20, 2017, Seoul, Republic of Korea J. Hascoët, B. Dupont de Dinechin, P. Guironnet de Massas, M. Q. Ho
made using dierent data NoC packet sizes (Bytes) without any
batching.
Packet Size in Bytes 16 B 32 B 64 B 128 B 248 B
Nb Cluster(s)
1 675 670 600 425 350
2 to 16 740 725 725 575 550
Table 2: Performance of the Remote Queues in Kilo IOPS
Such a communication pattern is important as it is used in of-
oading. Therefore, we provide the limitation of such implemen-
tation when ne-grained parallelism is required by an ooaded
application where the control is done by a host processor. In our
case, the IO subsystem is considered as a host processor that is of-
oading computations onto the compute matrix (though job queue
commands). For small messages, the results show that a simple
double-buering (with 2 CCs) allows to saturate the number of
IOPS for one IO master core.
7 CONCLUSION & PERSPECTIVES
We present the design and illustrate the advantages of a one-sided
asynchronous (AOS) communications and synchronizations pro-
gramming layer for the Kalray MPPA
®
processor. The motivation
is to apply to this and related CPU-based manycore processors
the established principles of one-sided communications libraries of
supercomputers, in particular: the Cray SHMEM library, the PNNL
ARMCI library, and the MPI-2 one-sided API subset. The main dif-
ference between these communication libraries and the proposed
AOS layer is that a supercomputer has a symmetric architecture,
where the compute nodes are identical and the working memory
is composed of the union of the compute node local memories.
Similar to Inniband low-level API, the AOS programming layer
supports the Read/Write, Put/Get and Remote Atomics protocols,
but is designed around the capabilities of an RDMA-capable NoC.
One-sided asynchronous operations are proven to be highly e-
cient thanks to relaxed ordering and easier to use as the initiator
is a master on the targeted memories. Indeed, one-sided commu-
nications do not require strict matching whereas the send/receive
operations do in the two-sided protocol. Our software implemen-
tation is capable to sustain more than 70% of the hardware peak
throughput with using RDMA put-get engines for data transfer
sizes greater equal than 8 KB. However, managing resource sharing,
ow-control, arbitration and notications in software has limitation
in terms of latency as shown. Based on this implementation and its
results, the forthcoming 3rd-generation MPPA
®
processor will in-
clude hardware acceleration for these engines’ key functions. This
will enable to reach very low-latency and close to peak throughput
on small transactions, along with respecting important ordering
properties for the memory consistency.
REFERENCES
[1] Dan Bonachea. 2008. GASNet Specication. Technical Report.
[2]
Eric A Brewer, Frederic T Chong, Lok T Liu, Shamik D Sharma, and John D
Kubiatowicz. 1995. Remote queues: Exposing message queues for optimization
and atomicity. In Proceedings of the seventh annual ACM symposium on Parallel
algorithms and architectures. ACM, 42–53.
[3]
Darius Buntinas, Dhabaleswar K Panda, and William Gropp. 2001. NIC-based
atomic remote memory operations in Myrinet/GM. In IN MYRINET/GM,âĂİ IN
WORKSHOP ON NOVE USES OF SYSTEM AREA NETWORKS (SAN1. Citeseer.
[4]
Barbara Chapman, Tony Curtis, Swaroop Pophale, Stephen Poole, Je Kuehn,
Chuck Koelbel, and Lauren Smith. 2010. Introducing OpenSHMEM: SHMEM
for the PGAS community. In Proceedings of the Fourth Conference on Partitioned
Global Address Space Programming Model. ACM, 2.
[5]
Silviu Ciricescu, Ray Essick, Brian Lucas, Phil May, Kent Moat, Jim Norris, Michael
Schuette, and Ali Saidi. 2003. The recongurable streaming vector processor
(RSVPTM). In Proceedings of the 36th annual IEEE/ACM International Symposium
on Microarchitecture. IEEE Computer Society, 141.
[6]
Benoît Dupont de Dinechin, Pierre Guironnet de Massas, Guillaume Lager, Clé-
ment Léger, Benjamin Orgogozo, Jérôme Reybert, and Thierry Strudel. 2013. A
distributed run-time environment for the kalray mppa
®
-256 integrated manycore
processor. Procedia Computer Science 18 (2013), 1654–1663.
[7]
Karl Feind. 1995. Shared memory access (SHMEM) routines. Cray Research
(1995).
[8]
David Gelernter, Alexandru Nicolau, and David A Padua. 1990. Languages and
compilers for parallel computing. Pitman.
[9]
Alexandros V Gerbessiotis and Seung-Yeop Lee. 2004. Remote memory access:
A case for portable, ecient and library independent parallel programming.
Scientic Programming 12, 3 (2004), 169–183.
[10]
Sergei Gorlatch. 2004. Send-receive Considered Harmful: Myths and Realities of
Message Passing. ACM Trans. Program. Lang. Syst. 26, 1 (Jan. 2004), 47–56.
[11]
Peter J Keleher, Alan L Cox, Sandhya Dwarkadas, and Willy Zwaenepoel. 1994.
TreadMarks: Distributed Shared Memory on Standard Workstations and Operat-
ing Systems.. In USENIX Winter, Vol. 1994. 23–36.
[12]
J Nieplocha V Tipparaju M Krishnan and G Santhanaraman DK Panda. 2003.
Optimizing mechanisms for latency tolerance in remote memory access commu-
nication on clusters. In IEEE International Conference on Cluster Computing. IEEE,
138.
[13]
Thierry Lepley, Pierre Paulin, and Eric Flamand. 2013. A novel compilation
approach for image processing graphs on a many-core platform with explicitly
managed memory. In Proceedings of the 2013 International Conference on Compilers,
Architectures and Synthesis for Embedded Systems. IEEE Press, 6.
[14]
Jiuxing Liu, Weihang Jiang, Pete Wycko, Dhabaleswar K Panda, David Ashton,
Darius Buntinas, William Gropp, and Brian Toonen. 2004. Design and Implemen-
tation of MPICH2 over InniBand with RDMA Support. In Parallel and Distributed
Processing Symposium, 2004. Proceedings. 18th International. IEEE, 16.
[15]
Kevin JM Martin, Mostafa Rizk, Martha Johanna Sepulveda, and Jean-Philippe
Diguet. 2016. Notifying memories: a case-study on data-ow applications with
NoC interfaces implementation. In Proceedings of the 53rd Annual Design Au-
tomation Conference. ACM, 35.
[16]
John Mellor-Crummey, Laksono Adhianto, William N. Scherer, III, and Guohua
Jin. 2009. A New Vision for Coarray Fortran. In Proc. of the Third Conference
on Partitioned Global Address Space Programing Models (PGAS ’09). Article 5,
5:1–5:9 pages.
[17]
Jarek Nieplocha and Bryan Carpenter. 1999. ARMCI: A portable remote memory
copy library for distributed array libraries and compiler run-time systems. Parallel
and Distributed Processing (1999), 533–546.
[18]
Jarek Nieplocha, Vinod Tipparaju, Manojkumar Krishnan, and Dhabaleswar K
Panda. 2006. High performance remote memory access communication: The
ARMCI approach. International Journal of High Performance Computing Applica-
tions 20, 2 (2006), 233–253.
[19]
Robert W. Numrich and John Reid. 1998. Co-array Fortran for Parallel Program-
ming. SIGPLAN Fortran Forum 17, 2 (Aug. 1998), 1–31. https://doi.org/10.1145/
289918.289920
[20]
Andreas Olofsson. 2016. Epiphany-V: A 1024 processor 64-bit RISC System-On-
Chip. CoRR abs/1610.01832 (2016).
[21]
Selma Saidi, Rolf Ernst, Sascha Uhrig, Henrik Theiling, and Benoît Dupont de
Dinechin. 2015. The shift to multicores in real-time and safety-critical systems.
In 2015 International Conference on Hardware/Software Codesign and System Syn-
thesis, CODES+ISSS 2015, Amsterdam, Netherlands, October 4-9, 2015. 220–229.
[22]
Karthikeyan Vaidyanathan, Lei Chai, Wei Huang, and Dhabaleswar K Panda.
2007. Ecient asynchronous memory copy operations on multi-core systems
and I/OAT. In Cluster Computing, 2007 IEEE International Conference on. IEEE,
159–168.
[23]
Anish Varghese, Bob Edwards, Gaurav Mitra, and Alistair P Rendell. 2014. Pro-
gramming the adapteva epiphany 64-core network-on-chip coprocessor. In Par-
allel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE Interna-
tional. IEEE, 984–992.
... Halo exchange between threads is done by writing directly into the memory zone of the respective cube owners, protected by mutual locks thanks to the CPU cache system. This mechanism cannot be directly used onto clustered architectures like MPPA as it requires either: (1) explicit inter-cluster communications; or (2) committing changes to the global memory then fetched by other clusters. Solution (1) is not relevant in our scope due to (small) local memories, numerous subdomains must be streamed continuously in the MPPA's compute clusters. ...
... Each transfer is registered as an event on which computing threads (or rather the master thread) can later come back and wait for completion, by calling the bli dma backend event wait() function. For further details on these data structures, reader is invited to refer to previous works by Ho et al. [3] and Hascoët et al. [2]. ...
... Performance in MLUPS TODOFor a better understanding of the benefit of our streaming algorithm, we modified the OPAL async code to be able to work with arbitrary values of pipeline-depth. Different pipeline depths were then tried out(1,2,4,6,8) to see if increasing the number of asynchronous buffers can improve the performance. The block size is thus reduced to 8 × 8 × 8 so that up to eight subdomains can be stored in the local memory. ...
Thesis
Upcoming Exascale target in High Performance Computing (HPC) and disruptive achievements in artificial intelligence give emergence of alternative non-conventional many-core architectures, with energy efficiency typical of embedded systems, and providing the same software ecosystem as classic HPC platforms. A key enabler of energy-efficient computing on many-core architectures is the exploitation of data locality, specifically the use of scratchpad memories in combination with DMA engines in order to overlap computation and communication. Such software paradigm raises considerable programming challenges to both the vendor and the application developer. In this thesis, we tackle the memory transfer and performance issues, as well as the programming challenges of memory- and compute-intensive HPC applications on he Kalray MPPA many-core architecture. With the first memory-bound use-case of the lattice Boltzmann method (LBM), we provide generic and fundamental techniques for decomposing three-dimensional iterative stencil problems onto clustered many-core processors fitted withs cratchpad memories and DMA engines. The developed DMA-based streaming and overlapping algorithm delivers 33%performance gain over the default cache-based implementation.High-dimensional stencil computation suffers serious I/O bottleneck and limited on-chip memory space. We developed a new in-place LBM propagation algorithm, which reduces by half the memory footprint and yields 1.5 times higher performance-per-byte efficiency than the state-of-the-art out-of-place algorithm. On the compute-intensive side with dense linear algebra computations, we build an optimized matrix multiplication benchmark based on exploitation of scratchpad memory and efficient asynchronous DMA communication. These techniques are then extended to a DMA module of the BLIS framework, which allows us to instantiate an optimized and portable level-3 BLAS numerical library on any DMA-based architecture, in less than 100 lines of code. We achieve 75% peak performance on the MPPA processor with the matrix multiplication operation (GEMM) from the standard BLAS library, without having to write thousands of lines of laboriously optimized code for the same result.
... Additionally, it is up to the software to take care of data caching and replication to boost performance. Finally, the rich NoC exposes mechanisms for asynchronous programming to overlap communication with computation [Hascoët et al. 2017]; and hand-operated routing to guarantee uniform communication latencies. ...
... However, challenges introduced by their architectural intricacies to software programmability impact from low-to userlevel applications. Examples of these challenges are dark silicon [Haghbayan et al. 2017], data prefetching and tiling [Francesquini et al. 2015], asynchronous communication [Hascoët et al. 2017], non-coherent caches [de Dinechin et al. 2013a] and application deployment [Souza et al. 2017]. ...
... Overall, runtime systems are usually shipped by manufacturers of lightweight manycore processors as a cutting-edge performant programming environment. Examples of such runtime systems are NodeOS [de Dinechin et al. 2013b], libasync [Hascoët et al. 2017], Epiphany SDK [Varghese et al. 2014] and CLETE [Richie et al. 2017]. ...
Conference Paper
The performance and energy efficiency provided by lightweight manycores is undeniable. However, the lack of rich and portable support for these processors makes software development challenging. To address this problem, we propose a portable and lightweight MPI library (LWMPI) designed from scratch to cope with restrictions and intricacies of lightweight manycores. We integrated LWMPI into a distributed OS that targets these processors and evaluated it on the Kalray MPPA-256 processor. Results obtained with three applications from a representative benchmark suite unveiled that LWMPI achieves similar performance scalability in comparison with the low-level vendor-specific API narrowed for MPPA-256, while exposing a richer programming interface.
... Additionally, hardware exposes mechanisms for asynchronous programming and explicit message routing, which engineers should also handle. The former is a fundamental requirement when aiming at communication and computation overlapping (HASCOËT et al., 2017). The latter should be strongly considered in order to guarantee uniform communication latencies (DINECHIN et al., 2014). ...
... RCCE makes extensive use of hardware Message Passing Buffers (MPBs) to enable on-chip communication with minimum latencies. Similarly, Kalray MPPA-256 features libasync (HASCOËT et al., 2017), a library that efficiently manages on-chip DMA engines to enable high-performance on-chip communication with asynchronous and one-sided capabilities. Finally, the Adapteva Epiphany processor (VARGHESE et al., 2014) is shipped with a Software Development Toolkit (SDK) that exposes primitives for facilitating data transfers among the on-chip SPMs, programming DMA channels and synchronizing execution across the cores. ...
Thesis
Lightweight Manycore (LW Manycore) processors were introduced to deliver performancescalability with low-power consumption. To address the former aspect, they rely on specificarchitectural characteristics, such as a distributed memory architecture and a rich Network-on-Chip (NoC). To achieve low-power consumption, they are built with simple low-power MultipleInstruction Multiple Data (MIMD) cores, they have a memory system based on ScratchpadMemories (SPMs) and they exploit heterogeneity by featuring cores with different capabilities.Some industry-successful examples of these processors are the Kalray MPPA-256, the PULP andthe Sunway SW26010. While this unique set of architectural features grant to LW Manycoresperformance scalability and energy efficiency, they also introduce multiple challenges in softwareprogrammability and portability. First, the high density circuit integration turns dark silicon intoreality. Second, the distributed memory architecture requires data to be explicitly fetched/of-floaded from remote memories to local ones. Third, the small amount of on-chip memory forcesthe software to partition its working data set into chunks and decide which of them should be keptlocal and which should be offloaded to remote memory. Fourth, the on-chip interconnect invitessoftware engineers to embrace a message-passing programming model. Finally, the on-chipheterogeneity makes the deployment of applications complex. One approach for addressing thesechallenges is by means of an Operating System (OS). This type of solution craves to bridgeintricacies of an architecture, by exposing rich abstractions and programming interfaces, as wellas handling resource allocation, sharing and multiplexing. Unfortunately, existing OSes struggleto fully address programmability and portability challenges in LW Manycores, because they werenot designed to cope with architectural features of these processors. In this context, the maingoal of this work boils down to propose a novel OS for LW Manycores that specifically copeswith these uncovered challenges. The main contribution of this work lies with the advancementsof resource management in LW Manycore processors. On the one hand, from the scientificperspective this main contribution may be unfolded in three specific contributions. First, acomprehensive Hardware Abstraction Layer (HAL) that makes the development and deploymentof a fully-featured OS for LW Manycores easier, as well as it enables the portability of an OSacross multiple of these processors. Second, a rich memory management approach that is basedon Distributed Paging System (DPS). This is a novel system-level solution that we devisedfor managing memory of a LW Manycore. Third, a lightweight communication facility thatmanages the on-chip interconnect and exposes primitives with hardware channel multiplexing.On the other hand, as a technical contribution, this work introduces Nanvix. This is a concreteimplementation of an OS for LW Manycore processor that features the aforementioned scientificadvancements. Nanvix supports multiple architectures (Bostan, x86, OpenRISC, ARMv8 andRISC-V), runs on baremetal processors, exposes rich abstractions and high-level programminginterfaces.
... Rich on-chip interconnect exposes mechanisms for asynchronous programming and explicit routing on the chip. The former should be used in order to overlap communication with computation [18]. The latter should be extensively considered in order to guarantee uniform communication la- ...
... For instance, Wijngaart [26] and Clauss [27] provide communication interfaces for the Intel Single-Cloud Computer processor. Similarly, Kalray MPPA-256 features a communication library that shares some similarity with POSIX [25] and a specific interface for one-sided communications[18]. Finally, a particular communication API is provided to developers that target the Adapteva Epiphany architecture[28]. ...
Article
Full-text available
Lightweight manycore processors deliver high performance and scalability by bundling in a single chip hundreds of low-power cores, a distributed memory architecture and Networks-on-Chip (NoCs). Operating Systems (OSes) for these processors feature a distributed design, in which a communication layer enables kernels to exchange information and interoperate. Currently, this communication infrastructure is based on mailboxes, which enable fixed-size message exchanges with low latency. However, this solution is suboptimal because it can neither fully exploit the NoC nor efficiently handle the diversity of OS communication protocols. We propose an Inter-Kernel Communication (IKC) facility that exposes two kernel-level communication abstractions in addition to mailboxes: syncs, for enabling a process to signal and unlock another process remotely, and portals, for handling dense data transfers with high bandwidth. We implemented the proposed facility in Nanvix, the only open-source distributed OS that runs on a baremetal lightweight manycore, and we evaluated our solution on a 288-core processor (Kalray MPPA-256). Our results showed that our IKC facility achieves up to 16.87× and 1.68× better performance than a mailbox-only solution, in synchronization and dense data transfers, respectively.
... Finally, the Rich On-Chip Interconnect exposes mechanisms for asynchronous programming and explicit routing on the chip. The former should be used in order to overlap communication with computation [23]. The latter should be extensively considered in order to guarantee uniform communication latencies [24]. ...
Conference Paper
Multikernel operating systems (OSs) were introduced to match the architectural characteristics of lightweight manycores. While several multikernel OS designs are possible, in this work we argue on one that is structured in asymmetric microkernel instances. We deliver an open-source implementation of an OS kernel with these characteristics, and we provide a comprehensive assessment using a representative benchmark suite. Our results show that an asymmetric microkernel design is scalable and introduces at most 0.9% of performance interference in an application execution. Also, our results unveil co-design aspects between an OS kernel and the architecture of lightweight manycore, concerning the memory system and core grouping.
... Additionally, it should be highlighted that, compared to previous implementations on the platform [31], [32], this paper introduces the usage of a new communication library based on unilateral and asynchronous data transmission. This new library allows the developer to simplify the communication procedure while maintaining the performance [41]. As a summary, a dataflow diagram of the implementation is provided in Fig. 3.b. ...
Article
Full-text available
Hyperspectral (HS) imaging presents itself as a non-contact, non-ionizing and non-invasive technique, proven to be suitable for medical diagnosis. However, the volume of information contained in these images makes difficult providing the surgeon with information about the boundaries in real-time. To that end, High-Performance-Computing (HPC) platforms become necessary. This paper presents a comparison between the performances provided by five different HPC platforms while processing a spatial-spectral approach to classify HS images, assessing their main benefits and drawbacks. To provide a complete study, two different medical applications, with two different requirements, have been analyzed. The first application consists of HS images taken from neurosurgical operations; the second one presents HS images taken from dermatological interventions. While the main constraint for neurosurgical applications is the processing time, in other environments, as the dermatological one, other requirements can be considered. In that sense, energy efficiency is becoming a major challenge, since this kind of applications are usually developed as hand-held devices, thus depending on the battery capacity. These requirements have been considered to choose the target platforms: on the one hand, three of the most powerful Graphic Processing Units (GPUs) available in the market; and, on the other hand, a low-power GPU and a manycore architecture, both specifically thought for being used in battery-dependent environments.
... Accordingly, the MPPA3 processor is fitted with two global interconnects, respectively identified as 'RDMA NoC' and 'AXI Fabric' (Figure 3). The 'RDMA NoC' is a wormhole switching network-onchip designed to terminate two 100Gbps Ethernet controllers, and to carry the remote DMA operations found in supercomputer interconnects or communication libraries such as SHMEM [6]. The 'AXI Fabric' is a crossbar of busses with round-robin arbiters that connects the compute clusters, the external DDR memory controllers, the PCIe controllers, and other I/O controllers. ...
Conference Paper
The requirement of high performance computing at low power can be met by the parallel execution of an application on a possibly large number of programmable cores. However, the lack of accurate timing properties may prevent parallel execution from being applicable to time-critical applications. This problem has been addressed by suitably designing the architecture, implementation, and programming models, of the Kalray MPPA (Multi-Purpose Processor Array) family of single-chip many-core processors. We introduce the third-generation MPPA processor, whose key features are motivated by the high-performance and high-integrity functions of automated vehicles. High-performance computing functions, represented by deep learning inference and by computer vision, need to execute under soft real-time constraints. High-integrity functions are developed under model-based design, and must meet hard real-time constraints. Finally, the third-generation MPPA processor integrates a hardware root of trust, and its security architecture is able to support a security kernel for implementing the trusted execution environment functions required by applications.
Article
Lightweight manycore processors deliver high performance and energy efficiency by bundling hundreds of low‐power cores, a distributed memory architecture with small local memories and Networks‐on‐Chip in a single die. However, the lack of rich and portable programming models for these processors makes software development a challenging task. Currently, two approaches are employed to address programmability in lightweight manycores: Operating Systems (OSes) and baremetal runtime libraries. The former provides portability but exposes complex Operating System (OS)‐level programming interfaces to developers. The latter focuses on providing rich and high performance interfaces, which are vendor‐specific and yield to non‐portable software. In this work, we address these programmability and portability challenges by combining a rich OS with a well‐known standard for parallel programming. We propose a portable and lightweight Message Passing Interface (MPI) library (LWMPI) designed from scratch to cope with restrictions and intricacies of lightweight manycores. We integrated LWMPI into Nanvix, an open‐source distributed OS that runs on silicon lightweight manycores. The results obtained with a synthetic benchmark and a subset of the CAP Bench applications running on Kalray MPPA‐256 unveil that LWMPI not only delivers a lightweight and richer programming interface but also presents good performance and scalability results.
Chapter
This chapter presents the third‐generation Massively Parallel Processor Array (MPPA) processor, manufactured in 16FFC CMOS technology, whose many‐core architecture has significantly improved upon the previous ones in the areas of performance, programmability, functional safety and cyber‐security. It discusses many‐core architectures and their limitations with regard to intelligent system requirements. The chapter presents the main features of the third‐generation MPPA architecture and processor. It introduces the MPPA3 application software environments. The structuring of the MPPA3 architecture into a collection of compute units, each comparable to an embedded multi‐core processor, is the main feature that enables the consolidation of application partitions operating at different levels of functional safety and cyber‐security, on a single processor. High‐integrity computing on the MPPA3 processor refers to applications that execute in a physically isolated domain of the processor, whose functions are developed under model‐based design and must meet hard real‐time constraints.
Thesis
The growing need for computing is more and more challenging, especially in the embedded system world with autonomous cars, drones, and smartphones. New highly parallel and heterogeneous processors emerge to answer this challenge. They operate in constrained environments with real-time requirements, reduced power consumption, and safety. Programming these new chips is a time-consuming and challenging task leading to huge software development costs. The Kalray MPPA® processor is a competitive example for low-power super-computing on a single chip. It integrates up to 288 VLIW cores grouped in 18 clusters, each fitted with shared local memory. These clusters are interconnected with a high-bandwidth network-on-chip, and DMA engines are used to communicate. This processor is used in this thesis for experimental results. We propose the AOS library enabling highperformance communications and synchronizations of distributed local memories on clustered manycores. AOS provides 70% of the peak hardware throughput for transfers larger than 8 KB. We propose tools for the implementation of static and dynamic dataflow programs based on AOS to accelerate the parallel application developments onto clustered manycores. We propose an implementation of OpenVX for clustered manycores on top of AOS. OpenVX is a standard based on dataflow for the development of computer vision and neural network computing. The proposed OpenVX implementation includes automatic optimizations like data prefetch to overlap communications and computations, or kernel fusion to avoid the main memory bandwidth bottleneck. Results show super-linear speedups.
Article
Full-text available
This paper describes the design of a 1024-core processor chip in 16nm FinFet technology. The chip ("Epiphany-V") contains an array of 1024 64-bit RISC processors, 64MB of on-chip SRAM, three 136-bit wide mesh Networks-On-Chip, and 1024 programmable IO pins. The chip has taped out and is being manufactured by TSMC. This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.
Conference Paper
Full-text available
In real-time and safety-critical systems, the move towards multicores is becoming unavoidable in order to keep pace with the increasing processing requirements and to meet the high integration trend while maintaining a reasonable power consumption. However, standard multicore systems are mainly designed to increase average performance, whereas embedded systems have additional requirements with respect to safety, reliability and real-time behavior. Therefore, the shift to multicores raises several challenges the embedded systems community has to face. These challenges involve the design of certifiable multicore platforms, the management of shared resources and the development/integration of parallel software. New issues are encountered at different steps of application development, from modeling and design to software implementation and hardware deployment. Therefore, both multicore/ semiconductor manufacturers and the real-time community have to meet the challenges imposed by multicores. The goal of this paper is to trigger such a discussion as an attempt to bridge the gap between the two worlds and to raise awareness about the hurdles and challenges that need to be tackled.
Article
Full-text available
The Kalray MPPA®-256 is a single-chip manycore processor that integrates 256 user cores and 32 system cores in 28 nm CMOS technology. These cores are distributed across 16 compute clusters of 16+1 cores, and 4 quad-core I/O subsystems. Each compute cluster and I/O subsystem owns a private address space, while communication and synchronization between them is ensured by data and control Networks-on-Chip (NoC). This processor targets embedded applications whose program- ming models fall within the following classes: Kahn Process Networks (KPN), as motivated by media processing; single program multiple data (SPMD), traditionally used for numerical kernels; and time-triggered control systems.
Article
Full-text available
In the construction of exascale computing systems energy efficiency and power consumption are two of the major challenges. Low-power high performance embedded systems are of increasing interest as building blocks for large scale high- performance systems. However, extracting maximum performance out of such systems presents many challenges. Various aspects from the hardware architecture to the programming models used need to be explored. The Epiphany architecture integrates low-power RISC cores on a 2D mesh network and promises up to 70 GFLOPS/Watt of processing efficiency. However, with just 32 KB of memory per eCore for storing both data and code, and only low level inter-core communication support, programming the Epiphany system presents several challenges. In this paper we evaluate the performance of the Epiphany system for a variety of basic compute and communication operations. Guided by this data we explore strategies for implementing scientific applications on memory constrained low-powered devices such as the Epiphany. With future systems expected to house thousands of cores in a single chip, the merits of such architectures as a path to exascale is compared to other competing systems.
Article
Full-text available
The OpenSHMEM community would like to announce a new effort to standardize SHMEM, a communications library that uses one-sided communication and utilizes a partitioned global address space. OpenSHMEM is an effort to bring together a variety of SHMEM and SHMEM-like implementations into an open standard using a community-driven model. By creating an open-source specification and reference implementation of OpenSHMEM, there will be a wider availability of a PGAS library model on current and future architectures. In addition, the availability of an OpenSHMEM model will enable the development of performance and validation tools. We propose an OpenSHMEM specification to help tie together a number of divergent implementations of SHMEM that are currently available. To support an existing and growing user community, we will develop the OpenSHMEM web presence, including a community wiki and training material, and face-to-face interaction, including workshops and conference participation.
Conference Paper
NoC-based architectures overcome the limitations of traditional buses by exploiting parallelism and offer large bandwidths. NoC adoption also increases communication latency, which is especially penalising for data-flow applications (DF). We introduce the notifying memories (NM) concept to reduce this overhead. Our original approach eliminates useless memory requests. This paper demonstrates NM in the context of video coding applications implemented with dynamic DF. We have conducted cycle accurate systemC simulation of the NoC on an MPEG4 decoder to evaluate NM efficiency. The results show significant reductions in terms of latency (78%), injection rate (60%), and power savings (49%) along with throughput improvement (16%).
Article
During the software crisis of the 1960s, Dijkstra's famous thesis "goto considered harmful" paved the way for structured programming. This short communication suggests that many current difficulties of parallel programming based on message passing are caused by poorly structured communication, which is a consequence of using low-level send-receive primitives. We argue that, like goto in sequential programs, send-receive should be avoided as far as possible and replaced by collective operations in the setting of message passing. We dispute some widely held opinions about the apparent superiority of pairwise communication over collective communication and present substantial theoretical and empirical evidence to the contrary in the context of MPI (Message Passing Interface).
Conference Paper
Explicitly managed memory many-cores (EMM) have been a part of the industrial landscape for the last decade. The IBM CELL processor, general-purpose graphics processing units (GP-GPU) and the STHORM embedded many-core of STMicroelectronics are representative examples. This class of architecture is expected to scale well and to deliver good performance per watt and per mm2 of silicon. As such, it is appealing for application problems with regular data access patterns. However, this moves significant complexity to the programmer who must master parallelization and data movement. High level programming tools are therefore essential in order to allow the effective programming of EMM many-cores to a wide class of programmers. This paper presents a novel approach designed for simplifying the programming of EMM many-core architectures. It initially addresses the image processing application domain and has been targeted to the STHORM platform. It takes a high-level description of the computation kernel algorithm and generates an OpenCL kernel optimized for the target architecture, while managing the parallelization and data movements across the hierarchy in a transparent fashion. The goal is to provide both high productivity and high performance without requiring parallel computing expertise from the programmer, nor the need for application code specialization for the target architecture.
Article
Co-Array Fortran, formerly known as F--, is a small extension of Fortran 95 for parallel processing. A Co-Array Fortran program is interpreted as if it were replicated a number of times and all copies were executed asynchronously. Each copy has its own set of data objects and is termed an image. The array syntax of Fortran 95 is extended with additional trailing subscripts in square brackets to give a clear and straightforward representation of any access to data that is spread across images.References without square brackets are to local data, so code that can run independently is uncluttered. Only where there are square brackets, or where there is a procedure call and the procedure contains square brackets, is communication between images involved.There are intrinsic procedures to synchronize images, return the number of images, and return the index of the current image.We introduce the extension; give examples to illustrate how clear, powerful, and flexible it can be; and provide a technical definition.