Conference PaperPDF Available

Auto-parallelisation of Sieve C++ Programs

Authors:
  • Codeplay Software Limited

Abstract and Figures

We describe an approach to automatic parallelisation of programs written in Sieve C++ (Codeplay’s C++ extension), using the Sieve compiler and runtime system. In Sieve C++, the programmer encloses a performance-critical region of code in a sieve block, thereby instructing the compiler to delay side-effects until the end of the block. The Sieve system partitions code inside a sieve block into independent fragments and speculatively distributes them among multiple cores. We present implementation details and experimental results for the Sieve system on the Cell BE processor.
Content may be subject to copyright.
HPPC
2007
(Hand-out) Proceedings of the
2007 Workshop on
Highly Parallel Processing
on a Chip
August 28, 2007, IRISA, Rennes, France
Organizers Martti Forsell and Jesper Larsson Träff
in conjunction with
the 13th International European Conference on
Parallel and Distributed Computing (Euro-Par)
August 28-31, 2007, IRISA, Rennes, France
Sponsored by
S
S
IP
S
S S
S
IP
S
S S
S
IP
S
S S
S
IP
M
S
S
S
S
IP
M
S
S S
S
IP
M
S
S S
S
IP
M
S
S S
S
IP
M
S
S
S
S
IP
M
S
S S
S
IP
M
S
S S
S
IP
M
S
S S
S
IP
M
S
S
S
S
IP
M
S
S S
S
IP
M
S
S S
S
IP
M
S
S S
S
IP
M
S
S
I/O
I/O
I/O
I/O I/O I/O
I/O
M
I/O I/O
M M
I/O
I/O
I/O
I/O
I/O
P
rni
C
M
S
rni
S
rni
re
S
rni
S
S
P
rni
C
M
S S
re
rni
L
S
S S
P
rni
C
M
S S
S S S S
C
Mre
re
C
M
rni
C
Mre
re
rni
C
ML
rni
re
C
M
P
rni
D
MD
rni
C
Mre
rni
C
Mre
rni
rni
re
P
rni
L
D
D
D
DD
D
P
rni
C
M
S
rni
S
rni
re
S
rni
S
S
P
rni
C
M
S S
re
rni
L
S
S S
P
rni
C
M
S S
S S S S
C
Mre
re
C
M
rni
C
Mre
re
rni
C
ML
rni
re
C
M
P
rni
D
MD
rni
C
Mre
rni
C
Mre
rni
rni
re
P
rni
L
D
D
D
DD
D
P
rni
C
M
S
rni
S
rni
re
S
rni
S
S
P
rni
C
M
S S
re
rni
L
S
S S
P
rni
C
M
S S
S S S S
C
Mre
re
C
M
rni
C
Mre
re
rni
C
ML
rni
re
C
M
P
rni
D
MD
rni
C
Mre
rni
C
Mre
rni
rni
re
P
rni
L
D
D
D
DD
D
P
rni
C
M
S
rni
S
rni
re
S
rni
S
S
P
rni
C
M
S S
re
rni
L
S
S S
P
rni
C
M
S S
S S S S
C
Mre
re
C
M
rni
C
Mre
re
rni
C
ML
rni
re
C
M
P
rni
D
MD
rni
C
Mre
rni
C
Mre
rni
rni
re
P
rni
L
D
D
D
DD
D
P
rni
C
M
S
rni
S
rni
re
S
rni
S
S
P
rni
C
M
S S
re
rni
L
S
S S
P
rni
C
M
S S
S S S S
C
Mre
re
C
M
rni
C
Mre
re
rni
C
ML
rni
re
C
M
P
rni
D
MD
rni
C
Mre
rni
C
Mre
rni
rni
re
P
rni
L
D
D
D
DD
D
P
rni
C
M
S
rni
S
rni
re
S
rni
S
S
P
rni
C
M
S S
re
rni
L
S
S S
P
rni
C
M
S S
S S S S
C
Mre
re
C
M
rni
C
Mre
re
rni
C
ML
rni
re
C
M
P
rni
D
MD
rni
C
Mre
rni
C
Mre
rni
rni
re
P
rni
L
D
D
D
DD
D
P
rni
C
M
S
rni
S
rni
re
S
rni
S
S
P
rni
C
M
S S
re
rni
L
S
S S
P
rni
C
M
S S
S S S S
C
Mre
re
C
M
rni
C
Mre
re
rni
C
ML
rni
re
C
M
P
rni
D
MD
rni
C
Mre
rni
C
Mre
rni
rni
re
P
rni
L
D
D
D
DD
D
P
rni
C
M
S
rni
S
rni
re
S
rni
S
S
P
rni
C
M
S S
re
rni
L
S
S S
P
rni
C
M
S S
S S S S
C
Mre
re
C
M
rni
C
Mre
re
rni
C
ML
rni
re
C
M
P
rni
D
MD
rni
C
Mre
rni
C
Mre
rni
rni
re
P
rni
L
D
D
D
DD
D
P
rni
C
M
S
rni
S
rni
re
S
rni
S
S
P
rni
C
M
S S
re
rni
L
S
S S
P
rni
C
M
S S
S S S S
C
Mre
re
C
M
rni
C
Mre
re
rni
C
ML
rni
re
C
M
P
rni
D
MD
rni
C
Mre
rni
C
Mre
rni
rni
re
P
rni
L
D
D
D
DD
D
rni
S
rni
S
rni
S
rni
S
Cre
C
D
(Hand-out) Proceedings of the
2007 Workshop on
Highly Parallel Processing
on a Chip
August 28, 2007, IRISA, Rennes, France
http://www.hppc-workshop.org/
in conjunction with
the 13th International European Conference on Parallel and Distributed Computing (Euro-Par)
August 28-31, 2007, IRISA, Rennes, France.
August 2007
Handout editors: Martti Forsell and Jesper Larsson Träff
Printed in Finland and Germany
2
CONTENTS
Foreword 4
Organization 5
Program 6
SESSION 1 - Alternative computing paradigms for CMPs
Keynote - Towards Realizing a PRAM-On-Chip Vision - Uzi Vishkin, UMIACS, University of Maryland 7
An asynchronous many-core dataflow processor for D3AS prototype - Lorenzo Verdoscia and
Roberto Vaccaro, ICAR-CNR Italy 8
SESSION 2 - Transactional memory and caching solutions
Hardware Transactional Memory with Operating System Support - Sasa Tomic, Adrian Cristal, Osman
Unsal, and Mateo Valero, Barcelona Supercomputing Center, Universitat Politécnica de Catalunya 18
Auto-parallelisation of Sieve C++ programs - Alastair Donaldson, Colin Riley, Anton Lokhmotov, and
Andrew Cook, Codeplay Software, University of Cambridge 30
Adaptive L2 Cache for Chip Multiprocessors - Domingo Benítez, Juan Carlos Moure, Dolores Isabel
Rexachs, and Emilio Luque, University of Las Palmas, Universidad Autonoma de Barcelona 40
COMA Cache Hierarchy for Microgrids of Microthreaded Cores - Li Zhang and Chris Jesshope,
University of Amsterdam 50
SESSION 3 - Language support and CMP future directions
Parallelization of Bulk Operations for STL Dictionaries - Leonor Frias, and Johannes Singler,
Universitat Politécnica de Catalunya, Universität Karlsruhe 60
Keynote - Societies of Cores and their Computing Culture - Thomas Sterling, Louisiana State University 70
3
FOREWORD
Technological developments are bringing parallel computing back into the limelight after some years of absence
from the stage of main stream computing and computer science between the early 1990ties and early 2000s. The
driving forces behind this return are mainly advances in VLSI technology: increasing transistor densities along with
hot chips, leaky transistors, and slow wires make it unlikely that the increase in single processor performance can
continue the exponential growth that has been sustained over the last 30 years. To satisfy the needs for application
performance, major processor manufacturers are instead planning to double the number of processor cores per
chip every second year (thus reinforcing the original formulation of Moore's law). We are therefore on the brink of
entering a new era of highly parallel processing on a chip. However, many fundamental unresolved hardware and
software issues remain that may make the transition slower and more painful than is optimistically expected from
many sides. Among the most important such issues are convergence on an abstract architecture, programming
model, and language to easily and efficiently realize the performance potential inherent in the technological devel-
opments.
The Workshop on Highly Parallel Processing on a Chip (HPPC) aims to be a forum for discussing such fundamen-
tal issues. It is open to all aspects of existing and emerging/envisaged multi-core processors with a significant
amount of parallelism, especially to considerations on novel paradigms and models and the related architectural
and language support. To be able to relate to the parallel processing community at large, which we consider essen-
tial, the workshop has been organized in conjunction with Euro-Par, the main European (and international) con-
ference on all aspects of parallel processing.
The Call-for-papers for the HPPC workshop was launched early in the year, and at the passing of the submission
deadline we had received 20 submissions, which were generally relevant to the theme of the workshop and of good
quality. The papers were swiftly and expertly reviewed by the program committee, most of them receiving 4 qual-
ified reviews. We thank the whole of the program committee for the time and expertise they put into the review-
ing work, and for getting it all done within the rather strict timelimit. Final decision on acceptance was made by the
program chairs based on the recommendations from the program committee. Being a(n extended) half-day event,
we had room for accepting only 6 of the contributions, resulting in an acceptance ratio of about 30%. The 6 accept-
ed contributions will be presented at the workshop today, together with two forward looking invited talks by Uzi
Vishkin and Thomas Sterling on realizing a PRAM-on-a-chip vision and societies of cores and their computing cul-
ture.
This handout includes the workshop versions of the HPPC papers and the abstracts of the invited talks. Final ver-
sions of the papers will be published as post proceedings in a Springer LNCS volume containing material from all
the Euro-Par workshops. We sincerely thank the Euro-Par organization for giving us the opportunity to arrange the
HPPC workshop in conjunction with the Euro-Par 2007 conference. We also warmly thank our sponsors VTT and
Euro-Par for the financial support which made it possible for us to invite Uzi Vishkin and Thom Sterling, both of
whom we also sincerely thank for accepting our invitation to come and speak.
Finally, we welcome all of our attendees to the Workshop on Highly Parallel Processing on a Chip in the beautiful
city of Rennes, France. We wish you all a productive and pleasant workshop. Based on the experience today and
feedback we hope to find out if there is a need and interest in continuing the HPPC format next year.
HPPC organizers
Martti Forsell, VTT, Finland
Jesper Larsson Träff, NEC Europe, Germany
4
ORGANIZATION
Organized in conjuction with the 13th International European Conference on Parallel and Distributed Computing
WORKSHOP ORGANIZERS
Martti Forsell, VTT, Finland
Jesper Larsson Träff, C&C Research labs, NEC Europe Ltd, Germany
PROGRAM COMMITTEE
Gianfranco Bilardi, University of Padova, Italy
Taisuke Boku, University of Tsukuba, Japan
Martti Forsell, VTT, Finland
Jim Held, Intel, USA
Peter Hofstee, IBM, USA
Ben Juurlink, Technical University of Delft, The Netherlands
Darren Kerbyson, Los Alamos National Laboratory, USA
Lasse Natvig, NTNU, Norway
Kunle Olukotun, Stanford University, USA
Wolfgang Paul, Saarland University, Germany
Andrea Pietracaprina, University of Padova, Italy
Alex Ramirez, Technical University of Catalonia and Barcelona Supercomputing Center, Spain
Peter Sanders, University of Karlsruhe, Germany
Thomas Sterling, Caltech and Louisiana State University, USA
Jesper Larsson Träff, C&C Research labs, NEC Europe Ltd, Germany
Uzi Vishkin, University of Maryland, USA
SPONSORS
VTT, Finland http://www.vtt.fi
Euro-Par http://www.euro-par.org
5
PROGRAM
Workshop on Highly Parallel Processing on a Chip (HPPC)
August 28, 2007, IRISA, Rennes, France
http://www.hppc-workshop.org/
in conjunction with
the 13th International European Conference on Parallel and Distributed Computing (Euro-Par)
August 28-31, 2007, IRISA, Rennes, France.
TUESDAY AUGUST 28, 2007
SESSION 1 - Alternative computing paradigms for CMPs
09:00-09:15 Opening notes - Jesper Larsson Träff and Martti Forsell, NEC Europe, VTT
09:15-10:05 Keynote - Towards Realizing a PRAM-On-Chip Vision - Uzi Vishkin, UMIACS, University of
Maryland
10:05-10:30 An asynchronous many-core dataflow processor for D3AS prototype - Lorenzo Verdoscia and
Roberto Vaccaro, ICAR-CNR Italy
10:30-11:00 -- Break --
SESSION 2 - Transactional memory and caching solutions
11:00-11:25 Hardware Transactional Memory with Operating System Support - Sasa Tomic, Adrian Cristal,
Osman Unsal, and Mateo Valero, Barcelona Supercomputing Center, Universitat Politécnica de Catalunya
11:25-11:50 Auto-parallelisation of Sieve C++ programs - Alastair Donaldson, Colin Riley, Anton Lokhmotov,
and Andrew Cook, Codeplay Software, University of Cambridge
11:50-12:15 Adaptive L2 Cache for Chip Multiprocessors - Domingo Benítez, Juan Carlos Moure, Dolores Isabel
Rexachs, and Emilio Luque, University of Las Palmas, Universidad Autonoma de Barcelona
12:15-12:40 COMA Cache Hierarchy for Microgrids of Microthreaded Cores - Li Zhang and Chris Jesshope,
University of Amsterdam
12:40-14:00 -- Lunch --
SESSION 3 - Language support and CMP future directions
14:00-14:25 Parallelization of Bulk Operations for STL Dictionaries - Leonor Frias, and Johannes Singler,
Universitat Politécnica de Catalunya, Universität Karlsruhe
14:25-15:15 Keynote - Societies of Cores and their Computing Culture - Thomas Sterling, Louisiana State
University
15:15-15:30 Closing notes - Jesper Larsson Träff and Martti Forsell, NEC Europe, VTT
6
KEYNOTE
Towards Realizing a PRAM-On-Chip Vision
Uzi Vishkin, UMIACS, University of Maryland
Abstract: Serial computing has become largely irrelevant for growth in computing performance at around 2003.
Having already concluded that to maintain past performance growth rates, general-purpose computing must be
overhauled to incorporate parallel computing at all levels of a computer system--including the programming
model—all processor vendors put forward many-core roadmaps. They all expect exponential increase in the num-
ber of cores over at least a decade. This welcome development is also a cause for apprehension. The whole world
of computing is now facing the same general-purpose parallel computing challenge that eluded computer science
for so many years and the clock is ticking. It is becoming common knowledge that if you want your program to run
faster you will have to program for parallelism, but the vendors who set up the rules have not yet provided clear
and effective means (e.g., programming models and languages) for doing that. How can application software ven-
dors be expected to make a large investment in new software developments, when they know that in a few years
they are likely to have a whole new set of options for getting much better performance?! Namely, we are already
in a problematic transition stage that slows down performance growth, and may cause a recession if it lasts too
long. Unfortunately, some industry leaders are already predicting that the transition period can last a full decade.
The PRAM-On-Chip project started at UMD in 1997 foreseeing this challenge and opportunity. Building on PRAM-
-a parallel algorithmic approach that has never been seriously challenged on ease of thinking, or wealth of its
knowledge-base—a comprehensive and coherent platform for on-chip general-purpose parallel computing has
been developed and prototyped. Optimize single-task completion time, the platform accounts for application pro-
gramming (VHDL/Verilog, OpenGL, MATLAB, etc), parallel algorithms, parallel programming, compiling, architec-
ture and deep-submicron implementation, as well as backward compatibility on serial code. The approach goes
after any type of application parallelism regardless of its amount, regularity, or grain size. Some prototyping high-
lights include: an eXplicit Multi-Threaded (XMT) architecture, a new 64-processor, 75MHz XMT (FPGA-based) com-
puter, 90nm ASIC tape-out of the key interconnection network component, a basic compiler, class tested program-
ming methodology where students are taught only parallel algorithms and pick the rest on their own, and ~100X
speedups on applications.
The talk will overview some future plans and will argue that the PRAM-On-Chip approach is a promising candidate
for providing the processor-of-the-future. It will also posit that focusing on a small number of promising approach-
es, such as PRAM-On-Chip, and accelerate their incubation and testing stage, would be most beneficial both: (i) for
the field as a whole, and (ii) for an individual researcher who is seeking improved impact.
Bio: Uzi Vishkin is a permanent member of the University of Maryland Institute for Advanced Computer Science
(UMIACS) and a Professor of Electrical and Computer Engineering since 1988. He got his DSc in Computer Science
from the Technion—Israel Institute of Technology and his MSc and BSc in Mathematics from the Hebrew
University, Jerusalem, Israel. He was Professor of Computer Science at Tel Aviv University and the Technion,
research faculty at the Courant Institute, NYU and a post-doc at IBM T.J Watson. He is Fellow of the ACM and an
ISI-Thompson Highly-Cited Researcher.
URL: http://www.umiacs.umd.edu/~vishkin/XMT
7HPPC 2007—Workshop on Highly Parallel Processing on Chip August 28, 2007, Rennes, France
8
An asynchronous many-core dataflow processor
for D3AS prototype
Lorenzo Verdoscia and Roberto Vaccaro
Institute for High Performance Computing and Networking (ICAR)-CNR,
Via Castellino, 111 80131 Napoli, Italy
{lorenzo.verdoscia, roberto.vaccaro}@na.icar.cnr.it
Abstract. The main aim of Demand Data Driven Architecture Sys-
tem(D3AS) prototype is an attempt to provide a new programming
model and architecture to allow efficient programming of highly parallel
systems based on hundred/thousand processor on a chip for new applica-
tions. The proposed D3AS prototype is based on the data-driven control
mechanism and functional programming style re-examination in terms
of many-core technology. In this paper first we present the architectural
and programming features of D3AS, then we describe the main charac-
teristics of CODACS demonstrator, specifically realized to validate the
basic design choices of the D3AS prototype. Finally, some experimental
results in solving a linear equation system with Jacobi and Gauss-Seidel
iterative algorithms are discussed. Results show that the D3AS approach
is feasible and promising.
1 Introduction
Nowday the radical changes that computer architectures are undergoing re-
flect the state of information technology. Multi-core processing is now main-
stream, and the future will be massively parallel computing performed on many-
core processors. According to [5], ”successful many-core architectures and sup-
porting software technologies could reset microprocessor hardware and software
roadmaps for the next 30 years.” As Moore’s Law and nanoscale physics conspire
to force chip designers to add transistors rather than increase processor clocks,
many people see the creation of many-core architectures (hundreds to thousands
of cores per processor) as a natural evolution of multi-core. On the other hand, if
many-core is destined to be the way forward, a new parallel computing environ-
ment will need to be developed. One of the main issues is the type of hardware
building blocks to be used for many-core systems. About this topic, in the sci-
entific and industrial community there exists a reasonably definitive stand that
envision processors with thousands of simple (i.e., RISC) processing cores. Small,
simple cores are the most efficient structures for parallel codes, providing the best
tradeoff between energy consumption, performance, and manufacturability. In-
tel’s own 80-core terascale prototype processor Polaris uses simple RISC-type
cores to achieve a teraflop (in less than 70 watts!), although the company im-
plied that commercial versions would support Intel Architecture-based cores. But
HPPC 2007—Workshop on Highly Parallel Processing on Chip August 28, 2007, Rennes, France
9
even 80 cores is an order of magnitude less than the typical one of a many-core.
However, advances in integrated circuit technology impose new challenges about
how to implement a high performance application for low power dissipation on
processors created by hundred of (coarse or fine)-grained cores running at 200
MHz, rather than on one traditional processor running at 20 GHz.
From the architectural point of view, this new trend raises at least two
queries: how to exploit such spatial parallelism, how to program such systems.
The first query brings us to seriously reconsider the dataflow paradigm, given
the fine grain nature of its operations. In fact, instead of carrying out in sequence
a set of operations like a von Neumann processor does, a many-core dataflow
processor could calculate a function first connecting and configuring a number of
identical simple cores as a dataflow graph and then allowing data asynchronously
flow through them. However, despite the model simplicity, in the past technology
limits and heterogeneity of links and actor I/O conditions shifted this approach
more and more towards the von Neumann model [7]. But with the introduction
of the homogeneous High-Level Dataflow System (hHLDS) [11], we believe that
the dataflow approach is still a valid proposal to increase performance and seri-
ously attack the von Neumann model at least at processor level.
The second query brings us to seriously reconsider the functional programming
style, given its intrinsic simplicity in writing parallel programs. In fact, func-
tional languages have three key properties that make them attractive for parallel
programming [9]: they have powerful mechanisms for abstracting over both com-
putation and coordination; they eliminate unnecessary dependencies; and their
high-level coordination achieves a largely architecture-independent style of par-
allelism. Moreover, functional programming, thanks to its properties which stem
from the inherent mathematical foundation of functions, constitute a valid com-
putational model to naturally extract the fine grain parallelism that programs
present and a dataflow engine can exploit.
2 D3AS Prototype. From the language to the architecture
D3AS (Demand Data Driven Architecture System) is a project under develop-
ment at the Institute for High Performance Computing and Networking.
The main aim of this project is to realize a computing system capable to
exploit coarse-grained functional parallelism as well as fine-grained instruction
level parallelism through direct hardware execution of static dataflow graphs.
2.1 Design remarks
In the model of computation based on the dataflow graph, a program is described
as a set of operator nodes, called actors, interconnected by a set of data-carrying
arcs, called links. Data is passed through this graph in packets called tokens.
Within this model of computation, the graph can be static or dynamic (change
during execution), links can carry several tokens or at most one at a time, the
graph can be specified graphically or textually, there are several different possible
HPPC 2007—Workshop on Highly Parallel Processing on Chip August 28, 2007, Rennes, France
10
firing rules for actors, etc.. Besides, this model explodes a program parallelism at
a very fine grain. But Whiting and Pascoe [15] observed that the very fine grain
parallelism of dataflow has proved a disadvantage in the realization of dataflow
machines. Interestingly, the same very fine grain parallelism makes dataflow
approach attractive, for example, FPGA custom computing machines. Indeed,
static dataflow graphs, in contrast to the greedy scheduling policy embodied in
dynamic ones ”execute whenever data is available” that is inadequate in many
circumstances [4], form a very natural model of computation on multi/many-core
chip based system; each chip constituted by hundreds/thousands of identical
functional units.
The D3AS design has been based also on the ”language first” approach [8],
and among languages, the functional style has been chosen as a computational
model. First of all with a functional programming language it is more difficult
(or even impossible) to make low-level mistakes and easier to concentrate on the
program logic [3]. But there is another main reason which binds the language
and dataflow.
The lazy and eager evaluation [1, 10, 9] are two computation methods for execut-
ing functional programs. While the computation of programs in the lazy evalu-
ation mode is driven by the need for function argument values, the execution in
the eager evaluation mode is driven by the availability of the function arguments.
The first one originates the demand driven execution model, implemented by re-
duction machines, the second one originates the data driven execution model,
implemented by dataflow computers. However, this distinction has become less
and less evident while architecture prototypes supporting such models were im-
plemented [14]. Therefore, a dataflow machine that implements the functional
computational model should be named demand-data driven machine. In fact, in
the data driven model the graph construction can be made in demand mode,
while in the demand driven model, the execution of active instructions, after
having reached the atomic operands, can be made in data mode.
Moreover, since functional languages are referentially transparent [2], programs
written in these languages can be considered static objects. This means that an
expression in a functional language depends on the meaning of its component
subexpressions and not on the history of any computation performed prior to
the evaluation of that expression.
Consequently, we believe that the most appropriate architecture devoted to
execute the related dataflow graph should support the static dataflow paradigm,
at least at the processor level.
2.2 D3AS General Architecture
In order to directly map and execute in hardware dataflow graphs created with
the hHLDS model what we need is a Reconfigurable Hardware System (RHS)
which, according to the model, executes graphs in completely asynchronous man-
ner. Such system, which defines the D3AS general architecture, is organized in
three subsystems:
HPPC 2007—Workshop on Highly Parallel Processing on Chip August 28, 2007, Rennes, France
11
-Actor Realization Subsystem (ARS), constituted by N identical MultiPurpose
Functional Units (MPFUs) to create a one-to-one correspondence among
graph actors and MPFUs;
-Token flow Realization Subsystem (TRS), constituted by three sets of N
buffer registers and a cross-bar switch network, implementing graph edges,
able to connect any MPFU output to any input of another MPFU;
-Graph Mapping Subsystem (GMS), devoted to store the following RHS con-
text information:
i operating code for each MPFU;
ii switching information for the interconnection network.
#1
# i+k+1
#2
# i+k+2
#i
#n
MPFU
#1
???
MPFU
#n
GRAPH CONFIGURATION & CONTROL
MPFU INTERCONNECT
Description
Token_In B Token_Out
MPFU
#2
……………
MPFU
#1
MPFU
#1
???
MPFU
#n
MPFU
#n
MPFU INTERCONNECTMPFU INTERCONNECT
TOKEN_IN A
BUFFER
REGISTRERS
TOKEN_IN A
BUFFER
REGISTRERS
Dataflow
Graph Token_In A
TOKEN_IN B
BUFFER
REGISTRERS
TOKEN_IN B
BUFFER
REGISTRERS
TOKEN_OUT
BUFFER
REGISTRERS
TOKEN_OUT
BUFFER
REGISTRERS
MPFU
#2
MPFU
#2
……………
a) b)
Fig. 1. a) D3AS Endo-architecture, b) MDP Endo-architecture.
Fig. 1.a shows the D3AS general architecture. Critical parameters in the RHS
design are
-N
MPFU : the number of the MPFUs constituting the ARS;
-C
MPFU: the logical and functional complexity of the MPFUs;
-IN
TRS: the type of interconnect for the TRS.
While the interconnect complexity is O(N2
MPFU), NMPFU implementable on a
VLSI device depends on both the interconnect complexity and CMPFU. Depend-
ing on the ratio between the number of effectively available MPFUs and graph
actors, it is possible to simultaneously map and execute on the RHS several
dataflow graphs corresponding to different applications.
The RMS/D3AS fundamental building block is a many-core chip named
Many-core Dataflow Processor (MDP) and based on nMPFUs, whose microar-
chitecture is shown in Fig. 1.a. The MDP Endo-architecture is shown in Fig. 1.b.
HPPC 2007—Workshop on Highly Parallel Processing on Chip August 28, 2007, Rennes, France
12
When in a graph the actor number is greater than n, first the graph is partitioned
in fitting subgraphs and then the RHS is configured interconnecting the appropri-
ate number of MDP as shown in Fig. 2.a. To preserve the globally pure dataflow
model, also the 2nd level interconnection network is a non-blocking cross-bar
switch. In this case its complexity becomes O((nk)2). Such interconnect can
be implemented, for example, via a Field Programmable Interconnection Chip.
Obviously growing the number of the RHS MPFUs, different cheaper network
solutions (dynamic, static, etc.) can be adopted. In this case the interconnec-
tion implementing dataflow graph edges among subgraphs mapped on different
MDPs are made virtual by messages routed through the adopted network. In
this case the execution model becomes an hybrid model we call Communicating
Dataflow Processes (CDP), and subgraphs belonging to different MDPs exchange
data tokens through messages. The Endo-architecture of a system based on this
execution model is shown in Fig. 2.b.
N1 N2
MESSAGE/TOKEN PASSING
INTERCONNECTION NETWORK
N1N1 N2N2 Nm
Communication
Assist (CA)
&
Routing
Communication
Assist (CA)
&
Routing
MDP
Dataflow
Process
MDP
Global / 2nd level
Interconnection Network
MDP MDP
#k
#1 #2
.....
Dataflow
Graph
Description
a) b)
Fig. 2. RHS Endo-architecture based on the a) pure dataflow and b) CDP model.
3 The demonstrator
In order to verify the correctness and validity of the basic design choices, we
realized a D3AS demonstrator named CODACS (COnfigurable DAtaflow Com-
puting System) [13]. In CODACS, whose Endo-architecture is shown in Fig. 3,
the Global/2nd level Interconnection Network is constituted by a WK-recursive
(Nd=4,L= 1); so, the execution model is hybrid CDP. In CODACS a node is
constituted by a Smart Router Subsystem and a Platform-Processor Subsystem
Smart Router Subsystem (SRS). When a message reaches a node, the SRS,
whose block representation is shown in Fig. 3, evaluates if it reached its des-
tination. If it did not reach it, the WK-recursive Message Manager (WKMM)
HPPC 2007—Workshop on Highly Parallel Processing on Chip August 28, 2007, Rennes, France
13
4
Message In & Out
Smart Router Subsystem
(SRS)
Platform-Proc. Subsystem
...
Packet Assembler
Packet Deassembler
WK-recursive Message Manager
Destination
List
GCL ITTE 0TTE
HOST
01
23
Management
Control unit
and
CU INTERCONNECT
MPFU
#1
MPFU
#64
TOKENOUT
ENSEMBLEBUFFERS
TOKEN_INA
ENSEMBLEBUFFERS
TOKEN_INB
ENSEMBLEBUFFERS
10 10
768
GRAPH SETTER
Control Section
Graph
Configuration
table
64
1
641641641
Interconnect
code
MPFU operating
codes
//
/
Fig. 3. CODACS Endo-architecture.
routes it through the appropriate output link according to the routing strategy
described in [12]. If a message reaches its destination node, the WKMM trans-
fers it to the Packet Disassembler (PD) for the processing. The PD unpacks it,
evaluates its content, and transfers information to the corresponding blocks: a)
to the Graph Configuration List (GCL), that contains the graph configuration
table list assigned to the platform-processor; b) to the Destination List (DL),
that contains the list of the result destination node set (one set for each config-
uration); c) to the Input Transfer Token Environment (ITTE).
In the ITTE, data token storage and transfer happen in separate buffers. To aug-
ment the throughput, different buses are used to transfer right and left MPFU
tokens. When results are ready inside the Output Token Transfer Environment
(OTTE), they are loaded into the Packet Assembler (PA). The PA scans the des-
tination node set from the DL, associates results and destination node, prepares
a new message, and transfers it to the WKMM for delivering.
Platform-Processor Subsystem (PPS). This environment, constituted by 64 MP-
FUs, executes the dataflow graph assigned to that node. PPS is both a static and
dynamic configurable system. It is static because, once cells are configured, in-
ternal functions do not change; it is dynamic because, changing the interconnect
and/or MPFU operation code, different graphs can be executed.
After receiving the graph configuration table from GCL, the Graph Configu-
rator (GC) executes two operations: sets the MPFU interconnect and assigns the
operation code to each MPFU, thus carrying out the one-to-one correspondence
HPPC 2007—Workshop on Highly Parallel Processing on Chip August 28, 2007, Rennes, France
14
(mapping) between graph nodes and computing units. Once the configuration
phase terminates, it sends a signal to the control that enables the two input
token buffers. When a graph computation ends, results are stored in the output
buffer and then transferred to the OTTE, and a new computation can take place.
If different input tokens must be processed through the same graph (e.g. matrix
inner product), the GC has to check only for input token availability. We point
out that the I/O Ensemble Buffers and Token Transfer Environments make local
this environment and the smart router one to each subsystem allowing to overlap
data load, message transfer, and computation.//
CODACS demonstrator is constituted by a Gidel PROC20KE board [6]
whose global view is shown in Fig. 4. There are 5 Altera FPGA components: 4
implement implementing 4 nodes and 1 constitutes the general control an man-
agement unit; flash memory implements the Dataflow Graph Description envi-
ronment assigned to each Processing Subsystem; SDRAM implement the Input
and Output Token Transfer environment which stores initial and final token val-
ues. Intermediate token values which must flow from a Processing Subsystem to
another are transferred through the main and neighbor bus.
Fig. 4. CODACS prototype block diagram
4 Performance
To evaluate D3AS goodness, we have utilized CODACS demonstrator. Jacobi
and Gauss-Seidel iterative algorithms have been used to solve the linear equation
system
Ax =B. (1)
HPPC 2007—Workshop on Highly Parallel Processing on Chip August 28, 2007, Rennes, France
15
These two algorithms constitute a good example to approach the same problem
in parallel and sequential mode. They calculate an approximation of the exact
solution:
xi=1
aii
i=j
aij xjbii=1...n. (2)
To execute the two algorithms, we tailored the corresponding graph description
tables according to the resources of the demonstrator. In this way, we have
obtained the configuration files for different values of n.
The execution time for one iteration is given by:
Titer =(Tcom +Tcal) (3)
where Titer can be evaluated as a function of n. In fact:
Tcom =tCT +tTT =(nMPFU nbMPFU +ntnbt )tbns(4)
Tcal =tMPFU nons(5)
where tCT is the configuration transfer time, tTT the token transfer time, tMPFU
the MPFU execution time, tbthe transfer time for a single byte, nothe number of
sequential elementary operations to upgrade an xivalue, nsthe number of steps
to upgrade all the xivalues, nMPFU the number of clusters to be configured,
nbMPFU the number of bytes to configure a cluster, ntthe number of tokens
transferred in a cycle, and nbt the number of bytes per token. tMPFU,n
bMP F U ,t
b,
and nbt depend on technology and architecture.
The total execution time in the computational engine is given by:
Ttot =niTiter (6)
where ni(number of iterations) depends on the initial value set goodness.
The other parameters can be known for each nby the size of the graph descrip-
tion tables and the longest path in the DFG. For instance, it results ns= 1 for
the Jacobi method if n20, whereas for the Gauss-Seidel method ns=n, and
no=6+log2n.
In a sequential environment, the time Tsneeded to execute an iteration is given
by:
Ts=k1n2+k2n(7)
where k1and k2depend on the environment and can not be exactly evaluated a
priori. However, with a processor power equal to 1.4 Gflops, the minimum values
for k1and k2can be set at 20 nsec.
With tMPFU = 30 nsec, tb= 5 nsec, and nbt = 4, Table 1 shows the values
of Tcom and Tcal for several values of n. Due to fine grain dataflow operations,
most of the time goes into communication. Some performance indices defined
to compare the two methods and evaluate the proposed architecture are shown
in Table 2. In particular, CP is the communication penalty defined as the ratio
between Equation 4 and 5 and speedup Sp is the ratio between the Equation 7
and 3.
HPPC 2007—Workshop on Highly Parallel Processing on Chip August 28, 2007, Rennes, France
16
Table 1. Time evaluation. They are expressed in μsec
Gauss-Seidel Jacobi
nT
sTcom Tcal Tcom Tcal
64 83.2 68.38 22.75 22.16 3.20
256 1315.8 547.14 238.23 170.56 65.20
1024 20992.0 7494.25 1436.61 1243.70 612.38
Table 2. Performance
Gauss-Seidel Jacobi
nCP Sp CP Sp
64 3.01 0.91 6.92 3.28
256 2.30 1.68 2.62 5.58
1024 5.22 2.35 2.03 11.31
5 Concluding remarks
In this paper the D3AS prototype is presented. Its basic building blocks are the
Many-core Dataflow Processors, able to execute dataflow graphs, included loops,
according to the static model hHLDS, using actors with homogeneous I/O condi-
tions so that no control tokens are required. To validate D3AS prototype design
choices employed CODACS demonstrator which allows the implementation of
the hybrid model Communicating Dataflow Processes. From the hardware point
of view it is based on a Gidel PROC20KE board with 5 FPGAs. As benchmark
we used the Jacobi and Gauss-Siedel iterative algorithms because they approach
the same problem respectively in parallel and sequential mode.
References
1. L. Allison. Lazy dynamic-programming can be eager. Information Processing
Letters, 43(4):207–212, 1992.
2. J.W. Backus. Reduction languges and variable free programming. Technical Report
RJ-1010, IBM, Yorktown Heights, NY, April 1972.
3. Koen Claessen, Colin Runciman, Olaf Chitil, John Hughes, and Malcolm Wallace.
Testing and Tracing Lazy Functional Programs using QuickCheck and Hat. In
4th Summer School in Advanced Functional Programming, number 2638 in LNCS,
pages 59–99, Oxford, August 2003.
4. D.E. Culler, K.E. Schauser, and T. von Eicken. Two fundamental limits on dataflow
multiprocessing. In Cosnard M., Ebcioglu K., and Gaudiot J.L., editors, Proc.
of the IFIP WG10.3 on Architectures and Compilation Techniques for Fine and
Medium Grain Parallelism, pages 153–164, Orlando, FL, January 20–22, 1993.
North-Holland.
5. K. Asanovic et alii. The Landscape of Parallel Computing Research: A view from
Berkeley. Technical Report UCB/EECS-2006-183, EECS University of California
at Berkeley, 2006.
HPPC 2007—Workshop on Highly Parallel Processing on Chip August 28, 2007, Rennes, France
17
6. GIDEL LTD. PROC20KE board. www.gidel.com, May 1999.
7. R.A. Iannucci. Toward a dataflow/von Neumann hybrid architecture. In P roc. of
the 15th Annual Symposium on Computer Architecture, pages 131–140, Honolulu,
Haw., May/June 1988. IEEE Computer Society.
8. J.R. Kennaway and M.R. Sleep. The language first approach. In F.B. Cham-
bers, D.A. Duce, and G.P. Jones, editors, Distributed Computing, pages 111–123.
Academic Press, 1984.
9. H.W. Loidl, F. Rubio, N. Scaife, K. Hammond, S. Horiguchi, U. Klusik, R. Loogen,
G. J. Michaelson, R. Pe, S. Priebe, J. Rebn, and P. W. Trinder. Comparing parallel
functional languages: Programming and performance. Higher Order and Symbolic
Computation, 16(3):203–251, 2003.
10. G. Tremblay and G.R. Gao. The impact of laziness on parallelism and the limits of
strictness analysis. In A. P. Wim Bohm and John T. Feo, editors, High Performance
Functional Computing, pages 119–133, April 1995.
11. L. Verdoscia and R. Vaccaro. A high-level dataflow system. Computing, 60(4):285–
305, 1998.
12. L. Verdoscia and R. Vaccaro. An adaptive routing algorithm for WK-recursive
topologies. Computing, 63(2):171–184, 1999.
13. Lorenzo Verdoscia. Codacs prototype: A platform-processor for chiara programs.
In IPDPS ’05: Proceedings of the 19th IEEE International Parallel and Distributed
Processing Symposium (IPDPS’05) - Workshop 13, page 255.1, Washington, DC,
USA, 2005. IEEE Computer Society.
14. Y. Wei and J. Gandiot. Demand-driven interpretation of FP programs on a data-
flow multiprocessor. IEEE Trans. on Computers, pages 946–966, August 1988.
15. Paul G. Whiting and Robert S.V. Pascoe. A history of data-flow languages. IEEE
Annals of the History of Computing, 16(4):38–59, Winter 1994.
HPPC 2007—Workshop on Highly Parallel Processing on Chip August 28, 2007, Rennes, France
18
Hardware Transactional Memory with Operating System
Support, HTMOS
Sasa Tomic, Adrian Cristal, Osman Unsal, and Mateo Valero
Barcelona Supercomputing Center
Universitat Politècnica de Catalunya
Abstract.
Hardware Transactional Memory (HTM) gives software developers the opportu-
nity to write parallel programs more easily compared to any previous programming method,
and yields better performance than most previous lock-based synchronizations.
Current implementations of HTM perform very well with small transactions. But when a
transaction overows the cache, these implementations either abort the transaction as un-
suitable for HTM, and let software takeover, or revert to some much more inecient hash-like
in-memory structure, usually located in the userspace.
We present a fast, scalable solution that has virtually no limit on transaction size, doesn't
prefer either directory based coherence or snooping, has low transactional read and write
overhead, works with physical addresses, and doesn't require any changes inside the cache
subsystem.
This paper presents Operating System (OS) and Architecture modications that can leverage
the existing OS Virtual Memory mechanisms, to support unbounded transaction sizes, and
transaction execution speed that does not decrease when transaction grows.
1 Introduction
1.1 Motivation
Transactional Memory (TM) is believed to be the key future technology, because of its ability to
give programmers a way to dene a segment of code that should execute as only one instruction,
with completely invisible intermediate results until the transaction "commits". This should allow
easier writing of code for CMP shared memory machines. Compared to locks, programs written
with TM support are easier to develop, do not suer from deadlocks and yield better performance
than most previous lock-based synchronizations.
Essentially, if one is not using locks or TM, multiple threads can start executing the same part
of code, the result might be something that would not be possible if they were executing in serial
fashion. And it would probably be incorrect. This is why synchronizing the access to parts of code
is necessary. TM makes this synchronization process trivial. Most of the time a shared data region
is usually accessed by a single transaction, in those cases the transaction can commit its updates
to this data region and makes the update visible to all other transactions that might access this
shared data region in the future. However, if multiple transactions try to simultaneously access a
shared data region, TM systems recognize this as a conict and abort one or more transactions
to arrive at a serialized, coherent system state. The TM implementation embeds the data region
inside a transaction which then becomes an atomic unit of execution. Note that transactions could
be embedded in each other in a nested fashion.
TM systems can be subdivided into two avors: Hardware TM (HTM) and Software TM (STM).
HTM systems bound TM implementations to hardware to keep the speculative updated state and
as such are fast but suer from resource limitations. Recent hardware-oriented approaches propose
solutions that works for most cases, usually limiting the size of transaction, in terms of the
size of write-set (the size of
perfectly associative transactional cache
), or the length of transaction
(until an interrupt, page fault). Those approaches usually use existing cache coherency [1,9,10] or
directory based coherency[12,14,15] for detecting transactional conicts. STM systems are much
more exible, however they are slow. Some researchers have proposed a Hybrid TM[4] (HyTM)
implementations to address the resource limitations of HTM but on the whole those approaches
HPPC 2007—Workshop on Highly Parallel Processing on Chip August 28, 2007, Rennes, France
19
are very complex to implement. In this work, we propose Hardware Transactional Memory with
Operating System support (HTMOS) which is complexity-eective, potentially performs on the
same order of magnitude with HTM, and is exible like STM systems. We present a fast, scalable
solution that has virtually no limit on transaction size, does not prefer either directory based
coherence or snooping, that has low transactional read and write overhead, that works with physical
addresses, and does not require any changes inside cache subsystem. Instead, changes are done on
the Operating System (OS) level - in the Virtual Memory system, and inside the processor - the
TLB and in the form of additional instructions/functionality.
HTMOS involves modest architectural and OS changes to keep the additional copies of a page in
memory for transactional memory. Compared to the previous HTM proposals, HTMOS has three
important advantages:
HTMOS implicitly accommodates large transactions without getting bogged down with com-
plex implementations which decreases the performance (as is the case for most HTMs). In fact
for a 4GB address space, a transaction could be as large as 2GB with HTMOS.
HTMOS is much more exible than other HTMs. For example, dierent versioning schemes
such as in-place update versus lazy update is trivial to implement. Most other HTM proposals
embed a certain versioning scheme in silicon which makes it very dicult to implement other
alternative schemes.
HTMOS implementation ensures strong atomicity. Non-transactional loads and stores do not
conict with transactional ones. More specically, each non-transactional load or store can be
seen like a single-instruction transaction to the transactional ones.
1.2 Previous work
Almost all current Hardware Transactional Memory implementations assume that transactions are
going to be very small in size. This assumption is based on current kernel applications that are
written by expert computer scientists, which tend to be small in overall size and also respect the
fact that transactions cannot be very long. Actually, there are no commercial large applications in
the wild that use transactional memory. Our assumption is that ordinary programmers will try to
use transactions whenever they are not sure if they should use them, and for as big segments of the
program as they can. Probably, even some database systems will be re-written to support HTM. If
this is all true, then transactions will become very long in total execution time (will have to suer
a couple of context switches) and start having very large read and write sets. To the best of our
knowledge, only one recent application suite uses large transactions[4].
Therefore, what we need is a hardware implementation that supports virtually innite size of
transactions, that is scalable to many processors, that has fast abort and fast commit, and that
doesn't require special architecture, like a common bus, a directory etc. It is very dicult to satisfy
all these requirements, this work is a rst step in this direction.
In the current implementations, there are generally two approaches to the version management:
lazy and eager, and two for conict detection: lazy and eager. Two representatives for these are
LogTM[14], from University of Wisconsin which has eager conict detection and eager version
management, and Transactional Memory Coherence and Consistency[8], from Stanford University,
which has lazy conict detection and lazy version management.
TCC stores the entire read and write set in private, small, per processor, perfectly associative
transactional cache. All transactional reads and writes are stored in that cache, and the processor
monitors the common bus for the writes to his transactional variables. If any write to private
transaction variables is noticed, the processor restarts his transaction. When any processor comes
to commit stage, it takes a global "commit token" for commit (nobody else can start committing
until this commit is nished), and starts writing values from the private transactional write-set
to the common bus. Any processor that has a conict with any of these variables restarts his
transactions. After writing all the values from the write-set, the processor releases the "commit
token". This approach allows maximal possible concurrency because every processor that came to
the commit point can commit, assuming that "commit token" is available. TCC also requires as big
as possible, private, transactional cache. If there is an overow of transactional cache, the processor
HPPC 2007—Workshop on Highly Parallel Processing on Chip August 28, 2007, Rennes, France
20
immediately requests commit permission and a transaction is guaranteed to never rollback. The
requirement for common bus can be avoided with a directory as is shown by Cha
et al
.[5].
Another popular approach is the LogTM, which provides very fast commit, virtually in one
cycle. In order to support transaction sizes not limited by cache size, a directory based coherency
is used instead. This way, unbounded transaction sizes are supported. All the values are written
"in place" and therefore conict detection has to be done on every read or write. When the conict
with some other processor is detected, the processor calls a software handler. Usually, this means
stalling until the transaction/processor that has locked the variable either commits or aborts. In
the current implementation, conict detection is done with the help of main-memory directory
based coherency. This could be a problem with commercial, consumer computers that should be
the main target for transactional memory. Adding additional bits to the main memory for tracking
directory state can be expensive in both in terms of hardware cost and in terms of the intellectual
eort needed to design a correct, ecient protocol[11]. Furthermore, it would require throwing away
memory chips with every major processor upgrade, i.e. when the user want