ArticlePDF Available

Strict and Relaxed Sieving for Multi-Core Programming

Authors:
  • Codeplay Software Limited

Abstract and Figures

In Codeplay's Sieve C++, the programmer can place code inside a "sieve block" thereby instructing the compiler to delay writes to global memory and apply them in order on exit from the block. The semantics of sieve blocks makes code more amenable to automatic parallelisation. However, strictly queue-ing writes until the end of a sieve block incurs overheads and is typically un-necessary. If the programmer can assert that code within a sieve block does not write to and then subsequently read from a global memory location, the sieve semantics can be relaxed: writes can be executed at any point from their origi-nally scheduled time until the end of the block. We present experimental results demonstrating the benefits of relaxed sieving on an x86 multi-core system.
Content may be subject to copyright.
Strict and Relaxed Sieving for Multi-Core Programming
Anton Lokhmotov1, Alastair Donaldson2,
Alan Mycroft1, and Colin Riley2
1Computer Laboratory, University of Cambridge
15 JJ Thomson Avenue, Cambridge, CB3 0FD, UK
2Codeplay Software
45 York Place, Edinburgh, EH1 3HP, UK
Abstract. In Codeplay’s Sieve C++, the programmer can place code inside a
“sieve block” thereby instructing the compiler to delay writes to global memory
and apply them in order on exit from the block. The semantics of sieve blocks
makes code more amenable to automatic parallelisation. However, strictly queue-
ing writes until the end of a sieve block incurs overheads and is typically un-
necessary. If the programmer can assert that code within a sieve block does not
write to and then subsequently read from a global memory location, the sieve
semantics can be relaxed: writes can be executed at any point from their origi-
nally scheduled time until the end of the block. We present experimental results
demonstrating the benefits of relaxed sieving on an x86 multi-core system.
1 Introduction
We are living in a time of change, where commodity computer systems are becoming
increasingly parallel and heterogeneous. General-purpose processor vendors, such as
Intel and AMD, have shifted their efforts away from boosting the clock frequency and
architectural complexity of single-core processors, concentrating instead on producing
processors consisting of multiple simpler cores.
Another growing trend is to supplement general-purpose “host” processors with
special-purpose co-processors, or accelerators. Co-processors can either be located on
the same chip as the host, or on a different chip (often on a separate plug-in board).
Examples include the Synergistic Processing Unit (SPU) of the Cell processor, as well
as graphics, physics and scientific computing boards. Accelerators which are comprised
of tens or hundreds of cores can be dubbed deca- and hecto-core respectively, to distin-
guish them from the currently offered dual- and quad-core general-purpose processors.
Parallel and heterogeneous systems are fast and efficient in theory but are hard to
program in practice. Unsurprisingly, efficient automatic parallelisation has been a pro-
grammer’s sweet dream for many decades. Ideally, the programmer would like to write
clear and concise code in a familiar, mainstream programming language assuming a
single processor and uniform memory; concentrate on computation rather than on com-
munication; and then sit back and relax, while the compiler automatically distributes
the program across the target system in an efficient and error-free manner.
This author gratefully acknowledges the financial support by a TNK-BP Cambridge Kapitza
Scholarship and by an Overseas Research Students Award.
But the dream is but a dream. Difficulties abound.
Compilers excel at doing mechanical tasks, that are either unaccessible to program-
mers in high-level languages (such as register allocation or instruction scheduling) or
too tedious (such as common subexpression elimination or strength reduction). Com-
pilers cannot in general convert a sequential algorithm into a parallel one. Human inge-
nuity is required to invent a new parallel algorithm which can then be presented to the
compiler for optimisation. Expressing an explicitly parallel algorithm in a sequential
language, however, may not be natural.
These problems aside, semantics-preserving re-ordering of a sequential program
requires accurate dependence analysis which is difficult in practice. Mainstream pro-
gramming languages, particularly object-oriented ones, derive from the C/C++ model
in which objects are manipulated by reference, for example, by using pointers. While
such languages allow for efficient implementation on sequential computers, the pos-
sibility of aliasing between references complicates dependence analysis. Sadly, alias
analysis is undecidable in theory and intractable in practice for large programs. This
often precludes parallelisation, even when the programmer “knows” that computation
can proceed in parallel.
The compiler’s failure to notice opportunities for parallelisation which are obvious
to the programmer provides a strong argument against the use of sequential languages.
The programmer, however, can help the compiler by explicitly giving it more infor-
mation about the sequential program than the compiler can extract itself. The grateful
compiler can, in return, generate more efficient parallel code.
1.1 Restricted pointers in C99
In ANSI C99 [1], the programmer can declare a pointer with a restrict qualifier
to assert that the data pointed to by the pointer in a given scope will not be accessed
via any other pointer in that scope. Correct use of restrict has no effect on code
semantics, but may enable certain compiler optimisations which would otherwise be
precluded by the possibility of aliasing; incorrect use causes undefined behaviour.
1.2 Sieve blocks in Sieve C++
In Codeplay’s Sieve C++ [2–4], the programmer can place a code fragment inside a
sieve block – a new lexical scope prefixed with the sievekeyword – thereby instructing
the compiler to:
delay writes to memory locations defined outside of the block (global memory),
and
apply them in order on exit from the block.
We illustrate the sieve semantics using the following code fragment:
float*a; ...
for(int i = 0; i < 8; ++i)
a[i] = a[i] + a[4];
2
01 2 345 6 7
0
1
2
3
4
5
6
7
{
}
(a) Original schedule.
01 2 345 6 7
0
1
2
3
4
5
6
7
{
}
(b) Delaying writes.
Fig.1. The schedules of memory accesses for the example of §1.2.
Fig. 1(a) represents the logical schedule of memory accesses. The x-axis shows the
offset from a; the y-axis shows the logical time. Blue boxes represent reads; red boxes
represent writes. For example, the row marked ‘0’ represents the first iteration of the
loop, during which a[0] and a[4] are read, and a[0] is written. The write happens
after the reads, hence the red box is placed lower than the blue boxes.
Placing this code fragment inside a sieve block:
sieve {
for(int i = 0; i < 8; ++i)
a[i] = a[i] + a[4];
} // writes to a[0:7] happen on exit
changes the schedule to that in Fig. 1(b). The side-effects are collected at the end of the
block, as if the code has been raked from the beginning to the end of the block using a
sieve, which is pervious to reads and impervious to writes (hence the name).
Note that this new schedule results in different values being written to a[5:7]
because the write to a[4] is delayed.
1.3 Declaring, not instructing
The use of sieve blocks eases parallelisation, because the compiler is free to reorder
computation involving reads from global memory. (The order of writes to global mem-
ory can be preserved by recording such writes in a FIFO queue and applying the queue
on exit from the sieve block.)
It is easy to see that the sieve semantics is equivalent to the conventional semantics
if code within a sieve block does not write to and then subsequently read from a global
memory location. (We will also say that such code does not generate true dependences
on global memory.) In other words, for any column in the memory access schedule, no
blue box is located below a red box (unlike in Fig. 1).
In this paper, we advocate that the use of the sieve keyword on a block should be
treated as an assertion that code inside the sieve block generates no true dependences
on global memory, rather than the directive to delay side-effects (as in §1.2). As in the
case of restrictin C99, correct use of sievewill have no effect on code semantics;
incorrect use will cause undefined behaviour.
3
We further illustrate and contrast use of the restrictand sieve keywords (§2),
and explain that the assertion that code generates no true dependences not only makes
code easier to develop and maintain (§3) but also may reduce the overheads of side-
effect queueing (§4). We conclude by presenting experimental results (§5).
2 Mary Hope and the Delayed Side-Effects
In a free interpretation given in [3], the programmer informs the compiler that code
inside a sieve block generates no true dependences. This seems sensible: why would
the programmer ever want to write a new value into a global memory location and then
read from this location, knowing that the write will be delayed (by his/her own request)
and the read will return the old value of this memory location?
In this section, we illustrate that the programmer may want to exploit the sieve
semantics, but argue that he/she should be discouraged from so doing.
2.1 FIR filter design
Suppose Mary Hope, an expert in digital signal processing, has designed a one-dimensional
mean filter, for which the output at time iis given by the formula:
yi=1
k
k1
X
j=0
xi+j−⌊k/2, for i=k/2,...,n1− ⌊k/2,
where {xi}and {yi}are, respectively, the input and output sequences (signals), nis
the number of input samples, and kis the number of input samples to compute the
mean over. Since the outputs of this finite impulse response filter can be computed in
parallel, Mary hopes that the compiler will exploit her multi-core computer to speed up
the processing.
2.2 FIR filter implementations
Implementation in C Mary decides to implement the filter in a familiar, efficient and
portable language, i.e. C. She represents the signals as arrays, and passes them into
the filter function in Fig. 2(a). Sadly, arrays in C are passed as pointers. As in most
signal-processing codes, if two arrays look different, they are different. This domain-
specific knowledge, however, is unknown to the compiler, which sees that the function
receives pointers to two memory regions that may overlap. That is, on one iteration an
assignment may write into a memory location that will be read on a subsequent iteration.
This creates a loop-carried data dependence, which prevents the compiler from running
loop iterations in parallel [5]. The compiler has to conservatively preserve the order of
iterations of the outer loop.
Implementation in C99 Mary is aware of this problem and annotates pointers xand
ywith the C99 restrictqualifier as in Fig. 2(b), indicating to the compiler that the
input and output memory regions do not overlap. Thus, the compiler can deduce that the
outer loop is free of loop-carried dependences, and hence run its iterations in parallel.
4
void mean1d(float *x,
float *y,
int n, int k)
{for(int i = k/2; i < n-k/2; ++i) {
float sum = 0.0f;
for(int j = -k/2; j < k-k/2; ++j)
sum += x[i+j];
// write to disjoint memory?
y[i] = sum / (float)k;
}
}(a) Implementation in C.
void mean1d(float *restrict x,
float *restrict y,
int n, int k)
{for(int i = k/2; i < n-k/2; ++i) {
float sum = 0.0f;
for(int j = -k/2; j < k-k/2; ++j)
sum += x[i+j];
// write to disjoint memory!
y[i] = sum / (float)k;
}
}(b) Implementation in C99.
void mean1d(float *x, float *y,
int n, int k)
{sieve {
for(intitr i(k/2); i < n-k/2; ++i){
float sum = 0.0f;
for(int j = -k/2; j < k-k/2; ++j)
sum += x[i+j];
// delay write to disjoint memory
y[i] = sum / (float)k;
}
}
}(c) Implementation in Sieve C++.
void mean1d(float *x,
int n, int k)
{sieve {
for(intitr i(k/2); i < n-k/2; ++i){
float sum = 0.0f;
for(int j = -k/2; j < k-k/2; ++j)
sum += x[i+j];
// delay write to same memory
x[i] = sum / (float)k;
}
}
}
(d) Controversial implementation in Sieve C++.
Fig.2. Implementations of a one-dimensional mean filter in C, C99 and Sieve C++.
Implementation in Sieve C++ To Mary’s disappointment, her favourite compiler lags
behind the latest hardware trends and only generates single-threaded code, which ef-
fectively exploits instruction and subword-level parallelism, but underutilises Mary’s
multi-core computer. Mary sets out to exploit alternatives and finds a paper on Sieve
C++ ([4] or [3]). Mary is pleased to learn that she only needs to enclose the function
body in a sieve block and use an instance of the Sieve C++ iterator class3intitr to
control the outer loop as in Fig. 2(c).
Controversial implementation in Sieve C++ After pondering a bit more on the sieve
semantics, Mary modifies her code: she is happy to discard the inputs after the results
are computed, so she assigns the results to the inputs as in Fig. 2(d). Since within the
sieve block the writes to global memory are delayed, the computation produces exactly
the same results as before, but – as Mary thinks – requires less memory.
In the next section, we will explain this is not the case, and discourage Mary Hope
from writing code in this way.
3 Strict and relaxed sieving
3.1 Going from sequential to parallel
We may interpret the sieve construct as a means to add parallel semantics to a sequential
language. We draw an analogy with different semantics of vector assignment statements
in early PL/I and Fortran 90. Consider the statement: a[0:7] = a[0:7] + a[4];
3Iterator classes are described in [4].
5
One interpretation of this statement is the following loop (cf. PL/I before the ANSI
standard of 1976 [6]):
for(int i = 0; i < 8; ++i)
a[i] = a[i] + a[4];
which may look as if it adds a scalar to a vector, but does not quite: the value of a[4]
changes after five iterations, so one value (the original value of a[4], say t) is added to
the first five elements of the vectorand another value (2t) to the rest. So iterations need
to be executed in the given order to ensure correctness.
Another interpretation is that no writes occur until all reads have completed (cf.
Fortran 90 [5]), as in the sieve construct. This can be expressed as:
float t = a[4];
for(int i = 0; i < 8; ++i)
a[i] = a[i] + t;
This loop indeed adds to vector a[0:7] a scalar t– the original value of its fifth
element. Moreover, any order of iterations produces the same result. For this reason, we
believe this interpretation is more natural in the age of parallelism.
Similarly, the sieve construct, which performs writes to global memory only after
all reads and computation have completed, provides a natural way to endorse parallel
semantics over a block of statements.
3.2 Block-based structure of the sieve construct
Sieve blocks generalise single statement vector assignments in two ways:
Sieve blocks can define local memory, writes to which are immediate.
Sieve blocks can be nested.
Sieve blocks have a natural interpretation (“read in, compute, write out”, e.g. via DMA)
on heterogeneous systems having complex memory hierarchies [3]. For example, Clear-
Speed’s CSX [7] is a SIMD array processor consisting of a control unit (CU) core and
96 identical processing element (PE) cores operating in lock-step. The CSX processor
is located on a plug-in board together with large on-board memory. Each PE core has
its own local memory.
Thus, a host workstation equipped with a ClearSpeed board has main memory,
on-board memory, and PE memory. This complexity can be abstracted away by using
nested sieve blocks, as in the following example:
int a = 0; // host memory
sieve {
int b = 0; // on-board memory
sieve {
int c = 0; // PE memory
a++; b++; c++;
print(a,b,c); // prints 0,0,1
}
print(a,b); // prints 0,1
}
print(a); // prints 1
6
For each hierarchy level, delayed writes to non-local memory are queued. In this exam-
ple, writes to aare queued twice.
3.3 Disadvantages of strict sieving
Strictly following the original sieve semantics (§1.2) in general incurs space and time
overheads. The runtime system needs to maintain a FIFO queue of side-effects to non-
local memory (space overhead) and apply the queue on exit from a sieve block (time
overhead).
Moreover, delayed side-effects may need to be copied more than once. First, this
may happen in the case of nested sieve blocks. Second, this may happen if the queue
does not fit into local memory, in which case the runtime system needs to “spill” the
queue to the previous level of memory hierarchy. On exit from the block, the runtime
will have to apply the queue from the spill location.
Hopefully, the overheads will be compensated for by parallel execution. The com-
piler, however, may have a cost model and decide that it is not worth parallelising a
sieve block. Still, the sieve semantics must be preserved, so the program is likely to
suffer an execution overhead.
All this is clearly not in Mary Hope’s interests.
3.4 Benefits of relaxed sieving
We suggest that Mary should be able to assert explicitly that code she places inside a
sieve block does not generate true dependences on global data.
First, the explicit assertion gives the desired equivalence with the conventional se-
mantics. By analogy, this is as if Mary writes:
float t = a[4];
a[0:7] = a[0:7] + t;
without relying on any specific semantics of vector assignments. The benefit is that code
is easier to reason about, hence write and maintain.
Second, writes to non-local memory can be applied at any moment from their orig-
inal schedule to the end of the block. For example, if the compiler believes that it is not
worth parallelising a sieve block, it can generate sequential code without the overhead
of maintaining the side-effect queue. As we will show in §4, optimisation can reduce
this overhead on a parallel system.
We will say that the assertion that code generates no true dependences allows re-
laxed sieving (by this we emphasise that under this assertion side-effects do not need to
be strictly delayed until the end of the block).
3.5 Analogy in HPF
Interestingly, this distinction between strict and relaxed sieving has an analogy with
the INDEPENDENTdirective in High Performance Fortran, which says that the loop it
is attached to is safe to execute in parallel [8]. This directive is essential for loops that
cannot be analysed (for example, if array subscripts are not affine functions of the loop
7
variables). If the loop has loop-carried dependences, however, running it in parallel may
produce different results from sequential execution.
While this conflicts with the design goal that an HPF program must always produce
the same results whether executing on a parallel system or on a sequential one (for
which the HPF directives are ignored), the standard defines the INDEPENDENT directive
as an assertion, and dictates that programs which violate this assertion do not conform
to the standard.
Similarly, under the assertion that a sieve block generates no true dependences on
global memory the sieve keyword can be ignored when compiling for a sequential
system.
3.6 Undefinedness of relaxed sieving
The ANSI C99 standard [1] specifies undefined behaviour as “behaviour, upon use of
a nonportable or erroneous program construct or erroneous data, for which this In-
ternational Standard imposes no requirements”. An example is the use of expression
++i + ++i, the result of which is compiler dependent.
Undefinedness is usually frowned upon, because it makes harder to write portable
and reliable programs. Thus, introducing a new language construct with potentially
undefined behaviour upon erroneous use may seem undesirable.
We note, however, that in the case of relaxed sieving the programmer’s assertion
(that code in a sieve block generates no true dependences on global memory) can be
verified at run-time (and used for debugging) by additionally recording executed reads
in the queue and checking that no read from a global memory location is preceded by
a write to the same location. By contrast, it would be harder to verify at run-time an
erroneous use of the restrictkeyword.
4 Optimising relaxed sieving
4.1 Vectorisation
Suppose Mary Hope writes something like:
float *a;
sieve __attribute__((nodep(RAW)) {
float t = a[5];
for(intitr i(0); i < 8; ++i)
a[i] = a[i+1] + t;
}
to inform the compiler that the enclosed code fragment does not generate a true depen-
dence on global data and hence the compiler may relax sieving.
Fig. 3(a) represents the logical schedule of memory accesses. Since no write to a
global memory location is followed by a read from the same location, the writes can
be arbitrarily delayed (and the reads can be arbitrarily advanced) from their original
schedule.
Assuming the architecture supports four-way vector instructions, the compiler may
vectorise code producing:
8
01 2 345 6 78
0
1
2
3
4
5
6
7
{
}
(a) Original schedule.
01 2 345 6 78
0
1
2
3
4
5
6
7
{
}
(b) Vector schedule.
Fig.3. The schedules of memory accesses for the example of §4.1.
float t = a[5];
for(int i = 0; i < 8; i +=4)
a[i:i+3] = a[i+1:i+4] + t;
which has the memory access schedule as in Fig. 3(b), having no side-effect queueing
overheads.
4.2 Speculation
Note that if we speculatively distribute the first iteration of the vectorised loop above
to one core and the second iteration to another, we cannot commit the side-effects of
the second iteration (writes to a[4:7]) until the first iteration has read its input data
a[1:4] (otherwise, the antidependence on a[4] is violated). In general, the side-effects
of a fragment [4] need to be held until all its preceding fragments have completed and
committed theirs, which also incurs overheads (although less than for strict sieving).
5 Experimental evaluation
We present experimental data averaged over multiple runs on a homogeneous x86 multi-
core system, with two 2GHz quad-core AMD Opteron (Barcelona) processors and 4GB
RAM, running under Windows Server 2003.
5.1 Implementation
We have implemented a prototype extension to Codeplay’s Sieve C++ compiler and
runtime system that allows sieve blocks to be annotated using the syntax of §4 to in-
dicate that writes can be arbitrarily delayed from their original schedule. Speculative
execution [4] of these annotated sieve blocks results in side-effects that are commit-
ted, in order, by the run-time as soon as a speculated fragment is confirmed to simulate
sequential execution.
9
12345678
0
1
2
3
4
5
6
7
8
Active cores
Speedup (w.r.t. sequential code)
Strict (w.r.t. Codeplay sequential)
Relaxed (w.r.t. Codeplay sequential)
OpenMP (w.r.t. Codeplay sequential)
OpenMP (w.r.t. Microsoft sequential)
(a) Mandelbrot Set 8000 ×8000
1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
9
10
Active cores
Speedup (w.r.t. sequential code)
Strict (w.r.t. Codeplay sequential)
Relaxed (w.r.t. Codeplay sequential)
OpenMP (w.r.t. Codeplay sequential)
OpenMP (w.r.t. Microsoft sequential)
(b) Julia Set 2000 ×2000
12345678
0
1
2
3
4
5
6
7
8
Active cores
Speedup (w.r.t. sequential code)
Strict (w.r.t. Codeplay sequential)
Relaxed (w.r.t. Codeplay sequential)
OpenMP (w.r.t. Codeplay sequential)
OpenMP (w.r.t. Microsoft sequential)
(c) Matrix Multiplication 2000 ×2000
1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
Active cores
Speedup (w.r.t. sequential code)
Strict (w.r.t. Codeplay sequential)
Relaxed (w.r.t. Codeplay sequential)
OpenMP (w.r.t. Codeplay sequential)
OpenMP (w.r.t. Microsoft sequential)
(d) Cyclic Redundancy Check 32M
Fig.4. Speedup w.r.t. sequential code.
5.2 Experimental setup
We use four benchmark programs. The matrix multiplication is performed for square
2000 ×2000 matrices. The cyclic redundancy check (CRC) is performed on a random
32M (1M = 220 ) word message. The Julia program ray traces a 2000 ×2000 3D slice of
a 4D quaternion Julia set. The Mandelbrot program calculates a 8000 ×8000 fragment
of the Mandelbrot set.
The graphs in Fig. 4 show (we used most aggressive compiler optimisation flags):
the speedup of Sieve C++ programs over the original (sequential) C++ programs
using strict (first bar) and relaxed (second bar) sieving (all code generated by Code-
play’s Sieve C++ compiler);
the speedup of C++ programs with OpenMP directives compiled by Microsoft’s
C++ compiler (version 14.00.50727.42, shipped with Visual Studio 2005) over
the original C++ programs compiled by Codeplay’s compiler (third bar) and Mi-
crosoft’s compiler (fourth bar).
10
5.3 OpenMP vs. Sieve C++
The third bar shows the performance of OpenMP code relative to that of Sieve C++
code. For example, for the Mandelbrot benchmark [Fig. 4(a)] code generated by Mi-
crosoft’s compiler to run on a single core is over two times slower than code gener-
ated by Codeplay’s compiler. On the other hand, for the Julia benchmark [Fig. 4(b)]
code generated by Microsoft’s compiler is over two times faster than code generated by
Codeplay’s compiler.
The fourth bar shows the scalability of OpenMP code with the number of engaged
cores. For the matrix multiplication [Fig. 4(c)] OpenMP code scales almost linearly
and appreciably better than Sieve C++ code. Similar, for the CRC [Fig. 4(d)], with the
exception of running on eight cores when the performance unexpectedly drops almost
to the same level as running on five cores. OpenMP code for the Mandelbrot and Julia
benchmarks scales sublinearly, running on eight cores only four times as fast as sequen-
tial code. In contrast, Sieve C++ code shows consistently good scalability.
Preliminary profiling using AMD’s CodeAnalyst tool reveals no reasons for the
OpenMP performance anomalies, but confirms that all cores are utilised thoughout exe-
cution, and shows no significant difference in L2 cache misses between configurations.
5.4 Strict sieving vs. relaxed sieving
Only the Mandelbrot benchmark [Fig. 4(a)] shows appreciably better performance im-
provement of relaxed sieving over strict sieving (up to 11% faster on eight cores), ap-
parently because of the improved temporal locality. When working with large data sets,
delaying side-effects until the end of a sieve block means that the side-effects of a spec-
ulated fragment get displaced from the cache by the side-effects of later fragments and
are brought back to the cache when committing them to global memory. Committing
the side-effects as soon as a speculated fragment is confirmed to simulate sequential
execution reduces this overhead.
Note that Fig. 4 shows that Sieve C++ code implemented with relaxed sieving suf-
fers a (small) performance overhead on a single core. As we have discussed in §3.4, the
compiler might have chosen instead to output sequential code with no sieving, thereby
incurring no overheads.
Fig. 5 shows memory overhead of sieving (memory overhead of OpenMP code is
negligible). Note that the more cores are active, the more fragments can be speculated
and their side-effects need to be held under relaxed sieving (for strict sieving overhead
is invariant of the number of active cores). We report (maximum) memory overhead for
eight active cores.
For the Mandelbrot benchmark relaxed sieving requires almost 50 times less mem-
ory overhead than strict sieving. The CRC benchmark performs a XOR reduction of a
random message and generates a single value per fragment, hence creating small mem-
ory overheads in both strict and relaxed sieving.
11
CRC Matrix Multiply Mandelbrot Julia
0
5
10
15
20
25
30
35
Memory overhead (MB)
Strict
Relaxed
680.5
Fig.5. Memory overhead.
6 Conclusion and future work
We have presented the concepts of strict and relaxed sieving. Automatic parallelisa-
tion based on sieving compares well to parallelisation via OpenMP pragmas. Relaxed
sieving reduces memory overhead which can also result in performance improvement.
We are implementing relaxed sieving in the Sieve C++ backend for the Cell BE pro-
cessor (previously reported in [4]) and will investigate further optimisations of relaxed
sieving on heterogeneous multi-core platforms.
Acknowledgements
We thank Marcel Beemster for his constructive criticism that led to writing this paper
and Paul Keir for his advice on implementing the benchmark examples in OpenMP.
References
1. American National Standards Institute: ANSI/ISO/IEC 9899-1999: Programming Lan-
guages – C. (1999)
2. Codeplay: Portable high-performance compilers. http://www.codeplay.com/
3. Lokhmotov, A., Mycroft, A., Richards, A.: Delayed side-effects ease multi-core program-
ming. In: Proc. of the 13th European Conference on Parallel and Distributed Computing
(Euro-Par). Volume 4641 of Lecture Notes in Computer Science., Springer (2007) 641–650
4. Donaldson, A., Riley, C., Lokhmotov, A., Cook, A.: Auto-parallelisation of Sieve C++ pro-
grams. In: Proc. of the Workshop on Highly Parallel Processing on a Chip (HPPC). Volume
4854 of Lecture Notes in Computer Science., Springer (2007)
5. Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architectures. Morgan Kaufmann,
San Francisco (2002)
6. Radin, G.: The early history and characteristics of PL/I. SIGPLAN Not. 13(8) (1978) 227–241
7. ClearSpeed Technology: The CSX architecture. http://www.clearspeed.com/
8. Kennedy, K., Koelbel, C., Zima, H.P.: The rise and fall of High Performance Fortran: an his-
torical object lesson. In: Proc. of the 3rd ACM SIGPLAN History of Programming Languages
Conference (HOPL-III), ACM (2007) 1–22
12
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Computer systems are increasingly parallel and heterogeneous, while programs are still largely written in sequential languages. The obvious suggestion that the compiler should automatically distribute a sequential program across the system usually fails in practice because of the complexity of dependence analysis in the presence of aliasing. We introduce the sieve language construct which facilitates dependence analysis by using the programmer’s knowledge about data dependences and makes code more amenable to automatic parallelisation. The behaviour of sieve programs is deterministic, hence predictable and repeatable. Commercial implementations by Codeplay shows that sieve programs can be efficiently mapped onto a range of systems. This suggests that the sieve construct can be used for building reliable, portable and efficient software for multi-core systems.
Conference Paper
Full-text available
We describe an approach to automatic parallelisation of programs written in Sieve C++ (Codeplay’s C++ extension), using the Sieve compiler and runtime system. In Sieve C++, the programmer encloses a performance-critical region of code in a sieve block, thereby instructing the compiler to delay side-effects until the end of the block. The Sieve system partitions code inside a sieve block into independent fragments and speculatively distributes them among multiple cores. We present implementation details and experimental results for the Sieve system on the Cell BE processor.
Article
Source material for a written history of PL/I has been preserved and is available in dozens of cartons, each packed with memos, evaluations, language control logs, etc. A remembered history of PL/I is retrievable by listening to as many people, each of whom was deeply involved in one aspect of its progress. This paper is an attempt to gather together and evaluate what I and some associates could read and recall in a few months. There is enough material left for several dissertations. The exercise is important, I think, not only because of the importance of PL/I, but because of the breadth of its subject matter. Since PL/I took as its scope of applicability virtually all of programming, the dialogues about its various parts encompass a minor history of computer science in the middle sixties. There are debates among numerical analysts about arithmetic, among language experts about syntax, name scope, block structure, etc., among systems programmers about multi-tasking, exception handling, I/O, and more.
Article
Source material for a written history of PL/I has been preserved and is available in dozens of cartons, each packed with memos, evaluations, language control logs, etc. A remembered history of PL/I is retrievable by listening to as many people, each of whom was deeply involved in one aspect of its progress. This paper is an attempt to gather together and evaluate what I and some associates could read and recall in a few months. There is enough material left for several dissertations. The exercise is important, I think, not only because of the importance of PL/I, but because of the breadth of its subject matter. Since PL/I took as its scope of applicability virtually all of programming, the dialogues about its various parts encompass a minor history of computer science in the middle sixties. There are debates among numerical analysts about arithmetic, among language experts about syntax, name scope, block structure, etc., among systems programmers about multi-tasking, exception handling, I/O, and more.
Article
HPF pioneered a high-level approach to parallel programming but failed to win over a broad user community.
Conference Paper
High Performance Fortran (HPF) is a high-level data-parallel programming system based on Fortran. The effort to standardize HPF began in 1991, at the Supercomputing Conference in Albuquerque, where a group of industry leaders asked Ken Kennedy to lead an effort to produce a common programming language for the emerging class of distributed-memory parallel computers. The proposed language would focus on data-parallel operations in a single thread of control, a strategy which was pioneered by some earlier commercial and research systems, including Thinking Machines' CM Fortran, Fortran D, and Vienna Fortran. The standardization group, called the High Performance Fortran Forum (HPFF), took a little over a year to produce a language definition that was published in January 1993 as a Rice technical report [50] and, later that same year, as an article in Scientific Programming [49]. The HPF project had created a great deal of excitement while it was underway and the release was initially well received in the community. However, over a period of several years, enthusiasm for the language waned in the United States, although it has continued to be used in Japan. This paper traces the origins of HPF through the programming languages on which it was based, leading up to the standardization effort. It reviews the motivation underlying technical decisions that led to the set of features incorporated into the original language and its two follow-ons: HPF 2 (extensions defined by a new series of HPFF meetings) and HPF/JA (the dialect that was used by Japanese manufacturers and runs on the Earth Simulator). A unique feature of this paper is its discussion and analysis of the technical and sociological mistakes made by both the language designers and the user community:, mistakes that led to the premature abandonment of the very promising approach employed in HPF. It concludes with some lessons for the future and an exploration of the influence of ideas from HPF on new languages emerging from the High Productivity Computing Systems program sponsored by DARPA.
Portable high-performance compilers
  • Codeplay
Codeplay: Portable high-performance compilers. http://www.codeplay.com/