ArticlePDF Available

2D Image Convolution using Three Parallel Programming Models on the Xeon Phi


Abstract and Figures

Image convolution is widely used for sharpening, blurring and edge detection. In this paper, we review two common algorithms for convolving a 2D image by a separable kernel (filter). After optimising the naive codes using loop unrolling and SIMD vectorisation, we choose the algorithm with better performance as the baseline for parallelisation. We then compare the parallel performance of the optimised code using OpenMP, OpenCL and GPRM implementations on the Intel Xeon Phi. We also measure the effects of optimisation techniques and demonstrate how they affect both sequential and parallel performance. Apart from comparing the code complexity as well as the performance of the chosen parallel programming models, we investigate the impact of a parallelisation technique, task agglomeration in GPRM.
Content may be subject to copyright.
2D Image Convolution using Three Parallel
Programming Models on the Xeon Phi
Ashkan Tousimojarad
University of Glasgow
Wim Vanderbauwhede
University of Glasgow
W Paul Cockshott
University of Glasgow
Image convolution is widely used for sharpening, blurring and edge
detection. In this paper, we review two common algorithms for con-
volving a 2D image by a separable kernel (filter). After optimising
the naive codes using loop unrolling and SIMD vectorisation, we
choose the algorithm with better performance as the baseline for
parallelisation. We then compare the parallel performance of the
optimised code using OpenMP, OpenCL and GPRM implementa-
tions on the Intel Xeon Phi. We also measure the effects of optimi-
sation techniques and demonstrate how they affect both sequential
and parallel performance. Apart from comparing the code complex-
ity as well as the performance of the chosen parallel programming
models, we investigate the impact of a parallelisation technique,
task agglomeration in GPRM.
1. Introduction
Throughput computing applications demand for fast response time
while dealing with a large amount of data. Image convolution is
one of such throughput computing applications. Convolution of an
image by a matrix of real numbers can be used to sharpen or smooth
an image, depending on the matrix used. If Ais an image and Kis
a convolution matrix, then B, the convolved image is calculated as:
By,x =X
Ay+i,x+jKi,j (1)
If kis a convolution vector, then the corresponding matrix Kis
such that Ki,j =kikj
A separable convolution kernel is a vector of real numbers that
can be decomposed into horizontal and vertical projections and
hence can be applied independently to the rows and columns of
the spatial domain to provide filtering [1]. It is a specialisation of
the more general convolution, but is algorithmically more efficient
to implement.
The image convolution algorithms are taken from the real code
used in a stereo matching algorithm. Image convolution and scaling
take up most of the cycles in the stereo matching algorithm. For all
tests, separable kernels of width 5 are used. The algorithm uses 3
colour planes and is heavily memory-fetch bound.
McCool et al. [2] list three desired features for the parallel pro-
gramming models that intend to enable parallelism: I) Performance,
II) Productivity and III) Portability. It should be possible to pre-
dict good performance, tune it, and scale it to larger systems. Pro-
ductivity is not only about expressiveness and composability, but
also about maintainability. Supporting a range of targets and oper-
ating systems is another desirable property, known as portability.
We have considered these three features in choosing the parallel
models for this study and we will refer to them throughout the pa-
We aim to explore a number of optimisation and parallelisation
techniques for enhancing the performance of 2D image convolu-
tion on a modern manycore architecture. For this purpose, we have
chosen three parallel programming models from different domains
and for different reasons: OpenMP, the de-facto standard for pro-
gramming shared memory architectures; OpenCL, known for being
portable across multiple platforms; and finally GPRM, a pure task-
based programming framework. It has been reported that GPRM
provides superior performance compared to OpenMP on manycore
architectures [3] [4].
This paper is structured as follows: we start by introducing the
Xeon Phi and the three parallel models in sections 2 and 3, fol-
lowed by the experimental setup in section 4. In section 5, we re-
view the two algorithms used to solve the problem: single-pass and
two-pass algorithms. In this section, we also list different optimi-
sation techniques to convert a naive code to an optimised one. We
then discuss the implementation details of the parallel two-pass al-
gorithm in each programming model. Results of parallelising the
two-pass algorithm is presented in section 6. In section 7, another
version of the single-pass algorithm is considered. We will show
that although, still the sequential two-pass algorithm outperforms
the sequential single-pass algorithm, the parallel performance of
the modified single-pass algorithm is better. Finally, we review a
few of related research studies and the Conclusion section sum-
marises our attempts in exploring the behaviour of the two algo-
rithms, pros and cons of the studied parallel models in solving this
problem on a manycore architecture as well as the effect of optimi-
sation and parallelisation techniques applied to improve the perfor-
2. Hardware Platform: Intel Xeon Phi
Contemporary applications have increased the trend of integrating a
large number of processing cores in order to meet their performance
goals. Most of these applications need single-chip implementation
to satisfy their size and power consumption requirements. Multi-
core and manycore processors have emerged as promising archi-
tectures to benefit from increasing transistor numbers.
The Intel Xeon Phi coprocessor 5110P used in this study is an
SMP (Symmetric Multiprocessor) on-a-chip which is connected
to a host Xeon processor via a PCI Express bus interface. The
Intel Many Integrated Core (MIC) architecture used by the Intel
Xeon Phi coprocessors gives developers the advantage of using
standard, existing programming tools and methods. Our Xeon Phi
comprises 60 cores (240 logical cores) connected by a bidirectional
ring interconnect.
The Xeon Phi coprocessor provides four hardware threads shar-
ing the same physical core and its cache subsystem in order to hide
the latency inherent in in-order execution. As a result, the use of at
least two threads per core is almost always beneficial [5]. The Xeon
arXiv:1711.09791v1 [cs.DC] 27 Nov 2017
Phi has eight memory controllers supporting 2 GDDR5 memory
channels each. The clock speed of the cores is 1.053GHz. Each core
has an associated 512KB L2 cache. Data and instruction L1 caches
of 32KB are also integrated on each core. Another important fea-
ture of the Xeon Phi is that each core includes a SIMD 512-bit wide
VPU (Vector Processing Unit). The VPU can be used to process 16
single-precision or 8 double-precision elements per clock cycle.
3. Parallel Programming Models
The design of manycore processors is strongly driven by demands
for greater performance at reasonable cost. To make effective use of
the available parallelism in such systems, the parallel programming
model is of great importance. There are several parallel program-
ming models, runtime libraries, and APIs that help developers to
move from sequential to parallel programming. For the purposes of
this study, we have chosen three parallel programming models that
the Xeon Phi supports: OpenMP, OpenCL and GPRM.
3.1 OpenMP
OpenMP is the de-facto standard for shared memory programming,
and is based on a set of compiler directives or pragmas, com-
bined with a programming API to specify parallel regions, data
scope, synchronisation, and so on. It also supports runtime con-
figuration through the use of runtime environment variables, e.g.
OMP NUM THREADS to specify the number of threads at runtime.
OpenMP is a portable parallel programming approach and is sup-
ported on C, C++, and Fortran. It has been historically used for
loop-level and regular parallelism through its compiler directives.
Since the release of OpenMP 3.0, OpenMP also supports task par-
allelism [6]. It is now widely used in both task and data parallel
The Intel OpenMP runtime library (as opposed to the GNU
implementation) allocates a task list per thread for every OpenMP
team. Whenever a thread creates a task that cannot not be executed
immediately, that task is placed into the thread’s deque (double-
ended queue). A random stealing strategy balances the load [7].
Since OpenMP is a language enhancement, every new construct
requires compiler support. Therefore, its functionality is not as
extensive as library-based models. Moreover, race condition is a
serious problem in OpenMP. Although it provides the user with a
high level of abstraction, the onus is still on the programmer to
avoid race condition.
3.2 OpenCL
OpenCL [8] is an open standard for heterogeneous architectures.
One of the main OpenCL’s objectives is to increase portability
across GPUs, multicore processors, and OS software via its abstract
memory and execution model; however its performance is not al-
ways portable across different platforms. It has been suggested to
consider the architectural specifics in the algorithm design, in order
to address its performance portability issue. The use of auto-tuning
heuristics could also improve the performance [9].
Although OpenCL is mostly compared against Nvidia’s CUDA
[10], we do not aim to cover discussions about GPUs in this work.
The reason why OpenCL is listed here is firstly because it allows
for sharing of workload between host and device with the same pro-
gram, hence increasing portability and productivity, and secondly
because the studied system, the Xeon Phi, supports the OpenCL
programming model.
3.3 GPRM
The Glasgow Parallel Reduction Machine (GPRM) [4] provides a
task-based approach to manycore programming by structuring pro-
grams into task code, written as C++ classes, and communication
code, written in GPC, a restricted subset of C++. The communica-
tion code describes how the tasks interact. Compiling a GPRM pro-
gram results in an Intermediate Representation (IR) including use-
ful information about tasks, their dependencies as well as the initial
task-to-thread mapping information. GPRM is not as well-known
as the other two approaches, but since it specifically targets many-
core architectures and has shown good performance compared to
OpenMP and other approaches on the Xeon Phi, we have used it to
see how well a “pure” task-based model performs for this bench-
The aim of GPRM is to abstract away all the details of threads,
promoting the idea that tasks are the actual computations, and the
threads are only their substrates. What this means in practice is that
the runtime system automatically creates as many threads as the
number of cores. The programmer only decides about the tasks’
details, i.e. the number of tasks (using a cutoff), the task code and
the communication pattern between the tasks.
GPRM promises a good performance by combining compile-
time (source to intermediate representation) and runtime (stealing)
techniques. In other words, compile-time decisions form the initial
task distribution and the runtime system adjusts dynamically. As
a result, some threads can be asleep during the execution, which
means that the number of active threads are tuned automatically.
An implication of this model is that if the scheduling and/or com-
munication overhead is significant, one should create fewer tasks
than the number of cores.
4. Experimental Setup
OpenMP and GPRM programs are executed natively on the MIC.
The Intel compiler icpc (ICC) 14.0.2 is used with the -mmic,
-O2,-no-offload and -ansi-alias flags for compiling the
programs for native execution on the Xeon Phi. The OpenMP
programs should be compiled with the -openmp flag and the library is required. Unlike OpenMP, for the GPRM
framework there is no shared library to be copied to the MIC.
We also used OpenCL (Intel OpenCL for MIC v1.2) with
-cl-mad-enable and -cl-fast-relaxed-math. As we want to
compare the compute performance on the MIC, the time to transfer
the image between the host and device in the offload mode is not
taken into account.
Threads and tasks are two completely different concepts. How-
ever, in a pure task-based model such as GPRM, concurrency is
only controlled by tasks. Suppose that we want to parallelise a
simple loop on N elements. In OpenMP, using 100 threads sim-
ply means that each thread gets N/100 of the workload (if the static
scheduling is chosen). In GPRM for every program, the number of
tasks should be specified. For a loop, each chunk corresponds to a
task, therefore 100 tasks (i.e. cutoff=100) on the Xeon Phi means
that each thread gets N/100 of the workload. If we choose a cut-
off=480, since there are 240 threads on the Xeon Phi, each thread
gets 2 tasks. GPRM speedup is therefore achieved by changing the
number of tasks rather than the number of threads. In the case of
the Xeon Phi, the number of threads is set to 240 by the GPRM
runtime system.
All speedup ratios are computed against the running time of
the sequential code implemented in C++, which means they are
directly proportional to the absolute running times. There will be 3
planes per image and the benchmark will be run for 1000 times in
order to measure a precise running time on the MIC. Therefore, the
time for each image should be considered as runningtime/1000.
We have considered a Gaussian separable 5×5 kernel and a
range of square images from 1152×1152 to 8748×8748 for the
purposes of this study. We refer two implementations of the con-
volution algorithm as single-pass and two-pass algorithms (imple-
mentations). The single-pass algorithm is the general algorithm
used for convolution, having a nested loop over the kernel, there-
fore comprised of four loops to compute the convolution. The two-
pass algorithm is specific to separable kernels, and uses a horizontal
1D convolution pass followed by a vertical 1D convolution pass to
convolve the image.
Authors in [11] have identified that the peak performance for the
two-pass algorithm occurs at 100 threads. Our initial experiments
has verified that considering the range of images from 1152×1152
to 8748×8748, 100 could be our magic number for both OpenMP
(optimal number of threads) and GPRM (optimal number of tasks)
models. It is worth stating that because the convolution time for
each image is too short, the communication overhead becomes
significant, and using all of the available resources in the Xeon Phi
is not advantageous. This is not the case for OpenCL, and using
less than the available resources in terms of compute units and
processing items does not improve the performance.
5. Convolution Implementations
The convolution algorithm used in the real world application only
works at the central part of the image that is in sight of multiple
cameras, and what happens at the far edges are ignored. Therefore,
we can safely ignore the complications at the edges and start the
convolution from the pixel for which the kernel can access the
required neighbours, i.e. pixel (2,2).
5.1 Single-Pass and Two-pass Algorithms
In order to solve the 2D convolution problem, the simplest ap-
proach is to loop over all the image pixels and all the kernel ele-
ments in one go. This algorithm in referred as the single-pass algo-
rithm in this study. It uses 4 nested loops, the 2 outer loops on the
rows and columns and the image and the 2 inner loops on the rows
and columns of the kernel. For a 5×5 kernel, according to Eq. 1, it
needs 25 multiply-accumulate operations for each pixel.
An alternative comes into play when a separable kernel is used:
the two-pass algorithm. As stated in the Introduction, a separable
kernel can be decomposed into horizontal and vertical projections
and hence can be applied independently to the rows and columns
of the spatial domain. For a 5×5 kernel, it reduces the number of
multiplications per pixel to 10.
From the algorithmic point of view, the two-pass convolution
algorithm should always be the preferred one if the kernel is sepa-
rable. It has O(n)time complexity, while the complexity of single-
pass algorithm is O(n2), where nis the kernel width.
5.2 From Naive to Parallelised Optimised Code
We have implicitly mentioned the two-pass algorithm as an opti-
misation for the single-pass algorithm. We discuss in this section
and section 7 that one should be careful about parallel performance
prediction based on the sequential runtime of algorithms. For the
purposes of this section we have chosen the largest 3 images of our
6 image test cases.
There is a number of other optimisations at different levels that
could be considered for image convolution. Nevertheless, we do
not consider our final optimised code as a Ninja code, i.e. the best
optimised solution. The optimisations listed here can be achieved
with a little programming effort. The resulting optimised code, by
definition should have performance comparable to a Ninja imple-
mentation, with a little effort of algorithmically improving the naive
code (i.e. the compiler-generated code), or by using the latest com-
piler technology for parallelisation and vectorisation [12]. It has
also been reported that the difference between a sequential opti-
mised (similar to our single-pass optimised code) and a sequential
Ninja implementation for a 2D convolution algorithm on the Xeon
Phi is around 1% [13].
Another point to make is that in this study, we are also con-
cerned about parallelisation techniques. An optimised sequential
algorithm that utilises the vector units efficiently is important as the
baseline for parallelisation, and that is why we will apply the fol-
lowing optimisations, but the other aspect of this study is to identify
the pros and cons of parallel programming models in parallelising
the optimised code.
Here, we consider the single-pass algorithm with 4 nested loops
as the naive code. It is important to note that since this algorithm
convolves image array A to B, at the end of the algorithm we copy
back B to A to have the original image convolved. To make sure
that the naive code does not utilise automatic vectorisation, the
code should be compiled with the -no-vec flag (although, the Intel
compiler failed to auto-vectorise our naive code).
Opt-1: Single-pass, Unrolled The first optimisation is loop un-
rolling. An average (among the 3 images) benefit of 2.5×can be
obtained by hand unrolling the nested loop over kernel into 25 mul-
tiplications. At this stage we change the statement in Eq. 2 inside
the kernel nested loop into 25 additions in the form of Eq. 3.
B[pId][i][j]+ = A[pI d][i+kx2][j+ky2]K[kx][ky]; (2)
B[pId][i][j] = A[pI d][i2][j2] K[0][0]+
A[pId][i2][j1] K[0][1] + ...+
A[pId][i][j]K[2][2] + ...+
A[pId][i+ 2][j+ 2] K[4][4];
Opt-2: Single-pass, Unrolled, Vectorised After unrolling the ker-
nel loops, only 2 out of 4 initial loops remain. Utilising the compiler
technology, we can enforce inner loop vectorisation using #pragma
simd, which if memory alias or dependence analysis fails, gives
hint to the compiler that the loop is safe to be vectorised 1. Vectori-
sation after unrolling gave us an average speedup of 22×over the
Opt-3: Two-pass, Unrolled The third optimisation is an algorith-
mic change due to the fact that the kernel is separable, hence in-
stead of 25 multiplication for each pixel (as a result of a 5×5 ker-
nel nested loop), we can use a horizontal 1D convolution followed
by a 1D vertical convolution. Therefore, the number of multiplica-
tions for each pixel becomes 5 + 5 = 10. This optimisation is first
combined with loop unrolling. Each of the two loops to be unrolled
in this case (one in the horizontal pass and the other in the verti-
cal pass) has the size of 5. An average speedup of 5.5×over the
baseline can be obtained at this stage.
Opt-4: Two-pass, Unrolled, Vectorised Repeating the second op-
timisation on the inner loops of the two-pass algorithm (i.e. those
over the image columns), we can now get an average of 47.1×per-
formance gain over the baseline, and we have just optimised the
sequential code so far.
Par-1: Single-pass, Unrolled, 100 OpenMP Threads OpenMP
provides the simplest way of parallelising the outer loop of the
single-pass algorithm. We obtained the an average of 191.1×
speedup over the baseline.
Par-2: Single-pass, Unrolled, Vectorised, 100 OpenMP Threads
On top of the previous optimisation, similar to “Opt-2”, we have en-
forced vectorisation on the inner loops over the image columns (for
1This always requires extra care, as enforcing SIMD vectorisation while
there is vector dependence results in incorrect results
both convolution computation and the copy-back operation). Apart
form that, the outer loops over the image rows are parallelised us-
ing #pragma omp parallel for. An average performance gain
of 1268.8×over the baseline has been achieved.
Par-3: Two-pass, Unrolled, 100 OpenMP Threads Parallelised
version of the two-pass algorithm provides an average of 393.7×
speedup over the baseline. This is almost 2.1×the speedup of the
competitive algorithm in “Par-1”.
Par-4: Two-pass, Unrolled, Vectorised, 100 OpenMP Threads
The best parallelised vectorised approach has the average speedup
of 1611.7×. This is only 1.3×the speedup of the competitive al-
gorithm in “Par-2”. This shows that the single-pass algorithm can
benefit more from vectorisation when parallelised. This is an im-
portant finding and we will see in section 7 that it helps another
version of the single-pass algorithm (without copy-back) to outper-
form the two-pass algorithm with 100 threads.
The speedup results for all the stages from naive to a parallelised
optimised code are illustrated in Figure 1.
Opt-0 Opt-1 Opt-2 Opt-3 Opt-4 Par-1 Par-2 Par-3 Par-4
Image size
Figure 1: From Naive to Parallelised Optimised code
Baseline: single-pass algorithm with copy-back to source
Opt-0: Naive, Single-pass, No-vec
Opt-1: Single-pass, Unrolled, No-vec
Opt-2: Single-pass, Unrolled, SIMD
Opt-3: Two-pass, Unrolled, No-vec
Opt-4: Two-pass, Unrolled, SIMD
Par-1 : Single-pass, Unrolled, No-vec, 100 omp threads
Par-2 : Single-pass, Unrolled, SIMD, 100 omp threads
Par-3 : Two-pass, Unrolled, No-vec, 100 omp threads
Par-4 : Two-pass, Unrolled, SIMD, 100 omp threads
5.3 OpenMP Implementation Details
An OpenMP implementation of the image convolution algorithm is
shown in List. 1.
This code corresponds to the last stage of the optimisations,
“Par-4”, as it implements the two-pass algorithm with a horizon-
tal pass followed by a vertical pass; the kernel loop is unrolled,
#pragma simd is used to enforce SIMD vectorisation, and the
outer loop is parallelised.
It is worth stating that #pragma omp parallel for has an
implicit barrier at the end.
1/2D c o n v o l u t i o n on e ac h p l an e /
2v o i d twoPassConv ( f l o a t ∗∗∗A , f l o a t ∗∗B , f l o a t
k , int p l a n e I d , int rows , int c o l s ) {
3/ / h o r i z o n t a l p a ss
4#pragma omp parallel f o r
5f o r (int i = 2 ; i <rows 2; i ++) {
6#pragma simd
7f o r (i n t j = 2 ; j <c o l s 2; j ++ ) {
8B [ p l a n e I d ] [ i ] [ j ] =
9A[ p l a n e I d ] [ i ] [ j 2] k [ 0 ] +
10 A[ p l a n e I d ] [ i ] [ j 1] k [ 1 ] +
11 A[ p l a n e I d ] [ i ] [ j ] k [ 2 ] +
12 A[ p l a n e I d ] [ i ] [ j +1 ] k [ 3 ] +
13 A[ p l a n e I d ] [ i ] [ j +2 ] k [ 4 ] ;
14 }
15 }
16 / / v e r t i c a l p a ss
17 #pragma omp parallel f o r
18 f o r (int i = 2 ; i <rows 2; i ++) {
19 #pragma simd
20 f o r (i n t j = 2 ; j <c o l s 2; j ++ ) {
21 A[ p l a n e I d ] [ i ] [ j ] =
22 B [ p l a n e I d ] [ i 2] [ j ] k [ 0 ] +
23 B [ p l a n e I d ] [ i 1] [ j ] k [ 1 ] +
24 B [ p l a n e I d ] [ i ] [ j ] k [ 2 ] +
25 B [ p l a n e I d ] [ i + 1 ] [ j ] k [ 3 ] +
26 B [ p l a n e I d ] [ i + 2 ] [ j ] k [ 4 ] ;
27 }
28 }
29 return ;}
31 /c a l l s t wo Pa ss Co nv on e ac h p la n e /
32 v o i d conv ( f l o a t ∗∗A, f l o a t ∗∗B, f l o a t k er ,
pi ma ge a ) {
33 #pragma n o v e c t o r
34 f o r (int p l a n e I d = 0 ; p l a n e I d <a . p l a n e s ;
p l a n e I d + + ) {
35 t wo P as s C on v ( A , B , k er , p l a n e I d , a . r o ws , a .
c o l s ) ;
36 }
37 return ;}
Listing 1: Two-pass Image Convolution Algorithm, OpenMP
5.4 OpenCL Implementation Details
The concept of threads or tasks does not apply directly in OpenCL,
but essentially OpenCL uses a model of compute units, loosely
corresponding to hardware threads, and processing elements, which
in a GPU maps to the “cores” in a streaming multiprocessor, but in
the Xeon Phi maps to the vector units. However, the most common
programming model in OpenCL is to specify the “global range”,
i.e. the number of invocations of a kernel, and let the runtime
system allocate the threads. If one wants more fine-grained control,
one can specify the “local range” as well, which expresses the
number of processing elements to use per compute unit.
Our approach for creating the OpenCL version from the original
version of the code is largely automated. We started from the “Opt-
3” version discussed above. We use an annotation to mark the
subroutine that will become the OpenCL kernel (i.e. twoPassConv).
Our source-to-source compiler2generates the OpenCL API code
as well as a single-threaded OpenCL kernel. We then manually
optimise the kernel and if required the OpenCL API calls. The
OpenCL API we use is our own OpenCL wrapper library3, which
provides convenient OpenCL integration in existing codebases for
C, C++ and Fortran.
Deriving the parallel kernel from the generated single-threaded
code is mostly a matter of replacing the loops by the OpenCL
indexing calls (get global id,get local id etc), and in the
case where the original code has multiple loops, as is the case
2[omitted for blind review]
3[omitted for blind review]
for the convolution, we use a conditional statement to identifiy
the portion of the kernel code to be executed on each invocation
(List. 2). The host code contains a corresponding loop over the
subsequent stages.
1k e r n e l v o i d t wo Pa s sC on v ( g l o b a l f l o a t A ,
g l o b a l f l o a t B , g l o b a l c o n s t f l o a t k ,
g l o b a l c o n s t i n t pass , c o n s t i n t c o l s ,
c o n s t i n t ro ws ) {
2c o n s t i n t i d x = g e t g l o b a l i d ( 0 ) ;
3c o n s t i n t c = i d x % c o l s ;
4c o n s t i n t r = ( id x % ( ro w s c o l s ) ) / c o l s ;
5/2D c o n v o l u t i o n on e ac h p l an e /
6i f (p a s s == 1) {
7/ / h o r i z o n t a l p a ss
8i f ( c>1 ) {i f ( c<c o l s 2 ) {
9A[ i dx ] = B[ i dx 2] k [ 0 ]
10 + B[ id x 1] k [ 1 ]
11 + B[ id x ] k [ 2 ]
12 + B [ i d x +1 ] k [3 ]
13 + B [ i d x +2 ] k [ 4 ] ;
14 }}} e l s e i f (p a s s == 2) {
15 / / v e r t i c a l p a ss
16 i f ( r >1) {i f ( r<rows 2) {
17 B[ id x ] =
18 A[ i d x 2c o l s ] k [ 0 ] +
19 A[ i d x c o l s ] k [ 1 ] +
20 A[ i dx ] k [ 2 ] +
21 A[ i d x+ c o l s ] k [ 3 ] +
22 A[ i d x + 2c o l s ] k [ 4 ] ;
23 }}}
24 }
Listing 2: Two-pass Image Convolution Algorithm, OpenCL
In order to validate our assumption about the mapping of work
items to threads and vector units, we implemented the kernel in
two different ways: first, the straightforward way, where only the
global range is specified, and corresponds to the amount of work to
be done, i.e.npoints = rows*cols*planes for the convolution. The
index into the image array is in this case simply the global index,
because the OpenCL kernel uses a 1D representation of the array.
Then, the more controlled approach where the global range
is ngroups*nths, the local range is nths and each kernel loops
over niters = npoints/(ngroups*nths). Here, ngroups is the num-
ber of work groups to be used, usually this is the same as the
number of compute units, and nths is the number of work items
per work group, usually the same as the number of process-
ing elements. The index into the image array is given by idx =
niters*nths*group id+nths*iter+local id, the important point here
is that the index is contiguous in the local id, rather than in the
loop iterator. In this way the operations over nths work items can
be vectorised.
We found that the optimal performance is achieved for ngroups
=236 and nths=16, which corresponds to 59 MIC cores with 4-way
multithreading, i.e. 236 compute units, and 16 elements per SIMD
vector (512-bit vectors). And indeed this is the same performance
as achieved with the first, much simpler approach.
5.5 GPRM Implementation Details
The GPRM implementation of the two-pass algorithm defines the
two phases of the algorithm as two different types of tasks. Since
all the tasks defined in the GPC code will be executed in parallel,
aseq pragma is required to run the two phases sequentially. Each
phase uses a partial continuous for,par cont for [3], in order to
parallelise the outer loop over rows, and a #pragma simd 4to help
the compiler vectorise the inner loop over columns. par cont for
is a sequential for loop that works as follows:
In GPRM, multiple instances of the same task are generated
(specified by CUTOFF in the List. 3), each with a different index
(similar to the global id in OpenCL). Each of these tasks calls
par cont for passing its own index to specify which parts of the
work should be performed by its host thread.
1#include ” C onv . h ”
3v o i d C onv : : h o r i z P a s s ( i n d , CUTOFF, . . . ) {
4p a r c o n t f o r ( 2 , r ow s 2, i nd , CUTOFF, t h i s , &
Con v : : h o r i z P a s s I n n e r L o o p , . . . ) ;
7v o i d C onv : : v e r t P a s s ( i n d , CUTOFF, . . . ) {
8p a r c o n t f o r ( 2 , r ow s 2, i nd , CUTOFF, t h i s , &
Con v : : v e r t P a s s I n n e r L o o p , . . . ) ;
Listing 3: Two-pass Image Convolution Algorithm, GPRM Task
1#include ”GPRM/ T as k / Co nvTas k . h”
3v o i d horizontalTasks (c o n s t i n t CUTOFF, . . . ) {
4/GPC f o r w i t h p a r a l l e l e v a l u a t i o n /
5f o r (int i n d =0 ; i n d <CUTOFF ; i n d + + ) {
6h o r i z P a s s ( i nd , CUTOFF, . . . ) ;
10 v o i d verticalTasks (( c o n s t i n t CUTOFF , . . . ) {
11 /GPC f o r w i t h p a r a l l e l e v a l u a t i o n /
12 f o r (int i n d =0 ; i n d <CUTOFF ; i n d + + ) {
13 v e r t P a s s ( i nd , CUTOFF, . . . ) ;
14 }
15 }
17 v o i d GPRM : : Co nv Ta sk : : t wo Pa ss Co nv ( . . . ) {
18 #pragma gp rm s e q
19 {
20 h o r i z o n t a l T a s k s (1 0 0 , . . . ) ;
21 v e r t i c a l T a s k s ( 10 0 , . . . ) ;
22 }
23 }
Listing 4: Two-pass Image Convolution Algorithm, GPC Code
6. Parallel Performance of the Two-pass
The focus of this section is on the parallel performance of the three
implementations of the two-pass algorithm.
We start by disabling the vectorisation in the Xeon Phi. The
results for the parallelised non-vectorised cases are compared with
the vectorised ones in Table 1. In order to disable vectorisation
for OpenMP and GPRM, the code should be compiled with the
-no-vec flag. In OpenCL, there is no explicit vectorisation option,
but we can effectively disable vectorisation by using only a single
processing element per compute unit.
4Unlike OpenMP, in this case the use of #pragma simd for the innermost
loop in the GPRM implementation is optional
Table 1: The effect of vectorisation on the parallel performance (ms) of the two-pass algorithm
Image Size OpenMP no-vec OpenCL no-vec GPRM no-vec OpenMP SIMD OpenCL SIMD GPRM SIMD
1152x1152 3.9 5.4 27.2 0.8 (4.9×) 2.0 (2.7×) 26.1 (1.0×)
1728x1728 8.5 12.3 32.8 2.0 (4.2×) 3.8 (3.2×) 26.6 (1.2×)
2592x2592 16.7 26.9 40.5 4.1 (4.1×) 7.8 (3.4×) 27.8 (1.5×)
3888x3888 39.9 61.6 60.4 8.8 (4.5×) 16.5 (3.7×) 32.5 (1.9×)
5832x5832 86.7 146.2 105.8 19.6 (4.4×) 38.1 (3.8×) 36.8 (2.9×)
8748x8748 195.4 334 216.9 59.2 (3.3×) 91.5 (3.6×) 60.1 (3.6×)
The average speedup obtained through vectorisation for the
OpenMP code is about 4.2×. It is important to note that this
speedup for the sequential code was almost twice as much (8.6×).
Therefore, the reported performance gain is specific to the case with
100 threads and should not be generalised.
It is worth noting that the speedup due to vectorisation in GPRM
is much less pronounced, mostly due to the higher overhead of the
GPRM runtime for smaller images. The same is true for OpenCL,
but to a lesser extent. On average, the speedup obtained through
vectorisation for the OpenCL code is about 3.5×. Clearly, the
OpenMP vectorisation is more efficient and this a large factor in
the lesser performance of OpenCL.
Figure 2 shows the speedup of the two-pass algorithm against
its optimised sequential implementation (i.e. version “Opt-4”). So
far, the algorithm is parallelised over each plane of size R×C, hence
R×Cin Fig.2. This means for 3 colour planes, the parallelised code
will be executed 3 times sequentially 5.
1152x1152 1728x1728 2592x2592 3888x3888 5832x5832 8748x8748
Image size
Figure 2: Speedup of the Vectorised Two-pass Algorithm, R×C
It is possible to inspect the difference between OpenMP and
GPRM more in detail. Since GPRM provides a modular methodol-
ogy for expressing tasks and defining the communication patterns
between them, we can create empty tasks and measure the over-
head of distributing them across different threads and the paral-
lel reduction. In other words, it is possible to measure the over-
head of communication between tiles in GPRM. If we deduct this
overhead from the total running time, we can measure the time
spent on the actual computation inside the framework. The GPRM-
compute time shown in Table 2 is gained by deducting the constant
communication overhead of 25.5ms from the total execution time.
We will discuss how this separation will help to find a better task
decomposition solution and hence better performance. Since the
OpenMP execution model is different from that of GPRM, we can-
not similarly separate the computation from the communication in
OpenMP. OpenCL also allows to run empty kernels to study the
overhead, we found that the overhead is between 0.25 and 0.4ms,
5Actually it is 3000 times, considering that we run the code 1000 times
so a small component in the total running time for all but the small-
est image size.
Table 2: Running time (ms) per image for the two-pass algorithm
Image Size OpenMP OpenCL GPRM-total OpenCL-compute GPRM-compute
1152x1152 0.8 2.0 26.1 1.8 0.6
1728x1728 2.0 3.8 26.6 3.6 1.1
2592x2592 4.1 7.8 27.8 7.5 2.3
3888x3888 8.8 16.5 32.5 16.2 7.0
5832x5832 19.6 38.1 36.8 37.7 11.3
8748x8748 59.2 91.5 60.1 91.0 34.6
As a solution to mitigate the GPRM overhead, we have used
task agglomeration: combining tasks into larger tasks to improve
performance [14]. We therefore consider images with the width of
3 times the width of the original images, meaning that each row
includes information for all 3 colour planes. This way, we include
the 3 colour planes into the parallelisation. Using this technique,
the size of tasks in GPRM becomes tripled and the overhead be-
comes one third (8.5ms per image). The speedup results for this
case, which we call 3R×Cis shown in Fig. 3. As expected, this
technique does not have similar significant impact on the OpenMP
1152x1152 1728x1728 2592x2592 3888x3888 5832x5832 8748x8748
Image size
Figure 3: Speedup of the Vectorised Two-pass Algorithm, 3R×C
Using task agglomeration, GPRM achieves better performance
than OpenCL for the three larger images, and the best performance
among all for the largest image.
7. Reconsidering the Single-pass Algorithm
In order to compare the single-pass and two-pass algorithms, it is
important to note that the two-pass algorithm uses an auxiliary
array to store the result of the first pass. In the second pass, it
uses the auxiliary array as the source and the original image as the
destination, thus at the end of the algorithm, the original image will
be replaced by the convolved one. It is convenient that the input
and output images can use the same array, but it comes at a price:
two assignment operations rather than one for every pixel. In order
to have a fair comparison, we expected the same from the single-
pass algorithm, i.e. overwriting the original image. This means that
although the single-pass algorithm can produce the result on an
output image by assigning new values for all the pixels only once, it
now needs to copy the convolved values back to the original image.
This copy-back operation constitutes a considerable extra over-
head and sometimes is not needed, e.g. when working with the
Xeon Phi as a co-processor. Suppose one runs some complex code
on the Xeon CPU and offloads the computation of the convolu-
tion to the Xeon Phi, i.e. the typical model for OpenCL. In that
model, there will be host-to-device and device-to-host copy cost. If
one copies an image array A to the MIC, convolves it into an ar-
ray B and copies that back to the host, there is of course no need
to copy on the data back to the original array on the device itself.
Consequently, we have also tested the single-pass code without the
ultimate “copy back to the original image” operation. We have mea-
sured the results again only for the three larger images.
            
Figure 4: From Naive to Parallelised Optimised code
Baseline: single-pass algorithm without copy-back to source
Opt-0: Naive, Single-pass, No-vec
Opt-1: Single-pass, Unrolled, No-vec
Opt-2: Single-pass, Unrolled, SIMD
Opt-3: Two-pass, Unrolled, No-vec
Opt-4: Two-pass, Unrolled, SIMD
Par-1 : Single-pass, Unrolled, No-vec, 100 omp threads
Par-2 : Single-pass, Unrolled, SIMD, 100 omp threads
Par-3 : Two-pass, Unrolled, No-vec, 100 omp threads
Par-4 : Two-pass, Unrolled, SIMD, 100 omp threads
Par-5 : Single-pass, Unrolled, No-vec, 100 GPRM tasks, 3R×C
Par-6 : Single-pass, Unrolled, SIMD, 100 GPRM tasks, 3R×C
Par-7 : Single-pass, Unrolled, SIMD, OpenCL
Par-8 : Two-pass, Unrolled, SIMD, OpenCL
After unrolling the kernel loop(s), for both non-vectorised and
vectorised approaches, the results were as expected, i.e. the Two-
pass algorithm had much better performance than the Single-pass
algorithm (1.6×to 1.9×).
It is worth mentioning that in some cases, e.g. for 3888×3888
images, the performance of the optimised single-pass algorithm
with OpenMP could be improved by up to 15% (10% in average
for the largest three images) by tuning the number of threads, e.g.
with 120 threads, but since for the C++-based approaches 100
threads or tasks are used and we do not intend to compare multiple
configurations together, we stick to this number.
Figure 4 shows that although the average sequential perfor-
mance of the optimised two-pass code is 1.6×better than the av-
erage sequential performance of the optimised single-pass code
(without copy-back), the average parallel performance of the op-
timised single-pass code (using OpenMP) is 1.2×better than that
of the optimised two-pass code. The reason can also be extracted
from Fig. 4: better utilisation of the vector units by the parallel
single-pass code (9.4×its parallel non-vectorised version) com-
pared to the parallel two-pass code (4.1×its parallel non-vectorised
Since GPRM had shown a good performance for the largest
array when we included parallelisation over planes into the tasks
(the 3R×Ccase), its results has been added to Fig. 4. As expected,
it produced the best result for the 8748×8748 image, using the
optimised single-pass algorithm with no copy-back. Its speedup
over the baseline naive code is 1850×with 100 tasks.
The results of the OpenCL kernel for the single-pass implemen-
tation are on average about 50% slower than for the two-pass im-
plementation, and although the two-pass version still achieves over
1000×speedup over the baseline, it is still the worst of the three
As the best result amongst all, we have been able to get up to
1970×(for the 5832×5832 image) speedup over the sequential
naive implementation of the algorithm, by only utilising the com-
piler technology, few algorithmic changes, and parallelisation (us-
ing OpenMP). Also, a 2160×speedup over the baseline has been
observed with 120 OpenMP threads for 5832×5832 matrices with
single-pass, no-copy approach.
8. Related Work
A similar 5×5 spatial kernel (filter) has been the focus of a number
of research papers [13] [12] [11] [15].
Petersen et al. [13] ported a subset of C benchmarks to Haskell
and measured their performance on parallel machines, including
the Xeon Phi. Considering three classes of naive, optimised, and
Ninja C implementations [12], our implementation of the image
convolution algorithm is classified as the optimised code, utilising
loop unrolling and SIMD vectorisation.
The reported Ninja gap for the Intel Labs Haskell Research
Compiler (HRC) for 8192×8192 images on the Xeon Phi using
the single-pass algorithm is 3.7×(for 57 threads) [13]. The authors
have disabled multithreading on the Xeon Phi, which is essentially
different from hyper-threading on the Xeon processors [5].
Authors in [11] explored this further and figured out that the
peak performance can be achieved with 100 threads. They have
also reported that the performance gap between the Vector Pascal
[11] and an optimised OpenMP implementations of the two-pass
algorithm with 100 threads is almost 6.4×.
Authors in [12] also focused on the optimisation techniques for
parallel applications, using both advancements in compiler tech-
nology and algorithmic techniques to bring down the Ninja perfor-
mance gap for throughput computing benchmarks, one of which is
the single-pass implementation of the convolution algorithm.
Tian et al. [15] focused on efficient utilisation of the SIMD vec-
tor units on the Xeon Phi and proposed a number effective tech-
niques to improve performance of parallel programs, including a
single-pass image convolution. They have reported a speedup of
2000×using their vectorisation techniques along with parallelisa-
tion. We have also observed a speedup of about 1970×(2160×
with 120 threads) without using any particular vectorisation tech-
nique. However, we have also highlighted the importance of the
Xeon Phi vector units, specially their impact on parallel perfor-
A similar study has been conducted using the TILEPro64
platform [16]. On the 64-core TILEPro64, GPRM outperformed
OpenMP in all cases.
9. Discussion
The OpenCL implementation of the two-pass convolution performs
worse than the OpenMP one by a factor of about two. This is ac-
tually not all that surprising because native OpenMP has very lit-
tle overhead in its use of the kernel threads on the MIC, whereas
OpenCL requires a runtime system for scheduling work on the
threads. Furthermore, OpenCL has no explicit vectorisation con-
trol, so achieving good vectorisation is harder than with the prag-
mas used in the OpenMP code, as shown by our results.
The GPRM model also has a fixed overhead (tens of millisec-
onds for hundreds of tasks) due to task creation and communica-
tion. By including the 3 image planes into the parallelisation (simi-
lar to the initial OpenCL approach), we reduced the overhead to one
third and thus the GPRM implementation achieved the best perfor-
mance for the largest image. We therefore conclude that GPRM is
not suitable to handle small tasks, but as soon as the tasks become
large enough to cover the scheduling overhead, it shows good per-
formance compared to the competitive models.
GPRM naturally fits algorithms with task (functional) decom-
position. It has its own complications though when it comes to do-
main decomposition, as it requires restructuring certain parts of the
program to fit the GPRM structure.
We therefore conclude that for the studied algorithm, 2D image
convolution, OpenMP is the most productive approach, followed
by OpenCL and then GPRM. In general, GPRM is attractive for
task-based programming, but for algorithms like the convolution, it
offers few advantages in terms of ease of programming. However,
GPRM facilitates modular design, which is key to improve produc-
tivity [17].
In terms of performance, OpenMP is the winning model, except
for very large images where GPRM shows better performance after
using task agglomeration. GPRM also outperformed OpenCL for
the three largest images, but OpenCL had better performance for
the three small images.
In terms of portability, GPRM and OpenMP only support con-
ventional multicore platforms and the MIC, and require the com-
mercial Intel compiler for the MIC. OpenCL has the advantage
of supporting many platforms including GPUs and FPGAs, and
its absolute performance is probably good enough for most pur-
poses. Furthermore, using our described approach for porting appli-
cations, OpenCL programming becomes actually quite easy (as is
clear from comparing the OpenCL kernel with the OpenMP code).
It can also be noted that to program the Xeon Phi with OpenCL, one
does not require the commercial Intel compiler, and the OpenCL
SDK is free, so it is a cost-effective solution.
10. Conclusion
In this study, we have chosen three very different parallel program-
ming models supported by the Xeon Phi (OpenMP, OpenCL and
GPRM) to solve a 2D image convolution problem over a test set of
6 square images, ranging from 1152×1152 to 8748×8748. For a
separable convolution kernel, two different algorithms can be con-
sidered: Single-pass, which requires only a single assignment in-
stead of two, but needs an additional copy if the result is required
in the original array, and Two-pass, which requires fewer compu-
tations and returns the result in the original array. After creating
optimised versions of both algorithms, we found that the choice be-
tween theses algorithm depends on which version of the single-pass
algorithm is required: if the result has to be copied back to the orig-
inal image, then the two-pass algorithm is always better. Otherwise,
the single-pass algorithm can provide better parallel performance,
even though its sequential performance is still worse.
We have explored a number optimisation and parallelisation
techniques on the Xeon Phi which helped us achieve a speedup near
2000×over the baseline, but none of these techniques requires a
major rewrite of the original code. The optimisation techniques in-
clude loop unrolling, vectorisation, and an algorithmic from single-
pass to two-pass or vice versa. Task agglomeration is also used as
a parallelisation technique to improve the performance of GPRM.
Although the OpenMP implementation showed the best over-
all performance on the Xeon Phi, the task-based GPRM model
achieved better performance for large images, and although the
OpenCL performance was not as good, it is still impressive, and
thus a good choice if the Intel compiler is not available.
[1] S. Smith, Digital Signal Processing: A Practical Guide for Engi-
neers and Scientists: A Practical Guide for Engineers and Scientists.
Newnes, 2013.
[2] M. McCool, J. Reinders, and A. Robison, Structured parallel pro-
gramming: patterns for efficient computation. Elsevier, 2012.
[3] A. Tousimojarad and W. Vanderbauwhede, “A parallel task-based ap-
proach to linear algebra,” in Parallel and Distributed Computing (IS-
PDC), 2014 IEEE 13th International Symposium on. IEEE, 2014,
pp. 59–66.
[4] A. Tousimojarad and W. Vanderbauwhede, “Steal Locally,
Share Globally: A strategy for multiprogramming in the
manycore era,” International Journal of Parallel Programming,
vol. 43, no. 5, pp. 894–917, 2015. [Online]. Available:
[5] J. Jeffers and J. Reinders, Intel Xeon Phi Coprocessor High Perfor-
mance Programming. Newnes, 2013.
[6] E. Ayguad´
e, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Mas-
saioli, X. Teruel, P. Unnikrishnan, and G. Zhang, “The design of
openmp tasks,” Parallel and Distributed Systems, IEEE Transactions
on, vol. 20, no. 3, pp. 404–418, 2009.
[7] J. Clet-Ortega, P. Carribault, and M. P´
erache, “Evaluation of openmp
task scheduling algorithms for large numa architectures,” in Euro-Par
2014 Parallel Processing. Springer, 2014, pp. 596–607.
[8] J. E. Stone, D. Gohara, and G. Shi, “Opencl: A parallel programming
standard for heterogeneous computing systems,” Computing in science
& engineering, vol. 12, no. 3, p. 66, 2010.
[9] P. Du, R. Weber, P. Luszczek, S. Tomov, G. Peterson, and J. Dongarra,
“From cuda to opencl: Towards a performance-portable solution for
multi-platform gpu programming,” Parallel Computing, vol. 38, no. 8,
pp. 391–407, 2012.
[10] J. Sanders and E. Kandrot, CUDA by example: an introduction to
general-purpose GPU programming. Addison-Wesley Professional,
[11] M. Chimeh, P. Cockshott, S. B. Oehler, A. Tousimojarad, and T. Xu,
“Compiling vector pascal to the xeonphi,Concurrency and Compu-
tation: Practice and Experience, 2015.
[12] N. Satish, C. Kim, J. Chhugani, H. Saito, R. Krishnaiyer, M. Smelyan-
skiy, M. Girkar, and P. Dubey, “Can traditional programming bridge
the ninja performance gap for parallel computing applications?” Com-
munications of the ACM, vol. 58, no. 5, pp. 77–86, 2015.
[13] L. Petersen, T. A. Anderson, H. Liu, and N. Glew, “Measuring the
haskell gap,” in Proceedings of the 25th symposium on Implementation
and Application of Functional Languages. ACM, 2013, p. 61.
[14] I. Foster, Designing and building parallel programs. Addison Wesley
Publishing Company, 1995.
[15] X. Tian, H. Saito, S. V. Preis, E. N. Garcia, S. S. Kozhukhov, M. Mas-
ten, A. G. Cherkasov, and N. Panchenko, “Effective simd vectorization
for intel xeon phi coprocessors,” Scientific Programming, vol. 501, p.
269764, 2015.
[16] A. Tousimojarad, “GPRM: a high performance programming frame-
work for manycore processors,” Ph.D. dissertation, University of Glas-
gow, 2016.
[17] J. Hughes, “Why functional programming matters,The computer
journal, vol. 32, no. 2, pp. 98–107, 1989.
... However, the increase of logical and physical cores of processors and the increasing emergence of new Application Programming Interfaces (API's) have proposed new strategies to reach lower processing time for the same convolution process. Thus, in [7] Tousimojarad et al. compared the image convolution process implemented in different API's, like OpenMP, OpenCL and GPRM using an Intel Xeon Phi5110P processor. ...
... In order to obtain the most representative features of an image, we implement a program using a parallel computing strategy that performs a two-dimensional convolution process using different squared kernels of odd sizes from 3  3t o1 5  15. Let be A an input image of size M  N and K any kernel matrix of size n  n, thus B will be the convolved image which is calculated by the Eq. 1 [7]. ...
Currently, there is a great advance in the construction of processors with many cores, providing more computational power and resources to use. In the field of image processing, most of the algorithms use a sequential architecture that prevents from reaching the maximum performance of processors. In this work, we design and implement a set of low-level algorithms to optimize the processing of a two-dimensional convolution to obtain the best performance that a CPU can grant. Our approach uses parallel processing in four different cases of study based on multithreading. The computation time is compared in order to find which case achieves the best performance. In the same way, the computation time of the proposed algorithms is measured, and then, it is compared with general frameworks, in order to have a real metric of the proposed library with popular Application Programming Interfaces (API’s) like OpenMP.
This book contains a selection of papers from The 2019 International Conference on Software Process Improvement (CIMPS’19), held between the 23th and 25th of October in León, Guanajuato, México. The CIMPS’19 is a global forum for researchers and practitioners that present and discuss the most recent innovations, trends, results, experiences and concerns in the several perspectives of Software Engineering with clear relationship but not limited to software processes, Security in Information and Communication Technology and Data Analysis Field. The main topics covered are: Organizational Models, Standards and Methodologies, Software Process Improvement, Knowledge Management, Software Systems, Applications and Tools, Information and Communication Technologies and Processes in non-software domains (Mining, automotive, aerospace, business, health care, manufacturing, etc.) with a demonstrated relationship to Software Engineering Challenges.
Full-text available
Processors with large numbers of cores are becoming commonplace. In order to utilise the available resources in such systems, the programming paradigm has to move towards increased parallelism. However, increased parallelism does not necessarily lead to better performance. Parallel programming models have to provide not only flexible ways of defining parallel tasks, but also efficient methods to manage the created tasks. Moreover, in a general-purpose system, applications residing in the system compete for the shared resources. Thread and task scheduling in such a multiprogrammed multithreaded environment is a significant challenge. In this thesis, we introduce a new task-based parallel reduction model, called the Glasgow Parallel Reduction Machine (GPRM). Our main objective is to provide high performance while maintaining ease of programming. GPRM supports native parallelism; it provides a modular way of expressing parallel tasks and the communication patterns between them. Compiling a GPRM program results in an Intermediate Representation (IR) containing useful information about tasks, their dependencies, as well as the initial mapping information. This compile-time information helps reduce the overhead of runtime task scheduling and is key to high performance. Generally speaking, the granularity and the number of tasks are major factors in achieving high performance. These factors are even more important in the case of GPRM, as it is highly dependent on tasks, rather than threads. We use three basic benchmarks to provide a detailed comparison of GPRM with Intel OpenMP, Cilk Plus, and Threading Building Blocks (TBB) on the Intel Xeon Phi, and with GNU OpenMP on the Tilera TILEPro64. GPRM shows superior performance in almost all cases, only by controlling the number of tasks. GPRM also provides a low-overhead mechanism, called “Global Sharing”, which improves performance in multiprogramming situations. We use OpenMP, as the most popular model for shared-memory parallel programming as the main GPRM competitor for solving three well-known problems on both platforms: LU factorisation of Sparse Matrices, Image Convolution, and Linked List Processing. We focus on proposing solutions that best fit into the GPRM’s model of execution. GPRM outperforms OpenMP in all cases on the TILEPro64. On the Xeon Phi, our solution for the LU Factorisation results in notable performance improvement for sparse matrices with large numbers of small blocks. We investigate the overhead of GPRM’s task creation and distribution for very short computations using the Image Convolution benchmark. We show that this overhead can be mitigated by combining smaller tasks into larger ones. As a result, GPRM can outperform OpenMP for convolving large 2D matrices on the Xeon Phi. Finally, we demonstrate that our parallel worksharing construct provides an efficient solution for Linked List processing and performs better than OpenMP implementations on the Xeon Phi. The results are very promising, as they verify that our parallel programming framework for manycore processors is flexible and scalable, and can provide high performance without sacrificing productivity.
Full-text available
Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel Xeon Phi coprocessors. In this paper, we present several effective SIMD vectorization techniques such as less-than-full-vector loop vectorization, Intel MIC specific alignment optimization, and small matrix transpose/multiplication 2D vectorization implemented in the Intel C/C++ and Fortran production compilers for Intel Xeon Phi coprocessors. A set of workloads from several application domains is employed to conduct the performance study of our SIMD vectorization techniques. The performance results show that we achieved up to 12.5x performance gain on the Intel Xeon Phi coprocessor. We also demonstrate a 2000x performance speedup from the seamless integration of SIMD vectorization and parallelization.
Full-text available
In a general-purpose computing system, several parallel applications run simultaneously on the same platform. Even if each application is highly tuned for that specific platform, additional performance issues are arising in such a dynamic environment in which multiple applications compete for the resources. Different scheduling and resource management techniques have been proposed either at operating system or user level to improve the performance of concurrent workloads. In this paper, we propose a task-based strategy called “Steal Locally, Share Globally” implemented in the runtime of our parallel programming model GPRM (Glasgow Parallel Reduction Machine). We have chosen a state-of-the-art manycore parallel machine, the Intel Xeon Phi, to compare GPRM with some well-known parallel programming models, OpenMP, Intel Cilk Plus and Intel TBB, in both single-programming and multiprogramming scenarios. We show that GPRM not only performs well for single workloads, but also outperforms the other models for multiprogramming workloads. There are three considerations regarding our task-based scheme: (i) It is implemented inside the parallel framework, not as a separate layer; (ii) It improves the performance without the need to change the number of threads for each application (iii) It can be further tuned and improved, not only for the GPRM applications, but for other equivalent parallel programming models.
Conference Paper
Full-text available
Processors with large numbers of cores are becoming commonplace. In order to take advantage of the available resources in these systems, the programming paradigm has to move towards increased parallelism. However, increasing the level of concurrency in the program does not necessarily lead to better performance. Parallel programming models have to provide flexible ways of defining parallel tasks and at the same time, efficiently managing the created tasks. OpenMP is a widely accepted programming model for shared-memory architectures. In this paper we highlight some of the drawbacks in the OpenMP tasking approach, and propose an alternative model based on the Glasgow Parallel Reduction Machine (GPRM) programming framework. As the main focus of this study, we deploy our model to solve a fundamental linear algebra problem, LU factorisation of sparse matrices. We have used the SparseLU benchmark from the BOTS benchmark suite, and compared the results obtained from our model to those of the OpenMP tasking approach. The TILEPro64 system has been used to run the experiments. The results are very promising, not only because of the performance improvement for this particular problem, but also because they verify the task management efficiency, stability, and flexibility of our model, which can be applied to solve problems in future many-core systems.
Full-text available
Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-vs-programming effort effectiveness of common multi-core processors and upcoming many-core architectures in delivering significant speedup, and close-to-optimal performance for commonly used parallel computing workloads. We first quantify the extent of the "Ninja gap", which is the performance gap between naively written C/C++ code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors. Using a set of representative throughput computing benchmarks, we show that there is an average Ninja gap of 24X (up to 53X) for a recent 6-core Intel® Core™ i7 X980 Westmere CPU, and that this gap if left unaddressed will inevitably increase. We show how a set of well-known algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X. These changes typically require low programming effort, as compared to the very high effort in producing Ninja code. We also discuss hardware support for programmability that can reduce the impact of these changes and even further increase programmer productivity. We show equally encouraging results for the upcoming Intel® Many Integrated Core architecture (Intel® MIC) which has more cores and wider SIMD. We thus demonstrate that we can contain the otherwise uncontrolled growth of the Ninja gap and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.
Conference Paper
Current generation of high performance computing platforms tends to hold a large number of cores. Therefore applications have to expose a fine-grain parallelism to be more efficient. Since version 3.0, the OpenMP standard proposes a way to express such parallelism through tasks. Because the task scheduling strategy is implementation defined, each runtime can have a different behavior and efficiency. Notwithstanding, the hierarchical characteristic of current parallel computing systems is rarely considered. This might come down to a loss of performance on large multicore NUMA systems. This paper studies multiple task scheduling algorithms with a configurable scheduler. It relies on a topology-aware tree-based representation of the computing platform to orchestrate the execution and the load-balacing of OpenMP tasks. High-end users can select the task-list granularity according to the tree structure and choose the most convenient work-stealing strategy. One of these strategies takes into account data locality with the help of the hierarchical view. It performs well with unbalanced codes, from BOTS benchmarks, in comparison to Intel and GNU OpenMP runtimes on 16-core and 128-core systems.
Intel's XeonPhi is a highly parallel x86 architecture chip made by Intel. It has a number of novel features which make it a particularly challenging target for the compiler writer. This paper describes the techniques used to port the Glasgow Vector Pascal Compiler to this architecture and assess its performance by comparisons of the XeonPhi with 3 other machines running the same algorithms. Copyright © 2015 John Wiley & Sons, Ltd.
As software becomes more and more complex, it is more and more important to structure it well. Well-structured software is easy to write, easy to debug, and provides a collection of modules that can be re-used to reduce future programming costs. Conventional languages place conceptual limits on the way problems can be modularised. Functional languages push those limits back. In this paper we show that two features of functional languages in particular, higher-order functions and lazy evaluation, can contribute greatly to modularity. As examples, we manipulate lists and trees, program several numerical algorithms, and implement the alpha-beta heuristics (an Artificial Intelligence algorithm used in game-playing programs). Since modularity is the key to successful programming, functional languages are vitally important to the real world.
In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide a single library with decent performance on a variety of platforms. We choose triangular solver (TRSM) and matrix multiplication (GEMM) as representative level 3 BLAS routines to implement in OpenCL. We profile TRSM to get the time distribution of the OpenCL runtime system. We then provide tuned GEMM kernels for both the NVIDIA Tesla C2050 and ATI Radeon 5870, the latest GPUs offered by both companies. We explore the benefits of using the texture cache, the performance ramifications of copying data into images, discrepancies in the OpenCL and CUDA compilers’ optimizations, and other issues that affect the performance. Experimental results show that nearly 50% of peak performance can be obtained in GEMM on both GPUs in OpenCL. We also show that the performance of these kernels is not highly portable. Finally, we propose the use of auto-tuning to better explore these kernels’ parameter space using search harness.