Content uploaded by Ashkan Tousi
Author content
All content in this area was uploaded by Ashkan Tousi on Dec 01, 2017
Content may be subject to copyright.
2D Image Convolution using Three Parallel
Programming Models on the Xeon Phi
Ashkan Tousimojarad
University of Glasgow
ashkan.tousi@gmail.com
Wim Vanderbauwhede
University of Glasgow
wim.vanderbauwhede@glasgow.ac.uk
W Paul Cockshott
University of Glasgow
wpc@dcs.gla.ac.uk
Abstract
Image convolution is widely used for sharpening, blurring and edge
detection. In this paper, we review two common algorithms for con-
volving a 2D image by a separable kernel (filter). After optimising
the naive codes using loop unrolling and SIMD vectorisation, we
choose the algorithm with better performance as the baseline for
parallelisation. We then compare the parallel performance of the
optimised code using OpenMP, OpenCL and GPRM implementa-
tions on the Intel Xeon Phi. We also measure the effects of optimi-
sation techniques and demonstrate how they affect both sequential
and parallel performance. Apart from comparing the code complex-
ity as well as the performance of the chosen parallel programming
models, we investigate the impact of a parallelisation technique,
task agglomeration in GPRM.
1. Introduction
Throughput computing applications demand for fast response time
while dealing with a large amount of data. Image convolution is
one of such throughput computing applications. Convolution of an
image by a matrix of real numbers can be used to sharpen or smooth
an image, depending on the matrix used. If Ais an image and Kis
a convolution matrix, then B, the convolved image is calculated as:
By,x =X
i
X
j
Ay+i,x+jKi,j (1)
If kis a convolution vector, then the corresponding matrix Kis
such that Ki,j =kikj
A separable convolution kernel is a vector of real numbers that
can be decomposed into horizontal and vertical projections and
hence can be applied independently to the rows and columns of
the spatial domain to provide filtering [1]. It is a specialisation of
the more general convolution, but is algorithmically more efficient
to implement.
The image convolution algorithms are taken from the real code
used in a stereo matching algorithm. Image convolution and scaling
take up most of the cycles in the stereo matching algorithm. For all
tests, separable kernels of width 5 are used. The algorithm uses 3
colour planes and is heavily memory-fetch bound.
McCool et al. [2] list three desired features for the parallel pro-
gramming models that intend to enable parallelism: I) Performance,
II) Productivity and III) Portability. It should be possible to pre-
dict good performance, tune it, and scale it to larger systems. Pro-
ductivity is not only about expressiveness and composability, but
also about maintainability. Supporting a range of targets and oper-
ating systems is another desirable property, known as portability.
We have considered these three features in choosing the parallel
models for this study and we will refer to them throughout the pa-
per.
We aim to explore a number of optimisation and parallelisation
techniques for enhancing the performance of 2D image convolu-
tion on a modern manycore architecture. For this purpose, we have
chosen three parallel programming models from different domains
and for different reasons: OpenMP, the de-facto standard for pro-
gramming shared memory architectures; OpenCL, known for being
portable across multiple platforms; and finally GPRM, a pure task-
based programming framework. It has been reported that GPRM
provides superior performance compared to OpenMP on manycore
architectures [3] [4].
This paper is structured as follows: we start by introducing the
Xeon Phi and the three parallel models in sections 2 and 3, fol-
lowed by the experimental setup in section 4. In section 5, we re-
view the two algorithms used to solve the problem: single-pass and
two-pass algorithms. In this section, we also list different optimi-
sation techniques to convert a naive code to an optimised one. We
then discuss the implementation details of the parallel two-pass al-
gorithm in each programming model. Results of parallelising the
two-pass algorithm is presented in section 6. In section 7, another
version of the single-pass algorithm is considered. We will show
that although, still the sequential two-pass algorithm outperforms
the sequential single-pass algorithm, the parallel performance of
the modified single-pass algorithm is better. Finally, we review a
few of related research studies and the Conclusion section sum-
marises our attempts in exploring the behaviour of the two algo-
rithms, pros and cons of the studied parallel models in solving this
problem on a manycore architecture as well as the effect of optimi-
sation and parallelisation techniques applied to improve the perfor-
mance.
2. Hardware Platform: Intel Xeon Phi
Contemporary applications have increased the trend of integrating a
large number of processing cores in order to meet their performance
goals. Most of these applications need single-chip implementation
to satisfy their size and power consumption requirements. Multi-
core and manycore processors have emerged as promising archi-
tectures to benefit from increasing transistor numbers.
The Intel Xeon Phi coprocessor 5110P used in this study is an
SMP (Symmetric Multiprocessor) on-a-chip which is connected
to a host Xeon processor via a PCI Express bus interface. The
Intel Many Integrated Core (MIC) architecture used by the Intel
Xeon Phi coprocessors gives developers the advantage of using
standard, existing programming tools and methods. Our Xeon Phi
comprises 60 cores (240 logical cores) connected by a bidirectional
ring interconnect.
The Xeon Phi coprocessor provides four hardware threads shar-
ing the same physical core and its cache subsystem in order to hide
the latency inherent in in-order execution. As a result, the use of at
least two threads per core is almost always beneficial [5]. The Xeon
1
arXiv:1711.09791v1 [cs.DC] 27 Nov 2017
Phi has eight memory controllers supporting 2 GDDR5 memory
channels each. The clock speed of the cores is 1.053GHz. Each core
has an associated 512KB L2 cache. Data and instruction L1 caches
of 32KB are also integrated on each core. Another important fea-
ture of the Xeon Phi is that each core includes a SIMD 512-bit wide
VPU (Vector Processing Unit). The VPU can be used to process 16
single-precision or 8 double-precision elements per clock cycle.
3. Parallel Programming Models
The design of manycore processors is strongly driven by demands
for greater performance at reasonable cost. To make effective use of
the available parallelism in such systems, the parallel programming
model is of great importance. There are several parallel program-
ming models, runtime libraries, and APIs that help developers to
move from sequential to parallel programming. For the purposes of
this study, we have chosen three parallel programming models that
the Xeon Phi supports: OpenMP, OpenCL and GPRM.
3.1 OpenMP
OpenMP is the de-facto standard for shared memory programming,
and is based on a set of compiler directives or pragmas, com-
bined with a programming API to specify parallel regions, data
scope, synchronisation, and so on. It also supports runtime con-
figuration through the use of runtime environment variables, e.g.
OMP NUM THREADS to specify the number of threads at runtime.
OpenMP is a portable parallel programming approach and is sup-
ported on C, C++, and Fortran. It has been historically used for
loop-level and regular parallelism through its compiler directives.
Since the release of OpenMP 3.0, OpenMP also supports task par-
allelism [6]. It is now widely used in both task and data parallel
scenarios.
The Intel OpenMP runtime library (as opposed to the GNU
implementation) allocates a task list per thread for every OpenMP
team. Whenever a thread creates a task that cannot not be executed
immediately, that task is placed into the thread’s deque (double-
ended queue). A random stealing strategy balances the load [7].
Since OpenMP is a language enhancement, every new construct
requires compiler support. Therefore, its functionality is not as
extensive as library-based models. Moreover, race condition is a
serious problem in OpenMP. Although it provides the user with a
high level of abstraction, the onus is still on the programmer to
avoid race condition.
3.2 OpenCL
OpenCL [8] is an open standard for heterogeneous architectures.
One of the main OpenCL’s objectives is to increase portability
across GPUs, multicore processors, and OS software via its abstract
memory and execution model; however its performance is not al-
ways portable across different platforms. It has been suggested to
consider the architectural specifics in the algorithm design, in order
to address its performance portability issue. The use of auto-tuning
heuristics could also improve the performance [9].
Although OpenCL is mostly compared against Nvidia’s CUDA
[10], we do not aim to cover discussions about GPUs in this work.
The reason why OpenCL is listed here is firstly because it allows
for sharing of workload between host and device with the same pro-
gram, hence increasing portability and productivity, and secondly
because the studied system, the Xeon Phi, supports the OpenCL
programming model.
3.3 GPRM
The Glasgow Parallel Reduction Machine (GPRM) [4] provides a
task-based approach to manycore programming by structuring pro-
grams into task code, written as C++ classes, and communication
code, written in GPC, a restricted subset of C++. The communica-
tion code describes how the tasks interact. Compiling a GPRM pro-
gram results in an Intermediate Representation (IR) including use-
ful information about tasks, their dependencies as well as the initial
task-to-thread mapping information. GPRM is not as well-known
as the other two approaches, but since it specifically targets many-
core architectures and has shown good performance compared to
OpenMP and other approaches on the Xeon Phi, we have used it to
see how well a “pure” task-based model performs for this bench-
mark.
The aim of GPRM is to abstract away all the details of threads,
promoting the idea that tasks are the actual computations, and the
threads are only their substrates. What this means in practice is that
the runtime system automatically creates as many threads as the
number of cores. The programmer only decides about the tasks’
details, i.e. the number of tasks (using a cutoff), the task code and
the communication pattern between the tasks.
GPRM promises a good performance by combining compile-
time (source to intermediate representation) and runtime (stealing)
techniques. In other words, compile-time decisions form the initial
task distribution and the runtime system adjusts dynamically. As
a result, some threads can be asleep during the execution, which
means that the number of active threads are tuned automatically.
An implication of this model is that if the scheduling and/or com-
munication overhead is significant, one should create fewer tasks
than the number of cores.
4. Experimental Setup
OpenMP and GPRM programs are executed natively on the MIC.
The Intel compiler icpc (ICC) 14.0.2 is used with the -mmic,
-O2,-no-offload and -ansi-alias flags for compiling the
programs for native execution on the Xeon Phi. The OpenMP
programs should be compiled with the -openmp flag and the
libiomp5.so library is required. Unlike OpenMP, for the GPRM
framework there is no shared library to be copied to the MIC.
We also used OpenCL (Intel OpenCL for MIC v1.2) with
-cl-mad-enable and -cl-fast-relaxed-math. As we want to
compare the compute performance on the MIC, the time to transfer
the image between the host and device in the offload mode is not
taken into account.
Threads and tasks are two completely different concepts. How-
ever, in a pure task-based model such as GPRM, concurrency is
only controlled by tasks. Suppose that we want to parallelise a
simple loop on N elements. In OpenMP, using 100 threads sim-
ply means that each thread gets N/100 of the workload (if the static
scheduling is chosen). In GPRM for every program, the number of
tasks should be specified. For a loop, each chunk corresponds to a
task, therefore 100 tasks (i.e. cutoff=100) on the Xeon Phi means
that each thread gets N/100 of the workload. If we choose a cut-
off=480, since there are 240 threads on the Xeon Phi, each thread
gets 2 tasks. GPRM speedup is therefore achieved by changing the
number of tasks rather than the number of threads. In the case of
the Xeon Phi, the number of threads is set to 240 by the GPRM
runtime system.
All speedup ratios are computed against the running time of
the sequential code implemented in C++, which means they are
directly proportional to the absolute running times. There will be 3
planes per image and the benchmark will be run for 1000 times in
order to measure a precise running time on the MIC. Therefore, the
time for each image should be considered as runningtime/1000.
We have considered a Gaussian separable 5×5 kernel and a
range of square images from 1152×1152 to 8748×8748 for the
purposes of this study. We refer two implementations of the con-
volution algorithm as single-pass and two-pass algorithms (imple-
mentations). The single-pass algorithm is the general algorithm
2
used for convolution, having a nested loop over the kernel, there-
fore comprised of four loops to compute the convolution. The two-
pass algorithm is specific to separable kernels, and uses a horizontal
1D convolution pass followed by a vertical 1D convolution pass to
convolve the image.
Authors in [11] have identified that the peak performance for the
two-pass algorithm occurs at 100 threads. Our initial experiments
has verified that considering the range of images from 1152×1152
to 8748×8748, 100 could be our magic number for both OpenMP
(optimal number of threads) and GPRM (optimal number of tasks)
models. It is worth stating that because the convolution time for
each image is too short, the communication overhead becomes
significant, and using all of the available resources in the Xeon Phi
is not advantageous. This is not the case for OpenCL, and using
less than the available resources in terms of compute units and
processing items does not improve the performance.
5. Convolution Implementations
The convolution algorithm used in the real world application only
works at the central part of the image that is in sight of multiple
cameras, and what happens at the far edges are ignored. Therefore,
we can safely ignore the complications at the edges and start the
convolution from the pixel for which the kernel can access the
required neighbours, i.e. pixel (2,2).
5.1 Single-Pass and Two-pass Algorithms
In order to solve the 2D convolution problem, the simplest ap-
proach is to loop over all the image pixels and all the kernel ele-
ments in one go. This algorithm in referred as the single-pass algo-
rithm in this study. It uses 4 nested loops, the 2 outer loops on the
rows and columns and the image and the 2 inner loops on the rows
and columns of the kernel. For a 5×5 kernel, according to Eq. 1, it
needs 25 multiply-accumulate operations for each pixel.
An alternative comes into play when a separable kernel is used:
the two-pass algorithm. As stated in the Introduction, a separable
kernel can be decomposed into horizontal and vertical projections
and hence can be applied independently to the rows and columns
of the spatial domain. For a 5×5 kernel, it reduces the number of
multiplications per pixel to 10.
From the algorithmic point of view, the two-pass convolution
algorithm should always be the preferred one if the kernel is sepa-
rable. It has O(n)time complexity, while the complexity of single-
pass algorithm is O(n2), where nis the kernel width.
5.2 From Naive to Parallelised Optimised Code
We have implicitly mentioned the two-pass algorithm as an opti-
misation for the single-pass algorithm. We discuss in this section
and section 7 that one should be careful about parallel performance
prediction based on the sequential runtime of algorithms. For the
purposes of this section we have chosen the largest 3 images of our
6 image test cases.
There is a number of other optimisations at different levels that
could be considered for image convolution. Nevertheless, we do
not consider our final optimised code as a Ninja code, i.e. the best
optimised solution. The optimisations listed here can be achieved
with a little programming effort. The resulting optimised code, by
definition should have performance comparable to a Ninja imple-
mentation, with a little effort of algorithmically improving the naive
code (i.e. the compiler-generated code), or by using the latest com-
piler technology for parallelisation and vectorisation [12]. It has
also been reported that the difference between a sequential opti-
mised (similar to our single-pass optimised code) and a sequential
Ninja implementation for a 2D convolution algorithm on the Xeon
Phi is around 1% [13].
Another point to make is that in this study, we are also con-
cerned about parallelisation techniques. An optimised sequential
algorithm that utilises the vector units efficiently is important as the
baseline for parallelisation, and that is why we will apply the fol-
lowing optimisations, but the other aspect of this study is to identify
the pros and cons of parallel programming models in parallelising
the optimised code.
Here, we consider the single-pass algorithm with 4 nested loops
as the naive code. It is important to note that since this algorithm
convolves image array A to B, at the end of the algorithm we copy
back B to A to have the original image convolved. To make sure
that the naive code does not utilise automatic vectorisation, the
code should be compiled with the -no-vec flag (although, the Intel
compiler failed to auto-vectorise our naive code).
Opt-1: Single-pass, Unrolled The first optimisation is loop un-
rolling. An average (among the 3 images) benefit of 2.5×can be
obtained by hand unrolling the nested loop over kernel into 25 mul-
tiplications. At this stage we change the statement in Eq. 2 inside
the kernel nested loop into 25 additions in the form of Eq. 3.
B[pId][i][j]+ = A[pI d][i+kx−2][j+ky−2]∗K[kx][ky]; (2)
B[pId][i][j] = A[pI d][i−2][j−2] ∗K[0][0]+
A[pId][i−2][j−1] ∗K[0][1] + ...+
A[pId][i][j]∗K[2][2] + ...+
A[pId][i+ 2][j+ 2] ∗K[4][4];
(3)
Opt-2: Single-pass, Unrolled, Vectorised After unrolling the ker-
nel loops, only 2 out of 4 initial loops remain. Utilising the compiler
technology, we can enforce inner loop vectorisation using #pragma
simd, which if memory alias or dependence analysis fails, gives
hint to the compiler that the loop is safe to be vectorised 1. Vectori-
sation after unrolling gave us an average speedup of 22×over the
baseline.
Opt-3: Two-pass, Unrolled The third optimisation is an algorith-
mic change due to the fact that the kernel is separable, hence in-
stead of 25 multiplication for each pixel (as a result of a 5×5 ker-
nel nested loop), we can use a horizontal 1D convolution followed
by a 1D vertical convolution. Therefore, the number of multiplica-
tions for each pixel becomes 5 + 5 = 10. This optimisation is first
combined with loop unrolling. Each of the two loops to be unrolled
in this case (one in the horizontal pass and the other in the verti-
cal pass) has the size of 5. An average speedup of 5.5×over the
baseline can be obtained at this stage.
Opt-4: Two-pass, Unrolled, Vectorised Repeating the second op-
timisation on the inner loops of the two-pass algorithm (i.e. those
over the image columns), we can now get an average of 47.1×per-
formance gain over the baseline, and we have just optimised the
sequential code so far.
Par-1: Single-pass, Unrolled, 100 OpenMP Threads OpenMP
provides the simplest way of parallelising the outer loop of the
single-pass algorithm. We obtained the an average of 191.1×
speedup over the baseline.
Par-2: Single-pass, Unrolled, Vectorised, 100 OpenMP Threads
On top of the previous optimisation, similar to “Opt-2”, we have en-
forced vectorisation on the inner loops over the image columns (for
1This always requires extra care, as enforcing SIMD vectorisation while
there is vector dependence results in incorrect results
3
both convolution computation and the copy-back operation). Apart
form that, the outer loops over the image rows are parallelised us-
ing #pragma omp parallel for. An average performance gain
of 1268.8×over the baseline has been achieved.
Par-3: Two-pass, Unrolled, 100 OpenMP Threads Parallelised
version of the two-pass algorithm provides an average of 393.7×
speedup over the baseline. This is almost 2.1×the speedup of the
competitive algorithm in “Par-1”.
Par-4: Two-pass, Unrolled, Vectorised, 100 OpenMP Threads
The best parallelised vectorised approach has the average speedup
of 1611.7×. This is only 1.3×the speedup of the competitive al-
gorithm in “Par-2”. This shows that the single-pass algorithm can
benefit more from vectorisation when parallelised. This is an im-
portant finding and we will see in section 7 that it helps another
version of the single-pass algorithm (without copy-back) to outper-
form the two-pass algorithm with 100 threads.
The speedup results for all the stages from naive to a parallelised
optimised code are illustrated in Figure 1.
1
4
16
64
256
1024
4096
Opt-0 Opt-1 Opt-2 Opt-3 Opt-4 Par-1 Par-2 Par-3 Par-4
Speedup
Image size
3888x3888
5832x5832
8748x8748
Figure 1: From Naive to Parallelised Optimised code
Baseline: single-pass algorithm with copy-back to source
Opt-0: Naive, Single-pass, No-vec
Opt-1: Single-pass, Unrolled, No-vec
Opt-2: Single-pass, Unrolled, SIMD
Opt-3: Two-pass, Unrolled, No-vec
Opt-4: Two-pass, Unrolled, SIMD
Par-1 : Single-pass, Unrolled, No-vec, 100 omp threads
Par-2 : Single-pass, Unrolled, SIMD, 100 omp threads
Par-3 : Two-pass, Unrolled, No-vec, 100 omp threads
Par-4 : Two-pass, Unrolled, SIMD, 100 omp threads
5.3 OpenMP Implementation Details
An OpenMP implementation of the image convolution algorithm is
shown in List. 1.
This code corresponds to the last stage of the optimisations,
“Par-4”, as it implements the two-pass algorithm with a horizon-
tal pass followed by a vertical pass; the kernel loop is unrolled,
#pragma simd is used to enforce SIMD vectorisation, and the
outer loop is parallelised.
It is worth stating that #pragma omp parallel for has an
implicit barrier at the end.
1/∗2D c o n v o l u t i o n on e ac h p l an e ∗/
2v o i d twoPassConv ( f l o a t ∗∗∗A , f l o a t ∗∗∗B , f l o a t
∗k , int p l a n e I d , int rows , int c o l s ) {
3/ / h o r i z o n t a l p a ss
4#pragma omp parallel f o r
5f o r (int i = 2 ; i <rows −2; i ++) {
6#pragma simd
7f o r (i n t j = 2 ; j <c o l s −2; j ++ ) {
8B [ p l a n e I d ] [ i ] [ j ] =
9A[ p l a n e I d ] [ i ] [ j −2] ∗k [ 0 ] +
10 A[ p l a n e I d ] [ i ] [ j −1] ∗k [ 1 ] +
11 A[ p l a n e I d ] [ i ] [ j ] ∗k [ 2 ] +
12 A[ p l a n e I d ] [ i ] [ j +1 ] ∗k [ 3 ] +
13 A[ p l a n e I d ] [ i ] [ j +2 ] ∗k [ 4 ] ;
14 }
15 }
16 / / v e r t i c a l p a ss
17 #pragma omp parallel f o r
18 f o r (int i = 2 ; i <rows −2; i ++) {
19 #pragma simd
20 f o r (i n t j = 2 ; j <c o l s −2; j ++ ) {
21 A[ p l a n e I d ] [ i ] [ j ] =
22 B [ p l a n e I d ] [ i −2] [ j ] ∗k [ 0 ] +
23 B [ p l a n e I d ] [ i −1] [ j ] ∗k [ 1 ] +
24 B [ p l a n e I d ] [ i ] [ j ] ∗k [ 2 ] +
25 B [ p l a n e I d ] [ i + 1 ] [ j ] ∗k [ 3 ] +
26 B [ p l a n e I d ] [ i + 2 ] [ j ] ∗k [ 4 ] ;
27 }
28 }
29 return ;}
30
31 /∗c a l l s t wo Pa ss Co nv on e ac h p la n e ∗/
32 v o i d conv ( f l o a t ∗∗∗A, f l o a t ∗∗∗B, f l o a t ∗k er ,
pi ma ge a ) {
33 #pragma n o v e c t o r
34 f o r (int p l a n e I d = 0 ; p l a n e I d <a . p l a n e s ;
p l a n e I d + + ) {
35 t wo P as s C on v ( A , B , k er , p l a n e I d , a . r o ws , a .
c o l s ) ;
36 }
37 return ;}
Listing 1: Two-pass Image Convolution Algorithm, OpenMP
5.4 OpenCL Implementation Details
The concept of threads or tasks does not apply directly in OpenCL,
but essentially OpenCL uses a model of compute units, loosely
corresponding to hardware threads, and processing elements, which
in a GPU maps to the “cores” in a streaming multiprocessor, but in
the Xeon Phi maps to the vector units. However, the most common
programming model in OpenCL is to specify the “global range”,
i.e. the number of invocations of a kernel, and let the runtime
system allocate the threads. If one wants more fine-grained control,
one can specify the “local range” as well, which expresses the
number of processing elements to use per compute unit.
Our approach for creating the OpenCL version from the original
version of the code is largely automated. We started from the “Opt-
3” version discussed above. We use an annotation to mark the
subroutine that will become the OpenCL kernel (i.e. twoPassConv).
Our source-to-source compiler2generates the OpenCL API code
as well as a single-threaded OpenCL kernel. We then manually
optimise the kernel and if required the OpenCL API calls. The
OpenCL API we use is our own OpenCL wrapper library3, which
provides convenient OpenCL integration in existing codebases for
C, C++ and Fortran.
Deriving the parallel kernel from the generated single-threaded
code is mostly a matter of replacing the loops by the OpenCL
indexing calls (get global id,get local id etc), and in the
case where the original code has multiple loops, as is the case
2[omitted for blind review]
3[omitted for blind review]
4
for the convolution, we use a conditional statement to identifiy
the portion of the kernel code to be executed on each invocation
(List. 2). The host code contains a corresponding loop over the
subsequent stages.
1k e r n e l v o i d t wo Pa s sC on v ( g l o b a l f l o a t ∗A ,
g l o b a l f l o a t ∗B , g l o b a l c o n s t f l o a t ∗k ,
g l o b a l c o n s t i n t ∗pass , c o n s t i n t c o l s ,
c o n s t i n t ro ws ) {
2c o n s t i n t i d x = g e t g l o b a l i d ( 0 ) ;
3c o n s t i n t c = i d x % c o l s ;
4c o n s t i n t r = ( id x % ( ro w s ∗c o l s ) ) / c o l s ;
5/∗2D c o n v o l u t i o n on e ac h p l an e ∗/
6i f (∗p a s s == 1) {
7/ / h o r i z o n t a l p a ss
8i f ( c>1 ) {i f ( c<c o l s −2 ) {
9A[ i dx ] = B[ i dx −2] ∗k [ 0 ]
10 + B[ id x −1] ∗k [ 1 ]
11 + B[ id x ] ∗k [ 2 ]
12 + B [ i d x +1 ] ∗k [3 ]
13 + B [ i d x +2 ] ∗k [ 4 ] ;
14 }}} e l s e i f (∗p a s s == 2) {
15 / / v e r t i c a l p a ss
16 i f ( r >1) {i f ( r<rows −2) {
17 B[ id x ] =
18 A[ i d x −2∗c o l s ] ∗k [ 0 ] +
19 A[ i d x −c o l s ] ∗k [ 1 ] +
20 A[ i dx ] ∗k [ 2 ] +
21 A[ i d x+ c o l s ] ∗k [ 3 ] +
22 A[ i d x + 2∗c o l s ] ∗k [ 4 ] ;
23 }}}
24 }
Listing 2: Two-pass Image Convolution Algorithm, OpenCL
Kernel
In order to validate our assumption about the mapping of work
items to threads and vector units, we implemented the kernel in
two different ways: first, the straightforward way, where only the
global range is specified, and corresponds to the amount of work to
be done, i.e.npoints = rows*cols*planes for the convolution. The
index into the image array is in this case simply the global index,
because the OpenCL kernel uses a 1D representation of the array.
Then, the more controlled approach where the global range
is ngroups*nths, the local range is nths and each kernel loops
over niters = npoints/(ngroups*nths). Here, ngroups is the num-
ber of work groups to be used, usually this is the same as the
number of compute units, and nths is the number of work items
per work group, usually the same as the number of process-
ing elements. The index into the image array is given by idx =
niters*nths*group id+nths*iter+local id, the important point here
is that the index is contiguous in the local id, rather than in the
loop iterator. In this way the operations over nths work items can
be vectorised.
We found that the optimal performance is achieved for ngroups
=236 and nths=16, which corresponds to 59 MIC cores with 4-way
multithreading, i.e. 236 compute units, and 16 elements per SIMD
vector (512-bit vectors). And indeed this is the same performance
as achieved with the first, much simpler approach.
5.5 GPRM Implementation Details
The GPRM implementation of the two-pass algorithm defines the
two phases of the algorithm as two different types of tasks. Since
all the tasks defined in the GPC code will be executed in parallel,
aseq pragma is required to run the two phases sequentially. Each
phase uses a partial continuous for,par cont for [3], in order to
parallelise the outer loop over rows, and a #pragma simd 4to help
the compiler vectorise the inner loop over columns. par cont for
is a sequential for loop that works as follows:
In GPRM, multiple instances of the same task are generated
(specified by CUTOFF in the List. 3), each with a different index
(similar to the global id in OpenCL). Each of these tasks calls
par cont for passing its own index to specify which parts of the
work should be performed by its host thread.
1#include ” C onv . h ”
2
3v o i d C onv : : h o r i z P a s s ( i n d , CUTOFF, . . . ) {
4p a r c o n t f o r ( 2 , r ow s −2, i nd , CUTOFF, t h i s , &
Con v : : h o r i z P a s s I n n e r L o o p , . . . ) ;
5}
6
7v o i d C onv : : v e r t P a s s ( i n d , CUTOFF, . . . ) {
8p a r c o n t f o r ( 2 , r ow s −2, i nd , CUTOFF, t h i s , &
Con v : : v e r t P a s s I n n e r L o o p , . . . ) ;
9}
Listing 3: Two-pass Image Convolution Algorithm, GPRM Task
Code
1#include ”GPRM/ T as k / Co nvTas k . h”
2
3v o i d horizontalTasks (c o n s t i n t CUTOFF, . . . ) {
4/∗GPC f o r w i t h p a r a l l e l e v a l u a t i o n ∗/
5f o r (int i n d =0 ; i n d <CUTOFF ; i n d + + ) {
6h o r i z P a s s ( i nd , CUTOFF, . . . ) ;
7}
8}
9
10 v o i d verticalTasks (( c o n s t i n t CUTOFF , . . . ) {
11 /∗GPC f o r w i t h p a r a l l e l e v a l u a t i o n ∗/
12 f o r (int i n d =0 ; i n d <CUTOFF ; i n d + + ) {
13 v e r t P a s s ( i nd , CUTOFF, . . . ) ;
14 }
15 }
16
17 v o i d GPRM : : Co nv Ta sk : : t wo Pa ss Co nv ( . . . ) {
18 #pragma gp rm s e q
19 {
20 h o r i z o n t a l T a s k s (1 0 0 , . . . ) ;
21 v e r t i c a l T a s k s ( 10 0 , . . . ) ;
22 }
23 }
Listing 4: Two-pass Image Convolution Algorithm, GPC Code
6. Parallel Performance of the Two-pass
Algorithm
The focus of this section is on the parallel performance of the three
implementations of the two-pass algorithm.
We start by disabling the vectorisation in the Xeon Phi. The
results for the parallelised non-vectorised cases are compared with
the vectorised ones in Table 1. In order to disable vectorisation
for OpenMP and GPRM, the code should be compiled with the
-no-vec flag. In OpenCL, there is no explicit vectorisation option,
but we can effectively disable vectorisation by using only a single
processing element per compute unit.
4Unlike OpenMP, in this case the use of #pragma simd for the innermost
loop in the GPRM implementation is optional
5
Table 1: The effect of vectorisation on the parallel performance (ms) of the two-pass algorithm
Image Size OpenMP no-vec OpenCL no-vec GPRM no-vec OpenMP SIMD OpenCL SIMD GPRM SIMD
1152x1152 3.9 5.4 27.2 0.8 (4.9×) 2.0 (2.7×) 26.1 (1.0×)
1728x1728 8.5 12.3 32.8 2.0 (4.2×) 3.8 (3.2×) 26.6 (1.2×)
2592x2592 16.7 26.9 40.5 4.1 (4.1×) 7.8 (3.4×) 27.8 (1.5×)
3888x3888 39.9 61.6 60.4 8.8 (4.5×) 16.5 (3.7×) 32.5 (1.9×)
5832x5832 86.7 146.2 105.8 19.6 (4.4×) 38.1 (3.8×) 36.8 (2.9×)
8748x8748 195.4 334 216.9 59.2 (3.3×) 91.5 (3.6×) 60.1 (3.6×)
The average speedup obtained through vectorisation for the
OpenMP code is about 4.2×. It is important to note that this
speedup for the sequential code was almost twice as much (8.6×).
Therefore, the reported performance gain is specific to the case with
100 threads and should not be generalised.
It is worth noting that the speedup due to vectorisation in GPRM
is much less pronounced, mostly due to the higher overhead of the
GPRM runtime for smaller images. The same is true for OpenCL,
but to a lesser extent. On average, the speedup obtained through
vectorisation for the OpenCL code is about 3.5×. Clearly, the
OpenMP vectorisation is more efficient and this a large factor in
the lesser performance of OpenCL.
Figure 2 shows the speedup of the two-pass algorithm against
its optimised sequential implementation (i.e. version “Opt-4”). So
far, the algorithm is parallelised over each plane of size R×C, hence
R×Cin Fig.2. This means for 3 colour planes, the parallelised code
will be executed 3 times sequentially 5.
0
10
20
30
40
50
1152x1152 1728x1728 2592x2592 3888x3888 5832x5832 8748x8748
Speedup
Image size
OpenMP OpenCL GPRM
Figure 2: Speedup of the Vectorised Two-pass Algorithm, R×C
It is possible to inspect the difference between OpenMP and
GPRM more in detail. Since GPRM provides a modular methodol-
ogy for expressing tasks and defining the communication patterns
between them, we can create empty tasks and measure the over-
head of distributing them across different threads and the paral-
lel reduction. In other words, it is possible to measure the over-
head of communication between tiles in GPRM. If we deduct this
overhead from the total running time, we can measure the time
spent on the actual computation inside the framework. The GPRM-
compute time shown in Table 2 is gained by deducting the constant
communication overhead of 25.5ms from the total execution time.
We will discuss how this separation will help to find a better task
decomposition solution and hence better performance. Since the
OpenMP execution model is different from that of GPRM, we can-
not similarly separate the computation from the communication in
OpenMP. OpenCL also allows to run empty kernels to study the
overhead, we found that the overhead is between 0.25 and 0.4ms,
5Actually it is 3000 times, considering that we run the code 1000 times
so a small component in the total running time for all but the small-
est image size.
Table 2: Running time (ms) per image for the two-pass algorithm
Image Size OpenMP OpenCL GPRM-total OpenCL-compute GPRM-compute
1152x1152 0.8 2.0 26.1 1.8 0.6
1728x1728 2.0 3.8 26.6 3.6 1.1
2592x2592 4.1 7.8 27.8 7.5 2.3
3888x3888 8.8 16.5 32.5 16.2 7.0
5832x5832 19.6 38.1 36.8 37.7 11.3
8748x8748 59.2 91.5 60.1 91.0 34.6
As a solution to mitigate the GPRM overhead, we have used
task agglomeration: combining tasks into larger tasks to improve
performance [14]. We therefore consider images with the width of
3 times the width of the original images, meaning that each row
includes information for all 3 colour planes. This way, we include
the 3 colour planes into the parallelisation. Using this technique,
the size of tasks in GPRM becomes tripled and the overhead be-
comes one third (8.5ms per image). The speedup results for this
case, which we call 3R×Cis shown in Fig. 3. As expected, this
technique does not have similar significant impact on the OpenMP
performance.
0
10
20
30
40
50
1152x1152 1728x1728 2592x2592 3888x3888 5832x5832 8748x8748
Speedup
Image size
OpenMP OpenCL GPRM
Figure 3: Speedup of the Vectorised Two-pass Algorithm, 3R×C
Using task agglomeration, GPRM achieves better performance
than OpenCL for the three larger images, and the best performance
among all for the largest image.
7. Reconsidering the Single-pass Algorithm
In order to compare the single-pass and two-pass algorithms, it is
important to note that the two-pass algorithm uses an auxiliary
array to store the result of the first pass. In the second pass, it
uses the auxiliary array as the source and the original image as the
destination, thus at the end of the algorithm, the original image will
be replaced by the convolved one. It is convenient that the input
6
and output images can use the same array, but it comes at a price:
two assignment operations rather than one for every pixel. In order
to have a fair comparison, we expected the same from the single-
pass algorithm, i.e. overwriting the original image. This means that
although the single-pass algorithm can produce the result on an
output image by assigning new values for all the pixels only once, it
now needs to copy the convolved values back to the original image.
This copy-back operation constitutes a considerable extra over-
head and sometimes is not needed, e.g. when working with the
Xeon Phi as a co-processor. Suppose one runs some complex code
on the Xeon CPU and offloads the computation of the convolu-
tion to the Xeon Phi, i.e. the typical model for OpenCL. In that
model, there will be host-to-device and device-to-host copy cost. If
one copies an image array A to the MIC, convolves it into an ar-
ray B and copies that back to the host, there is of course no need
to copy on the data back to the original array on the device itself.
Consequently, we have also tested the single-pass code without the
ultimate “copy back to the original image” operation. We have mea-
sured the results again only for the three larger images.
Figure 4: From Naive to Parallelised Optimised code
Baseline: single-pass algorithm without copy-back to source
Opt-0: Naive, Single-pass, No-vec
Opt-1: Single-pass, Unrolled, No-vec
Opt-2: Single-pass, Unrolled, SIMD
Opt-3: Two-pass, Unrolled, No-vec
Opt-4: Two-pass, Unrolled, SIMD
Par-1 : Single-pass, Unrolled, No-vec, 100 omp threads
Par-2 : Single-pass, Unrolled, SIMD, 100 omp threads
Par-3 : Two-pass, Unrolled, No-vec, 100 omp threads
Par-4 : Two-pass, Unrolled, SIMD, 100 omp threads
Par-5 : Single-pass, Unrolled, No-vec, 100 GPRM tasks, 3R×C
Par-6 : Single-pass, Unrolled, SIMD, 100 GPRM tasks, 3R×C
Par-7 : Single-pass, Unrolled, SIMD, OpenCL
Par-8 : Two-pass, Unrolled, SIMD, OpenCL
After unrolling the kernel loop(s), for both non-vectorised and
vectorised approaches, the results were as expected, i.e. the Two-
pass algorithm had much better performance than the Single-pass
algorithm (1.6×to 1.9×).
It is worth mentioning that in some cases, e.g. for 3888×3888
images, the performance of the optimised single-pass algorithm
with OpenMP could be improved by up to 15% (10% in average
for the largest three images) by tuning the number of threads, e.g.
with 120 threads, but since for the C++-based approaches 100
threads or tasks are used and we do not intend to compare multiple
configurations together, we stick to this number.
Figure 4 shows that although the average sequential perfor-
mance of the optimised two-pass code is 1.6×better than the av-
erage sequential performance of the optimised single-pass code
(without copy-back), the average parallel performance of the op-
timised single-pass code (using OpenMP) is 1.2×better than that
of the optimised two-pass code. The reason can also be extracted
from Fig. 4: better utilisation of the vector units by the parallel
single-pass code (9.4×its parallel non-vectorised version) com-
pared to the parallel two-pass code (4.1×its parallel non-vectorised
version).
Since GPRM had shown a good performance for the largest
array when we included parallelisation over planes into the tasks
(the 3R×Ccase), its results has been added to Fig. 4. As expected,
it produced the best result for the 8748×8748 image, using the
optimised single-pass algorithm with no copy-back. Its speedup
over the baseline naive code is 1850×with 100 tasks.
The results of the OpenCL kernel for the single-pass implemen-
tation are on average about 50% slower than for the two-pass im-
plementation, and although the two-pass version still achieves over
1000×speedup over the baseline, it is still the worst of the three
approaches.
As the best result amongst all, we have been able to get up to
1970×(for the 5832×5832 image) speedup over the sequential
naive implementation of the algorithm, by only utilising the com-
piler technology, few algorithmic changes, and parallelisation (us-
ing OpenMP). Also, a 2160×speedup over the baseline has been
observed with 120 OpenMP threads for 5832×5832 matrices with
single-pass, no-copy approach.
8. Related Work
A similar 5×5 spatial kernel (filter) has been the focus of a number
of research papers [13] [12] [11] [15].
Petersen et al. [13] ported a subset of C benchmarks to Haskell
and measured their performance on parallel machines, including
the Xeon Phi. Considering three classes of naive, optimised, and
Ninja C implementations [12], our implementation of the image
convolution algorithm is classified as the optimised code, utilising
loop unrolling and SIMD vectorisation.
The reported Ninja gap for the Intel Labs Haskell Research
Compiler (HRC) for 8192×8192 images on the Xeon Phi using
the single-pass algorithm is 3.7×(for 57 threads) [13]. The authors
have disabled multithreading on the Xeon Phi, which is essentially
different from hyper-threading on the Xeon processors [5].
Authors in [11] explored this further and figured out that the
peak performance can be achieved with 100 threads. They have
also reported that the performance gap between the Vector Pascal
[11] and an optimised OpenMP implementations of the two-pass
algorithm with 100 threads is almost 6.4×.
Authors in [12] also focused on the optimisation techniques for
parallel applications, using both advancements in compiler tech-
nology and algorithmic techniques to bring down the Ninja perfor-
mance gap for throughput computing benchmarks, one of which is
the single-pass implementation of the convolution algorithm.
Tian et al. [15] focused on efficient utilisation of the SIMD vec-
tor units on the Xeon Phi and proposed a number effective tech-
niques to improve performance of parallel programs, including a
single-pass image convolution. They have reported a speedup of
2000×using their vectorisation techniques along with parallelisa-
tion. We have also observed a speedup of about 1970×(2160×
with 120 threads) without using any particular vectorisation tech-
nique. However, we have also highlighted the importance of the
Xeon Phi vector units, specially their impact on parallel perfor-
mance.
A similar study has been conducted using the TILEPro64
platform [16]. On the 64-core TILEPro64, GPRM outperformed
OpenMP in all cases.
7
9. Discussion
The OpenCL implementation of the two-pass convolution performs
worse than the OpenMP one by a factor of about two. This is ac-
tually not all that surprising because native OpenMP has very lit-
tle overhead in its use of the kernel threads on the MIC, whereas
OpenCL requires a runtime system for scheduling work on the
threads. Furthermore, OpenCL has no explicit vectorisation con-
trol, so achieving good vectorisation is harder than with the prag-
mas used in the OpenMP code, as shown by our results.
The GPRM model also has a fixed overhead (tens of millisec-
onds for hundreds of tasks) due to task creation and communica-
tion. By including the 3 image planes into the parallelisation (simi-
lar to the initial OpenCL approach), we reduced the overhead to one
third and thus the GPRM implementation achieved the best perfor-
mance for the largest image. We therefore conclude that GPRM is
not suitable to handle small tasks, but as soon as the tasks become
large enough to cover the scheduling overhead, it shows good per-
formance compared to the competitive models.
GPRM naturally fits algorithms with task (functional) decom-
position. It has its own complications though when it comes to do-
main decomposition, as it requires restructuring certain parts of the
program to fit the GPRM structure.
We therefore conclude that for the studied algorithm, 2D image
convolution, OpenMP is the most productive approach, followed
by OpenCL and then GPRM. In general, GPRM is attractive for
task-based programming, but for algorithms like the convolution, it
offers few advantages in terms of ease of programming. However,
GPRM facilitates modular design, which is key to improve produc-
tivity [17].
In terms of performance, OpenMP is the winning model, except
for very large images where GPRM shows better performance after
using task agglomeration. GPRM also outperformed OpenCL for
the three largest images, but OpenCL had better performance for
the three small images.
In terms of portability, GPRM and OpenMP only support con-
ventional multicore platforms and the MIC, and require the com-
mercial Intel compiler for the MIC. OpenCL has the advantage
of supporting many platforms including GPUs and FPGAs, and
its absolute performance is probably good enough for most pur-
poses. Furthermore, using our described approach for porting appli-
cations, OpenCL programming becomes actually quite easy (as is
clear from comparing the OpenCL kernel with the OpenMP code).
It can also be noted that to program the Xeon Phi with OpenCL, one
does not require the commercial Intel compiler, and the OpenCL
SDK is free, so it is a cost-effective solution.
10. Conclusion
In this study, we have chosen three very different parallel program-
ming models supported by the Xeon Phi (OpenMP, OpenCL and
GPRM) to solve a 2D image convolution problem over a test set of
6 square images, ranging from 1152×1152 to 8748×8748. For a
separable convolution kernel, two different algorithms can be con-
sidered: Single-pass, which requires only a single assignment in-
stead of two, but needs an additional copy if the result is required
in the original array, and Two-pass, which requires fewer compu-
tations and returns the result in the original array. After creating
optimised versions of both algorithms, we found that the choice be-
tween theses algorithm depends on which version of the single-pass
algorithm is required: if the result has to be copied back to the orig-
inal image, then the two-pass algorithm is always better. Otherwise,
the single-pass algorithm can provide better parallel performance,
even though its sequential performance is still worse.
We have explored a number optimisation and parallelisation
techniques on the Xeon Phi which helped us achieve a speedup near
2000×over the baseline, but none of these techniques requires a
major rewrite of the original code. The optimisation techniques in-
clude loop unrolling, vectorisation, and an algorithmic from single-
pass to two-pass or vice versa. Task agglomeration is also used as
a parallelisation technique to improve the performance of GPRM.
Although the OpenMP implementation showed the best over-
all performance on the Xeon Phi, the task-based GPRM model
achieved better performance for large images, and although the
OpenCL performance was not as good, it is still impressive, and
thus a good choice if the Intel compiler is not available.
References
[1] S. Smith, Digital Signal Processing: A Practical Guide for Engi-
neers and Scientists: A Practical Guide for Engineers and Scientists.
Newnes, 2013.
[2] M. McCool, J. Reinders, and A. Robison, Structured parallel pro-
gramming: patterns for efficient computation. Elsevier, 2012.
[3] A. Tousimojarad and W. Vanderbauwhede, “A parallel task-based ap-
proach to linear algebra,” in Parallel and Distributed Computing (IS-
PDC), 2014 IEEE 13th International Symposium on. IEEE, 2014,
pp. 59–66.
[4] A. Tousimojarad and W. Vanderbauwhede, “Steal Locally,
Share Globally: A strategy for multiprogramming in the
manycore era,” International Journal of Parallel Programming,
vol. 43, no. 5, pp. 894–917, 2015. [Online]. Available:
http://dx.doi.org/10.1007/s10766-015-0350-0
[5] J. Jeffers and J. Reinders, Intel Xeon Phi Coprocessor High Perfor-
mance Programming. Newnes, 2013.
[6] E. Ayguad´
e, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Mas-
saioli, X. Teruel, P. Unnikrishnan, and G. Zhang, “The design of
openmp tasks,” Parallel and Distributed Systems, IEEE Transactions
on, vol. 20, no. 3, pp. 404–418, 2009.
[7] J. Clet-Ortega, P. Carribault, and M. P´
erache, “Evaluation of openmp
task scheduling algorithms for large numa architectures,” in Euro-Par
2014 Parallel Processing. Springer, 2014, pp. 596–607.
[8] J. E. Stone, D. Gohara, and G. Shi, “Opencl: A parallel programming
standard for heterogeneous computing systems,” Computing in science
& engineering, vol. 12, no. 3, p. 66, 2010.
[9] P. Du, R. Weber, P. Luszczek, S. Tomov, G. Peterson, and J. Dongarra,
“From cuda to opencl: Towards a performance-portable solution for
multi-platform gpu programming,” Parallel Computing, vol. 38, no. 8,
pp. 391–407, 2012.
[10] J. Sanders and E. Kandrot, CUDA by example: an introduction to
general-purpose GPU programming. Addison-Wesley Professional,
2010.
[11] M. Chimeh, P. Cockshott, S. B. Oehler, A. Tousimojarad, and T. Xu,
“Compiling vector pascal to the xeonphi,” Concurrency and Compu-
tation: Practice and Experience, 2015.
[12] N. Satish, C. Kim, J. Chhugani, H. Saito, R. Krishnaiyer, M. Smelyan-
skiy, M. Girkar, and P. Dubey, “Can traditional programming bridge
the ninja performance gap for parallel computing applications?” Com-
munications of the ACM, vol. 58, no. 5, pp. 77–86, 2015.
[13] L. Petersen, T. A. Anderson, H. Liu, and N. Glew, “Measuring the
haskell gap,” in Proceedings of the 25th symposium on Implementation
and Application of Functional Languages. ACM, 2013, p. 61.
[14] I. Foster, Designing and building parallel programs. Addison Wesley
Publishing Company, 1995.
[15] X. Tian, H. Saito, S. V. Preis, E. N. Garcia, S. S. Kozhukhov, M. Mas-
ten, A. G. Cherkasov, and N. Panchenko, “Effective simd vectorization
for intel xeon phi coprocessors,” Scientific Programming, vol. 501, p.
269764, 2015.
[16] A. Tousimojarad, “GPRM: a high performance programming frame-
work for manycore processors,” Ph.D. dissertation, University of Glas-
gow, 2016.
[17] J. Hughes, “Why functional programming matters,” The computer
journal, vol. 32, no. 2, pp. 98–107, 1989.
8