Conference PaperPDF Available

OpenCLIPP: OpenCL Integrated Performance Primitives library for computer vision applications Performance results

Authors:
  • University of Moncton, Moncton, Canada

Abstract

In recent years, we see an increase of interest for GPGPU computing (General-Purpose computation on Graphics Processing Units). This domain aim to using the processing power of the GPU (Graphics Processing Units) in order to accelerate general processing like mathematics, 3D visualization, image processing, etc. In the past years, CUDA (Compute Unified Device Architecture) a parallel computing platform and programming model invented by NVIDIA was the main driver of this interest and the most used architecture for GPGPU computing. With the recent advent of Open Computing Language (OpenCL), we see more and more work conducted using this new platform. OpenCL is an open standard maintained by the non-profit technology consortium Khronos Group. It has been adopted by multiple companies including NVIDIA (the inventor of CUDA). With this increase of interest, the availability of a set of performance primitives for general purpose applications can help accelerate the work of the research and industrial communities. Intel, for example, develops Intel Integrated Performance Primitives (Intel IPP), a multi-threaded software library of functions for multimedia and data processing applications. In the other hand, NVIDIA offers the NVIDIA Performance Primitives library (NPP), a collection of GPU-accelerated image, video, and signal processing functions that deliver faster performance than comparable CPU-only implementations. In this work, we present the architecture and development of an open source OpenCL integrated performance primitives library called OpenCLIPP. This library aim to provide a free and open source set of OpenCL functions with a simple interface similar to Intel IPP and NVIDIA NPP. The first release includes mainly image processing and computer vision algorithms: Convolution filters, Thresholding, Blobs, etc. The developed functions are introduced and benchmarks with equivalent Intel IPP and NVIDIA NPP functions are presented. This library will be made available to the open source community.
OpenCLIPP: OpenCL Integrated
Performance Primitives library for computer
vision applications
Moulay Akhloufi, akhloufi@gel.ulaval.ca ; Antoine Campagna
In recent years, we see an increase of interest for GPGPU computing (General-Purpose
computation on Graphics Processing Units). This domain aim to using the processing
power of the GPU (Graphics Processing Units) in order to accelerate general processing
like mathematics, 3D visualization, image processing, etc.
In the past years, CUDA (Compute Unified Device Architecture) a parallel computing
platform and programming model invented by NVIDIA was the main driver of this interest
and the most used architecture for GPGPU computing. With the recent advent of Open
Computing Language (OpenCL), we see more and more work conducted using this new
platform. OpenCL is an open standard maintained by the non-profit technology
consortium Khronos Group. It has been adopted by multiple companies including
NVIDIA (the inventor of CUDA).
With this increase of interest, the availability of a set of performance primitives for
general purpose applications can help accelerate the work of the research and industrial
communities. Intel, for example, develops Intel Integrated Performance Primitives (Intel
IPP), a multi-threaded software library of functions for multimedia and data processing
applications. In the other hand, NVIDIA offers the NVIDIA Performance Primitives library
(NPP), a collection of GPU-accelerated image, video, and signal processing functions
that deliver faster performance than comparable CPU-only implementations.
In this work, we present the architecture and development of an open source OpenCL
integrated performance primitives library called OpenCLIPP. This library aim to provide a
free and open source set of OpenCL functions with a simple interface similar to Intel IPP
and NVIDIA NPP. The first release includes mainly image processing and computer
vision algorithms: Convolution filters, Thresholding, Blobs, etc. The developed functions
are introduced and benchmarks with equivalent Intel IPP and NVIDIA NPP functions are
presented. This library will be made available to the open source community.
M. Akhloufi, A. Campagna, "OpenCLIPP: OpenCL Integrated Performance Primitives library for computer vision
applications", Proc. SPIE Electronic Imaging, Intelligent Robots and Computer Vision XXXI: Algorithms and Tech-
niques, 9025-31, San Francisco, CA, USA, February 2014
Performance results
OpenCLIPP can provide a significant performance improvement to all image processing
applications, regardless of the platform used (AMD or NVIDIA, Windows or Linux).
Performance gain is substantial when compared to even the most optimized CPU libraries
when processing large (>10 MPixels) images on high end GPUs.
GPU processing is not a good choice for small images (<1 MPixels) due to the overhead.
This library was made Open Source so that interested programmers ca use it free in their
applications and contribute to improve it: http://openclipp.wix.com/openclipp
Computer vision is more and more used in today's applications.
With always higher resolution and more demanding algorithms, applications are often limited by the processing power of CPUs.
An alternative is the use of GPUs.
We present a new library based on OpenCL to perform high speed image processing on GPUs: OpenCLIPP
The library is Open Source, LGPL licensed and free for commercial use. You can download it on GitHub website:
http://openclipp.wix.com/openclipp
OpenCL is a framework that allows using the computing resources present in specialized
computing devices like GPUs.
How it works :
1. A program is written in a language similar to C
2. The program gets compiled for the computing device used
3. The compiled program runs in parallel over all the computing resources of the device
What is OpenCL ?
The library provides an interface in C, allowing many programming languages to use its capabilities.
// Variables
ocipContext Context =NULL;
ocipImage SourceImage,ResultImage;
SImage ImageInfo = {...}; // Fill with size, type, channels of image
// Initialize OpenCL
ocipInitialize(&Context,NULL,CL_DEVICE_TYPE_ALL);
ocipSetCLFilesPath("/path/to/cl files/");
// Create images in OpenCL device
ocipCreateImage(&SourceImage,ImageInfo,SourceImageData,CL_MEM_READ_ONLY);
ocipCreateImage(&ResultImage,ImageInfo,ResultImageData,CL_MEM_WRITE_ONLY);
// Prepare the Filters - compiles the OpenCL C program
// optional (would otherwise be done upon the first filter call)
ocipPrepareFilters(SourceImage);
// Apply filter (asynchronous)
ocipSobel(SourceImage,ResultImage);
// Transfer image to host (synchronous)
ocipReadImage(ResultImage);
How to use in C
There are two existing and popular image processing primitives libraries :
Intel IPP optimized for CPUs
NVIDIA NPP, which provides a similar interface to Intel IPP but allows computing on NVIDIA CUDA
GPUs
OpenCLIPP provides an interface in C inspired by the interface in these libraries but simplified.
OpenCLIPP also provides a C++ interface.
The library supports images with :
signed and unsigned integer of 8, 16 or 32 bits, or floating point 32 bits
1, 2, 3 or 4 channels
almost any size (maximum image size depends on hardware)
Library interface
How it works
Conclusion
Introduction
by
Moulay Akhloufi, (akhloufi@gel.ulaval.ca)
Antoine W. Campagna
The library comes with a test and benchmarking program.
The results below have been obtained with a PC with the following specifications:
Intel Core i7-3770 8GB RAM
NVIDIA GeForce GTX 680
Windows 7 64b
Each primitive was run 30 times, the average of all runs is given.
Image transfer and program compilation times are not included in the results.
Here we see the performance advantage of GPUs with OpenCLIPP performing up to 8 times faster than IPP for
calculating the absolute difference between two images. We can also see OpenCLIPP beats NPP by a small margin.
And here is the same results along a logarithmic scale to better see the performance on small images.
We can see that GPU operations have an overhead.
The overhead for NPP is 0.01ms and the overhead for OpenCLIPP is higher at 0.03ms.
OpenCV OCL has a even higher overhead at 0.11ms
CPU has no such overhead so IPP beats GPU for small images.
AbsDiff is a very simple algorithm. Below, we show a more complex algorithm
TopHat morphological operation, which has many memory accesses for each pixel
We can see OpenCLIPP has a 2X lead over IPP here and a slight lead over NPP
And here is a statistical reduction, presented in GB/s
Right now, there are two major frameworks for GPU computing : OpenCL and CUDA.
CUDA has its advantages but CUDA works only on NVIDIA devices, while OpenCL works
on all major high performance devices.
In our experiments, we found that OpenCL is as fast as CUDA on NVIDIA hardware.
OpenCL may also become prevalent on mobile devices (where GPUs are increasingly
powerful). This will increase the range of OpenCL applications.
Why OpenCL ?
The library itself is implemented in C++ and C++ programs can use the C++ interface directly.
using namespace OpenCLIPP;
SImage ImageInfo = {...}; // Fill with size, type, channels of image
// Initialize OpenCL
COpenCL CL;
CL.SetClFilesPath("/path/to/cl files/");
Filters filters(CL);
// Create images in OpenCL device
ColorImage SourceImage(CL,ImageInfo,SourceData);
ColorImage ResultImage(CL,ImageInfo,ResultData);
// Prepare the Filters - compiles the OpenCL C program
// It is optional (would otherwise be done upon the first filter call)
filters.PrepareFor(SourceImage);
// Apply filter (asynchronous)
filters.Sobel(SourceImage,ResultImage);
// Transfer image to host (synchronous)
ResultImage.Read(true);
How to use in C++
RAM
CPU VRAM
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.
GPU
0,0001
0,001
0,01
0,1
1
10
Time in ms - lower is better
Image size
AbsDiff U8 - log scale
CPU (IPP) OpenCLIPP NPP OpenCV OCL
0
1
2
3
4
5
6
7
8
512x512 1024x1024 2048x2048 HXGA 4096x4096 HSXGA HUXGA WHUXGA
Time in ms - lower is better
Image size
AbsDiff U8
CPU (IPP) OpenCLIPP NPP OpenCV OCL
0
5
10
15
20
25
512x512 1024x1024 2048x2048 HXGA 4096x4096 HSXGA HUXGA WHUXGA
Time in ms - lower is better
Image size
TopHat U8
CPU (IPP) OpenCLIPP NPP OpenCV OCL
0
20
40
60
80
100
120
140
Processing bandwidth in GB/s - higher is better
Image size
Processing bandwidth for Mean Reduction - F32
CPU (IPP) OpenCLIPP NPP
Here we see a good 40GB/s for CPU when inside the cache and 15GB/s for images too big for the
cache.
Performance of OpenCLIPP increases with the size of the image, reaching 135GB/s, 9X faster than
IPP and 50% faster than NPP. OpenCV OCL failed to calculate the mean in current version.
Arithmetic Add AddSquare Sub AbsDiff Mul Div Min Max
AddC SubC AbsDiffC MulC DivC RevDivC MinC MaxC
Abs Exp Log Sqr Sqrt Sin Cos
Logic And Or Xor AndC OrC XorC Not
LUT LUT, Linear LUT, Scale LUT
Morphology Erode Dilate Open Close Gradient TopHat BlackHat
Transform MirrorX MirrorY Flip Transpose Resize SetAll
Conversions Convert Scale Copy ToGray SelectChannel ToColor
Tresholding TresholdGT TresholdLT TresholdGTLT Compare
Filters Gauss Sharpen Smooth Median Sobel Prewitt Scharr HiPass Laplace
Reductions Min Max MinAbs MaxAbs Sum Mean MeanSqr
More functions Histogram, Integral scan, Blob labeling and FFT (soon)
Supported primitives in version 1
HXGA-4096x3072, HSXGA-5120x4096, HUXGA-6400x4800, WHUXGA-7680x4800
... GPUs are enjoying significant popularity in HPC [31]- [33] and are widely adopted to accelerate the SAT computation, such as in [17], [34]- [38]. Several state-of-the-art libraries also provide SAT primitive programming interfaces, such as Intel IPP [39], Nvidia NPP, Matlab [40], OpenCV [41], OpenCLIPP [42] . . . etc. Typically the GPU implementations employ the device scratchpad memory as fast cache to improve data reuse and use a parallel warp-scan algorithm (Sec. ...
... OpenCV [5] seems to be the most widely used and actively maintained image processing library with GPU support. OpenCLIPP [6] is another example of open source image processing library with GPU support. CUVIlib [7] is a library that support GPU processing, although it is not free. ...
Conference Paper
Full-text available
Image processing with GPUs requires the use of an API like OpenCL or CUDA. A higher level library that hides these APIs is a better option if the programmer does not need to fine tune or implement his own image processing operations. Among recent libraries that support OpenCL, there are Open-CLIPP and OpenCV, although none of them supports 3D images. Using OpenCL, however, is not as simple as programming CPUs, and many API calls are needed to prepare the environment before calling the image processing function (shader). In this article we describe a library with a code generator that, from a few directives merged in a shader source code, generates a wrapper code with all the OpenCL API calls needed before calling the shader, simplifying, thus, the maintenance of an image processing library. The proposed library performance is better than OpenCV, CImg, ITK libraries for all the tested operators. Index Terms— OpenCL, OpenGL, 3D image processing.
... Reichenbach et al. [16] combined the advantages of multicore CPUs, GPUs, and FPGAs to build up a heterogeneous image processing pipeline for adaptive optical systems. Authors in [1] developed an OpenCL library of image processing primitives (OpenCLIPP) in order to simplify the exploitation of GPUs. Some GPU works were dedicated to medical imaging in [17] which presents a survey of GPU based medical applications, related to segmentation, registration and visualization methods. ...
Conference Paper
Accurate Vertebra localization presents an essential step for automating the diagnosis of many spinal disorders. In case of MR images of lumbar spine, this task becomes more challenging due to vertebra complex shape and high variation of soft tissue. In this paper, we propose an efficient framework for spine curve extraction and vertebra localization in T1-weighted MR images. Our method is a fast parametrized algorithm based on three steps: 1. Image enhancing 2. Meanshift clustering [5] 3. Pattern recognition techniques. We propose also an adapted and effective exploitation of new parallel and hybrid platforms, that consist of both central (CPU) and graphic (GPU) processing units, in order to accelerate our vertebra localization method. The latter can exploit both NVIDIA and ATI graphic cards since we propose CUDA and OpenCL implementations of our vertebra localization steps. Our experiments are conducted using 16 MR images of lumbar spine. The related results achieved a vertebra detection rate of 95% with an acceleration ranging from 4 to 173 \times thanks to the exploitation of Multi-CPU/Multi-GPU platforms.
Thesis
Dans le secteur industriel, la course à l’amélioration des définitions des capteurs vidéos se répercute directement dans le domaine du traitement d’images par une augmentation des quantités de données à traiter. Dans le cadre de l’embarqué, les mêmes algorithmes ont fréquemment pour contrainte supplémentaire de devoir supporter le temps réel. L’enjeu est alors de trouver une solution présentant une consommation énergétique modérée, une puissance calculatoire soutenue et une bande passante élevée pour l’acheminement des données.Le GPU est une architecture adaptée pour ce genre de tâches notamment grâce à sa conception basée sur le parallélisme massif. Cependant, le fait qu’un accélérateur tel que le GPU prenne place dans une architecture globale hétérogène, ou encore ait de multiples niveaux hiérarchiques, complexifient sa mise en œuvre. Ainsi, les transformations de code visant à placer un algorithme sur GPU tout en optimisant l’exploitation des capacités de ce dernier, ne sont pas des opérations triviales. Dans le cadre de cette thèse, nous avons développé une méthodologie permettant de porter des algorithmes sur GPU. Cette méthodologie est guidée par un ensemble de critères de transformations de programme. Certains d’entre-eux sont définis afin d’assurer la légalité du portage, tandis que d’autres sont utilisés pour améliorer les temps d’exécution sur cette architecture. En complément, nous avons étudié les performances des différentes mémoires ainsi que la gestion du parallélisme gros grain sur les architectures GPU Nvidia.Ces travaux sont une étape préalable à l’ajout de nouveaux critères dans notre méthodologie, visant à maximiser l’exploitation des capacités de ces GPUs. Les résultats expérimentaux obtenus montrent non seulement la fiabilité du placement mais aussi une accélération des temps d’exécution sur plusieurs applications industrielles de traitement d’images écrites en langage C ou C++.
ResearchGate has not been able to resolve any references for this publication.