ArticlePDF Available

Experiences porting from CUDA to OpenCL

Authors:
  • Acellera Ltd.
Experiences porting from CUDA to OpenCL
Matt Harvey
Imperial College London
CBBL IMIM
MEW 20
December 8, 2009
Matt Harvey Experiences porting from CUDA to OpenCL
Introduction
Molecular dynamics widely used simulation methodology.
Classical treatment of molecules.
Biomolecular modelling (10-100k atoms) through to inorganic
polymer (1M+) modelling.
Timesteps 2 fs, interesting timescales µs(109iterations)
Matt Harvey Experiences porting from CUDA to OpenCL
Desmond & NAMD scaling on DHFR (29k atoms)
64 128 192 256 320 384 448 512
Px
0
4
8
12
16
20
24
28
32
Speed-up
Desmond
NAMD
Scaling on DHFR
(Peak performance 170ns/day, 1.4ms/step), Bowers et al Proc. SC 2006 (2006)
Matt Harvey Experiences porting from CUDA to OpenCL
Accelerating MD
Why bother? (2007)
Lots of obvious parallelism – GPUs are designed for
data-parallel computation.
GPUs probably aren’t going away soon.
Future CPUs will be increasingly multithreaded, probably
converge.
Focus on small systems O(10-100k atoms)
Concentrate on best performance on a single (multiple GPU)
machine – stong scaling will be much worse if GPUs faster.
Which GPUs? - NVidia only choice
High level programming language (CUDA)
Alternatives - ATI - CAL assembler-level, complex hardware,
bad docs
Matt Harvey Experiences porting from CUDA to OpenCL
AceMD Performance - DHFR
32 64 96 128 160 192 224 256
Px
0
10
20
30
ms/step
Desmond
NAMD
AceMD (1 GPU)
AceMD (3 GPU)
AceMD Performance (DHFR)
1 GPU 17.3 ms
(20ns/day)
3 GPU 7.9 ms
(43ns/day)
GPUs are GTX 280
Harvey et al,JCTC (2009), Bowers et al Proc. SC 2006 (2006)
Matt Harvey Experiences porting from CUDA to OpenCL
How are we using it?
Running since early 2008
Successor to PS3Grid
Uses BOINC middleware
1000+ G200-class GPUs attached
100µsof trajectory/month
Sudies include:
Protein-ligand binding affinities
HIV protease dynamics
Ion channel permeability
Giorgino et al, JCTC (submitted)
Saddiq & de Fabritiis, JACS (submitted)
http://www.gpugrid.org
Matt Harvey Experiences porting from CUDA to OpenCL
State of play, Dec 2009
Still no large GPU installations in HPC centres
Would like faster simulation rates O(100ns) day
No new hardware from Nvidia for 1yr (GTX285, Jan 2009)
ATI hardware similar – much higher ’peak performance’
Lots of ATI cards going unused by GPUGRID
OpenCL - new standard for cross-platform programming of
GPUs
So, let’s try out OpenCL. Can we make an OpenCL code that
is as good as the current CUDA code?
runs well on ATI hardware?
Wasn’t a waste of time if the answer’s no!
Matt Harvey Experiences porting from CUDA to OpenCL
OpenCL - What is it?
OpenCL is a standard promoted by Khronos Group (of OpenGL
fame), supported by ATI, NVidia
Host C API for controlling and interacting with
GPU/Accelerator devices
A C language for writing device kernels
An abstract device model that maps very well to NV and ATI
hardware (surprise!)
Matt Harvey Experiences porting from CUDA to OpenCL
CUDA vs OpenCL models
CUDA term OpenCL term
GPU Device
Multiprocessor Compute Unit
Scalar core Processing element
Global memory Global memory
Shared (per-block) memory Local memory
Local memory (automatic, or local ) Private memory
kernel program
block work-group
thread work item
Syntactic differences in kernel code
C host-side API like CUDA C API
Nothing like the CUDA language extensions!
Matt Harvey Experiences porting from CUDA to OpenCL
From CUDA to OpenCL
Given the similarities it should be possible to
Abstract the differences between CUDA (C API) and OpenCL
API
Use simple source-to-source translation to convert kernels
Automatically generate source code for kernel entry point
functions – Need to make a tool that does explicitly what
nvcc does during a compilation
Swan - runtime library (CUDA or OpenCL) and compile-time
translator for kernels.
Matt Harvey Experiences porting from CUDA to OpenCL
Swan kernel translator
Perl Regular expression based, eg:
s/threadIdx.x/get local id(0)/g
Compiles the kernel with nvcc (no translation needed) or
opencl compiler
Generates entry points from code analysis
Matt Harvey Experiences porting from CUDA to OpenCL
Kernel conversion example
__global__ void vecadd( float *a, float *b, float *c ){
int idx = threadIdx. + blockDim.x * blockIdx.x;
c[idx] = a[idx] + b[idx];
__syncthreads();
}
becomes
#include <swan-macros.h>
__global void vecadd( __global float *a, __global float *b, __global float
int idx = get_local_id(0) + get_local_size(0) * get_group_id(0);
c[idx] = a[idx] + b[idx];
barrier(CLK_LOCAL_MEM_FENCE);
}
Matt Harvey Experiences porting from CUDA to OpenCL
Kenel invocation
float *a, *b, *c;
vecadd<<< n, m >>>( a, b, c )
Runs n blocks each with m threads
Arguments type-checked at compile time
We like the conciseness - don’t want to change it too much
Matt Harvey Experiences porting from CUDA to OpenCL
CUDA C API Example
vecadd<<< grid, block >>>( a, b, c );
becomes..
CUmodule module; CUfunction f;
int offset;
dim3 block, grid;
int len = sizeof(float4*);
cuModelLoadData( &module, "module.cubin" );
cuModuleGetFunction( &f, module, "vecadd" );
cuParamSetv( f, offset, &a, len ); offset += len;
cuParamSetv( f, offset, &b, len ); offset += len;
cuParamSetv( f, offset, &c, len ); offset += len;
cuParamSetSize( f, offset );
cuFuncSetBlockShape( f, block.x, block.y, block.z );
cuFuncSetSharedSize( f, 0 );
cuLaunchGrid( f, grid.x, grid.y);
(OpenCL code very similar.)
Matt Harvey Experiences porting from CUDA to OpenCL
Problems with the C API
Not just more verbose, many more things to go wrong!
The kernel object could fail to load/compile.
No guarantee that ”vecadd” exists.
We could get the formal parameters wrong.
Want to replicate as much of the CUDA call syntax as possible.
Fortunately - can machine generate the code.
Matt Harvey Experiences porting from CUDA to OpenCL
Entry Point generation
__global__ vecadd( float* a, float* b, float* *c ) { ... }
becomes
static unsigned char __swan_program_source[] = {
0x00, 0x10, 0x12, ...
};
void k_vecadd( dim3 grid, dim3 block, int shmem,
float *a, float *b, float* c ) {
swanInit( __swan_program_source );
swanRunKernel( "vecadd", grid, block, shmem, a, b, c );
}
Change ”vecadd<<<g,b>>>(a,b,c)” to ”kvecadd( g, b, 0, a, b, c )
Matt Harvey Experiences porting from CUDA to OpenCL
Swan Runtime library
Private functions (used by entry points)
swanInit( unsigned char *program binary );
swanRunKernel( char *kernel, dim3 grid, dim3 block, int
shmem, ... );
swanMemcpy(), swanMalloc()
Two implementations - CUDA and OpenCL
Matt Harvey Experiences porting from CUDA to OpenCL
Using Swan
Two step compilation:
swan --cuda -o kernel.kh kernel.cu
gcc -o kernel.exe kernel.c -lswan.cuda
or
swan --ocl -o kernel.kh kernel.cu
gcc -o kernel.exe kernel.c -lswan.ocl
Can now use the system C/C++ compiler for host code.
Matt Harvey Experiences porting from CUDA to OpenCL
Swanified code benefits
Gain platform independence (CUDA, OpenCL)
Lose almost none of the conciseness of CUDA
Most code translation automated
Matt Harvey Experiences porting from CUDA to OpenCL
Swanified AceMD
AceMD refactored to use Swan (180 kloc, 250 files, 100
kernels) 1 week of work (many OpenCL bugs!)
All features supported in CUDA version.
Not all features supported in OpenCL version:
No FFT library, so no PME
No texturing (sampler) support, so no optimised NB kernel
No atomic operations, so no fast particle binning
But, enough to test with!
Matt Harvey Experiences porting from CUDA to OpenCL
Performance
Compare the original CUDA code against:
Swan with CUDA target (should be no difference)
Swan with NVidia OpenCL target (hopefully no difference)
Swan with ATI GPU target (who knows?)
Swan with ATI CPU target (hopefully not much worse than a
reference CPU implementation)
Caveat All of the OpenCL compilers are pre-release betas. No
guarantee that the compilers are optimal (certainly not bug free).
Using NVidia Cuda 3.0 beta, ATI Stream SDK 2.0 Beta 4
Matt Harvey Experiences porting from CUDA to OpenCL
CUDA vs CUDA+swan
Stage Cuda (original) CUDA (swan) Speedup
Bonded terms 0.853 0.786 +1.1x
Binning 0.557 0.966 -1.8x
Nonbonded terms 13.152 13.925 -1.1x
PME 5.162 5.526 -1.1x
Integration 0.098 0.091 +1.1x
Total 19.998 21.457 -1.1x
Times in ms. Tesla C1060 Cuda 3.0 beta, Centos 5.4
<10% drop in speed
Due to changes to some datatypes, formal arguments for OpenCL
Matt Harvey Experiences porting from CUDA to OpenCL
CUDA vs Nvidia OpenCL
Stage CUDA Nvidia OCL Speedup
Bonded terms 0.396 0.477 -1.1x
Binning 0.863 3.833 -4.4x
Nonbonded terms 26.548 39.408 -1.5x
Integration 0.090 0.184 -2.0x
Total 28.506 43.924 -1.5x
NVidia Tesla C1060, HP xw6600, 2 x Xeon 5430, Centos 5.4, CUDA 3.0 beta
Model: Gramicidin-A 29042 atoms, cutoff=12˚
A switch=10.5˚
A
No PME (no FFT)
Optimised kerrnels fare the worst - differe nt compilers
Unoptimised, high arithmetic intensity kernels (bonded) best
Uncoelesced memory accesses/atomic ops slow
Matt Harvey Experiences porting from CUDA to OpenCL
CUDA vs ATI OpenCL (GPU)
Stage CUDA Nvidia OCL ATI OCL Speedup
Bonded terms 0.396 0.477 1.930 -2.2x
Binning* 16.438 21.160 61.981 -3.8x
Nonbonded terms 26.548 39.408 168.342 -6.3x
137.94 -5.1x
Integration 0.090 0.184 0.489 -5.4x
Total 44.081 61.251 234.196 -5.3x
NVidia Tesla C1060, HP xw6600, 2 x Xeon 5430, Centos 5.4, CUDA 3.0 beta
ATI 4850 (1TFLOP), HP xw6600, 2 x Xeon 5430, CentOS 5.4, ATI OpenCL beta 4
Model: Gramicidin-A 29042 atoms, cutoff=12˚
A switch=10.5˚
A
Slow alogrithm for binning (no atomic memory operations)
Matt Harvey Experiences porting from CUDA to OpenCL
Why so slow?
ATI hardware is much slower than expected (hoped for)! Why?
Random memory access to memory especially bad
No shared memory - faked up in global memory
Many kernels optimised to fit NV resource contraints
ATI cores are individually SIMD, NV’s are scalar, so mostly
(75%) idle?
4xxx series cards are RV770-based - old technology, pre OpenCL.
Need to try a 5xxx series (RV860) card - designed with OpenCL in
mind.
Matt Harvey Experiences porting from CUDA to OpenCL
CUDA vs ATI OpenCL (CPU)
ATI OpenCL will also generate code for multicore CPUs. Testing
on a dual Xeon 5530 (2.4GHz), 1333MHz memory
NAMD : 90ms/step
AceMD : 165ms/step -1.8x
AceMD (Tesla C1060) : 44ms/step +2.1x
AceMD (ATI 4850) : 203ms/step -2.3x
AceMD is doing twice as much computation as NAMD: actually
comparable speed CPU compiler quite good, kernels map onto
SSE well.
Model: Gramicidin-A 29042 atoms, cutoff=12˚
A switch=10.5˚
A, no PME
Matt Harvey Experiences porting from CUDA to OpenCL
Summary
Possible to maintain a single code base that works for both
CUDA and OpenCL.
NVidia OpenCL performance almost as good as CUDA –
highly tuned kernels fare worst.
Not all CUDA features are supported (C++ templates,
textures, cuFFT, etc).
ATI hardware performance much worse (5x) than hoped for
probably need to wait for newer hardware, runtime release
Matt Harvey Experiences porting from CUDA to OpenCL
Conclusions
CUDA still the most mature programming tool, but
OpenCL is a viable alternative to CUDA on NVidia hardware.
ATI OpenCL is only its first release, should get better.
Current ATI 4xxx hardware doesn’t fit the OpenCL model well
(shmem).
ATI processor cores quite different to NVidia’s
Platform-independent OpenCL kernels possible now,
high-performance from single kernels unlikely to be possible
soon.
Swan available on request.
Matt Harvey Experiences porting from CUDA to OpenCL
Acknowledgements
HPC Europa 2 (228398)
Virtual Physiological Human Network of Excellence (FP7)
Programa Ram´on y Cajal
Obra Social Fundac´o “La Caixa”
National Science Foundation (OCI-0721124)
Imperial College London
UPF/IMIM-Hospital del Mar
Acellera Ltd
Matt Harvey Experiences porting from CUDA to OpenCL
Thanks!
Any questions?
Matt Harvey Experiences porting from CUDA to OpenCL
GPUGRID reliability
Client systems a mix of Win XP, Vista, 7 and Linux (32b/64b)
Typically 1000 hosts active
50% failure rate
Most failures not GPU-related - BOINC problems, timed out
WUs
GPU problems:
Factory overclocked cards
Persistent problem with cuFFT on original GTX260 cards
Few memory errors. Run a memory tester periodically.
Matt Harvey Experiences porting from CUDA to OpenCL
AceMD Performance - ApoA1
Program # GPU ms/step
AceMD 1 74
3 30
NAMD 4 87
16 (4 nodes) 27
60 CPU 44
AceMD GPUs are
GTX 280 (1Tflops)
NAMD GPUs are G80
(500Gflops)
Harvey et al,JCTC (2009)
AceMD runs exclusively on the GPU(s), using host only for IO (+comms in
parallel version).
NAMD runs on the host, passes selected work to the GPU.
PCI bandwidth low (<6GB/s), GPU memory 80GB/s+, host memory 30GB/s
Matt Harvey Experiences porting from CUDA to OpenCL
... As the development of CUDA is in a more mature stage than for example OpenCL, we chose CUDA as basis for this work hoping to avoid unnecessary hassle due to framework bugs and limitations, at the price of being bound to one vendor. Converting the code to OpenCL later is possible with a reasonable amount of work that can be partially automatized [Har09b]. ...
Article
In 2008 the Mutual Information Analysis (MIA) has been introduced as a generic side-channel key distinguisher suitable for detecting non-linear relations between measure-ments and hypothetical leakage. It is considered a convenient technique for higher-order scenarios as well as for attacks by an adversary not capable of obtaining an accurate leakage model. This thesis uses CUDA, Nvidia's framework for General Purpose Com-putations on Graphics Processing Units (GPU), for a fast parallel implementation of MIA that achieved a performance boost of a factor of 4 to 12 compared to a sequen-tial reference implementation. We evaluate different optimization strategies and analyze the constraints to the possible profit that MIA can draw from stream processing par-allelizations. More generally our results suggest that GPUs can significantly speed up the computation time of side channel attacks. While the performance gain for our ap-proach is decreased by hitting the memory bandwidth limits without exhausting the computational power of the GPU, attacks having a higher arithmetic intensity, e.g. in higher-order scenarios, are likely to achieve an even higher performance gain.
Article
This paper reviews developments in general purpose computing on graphics processor units (GPGPU computing) from the perspective of video-game-related artificial intelligence (AI). We present an overview of the field, beginning with early shader language solutions and continuing to discuss three accessible platforms for GPGPU development: CUDA, OpenCL, and Direct Compute. Consideration is given to the commercial and practical realities which hinder the adoption of GPGPU solutions within video game AI, and developments in GPGPU computing directly relevant to common AI practices within the video games industry are reviewed in depth.
Conference Paper
The recent advances in genomic microarrays design provide the possibility to retrieve hundreds of thousands of significative genetic features from patients at affordable costs. Understanding if non-linear interactions (epistatic relationships) between these features determine or not the arising of complex common multifactorial genetic diseases is a critical task for human geneticists. The algorithms able to detect such relationships, like the Multifactor Dimensionality Reduction (MDR) algorithm, are computationally expensive and their practical utility is very often limited by the amount of time required by the analysis. This paper presents three hardware-accelerated implementations of the MDR algorithm, tailored for many-core processors, Xilinx Virtex-5 FPGAs and generic GPUs respectively. These implementations provide timing performance improvements of up to two magnitude orders with respect to the software implementation.
ResearchGate has not been able to resolve any references for this publication.