Science topic

GPU Programming - Science topic

Explore the latest questions and answers in GPU Programming, and find GPU Programming experts.
Questions related to GPU Programming
  • asked a question related to GPU Programming
Question
4 answers
Hello everyone,
I am trying to generate an algorithm that can roughly estimate the support's volume of an stl file at a specific orientation. Does anyone have any ideas where and how to start? I am trying to do this in Python but based on what I read, this can be a GPU computational and geometrical design problem. So I am not sure if Python is a good place to start. I appreciate any ideas/responses. Thank you.
  • asked a question related to GPU Programming
Question
4 answers
Dear Friends,
I would like to know about the best method to follow for doing MATLAB based parallel implementation using GPU of my existing MATLAB sequential code. My code involves several custom functions, nested loops.
I tried coverting to cuda-mex function using MATLAB's GPU coder, but I observed that it takes much more time (than CPU) to run the same function.
Proper suggestions will be appreciated.
Relevant answer
Answer
MATLAB may launch a parallel pool automatically based on your selections. To enable this option, go to the Home tab's Environment group and click Parallel > Parallel Preferences, followed by Automatically build a parallel pool. Set your solver to parallel processing mode.
Make a row vector with values ranging from -15 to 15. Use the gpuArray function to move it to the GPU and build a gpuArray object. To work with gpuArray objects, use any MATLAB function that supports gpuArray. MATLAB uses the GPU to do computations automatically.
To start a parallel pool on the cluster, use the parpool function. When you do this, parallel features like parfor loops and parfeval execute on the cluster workers. If you utilize GPU-enabled functions on gpuArrays, the operations will be executed on the GPU of the cluster worker.
  • asked a question related to GPU Programming
Question
4 answers
I have read descriptions of several functions from about 10 different packages allowing GPU acceleration. From what I have understood so far, GPU acceleration can only be applied on C/CUDA coded functions.
Am I wrong thinking my own R functions, which themselves require functions of different R packages, cannot benefit of GPU acceleration until I "translate" into C/CUDA my functions and all dependent functions?
In other words, does it means that almost all R packages created so far are unable to benefit from GPU acceleration?
Relevant answer
Answer
Basically, as you correctly stated, you can run specific algorithms implemented in R packages on the GPU, but general R code doesn't.
Yet, R has some useful GPU libraries(https://cran.r-project.org/web/views/HighPerformanceComputing.html), even if it's not R that's directly running on GPU, but the C/Cuda code.
Best wishes,
Francesco
  • asked a question related to GPU Programming
Question
5 answers
I am looking for solving for the first few eigenvectors of a really large sparse symmetric matrix (1M x 1M). I am looking for suggestions as to which library will be best suited for my purpose. I am looking to compute the eigenvectors inside a few seconds. I have taken leverage of the library Spectra which uses the Iterative Arnoldi to solve the eigenvectors, but the time for computation has let me down. I have also looked into CUDA/MKL , but these libraries tend to take advantage of existing architectures. Is there some alternative or CUDA/MKL/OpenCL is the standard these days? 
Relevant answer
  • asked a question related to GPU Programming
Question
4 answers
I am working on continuous-time nonlinear dynamical systems which require a lot numerical computations. Previously, I used 'for' loop for solving the system for a range of parameter values to investigate the bifurcation and regions of stability. Then I used vectorization, which made my codes faster. I have a NVIDIA GTX 1060 6 GB card installed on my pc. I don't know if GPU programming will help me getting results faster or not. So, I want suggestions from you if I should start learning GPU programming or not and if the answer is yes where should I start.
Thank you.
Relevant answer
Answer
Thank you @Kristoffer and @Antonio for your help. Also sorry for the late reply. I will follow your instructions and will let you know if I get positive results.
  • asked a question related to GPU Programming
Question
2 answers
I am using Nvidia's K20x cards on my machine. I am facing some problems in using lammps gpu enabled version upto its full efficiency on GPU-node. The main role of the gpu is to save the time consuming step of building the neighbour list and then calculation of force between these pairs on cpu. But, I am only able to use force calculation part otherwise its giving some error on using for building neighbourlist. I have also tried re-compilation but facing the same problem. So, please let me know the solution or the mistake I am doing.
I am getting the error something like this "Cuda driver error 700 in call at file 'geryon/nvd_timer.h' in line 76".
Relevant answer
Answer
hi Mr shrivastav
did you get any help?
I have the same problem with lammps.
please help me if you can
  • asked a question related to GPU Programming
Question
1 answer
I have an idea about name scope that naming a group of OPS to view in graph. But I am loosing my mind at variable scope. I can't get it right to execute my code in multiple GPU.
Relevant answer
Answer
Variable scope allows you to create new variables and to share already created ones while providing checks to not create or share by accident. For details, see the https://www.tensorflow.org/api_docs/python/tf/variable_scope
i think the best way to understand it is through tensorgraph debugger
  • asked a question related to GPU Programming
Question
2 answers
Hello everyone, I have a question according to the title.
in your opinion, What are the hottest topics that can I work on in master thesis.
thank you
Relevant answer
Answer
Hi Hisham,
If you already have the serial code and you have already decided that the thesis framework is accelerating this specific code using OpenACC then the "hottest" topics could be "which adaptations of the original code/algorithm can be done to help OpenACC in improving the application's performance" and "how does your proposals work on different GPUs (if you have more than one available), i.e., how general they are".
Of course, a comparison with the corresponding CUDA implementation can also be interesting.
Good luck,
Eduardo
  • asked a question related to GPU Programming
Question
18 answers
After reading the proceeding: "Best Practices in Running Collaborative GPU Hackathons: Advancing Scientific Applications with a Sustained Impact" I came across the paper: "Porting the MPI Parallelized LES Model PALM to Multi-GPU Systems – An Experience Report". It is evident that GPU's are Single Instruction, multiple threads and therefore it is not a cache-optimized computer architecture. My experience is that unless you changed the whole kernel the Hybrid approach MPI+GPU affects the performance a lot. However, there is a boom in HPC with GPU's. In some instances, MPI alone performs better than the hybrid since we do not need to move information back and forth to the GPU and CPU, respectively. We have different good practices but there is no a standard or a reference that we can always take to the bank. The picture becomes even worse when high order schemes are used.
1) Which is your common approach for this issues in CFD?
2) Should we keep the structure of the source code as functional programming, imperative programming with a bunch of do .. end do in each subroutine that affects the performance a lot?
3) Should we use data region (where the data remains local in the GPU) where we packed all the computations even though we are hurting the readability of the source code?
4) Should we update the ghost cells in every time step?
Again the focus is only CFD
Relevant answer
Answer
While there is no universal answer to your question about hybrid computation, more and more people consider that the directive approach (OpenACC / OpenMP) to be the nominal way to go for porting large legacy codes, mostly because OpenAcc / OpenMP directives can be added incrementally in the code.
An interesting emerging alternative, which is probably more adapted to new codes aiming both at performance and portability across architectures (x86_64 multicore, nvidia GPU, AMD GPU, ARM, ...) is to use a "smart" library like Kokkos (developped at Sandia NL) or RAJA (devel at Livermore NL) or others to implement shared memory parallelism. These libraries provides "generic" concepts to express parallel loops (for, reduce, and so on). These libraries focus on node-level parallelism, and need to be used in combination with MPI for large scale distributed computation.
Another major advantage of these libraries, when compared to the directive approach, is that, and this is especially true for Kokkos, is that they also provide "architecture-aware" data containers. E.g. Kokkos provides arrays (called Kokkos::View) for CPU/GPU/... which allows you to access data like this data(i,j,k), very intuitively, and simplify the memory management when dealing with GPUs. Note that Kokkos library is a core building piece of the large parallel linear algebra package Trilinos.
I think that portability and more precisely "performance portability" is a keep concept today, that is being able to develop codes that can run efficiently on multiple platforms, not just one.
  • asked a question related to GPU Programming
Question
2 answers
How can I read a .mhd 3d image? Can you please provide a MATLAB example code.
Relevant answer
Answer
You can read/see them in a number of image viewers such as ITK-SNAP, Slicer, ImageJ etc. Or, you can play with the data in python/matlab.
  • asked a question related to GPU Programming
Question
7 answers
Are there some tools for creating CUDA projetcts using QT creator
Relevant answer
Answer
I confirm Sidi,
Indeed, i have created a docker image with cuda and qtcreator in my docker hub registry.
you can check and use this image : https://hub.docker.com/r/amine2733/qt-opencv-gpu/
  • asked a question related to GPU Programming
Question
4 answers
have an question about Keras library for deep learning
I have 3 GPUs:
GTX 1080 TI
GTX 1080 TI
Tesla K20
I want to know how can I run one model on three GPUs?
Relevant answer
Answer
I confirm that you use Keras (>2.0) for exploiting multiple GPUs. However, the three GPUs need to be from the same generation.
In your case, there is no problem for using the two GTX 1080 TI, but for the third one (Tesla K20), I'm not sure. It is a different card.
Please, if you test that, can you inform me if ti worked with the three.
Hope this helped.
Best regards,
Sidi,
  • asked a question related to GPU Programming
Question
4 answers
more explanation on implementation
Relevant answer
Answer
I don't think so, this is a 1 GB main memory device, if you are thinking on exploring small things maybe you can do something. But if you are thinking on doing some real work then it won't be enough. Efehan's advice is definitively a good one, although expensive :)
  • asked a question related to GPU Programming
Question
9 answers
What would be a good starting point to learn GPU programming? 
Is CUDA is the only programming language for it?
Relevant answer
Answer
Hi,
I think CUDA is not that easier than OpenCL. I have programmed both from early stages of them (2008). There is some extra boilerplate code (which you can write once and use in all your projects, like me!), but the kernels (the functions executed on GPU) seem very similar. You can use OpenCL on all GPUs (even nVidia ones), Intel and AMD CPUs, FPGA, etc. CUDA is only for nVidia GPUs. Performance-wise, they are very similar, though nVidia pushes towards CUDA.
There are several books on OpenCL now. See "OpenCL Programming by Example" or "OpenCL in Action". Also, take a look at AMD OpenCL university kit (free, link below). After you get familiar with the concept, read, cover-to-cover, OpenCL manual as well as AMD and nVidia optimization guides, so you can write efficient programs.
Hope this helps!
  • asked a question related to GPU Programming
Question
4 answers
Dear friends with experience in high(er) performance computing,
My lab is planning to purchase a computer specifically for GPU-based rendering (fancy picture / video generation based on CAD-type files).  Currently mainly using Blender software.  
We have spec’d out a machine (below).  Could you give me some feedback on any improvements you think would make it better?  We are working with a budget of about $7-10k.  
REALLY appreciate your help/advice!
 CPU:
Intel - Xeon E5-2697 V4 2.3Ghz 18 Core ~ $2495.99
CPU Cooler:
Corsair - H115i 104.7 CFM Liquid CPU Cooler ~ $129.99
Motherboard:
Asus-X99-E-10G WS SSI CEB LGA2011-3 ~ $639.99
Memory:
Corsair - Vengeance LPX 64GB (8x8GB) DDR4-2400 ~ $449.99
Storage:
• Samsung 850 EVO 1TB 2.5" SATA III Internal SSD (MZ-75E1T0B/AM) ~ $356.99 w/mount
     2. Toshiba 5TB X300 7200 rpm SATA III 3.5" Internal HDD ~ $146.84
Video Card:
2x - NVIDIA - Titan Xp 12GB Video Card (2-Way SLI) ~ $1432.15 each ~ $2864.30
Power Supply:
Corsair - AX1500i 1500W 80+ Titanium Certified Fully-Modular ATX ~ $409.99
Computer Tower:
Thermaltake ATX Full Tower Cases CA-1H1-00F6WN-00 ~ $249.99
Total cost for core computer parts ~ $7744.07
Relevant answer
Answer
Also, forgot to comment on the rest of the components. Those are fine. The Samsung EVO drive is great. There are some SSD drives that have the M2 interface and that is significantly faster. You may consider adding a cheaper 7200 rpm but large additional drive for data storage...keep the OS on the SSD drive as any data that needs to be accessed quickly.
  • asked a question related to GPU Programming
Question
8 answers
I know some popular libraries among researcher focusing on deep reinforcement learning such as TensorFlow, Theano, Caffe, Torch and CNTK. I usually work on both CPU and GPU. I don't need much detail configuration and customization, but computational time is important. Can anyone recommend which library is best for my case?
Thank you in advance
Relevant answer
Answer
Since computation is concerned, I would recommend TensorFlow because it supports distributed computing and multiple GPUs.
  • asked a question related to GPU Programming
Question
5 answers
CUDA C/ C++ for Windows 7 PC.
Relevant answer
Answer
Me too.
  • asked a question related to GPU Programming
Question
5 answers
Since we don't know the expected generated code or execution context.
Relevant answer
Answer
You can use gpu profiling which will tell the portion of the gpu execution time of some code. 
But if you means is the theoritical efficiency than you must analyze the primitive  operation of your algorithm. Please refer to big O notation and the parallel algorihm books if you want to analyze theoritically.
Hopes it help you.
  • asked a question related to GPU Programming
Question
7 answers
I should buy a computer to implement GPU on it for my thesis which is about image segmentation with graph cut algorithms.
can any body help me to realize the best hardware and software for this application??
Relevant answer
Answer
You should not go for a GPU solution out of the blue, unless you have already made an analysis of the specifics of the problem and be SURE that it can be benefit from the GPU characteristics.
If you have access to the OpenACC compilers, the port from a normal CPU implementation to GPU is not very complicated.
  • asked a question related to GPU Programming
Question
4 answers
Hi all,
I have been trying to process EEG data in matlab 2009 using the GPU mat. However I am stuck right at the GPUstart command, it gets stuck at the assigning the cubinpath. I can see that the while reading the major and minor revisions of the GPU, the major =5 and minor =0. However, I can't see conditionals in the program for major =5. The max I can see is major =3, for the Kepler architecture.
Can anyone walk me through this process ??
Relevant answer
Answer
Thanks for your responses. However I have already explored these options. The problem is that the pct in matlab is supported only beyond R2010a. and GPUmat only supports up till the kepler architecture.
I am looking at freewares to help me use matlab 2009 with a quadro K620
  • asked a question related to GPU Programming
Question
6 answers
I am trying to adaptively sample a uniform pixel grid, based on a specific pixel position and a linear (or close to linear, depending on the method) falloff in importance/sampling depth based on this position.
I already have several approaches in mind and currently we rely on a sorting-based method, but it won't be fast enough at higher resolutions (currently we run 1024^2, but it could be arbitrarily high, depending on the output device).
Is there any method allowing for quickly sampling such a distribution? The main challenge here seems to be that we absolutely do not want any duplicates, which is why there has not been much progress with standard importance sampling so far.
Also, the pixel position which is used for distance computation may change all the time (from frame to frame) and we need all that stuff to be done for hundreds of frames per second with tens of thousands of samples per frame. In addition,
it would be absolutely great if such a method could guarantee a fixed number of samples being generated. This would be no problem with importance sampling, but the only way of avoiding duplicates when using this is to mark sampled pixels and
reject a sample if it hits the area of a pixel that has already been sampled, making it hard to predetermine the time the sampling step is going to take.
Does anyone know of a more elegant approach to this issue? If that's important in any way, we use CUDA for all our stuff.
Relevant answer
Answer
You have quite an interesting problem there Thorsten. First of all, you're absolutely right that mailboxing (=sample marking and checking) is not something you want to do in a parallel environment if you have so many samples to generate.
There are two approaches you could consider. One is rejection sampling: first you compute an upper bound on the image value (should be easy from your description). Then for every pixel you generate a random number and decide if to sample it or not. I'm not going to derive the formula, but based on a) the pixel value, b) the upper bound from the first step, c) number of pixels in the image and d) number of desired samples, you should be able to derive the probability of sampling every given location, so that overall, the expected number of samples in the image is what you want.
Second approach could be seen as halftoning (dithering) the image. Look for Floyd-Steinberg if you're not familiar with this. Point is, you can interpret your problem as approximating the grayscale 'distance' image by a binary function (where '1' naturally means placing a sample in a given pixel). This is then precisely what halftoning via error diffusion does. Note that this will also require you to normalize the image, but that's doable in real-time I think.
Both of these should work, although the first one is much better parralelizable. The second one on the other hand will lead to a better spatial distribution of samples (i.e., it'll avoid sample clustering). Admittedly neither will give you exactly the number of samples you specify, but the expected number of samples will correspond to the desired one, and at that number of samples the deviation will be quite tiny I expect. On the other hand, both of the methods guarantee you 0 or 1 sample per pixel.
  • asked a question related to GPU Programming
Question
6 answers
i want to model a Semi-submersible propeller which is a multiphase problem. my target is to show the advantage of using GPU accelerator in solving complex problem and discover how much GPUs can fastened the process of solving. i am going to use ANSYS Fluent v16.0. Is Fluent capable using GPU like Tesla K80 to meet my requirement? Do i have to write a code or any kine of consideration in order to solve my problem? any advice will be greatly appreciated.
Relevant answer
Answer
I found a document from ANSYS which declare that it is impossible to do it! on page 17
  • asked a question related to GPU Programming
Question
1 answer
I have a very simple SAT solver, called CCC, http://fmv.ektf.hu/files/CCCv1.0.zip, and I would like to port it to GPU. How to start this? Which GPU framework shall I use? Are there any GPU based SAT solver?
Relevant answer
Answer
Yes, there are...
Just a quick search on github and I've found 3 sat solver projects which use GPU.
For CUDA version you can check cuda-sat-solver:
For OpenCL you can check clsat and opancl-satsolver:
The framework is a hard decision.
You can either use CUDA or OpenCL.
CUDA is used by NVIDIA. Its advantage is that HPC environments use also NVIDIA GPU clusters.
OpenCL is an open standard, used by many companies. It should be the future. Some CPUs also support it.
  • asked a question related to GPU Programming
Question
19 answers
I suspect that the Xeon Phi gives slightly less accurate results in 32 bit floating point vector arithmetic than conventional Intel processors. Has anyone else got information on this?
Relevant answer
Answer
A brief follow-up... I just found two relevant articles, related to the topic discussed above.
One is "Differences in Floating-Point Arithmetic Between Intel® Xeon® Processors and the Intel® Xeon Phi™ Coprocessor", found at https://software.intel.com/en-us/articles/differences-in-floating-point-arithmetic-between-intel-xeon-processors-and-the-intel-xeon.
The other one is "Consistency of Floating-Point Results using the Intel® Compiler or Why doesn’t my application always give the same answer?", at https://software.intel.com/sites/default/files/managed/9d/20/FP_Consistency_300715.pdf.
These may be of interest to anyone using both Intel® Xeon® processors and Xeon Phi™ coprocessors, or the Intel® Compiler (or even if just performing floating-point computations in general).
  • asked a question related to GPU Programming
Question
3 answers
I want to use GPU for calculating non-overlapped floating point array values.
Relevant answer
Answer
Aslo you can try OpenCL
  • asked a question related to GPU Programming
Question
4 answers
I am using Tesla K20. I got an error that the shared memory is limited to 16K although K20 supports up to 48K. How to configure the GPU and NVCC compiler to use 48K shared memory instead of 16K?
Relevant answer
Answer
You will probably want to also take a look at cudaFuncSetCacheConfig, documented here:
For shared memory use on the K20 you may want to configure this to "cudaFuncCachePreferShared".
  • asked a question related to GPU Programming
Question
7 answers
I'm doing equilibrium and steered molecular dynamics simulations with GROMACS 4.6.5 in a single computer with 2 GPUs and 8 CPUs. I have obtained that simulations in the GPU-CPU goes almost 3 times faster than when they are run only at CPU. I would like to know whether this performance is reasonable or may it be increased?.
Relevant answer
Answer
People looking to understand the nuts and bolts of GROMACS performance and how to get the most out of GPUs should (1) use the most recent software, rather than 4.6.5 which is quite outdated at this point and (2) read the following papers:
The performance depends on the quality of the hardware and the balance between CPU and GPU workloads. One cannot necessarily slap on an arbitrary GPU and expect massive gains.
  • asked a question related to GPU Programming
Question
7 answers
The classification process takes long time (11 hours for a picture) but gives excellent results. i have already shrunk my training bank to the minimum. the File of ClassificanionKNN stores many types of information cells and arrays, uploading this bank to parallel processing or GPU processing causes errors. (Matlab preferred).
My background is more agricultural than computer science.
Thanks for the help
Relevant answer
Answer
I have three arrays of 20170320 pixels and a bank of 14500 points in 7 classes. if the bank is smaller classification is less good.
thanks for your help.
i currently trying a faster method ... it takes time for results....
  • asked a question related to GPU Programming
Question
3 answers
Relates to GPUs and Xeon Phi
Relevant answer
Answer
Generally, GPUs get through one instruction per cycle and then hide latency by switching threads. I think that the scatter/gather instructions in NVIDIA GPUs are one cycle if all the data is in the same cache line, otherwise it slows down a lot. There's a whole banking issue with the cache lines as well, which is complex. So, if you actually use the scatter/gather capability to actually scatter and gather, performance slows a *lot*.
  • asked a question related to GPU Programming
Question
6 answers
52
especially in Geant4 MPI
Relevant answer
Answer
Although you mention Gean4, you haven't really described which results you're trying to merge.  Such merging obviously may depend on physics, rather than the trivial advice to use gather/reduce.  Mostly, it's not really clear what you have done, and want to do.  My impression is that Gean4 has reasonable MPI support built in:
so are you already using G4MPImanager/G4MPIsession ?
  • asked a question related to GPU Programming
Question
13 answers
I have somewhat parallelized C++ code running on a compute server with 22 boards.  It does not use the GPU at all.  Is there a way to re-compile my code to use the GPU without my having to re-code anything?  It need not be freeware, and I can buy a new compiler if need be.
Relevant answer
Answer
No, that won't work as you intend.
There is a new compiler Sambamba
that parallelizes automatically, but
for CPUs, not for GPUs.
In GPU programming, a huge set of processors
executes the same function called the kernel.
The threads that are running are only
distinguished by coordinates (1, 2, or 3
dimensions of a grid), and these coordinates
can be used to identify the part of the
problem they have to work on.
See "Thread execution" and following chapters in
And "GPU kernels" (page 16) in
Regards,
Joachim
  • asked a question related to GPU Programming
Question
7 answers
I often use Mathematica software in my scientific work. Sometimes computations are very complex and spend a lot of time on standard PC (4-6 physical cores or 8-12 logical cores). I try to use parallel calculations but they accelerate a few times.  Generally I solve numerically single or double integrals with nonlinear equations which can also contain integrals. One mentioned operation is calculated in non parallel mode but I need series of results which I can calculate parallel (for example ParalellTable[]).
I have question with CUDA in Mathematica. Theoretically very efficient graphic card should  give significant acceleration. After reading Mathematica manual we can state that it is easy to calculate matrix operations with CUDA but integrals and nonlinear equations are rather difficult to obtain. I have question if somebody has good experience in solving said issues using GPU computing and could give me clear tips, literature, etc?
Relevant answer
Answer
Cuda processors are SIMD (single instruction multiple data) architectures. Cuda processors are extremely efficient at computation on data that is not dependent on one another. Depending on the algorithm several dependencies may arise that give way to a break in parallelism.This break in parallelism usually takes place through conditionals.
You would need to be specific on the algorithm to get a more specific answer, since integration could benefit from reduction operations in CUDA, but again it depends on the algorithm.
  • asked a question related to GPU Programming
Question
8 answers
A real time visual attention system based on the human visual attention, it's better to use a only a FPGA, GPU or FPGA+dual core ARM?
Relevant answer
Answer
The key to succeed in this kind of project is to properly partition the application, which blocks will be executed in the dual core ARM and which blocks will be executed in the FPGA fabric. You must know the formulas of your algorithm(s) and see if they show dependency or not with respect to time. If algorithm(s) do not show such a dependency, they can easily implemented in the FPGA using VHDL. If the algorithm show dependency, you must look for ways how you can convert a serial algorithm in a parallel version and take advantage of languages such as VHDL. Another option is to write high level language as C/C++ and use OpenCL implementation from Altera or Vivado HLS and see what hardware produce, but note both OpenCL and Vivado HLS have limitations, and not sure if the hardware those tools produce will be efficient. Finally, use the dual core ARM just as a control unit, to start/end  tasks in your algorithm.
  • asked a question related to GPU Programming
Question
13 answers
I developed JPEG image compression using MATLAB. How I can execute the same program using GPU. Please give the suggestions regd.
Relevant answer
Answer
GPU programs are significantly different from CPU programs. In general, programs running on a CPU cannot be executed on the GPU. Most likely you cannot just run your Matlab code on the GPU.
The easiest way is to use special functions from the Parallel Toolbox (might not be included in your Matlab version). These functions will then automatically be executed on the GPU. Only question is if your algorithm is based on these functions. In any case you need to rewrite your code for the GPU.
Usually, GPU code is written in a special language, CUDA or OpenCL. The previous solution is just a wrapper to this. Matlab supports GPU functions (called kernels) written in CUDA to be executed. This is also part of the Parallel Toolbox. Note that CUDA only supports nVidia graphic cards. I haven't read up on this, but it might be possible that the Parallel Toolbox in general only works with nVidia graphic cards.
For further information I'd like to refer to the attached links. (BTW these are the first hits on Google!)
  • asked a question related to GPU Programming
Question
10 answers
I have a kernel which shows poor performance, nvprof says that it has low warp execution efficiency (page 3 in the attached PDF) and suggests to reduce an "intra-warp divergence and predication". Am i right that intra-warp divergence is any if-then statement which creates branching? For example this causes divergence: if (x<0.5) then action; but this doesn't: if (threadId.x == 0) then action2. Right? If this is right, then my kernel doesn't have any divergence. Another issue i'm concerned about is i'm using a lot of shared memory while all variables are of double precision, which causes bank conflicts in case of CC2.x with subsequent serialisation of calculations. What can we do in this case? Could increase of number of active warps help to hide the latencies caused by serialisation?
Relevant answer
Answer
Sorry Mikhail, I forgot to mention that the answer was about intra-warp divergence. In CUDA, there are two kinds of control (execution) path divergence: inter-warp and intra-warp.
I'll try to give an illustrative example. Imagine there are two squads of (say) 32 soldiers each, executing orders of one officer. The officer can give only one order to only one of the squads at the given moment (limitation of SIMD machines). Their task is to to prepare dinner and dig trenches. The first squad is "armed" with knives and potatoes, while the second is "armed" with shovels. The officer gives the order to the first squad to lift their knives. While they are lifting their knives, they can not do anything else, so the officer gives the order to the second squad to lift their shovels. By then, the first squad finished lifting their knives, so the officer gives them the order to lift the first potato. By then, the second squad finished lifting shovels, so they are issued the order to thrust the shovels into ground. By then, the first squad finished lifting potatoes, so the officer gives them the order to put the blade of their knives on the potatoes they are holding. The officer turns his attention to the second squad. Unfortunately, one of the soldiers hit a solid rock with his shovel, so the officer tells the other soldiers of the squad to wait (with their shovels thrust into ground), and addresses specifically the soldier who hit the rock to bend down and reach for the rock. The officer turns his attention to the first squad and tells them to rotate their potatoes by 20 degrees (he'd probably have to specify the axis of rotation and direction, but that is irrelevant here :) Then the officer turns his attention to the second squad, tells the soldier who hit the rock to grab the rock, and the other soldiers of the second squad to wait. Back to the first squad, the officer tells them to rotate their potatoes by 20 degrees (with the same remark as before). Back to the second squad, the officer tells the solder who grabbed the rock to throw the rock away (he'd probably have to warn the soldier to avoid hitting the rest of the squad with the rock, but we'll ignore that :) and the other soldiers of the squad to wait. Back to the first squad, another 20 degrees. Back to the second squad, the officer tells the soldier who just threw the rock away to thrust his shovel into ground and the other soldiers to wait. Back to first squad, another 20 degrees. Back to the second squad, the officer issues the order to the whole squad to lift their shovels (hopefully, with some soil on them) - because the problematic solder is now in the same state as the others from his squad.
I guess you figured out from this illustration that each squad is actually a warp, and a soldier is actually a thread. Warps compete for the commander's attention (concurrent execution), but within each warp threads execute their work in parallel. That is, in parallel - until at least one of them has to execute a different instruction from the others. In that case, some threads will have to wait until the others finish their work. There is loss in parallelism until their execution paths converge into a single path once again. As you could see, there is no loss in efficiency among two different warps (squads) even though the work they did was completely different. Put it simply, the work of the first squad was not influenced by the second squad and vice-versa. That is called inter-warp divergence. On the other hand, there was some loss in efficiency withing the second squad because one of the soldiers was unable to catch up with the rest of the squad - so the rest of the squad was forced to wait for him. Soldiers in that squad had to execute, at least partially, different orders. That is called intra-warp divergence and it causes some penalty. In the example I gave, the penalty was not great. Imagine that there was another soldier who hit a gold bar, and another one who hit an electric cable. The officer would probably not give the first one the order to throw the gold bar away, so other soldiers (including the one who hit the rock and the one who hit the electric cable) would have to wait for commander's attention. The worst possible situation would be when every single soldier within the squad hits something different with the shovel - different in a sense that it requires different kind of action. The commander would have to tell each one of them how to solve the problem, while the others (within the same squad) would have to wait.
I hope I managed to explain (in essence) what is intra-warp divergence :)
  • asked a question related to GPU Programming
Question
4 answers
Hello friends.
like always, you helped me. i ask another question.
we are conduction survey on GPU based database. for that we need SQL processing algorithms and respective Paper/article published still now...
I have also tried and i get following database
So please help me by replaying other database (SQL Query processing algorithms) link
and Thanks in Advance...
Relevant answer
Answer
There is a project to speedup postgres with GPGPU, this is a link to a pretty good presentation about it: http://www.slideshare.net/kaigai/gpgpu-accelerates-postgresql
  • asked a question related to GPU Programming
Question
3 answers
I have to solve the set of non-linear equations in 200x200x200x1000 variables space. I developed the parallel code in Matlab for that but it takes very very long calculation time. The solution could be GPU based algorithms. However, it is not cheap to get the license for commercialized GPU code.
Do you have any idea of homemade code for GPU based algorithm to solve set of 7 linear equations with 4D variables space in Matlab with MPI?
Thank you
Relevant answer
Answer
In order to solve large non linear systems in parallel with MPI, you can use PETSc, it is very efficient
  • asked a question related to GPU Programming
Question
6 answers
Is there any proposal to add timeouts to MPI?
Let's say that I would like to perform an MPI_Recv which could block.
In some cases I would like to limit the time that the MPI_Recv can be blocked. It would be useful to have a receive timeout in those cases.
I know that this can actually be done with MPI_Irecv & MPI_Test, but I wanted to know if anyone was able [somehow] to define a Rx Timeout parameter to the MPI environment and then get a [new] MPI_ERR_TIMEOUT when that occurs in a MPI_Recv.
I guess the same could be applied to other MPI calls.
Relevant answer
Answer
Hey David. The answer to your question is no. Not many people are fan of the idea in the MPI forum. When you think of how difficult it is to orchestrate MPI_Cancel, you can easily get why the idea is a hard sell.
As you said, you can implement it yourself on top of MPI ... but then you have to watch out for things like spurious message matching; that is, messages that you thought were "dead" (but are not) and then the middleware matches them internally and; at application level you think that they are messages that are subsequent to the one you cancelled (just one issue). I worked on that and can point to references if interested.
  • asked a question related to GPU Programming
Question
4 answers
I am unable to find the function to retrieve web page in kernel. If anyone could share his/her work or guide me to a useful tutorial.
Relevant answer
Answer
The webpage needs to be retrieved by the CPU. It does not really make sense to make this parallel. The reason for this is that you need to request the webpage via your network card (and your internet connection). There is a speed-limit to this. You will always need to wait for the content first. Todays CPUs will be fast enough to handle this amount of incoming data (just speaking of receiving and not analyzing). It would make sense, though, to send several requests in parallel. The reason for this is the implementation of TCP/IP which will start each request at a low rate and then over time will increase the speed according to your bandwidth. When you are only requesting the text of a webpage you will never reach the full potential of your internet bandwidth.
Analyzing the content of a webpage could be done in parallel, though. You would need to do tests to figure out if this would make sense. If the webpages are not coming in fast enough then a CPU might be enough to do the computations until the next webpage arrives. Only if the CPU cannot handle analyzing incoming webpages fast enough it makes sense to explore GPGPU implementations.
  • asked a question related to GPU Programming
Question
2 answers
We are accelerating some matlab codes. We use a mix between gpuArrays and CUDA to achieve this. Does anyone have best practices document for that?
Does the gpu array include any clue to manage the number of blocks in a grid, the number of threads/block, shared memory, pinned memory, etc?
Relevant answer
Dear Ahmed, this is a very general question. I have used the CUDA-Matlab integration for solving several types of problems. For instance, I have used it for solving large and sparse linear systems. Furthermore, I have used it for variable selection application in the context of multivariate calibration. In both cases, it was possible to demonstrate that the parallel version is faster than sequential one.
As far as I know, the Parallel Computing Toolbox (PCT) plugin still does not provide support for setting the number of blocks and threads per block. This is automatically defined by PCT in runtime through an optimized way.
I suggest you to read some papers of mine that describe this issue. Moreover, this link may be useful for you: https://www.ll.mit.edu/HPEC/agendas/proc07/Day3/11_Fatica_Poster.pdf
Best wishes and good luck with your research!
  • asked a question related to GPU Programming
Question
6 answers
On a standard multi-core machine, it is easy to specify the number of cores you want to use. As a comparison, it would be interesting to know how cuda programmes scale with the number of GPU cores that the programme uses. How does one alter the number of cores that will be used to handle a task?
Relevant answer
Answer
Please read first few chapters of this document. surely you can get good knowledge about CUDA
  • asked a question related to GPU Programming
Question
52 answers
I have noticed that CUDA is still prefered for parallel programming despite only be possible to run the code in a NVidia's graphis card. On the other hand, many programmers prefer to use OpenCL because it may be considered as a heterogeneous system and be used with GPUs or CPUs multicore. Then, I would like to know which one is your favorite and why?
Relevant answer
Answer
I think this debate is quite similar to DirectX vs. OpenGL. The only difference is that DirectX is bound to a single operating system, whereas CUDA is not. Both DirectX and CUDA are proprietary solutions whereas OpenGL and OpenCL are open standards (actually, both managed by the Khronos Group).
The question is really about your goals. Do you want maximum performance? Then CUDA might be the best choice. Updates for CUDA occur quite often supporting the most recent features of NVIDIA GPUs. But, you have vendor lock-in and can only use NVIDIA cards. Looking at other discussions, AMD graphics cards can provide "more bang for the buck", especially when it is about double precision operations. NVIDIA graphic cards artificially reduce the bandwidth for double precision computations. So, if you want to use CUDA with double precision you need to buy TESLA compute cards. These are more expensive. In summary, depending on your problem and if maximum performance (i.e. every last percent) is of your concern (and money is not) then CUDA is the best choice.
If you don't buy a new graphic card once a year to get the best performance or you don't have the money for an expensive card, OpenCL (and maybe AMD) is a valid alternative. Also, if you don't spend too much time on optimization and don't learn every new language/GPU feature as soon as it comes out, there is not much of a performance difference between OpenCL and CUDA in general.
My choice is really simple: I've chosen OpenCL. The reason for this is that I'm writing commercial software which should work with the hardware the client already has. He should not be forced to buy an NVIDIA card. Furthermore, hardware as old as back to 2010/2012 should still be supported, which means that I cannot use modern hardware features. With these restrictions there is not much of a performance difference between CUDA and OpenCL. But, the good thing is that I don't have to write my code twice--once for the GPU and once for the CPU. OpenCL kernels will support the CPU as well. (The reason for this is that even if a client does not have an OpenCL-capable GPU the software should still work.)
Looking at the battle DirectX vs. OpenGL, OpenGL has a good chance to win in the long run as Valve is switching to OpenGL and starts bringing out their own operating system SteamOS. The reasons are portability and compatibility. For the same reasons OpenCL might win over CUDA in the long run. Not now, but maybe 10-20 years from now. OpenCL will not die because it is necessary for Mac OS and for AMD hardware. The question is if NVIDIA will fight for CUDA the same way as Microsoft does for DirectX. In my opinion the cost of DirectX for Microsoft is already too high.
  • asked a question related to GPU Programming
Question
4 answers
I am running this GPU news group on Facebook and would like to have access to the latest research with GPUs: http://www.facebook.com/gpucomputing
Relevant answer
Hello, Martin! I have used GPU for solving linear systems and for variable selection in multivariate calibration problems. In my profile you can have access to some of my papers.
Best regards!
  • asked a question related to GPU Programming
Question
1 answer
I want to read a video file using openCV CUDA C++. However when I tried the code of the link given, it gave "OpenCV was built without CUDA Video decoding support". I have been suggested that I should decode the video using CPU and process the raw frames using GPU instead. Need to know about these terminologies and how I can attain it?
Relevant answer
Answer
Video usually is highly compressed (e.g. mpeg, avi) in order to save disc-space or transmission band-width. If you decompress it (a suitable codec is needed), it will be converted to raw frames, e.g. a time-series of normal images. Such uncompressed (=raw) images are 2D matrices of pixel-values, and are suitable to feed to the GPU.
  • asked a question related to GPU Programming
Question
22 answers
Any suggestions?
Relevant answer
Answer
It highly depends on the problem which method will be better. And then you also need to define what you actually mean by "better": Better convergence? Better speed? Sometimes a slightly better convergence is not important because of bad performance. Currently, in our simulation software we are using BiCGSTAB(2). But, we have several problems which we cannot simulate because the solver will not converge. We will try AMG methods for these kinds of problems. But probably it will be too slow in the general case.
  • asked a question related to GPU Programming
Question
4 answers
I want to analyse the enhancement in processing time of a video on GPU. For the same, I need to know how to read a video file (or from a webcam) using openCV CUDA on a linux OS?
Relevant answer
Answer
I think you can't read directly from a video stream or webcam to GPU memory. You have to first read each frame normally and then upload each frame to GPU memory. Then, you can perform operations on the GPU.
  • asked a question related to GPU Programming
Question
2 answers
I need to transfer a relatively large raw frame grabber images to GPU for subsequent CUDA enhanced signal processing and it seems that GPU Direct / RDMA is the way to go but all I can find for examples is related to linux applications.
Relevant answer
Answer
GPUDirect RDMA is only usable on Linux systems
  • asked a question related to GPU Programming
Question
2 answers
I have parallelized (on GPU) and used the SPA for variable selection in multivariate calibration problems, and would like to know if there are some others parallelized algorithms that have been used for the same content.
Relevant answer
Thanks, Guilherme! I will research about them too.
  • asked a question related to GPU Programming
Question
5 answers
Meteorological models generally benefit from multi processors, depending on its optimization. As GPUs are capable of massively parallel computing, they're often seen as a solution for cheaper and faster model runs. Are we any close from running full meteorological models such as WRF on personal PCs using its (consumer-grade) GPU?
Relevant answer
Answer
In my opinion no. The biggest problem is getting data on and off the GPU.
(The CPU has that problem too, as it spends much of its time waiting on data
which is not in the registers or cache.)
The newest (and most expensive) NVIDIA card has only 12GB of on-board memory.
If you can parallelize the problem well for the GPU, and fit the entire problem within 12 GB, and your I/O requirements are not large, then the problem can
be done on a GPU very quickly.
Alternatively, if you can come up with a way to fill up the 12 GB of memory
compute with only that for several minutes, then move a different 12GB piece of
data in, and compute on that (the old memory overlay method) then you
can do well.
Does WRF have these capabilities? (I don't think so.)
  • asked a question related to GPU Programming
Question
21 answers
CUDA C/C++, CUDA Fortran, PyCUDA, OpenCL, etc,...
Relevant answer
Answer
From the experience I have of playing around with GPGPU, I would most definitely agree with all other responses saying that it depends on what your aim is.
Do you want to port an existing application or are you starting on a new project, or are you just wanting to experiment / prototype to ascertain potential benefits? Will the code always run on a single vendor's hardware?
For what it's worth, here are my few cents on the various possibilities
- cuda/opencl. Having used both, the kernel code (the instructions actually executed on the GPU) is basically the same, both in terms of functionality and in terms of coding style and syntax, and performance on a single GPU is much the same for all tests I have performed. Host-side code is more different, with opencl being perhaps a little more complicated, but in most cases it is "boiler-plate code" that can be copy-pasted, and its intricacies don't need to be mastered for a first foray into GPGPU programming.
OpenCL allows portability -albeit with some effort- across vendors, and even across widely differing types of hardware, both current and future (HSA-enabled hardware for example looks very promising). I have a CFD code written in OpenCL which runs well on NVIDIA & AMD GPUs, and also on cpus. This is not possible with cuda. Incidently, if performance/$ is a metric of interest, AMD hardware should absolutely not be discounted too quickly.
On the other hand, the one really interesting feature cuda currently has over opencl is for multi-GPU communication. Memory can be transferred between cards without being buffered in CPU memory, which can significantly speed up transfers, and simplify programming. Various implementations of Cuda-aware MPI exist, allowing high-performance MPI communications between cards. While things like this may appear in the future for OpenCL, to my knowledge none of the current implementations allow it.
- OpenACC-style directives; they allow existing fortran or C code to be adapted progressively while (usually) retaining the initial purely CPU codebase. In my experience however, the effort required to obtain good performance using directive in "real" CFD code is on par with, or at least not an order of magnitude smaller than that required to start again from scratch in GPGPU language.
While current working compilers are commercial, GCC is on its way to allowing OpenACC-accelerated code to be compiled.
- code translation tools such as par4all (the only one I have tested) seem promising; with some tuning after the translation, I was able to obtain good performance on reasonably realistic tests
- scripting languages interfaced with cuda/opencl: they are GREAT for prototyping/testing, and indeed more and more complete codes seem to use python as "glue" to call high-perfomance GPU-accelerated kernels
Python is also fun to use, and a great way to start with cuda/opencl without all the annoying host-side code.
  • asked a question related to GPU Programming
Question
3 answers
I would like to know if there is an application for human genome assembly with ALLPATH or "DE NOVO" algorithms to run on a GPU using CUDA.
Relevant answer
Thank you very much, Marco! This is going to be very helpful!!! Best Regards!
  • asked a question related to GPU Programming
Question
3 answers
I wish to achieve parallelism in video processing by using openGL. Can we stream multiple videos on GPU using openGL and perform edge detection and object segmentation in that effectively? Are there any codes/functions available in openGL for tasks such as edge detection and object segmentation? Can MATLAB codes be converted to openGL ones?
Relevant answer
Answer
In this situation I would prefer to use OpenCV instead of OpenGL.
OpenCV has modules through which you can leverage GPUs.
OpenCV has a fairly good functions available for edge detection.
For segmentation there is a function for using watershed algorithm.
these links might help you to write segmentation algorithms
  • asked a question related to GPU Programming
Question
6 answers
GPU based image registration ?
Relevant answer
Answer
Phase correlation works very fast and flawlessly on GPUs.
  • asked a question related to GPU Programming
Question
2 answers
For Laptop.
Relevant answer
Answer
It would be better if you could choose nvidia instead of AMD due to extensive support of CUDA libraries for image rendering/implementation. If in Laptop, I recommend newly released maxwell mobile series 860M for light purpose uses
and quadro 4100M for workstation kind of things
I always recommend desktop graphics cards depending on budget , I strongly suggest to wait 2-3 months for release of Geforce 880 series with major architectural improvements along with CUDA6 support
  • asked a question related to GPU Programming
Question
1 answer
I observe no change in behavior if I switch using the default GCC 4.6.3 on Ubuntu. I don't know if my employers' IT department enabled any special configuration options when they built and installed GCC 4.5.3 on the RHEL5 system.
Relevant answer
Answer
Can you provide any sample code?
Thanks.
  • asked a question related to GPU Programming
Question
5 answers
I want to learn about GPU programming. I want to know about platforms that are being used currently, and also what libraries to use. How should I begin?
Relevant answer
Answer
I think that first we need to understand what you mean by GPU programming. You can either use the GPU for computations or graphics. The first is called GPGPU computing (General Purpose GPU) and you would use OpenCL, CUDA or OpenACC for this. The latter is traditional graphics programming using OpenGL or DirectX (though DirectX is restricted to Windows).
If you want to dive into graphics programming I would recommend OpenGL over DirectX, mainly because of portability. OpenGL is also the standard in scientific research. The best part for computer graphics is that the coordinate system has the 'correct' orientation for common mathematical descriptions. If you want to learn OpenGL then it does not work without learning to write shaders using GLSL. Look for books and tutorials that directly start with using shaders since with OpenGL versions 3 and 4 the so-called fixed function pipeline has been deprecated.
If you want to get into using the GPU for computations then you have the choice between OpenCL, CUDA, and OpenACC. OpenCL and CUDA are direct competitors. CUDA is more wide-spread, slightly easier to use, but only restricted to NVIDIA hardware (and x86 processors if you are using the PGI compiler). OpenCL on the other hand is supported by NVIDIA graphics cards, AMD GPUs, CPUs and APUs, and Intel CPUs, some of their on chip graphics and their Xeon Phi. I've also read about support of ARM CPUs(mostly used in mobile devices). OpenACC is probably easier to start with, especially if you are familiar with OpenMP. But, it is only supported by few compilers. It uses OpenCL in the background, so it is also quite portable. Since it does not give you a lot fine-grained control it is usually a little bit slower than fully optimized OpenCL or CUDA code.
Let us know what area of GPU programming you are interested in and we can then give you a few additional pointers.
  • asked a question related to GPU Programming
Question
21 answers
This is a code for tracking a gps signal . There 8 software channels created and the tracking for each channel is performed sequentially .overall the tracking loop takes about 5 minutes to track 5 channels in which the signals are detected. I need to optimize the code to make it work faster and maybe get it down to 2 mins. Can anyone suggest a solution ? i have already done pre allocation of arrays and such minor optimizations .
Relevant answer
Answer
The first chapter of "Accelerating Matlab with GPU computing: A primer with examples" has several useful suggestions for sppeding-up matlab code without using GPUs, pre allocation of arrays and vectorization included but there are several other tricks that are useful to know and use (I didn't know that the order of double loops: lines then columns or columns then lines affects processing speed, for instance).
  • asked a question related to GPU Programming
Question
18 answers
I am implementing some GPU computing undevelopment under CUDA, however NVIDIA non professional GPUs (that I am using for testing) are very (artificially) limited in performance for double precision calculations. The NVIDIA TITAN is an exception, but its price range is indeed in another scale. Conversely AMD mid-range gaming boards are at least on paper not limited in DP calculations. For example the HD7970-3GB is rated 947 GFLOPS in Double Precision for 300$ which is very good for my resource economy. However I don't have experience with OpenCL. Hence i would appreciate some 'expert' comparison between NVIDIA and AMD solutions before starting to climb the learning curve of the AMD solution.
Thanks
-Please rate the questions/answers you find useful-
Relevant answer
Answer
@Ivan
I cannot entirely agree with your statements. And there are certainly other sides to the medal in many of the points you make.
1) Two or three years ago OpenCL finally caught up to CUDA performance-wise. Probably, CUDA is still faster on NVIDIA hardware, but it is only a very tiny percentage. In comparison to AMD's gaming hardware NVIDIA looses big time for double precision. But, actually this is the question that has been asked initially. Futhermore, OpenCL allows for extensions which then allows to access special hardware features if necessary. In the end both languages are equally powerful and the implementation of the respective drivers for CUDA and OpenCL decide about the actual performance.
2) There are different tools and libraries for CUDA. Most people only know the CUDA libraries; and quite frankly, these are very popular. But, with a quick search on Google I found the following two links:
I don't know about books and guides for OpenCL. But, I prefer quality over quantity. And there are certainly good OpenCL books out there. Fewer books also makes it easier to choose one ;)
3) The question seems to be not directed towards supercomputers. And there are many computers with AMD hardware. Considering the point of performance per dollar AMD is a viable choice compared to NVIDIA, especially if it is about double precision computation.
4) OpenCL only has an initial overhead during the setup phase. Usually, this is mostly Copy&Paste to get it working in the first place. The actual implementation for just one specific type of hardware does not have any overhead compared to CUDA.
5) The question is not about writing portable code. (And CUDA is not portable at all, so this is not a fair point.) When choosing only one target platform OpenCL is as easy to tune as CUDA.
Coming back to the original question, AMD is a viable choice. It is true that OpenCL might be slightly slower on NVIDIA hardware. But then again you wouldn't be switching from CUDA to OpenCL if you would keep using NVIDIA hardware. The only of Ivan's points that is applicable to your situation is that there is quite an overhead during initialization. But, this point can be countered by the lower cost for higher performance. In the end, you can save a lot of time with every computation with a higher programming overhead for writing the initial setup.
  • asked a question related to GPU Programming
Question
5 answers
I would like to use GPU's to run micromagnetic simulations. What factors should I consider before buying one? Does anybody know of good source of info on the subject?
Relevant answer
Answer
Simulation work takes lots of computing, so in my view the first priority goes to Number of CUDA cores or processing cores, next the processing power of the GPU, that is in the GFLOPS (for single and double precision floating point numbers). GFLOPS is related to number of cores.
Next is the memory system (size and bandwidth).More memory means more capability of storing more data/info of the particles to be simulated, more memory bandwidth means faster transaction of the data between CPU-GPU or GPU-memory.
The Bus interface is also another factor. It determines the data flow rate between the CPU and GPU. If the data to be tranferred betweed the CPU and GPU is large, this factor should not be ignored. For example PCI-e bus.
Now talking about the cards available nowdays, the max CUDA core counts to be 2880 yielding 5300+ GFLOPS for singe precision and 1400+ GFLOPS for double precision calculations, 12GB GDDR5 memory giving 288 GB/s of theoretical bandwidth (as for Nvidia Tesla K40, and costs $5000+). On the other hand, Nvidia Geforce GTX TITAN Black Edition packs similar theoretical performance but at around $1200.
  • asked a question related to GPU Programming
Question
8 answers
CUDA provides a sort of memory that can be indexed by two floating point indices. When you index between integer points it returns the bilinear interpolation for the position you asked for. On an Intel Xeon Phi you do this by compiling gather instructions along with appropriate multiplications, how does NVIDIA hardware do it.
Relevant answer
Answer
As some of the others have pointed out. The texture cache is a read only cache to the DRAM and you can request values between the integer points and the texture cache returns the appropriate linearly, bilinearly, or trilinearly interpolated value. The interpolation is done in hardware and in my experience is as fast as a single memory access; however the interpolation is performed using only 8 fractional bits (see section E.2 in the CUDA C programming guide), so the interpolation is a bit courser than you would achieve on the Xeon Phi.
While I know that the interpolation is performed in hardware, I am not aware of a detailed description of the hardware implementation.
  • asked a question related to GPU Programming
Question
5 answers
Using which compiler OpenCL and OpenACC programs can be compiled?
Relevant answer
Answer
Samsung published an implementation of OpenACC for GCC:
  • asked a question related to GPU Programming
Question
5 answers
I just set up CUDA-5.5 on Ubuntu 12-04 and my GPU is the NVIDIA NVS 3100M. I have set up a page where I uploaded my experience doing it and potential applications of parallel programming using CUDA that I am looking at right now: https://sites.google.com/site/niravadesaishomepage/home/research/nvidia-gpu-programming. Any suggestions you might have, are welcome. I am looking for the CUBLAS libraries right now, esp. the libcublas.so. If any of you know where to find them, it will be very helpful.
Relevant answer
Answer
CUBLAS is located in the lib or lib64 directory of your CUDA installation. If you have chosen the default install location, that would probably be:
/usr/local/cuda/lib64/libcublas.so
Be sure to export the lib or lib64 directory to your LD_LIBRARY_PATH to ensure your binary can find it when running your application.
  • asked a question related to GPU Programming
Question
5 answers
My question stems from the fact that single precision algorithm are vastly faster in GPU
implementation. I would like to understand which part of an optical FDTD would be critically affected by switching from double to a single precision computation.
I would like also to understand whether would be convenient or not (computational wise) to perform some part of the computation in double and some part as single precision float and storing the data in a single precision (my understanding is that under CUDA for example switching from single to double can be performed simultaneously on large data using SIMD instructions).
In the end, GPU hardware is still limited in terms of available memory (assuming that a transfer to the central memory is unpractical for many fast applications), in addition NVIDIA for example strongly limits dual-precision computation in Mid-range hardware, applying an expensive bonus for double-precision enabled hardware.
Relevant answer
Answer
Hello,
I am solving my conseved FDTD equation using spectral methods. When I used single precision the average of the field was shifting by a few percentage this is why I always use the double precision. The problem comes fro the Fourier transforms. I am now using double precision all the time and I think the performance loos is only 2-3 times.
My opinion is that if you are doing a simple Euler algorithm and evaluate all de lapcians by f(i+1)+f(i-1)-2*f(i) the single precision will be enough.
  • asked a question related to GPU Programming
Question
2 answers
GPU-like accelerators is a disruptive technology in parallel computing. Do you think that it is possible to use GPU accelerators to speed up your favorite ensemble algorithm? What actual speed-ups do you think will be realistic? For computer scientists, think of a matrix vector multiplication where the ensemble matrix is 10^9 by 100. We also often have algorithms where a large problem is split into small problems of size < 1000x1000.
Relevant answer
Answer
GPU is for sure promising but I am not so convinced it is to that extend useful in the "real and complex" model setup. What I mean is if using ROMS or similar atmo model like WRF there are zillion of code lines and to benefit fruitfully from GPU (this is my opinion) one should invest time into re-coding i.e. OpenCL, OpenACC. There were some attempts to code only part of the expensive dynamics (in WRF micro physics if I recall correctly) but didn't get to far. More promising to me looks Intel Phi which is getting to the speed improvement without big re-coding effort, basically you need to recompile your model.
I would like to hear opinion of the ocean/atmo community about Intel Phi and "standard" Xeons, benchmark, speedup, bottlenecks etc.
Cheers,
Ivica
  • asked a question related to GPU Programming
Question
11 answers
Using cuSparse library for Tridiagonal solver on a CUDA compatible GPU with compute capability 1.1, has decreased the performance drastically, up to 50 times slower when compared with a traditional serial solver on a Core 2 Duo CPU.
Is it because of the low compute compatibility or is my implementation wrong?
  • asked a question related to GPU Programming
Question
18 answers
I currently have an OpenCL code that uses double precision floating point in massively parallel filtering functions (i.e. a lot of multiply-accumulate instructions). Since it does not meet a time deadline, I was wondering how much time I can save if I used 64-bits fixed-point integers rather than 64-bits floating-point. The accuracy will only be marginally affected. The main problem I have is execution time. Before doing the transition I would like to hear from some experts.
Relevant answer
Answer
Mario, I am very glad to hear you raising this question. 64-bit precision is being blindly used as a panacea in so many applications, when the input arguments have only three or four decimals of accuracy and the only accuracy needed in the outputs is also three or four decimals. But numerically-ignorant programmers invoke 15 decimals of accuracy throughout a calculation, just to be safe. YES, there is massive savings possible over using 15-decimal "scratch" variables for the entire calculation, which is usually done because no one has the time or expertise to think about rounding error any more. You can increase speed and accuracy at the same time as you decrease memory footprint, pressure on bandwidth, energy usage, and power consumption, just by right-sizing the precision and dynamic range to what you need.
Any possibility of quantifying the error in your output as a function of errors in inputs? It's usually not easy to figure out, but if you can, you may be astonished at the economy of storage that the knowledge presents to you.
Double-precision IEEE floats provide over 600 orders of magnitude; I suspect you need less than 10 orders of magnitude in your calculation. That excess dynamic range consumes an 11-bit exponent when you probably only need about 5 bits of exponent. If you know the largest value and smallest value that your algorithm will ever require, by all means scale the problem to fit that range and used fixed-point math to get more digits of fraction, since you really don't need to spend precious bits on an oversized exponent. Nor do you likely need 52 digits of fraction, as provided by IEEE doubles. There is probably a way to get excellent results, better than what you're getting, using 32-bit fixed point. If you can figure out how, you'll nearly double your speed because almost all current architectures are bandwidth-limited, not FLOPS-limited.
  • asked a question related to GPU Programming
Question
20 answers
I have a working pseudo-spectral code that's well-tested, but too slow for my needs. My plan is to re-implement in C or C++ with GPU acceleration via CUDA using my existing code as a guide and check. Immediate applications are to problems in heliospheric physics but my goal is to produce something as broadly adaptable as possible given my time and resources, so I want to put a lot of careful thought into the design phase.
Concerning implicit methods - any suggested references are welcome. I've found a few, but I want to be thorough.
Concerning my code-design - I want to produce something that's fast, scalable, and adaptable while still being well-organized without being "over-written", and easy for other users to learn and modify. I'm especially interested in learning what pitfalls to avoid - particularly in regard to the use of GPU acceleration, but also in regard to any memory-management or efficiency problems that could arise due to a programming style that errs on the side of over-modularization (if that's possible).
I gladly welcome and appreciate all forms of input including wish-lists, cautionary tales, stories of bitter-experience, and anything else you might be inclined to offer. Thanks!
Relevant answer
Answer
If you're attempting to use C++, it may be useful to try coding initially in C and use python scripts to handle some of the brute force calculations. I'm suggesting this because CUDA can be bulky and hard to handle, specifically when attempting to get the correct memory usage on the device and optimizing block sizes. pyCUDA is a library that i've had some success using to do CUDA based calculations without having to go through actually coding in CUDA. There is a tradeoff of speed since you are not coding exactly what you want, but it is helpful to see if your idea works on the GPU. Good Luck!
  • asked a question related to GPU Programming
Question
5 answers
We have assisted to many advances over the last years regarding the automatic mapping of high-level languages to hardware, namely C-to-HDL translators. Examples are PICO-NPA, LegUp, OpenRCL, AutoPilot, FCUDA and quite a few others. More recently, we had contact with the Silicon-to-OpenCL (SOpenCL) tool from the University of Thessaly. Even more recently the Vivado compiler from Xilinx has been released and the OpenCL-RTL program from Altera has been announced to come out soon. Maxeler has also proposed a quite interesting solution based in Java.
My question is: Do you believe in any of them? How do you see the future in this area of automatic translation and synthesis (for FPGAs and ASICs) over the next, let us say, 10 years? Can you share your experience (and vision) on the subject?
Relevant answer
Answer
The answer to your question could be "yes" or "no", depending on how you define "high-level" language (and "widely").
Traditionally, HLS has been on a bit of an alchemic quest in trying to translate ordinary sequential programs in traditional sequential programming languages (such as C and Java) into circuits. In that narrow interpretation, I think the answer must be "no." I agree with Mario that that goal is wishful thinking. Even tools that advertise this capability typically bait-and-switch their users --- instead of translating C to gates, they really translate "well-prepped" (as Mario put it) C to gates, and "prepping" here can mean many things, such as selecting a small subset of C that is amenable to synthesis, or significantly annotating segments of code with information required to produce reasonable implementations.
On the other hand, if you interpret "high-level language" as one that might abstract from specific details of circuits (clocks, specialized architectural elements such as ALUs and memory etc.), while at the same time exposing features that would be relevant to the algorithm designer (most importantly parallelism), then I think the answer could well be "yes". It would mean shifting some of the focus of HLS research from ways of *extracting* parallelism from sequential code to economical ways of letting users *express* concurrency and build significant applications doing so.
  • asked a question related to GPU Programming
Question
8 answers
CPU programming gives good results, especially when it requires a "fast response". But testing for programming, the GPU's vector calculus reports that significant improvements in the overall process time. No less important is the fact that you can do a lot of calculations spending a much lower power level.
Relevant answer
Answer
The "real benefit" of moving "everything possible" to GPU is close to none. In order to get benefit out of accelerators such as GPUs one has to find computationally expensive (= calculation dominated) parts of the programm which can run independently and seperate them into so called kernels. These kernels are then executed by the GPU.
This process is not always possible and will in no programm be "everything possible" ;) Some programmes have very little computation and a lot of copying memory around. These applications are bandwidth limited (= data dominated) and will not perform well on external accelerators.
So like in any development process: Find out what the properties of your problem are and find a better way to do it.
And finally: Don't use GPUs just because everybody seems to do that nowadays
And really finally: If you want to use GPUs, do so using OpenACC instead of CUDA!
  • asked a question related to GPU Programming
Question
8 answers
Generally optical flow is used for motion tracking
Relevant answer
Answer
Optical flow estimation algorithm is best for moving camera applications.
  • asked a question related to GPU Programming
Question
13 answers
How can cuda be integrated with code blocks.
Relevant answer
Answer
You can use Netbeans with CUDA, it is so efficient. If you like, I can send you a tutorial for creating Netbeans projects using CUDA
  • asked a question related to GPU Programming
Question
6 answers
Can we overlapp data transfers by GPU kernels executions within OpenCL?
Relevant answer
Answer
Thanks Mostafa,
Yes, I have juste tested it.
Best regards,
  • asked a question related to GPU Programming
Question
3 answers
Is there some specific applications that require motion tracking in 4K videos? I developed a GPU based optical flow motion estimation method that enables a real time processing for 4k video.
Relevant answer
Answer
Dear Sidi,
I think that video surveillance applications should be a good use case of motion traking using 4K videos.
  • asked a question related to GPU Programming
Question
6 answers
What are the main advantages of using SLI for GPU (CUDA) of Multi-GPU programming?
Relevant answer
Answer
"SLI Frame Rendering: Combines two identical NVIDIA Quadro PCI Express graphics cards with an SLI connector to transparently scale application performance on a single display by presenting them as a single graphics card to the operating system."Therefore, we can use SLI in conjunction with CUDA to have two identical cards on my machine (any 8800 or Quadro 5600 or 4600) and program 256 multiprocessors as though they were one GPU?
SLI and CUDA are orthogonal concepts. The first is for automatic distribution of rasterization, the second is for addressing direct execution of code on the GPU. CUDA is not used for rendering (on- or offscreen). That is when using CUDA you can simply list all available cards in the machine and directly submit code to execute. This code has nothing to do with shader code - it is C-like. So you have a lot more control of what happens where and when.No, in general you don't want to use SLI if you plan on using the GPUs for compute instead of pure graphics applications. You will be able to access both GPUs as discrete devices from within your CUDA program. Note that you will need to explicitly divide work between the GPUs.
I don't have an explanation for why SLI isn't desirable for compute applications,
  • asked a question related to GPU Programming
Question
5 answers
Are there any tools dedicated to this?
Relevant answer
Answer
yup Intel had provided SDK for Xeon PHI architecture hardware