Questions related to GPU Programming
I am trying to generate an algorithm that can roughly estimate the support's volume of an stl file at a specific orientation. Does anyone have any ideas where and how to start? I am trying to do this in Python but based on what I read, this can be a GPU computational and geometrical design problem. So I am not sure if Python is a good place to start. I appreciate any ideas/responses. Thank you.
Related material: https://www.sciencedirect.com/science/article/abs/pii/S0097849315000564
I would like to know about the best method to follow for doing MATLAB based parallel implementation using GPU of my existing MATLAB sequential code. My code involves several custom functions, nested loops.
I tried coverting to cuda-mex function using MATLAB's GPU coder, but I observed that it takes much more time (than CPU) to run the same function.
Proper suggestions will be appreciated.
I have read descriptions of several functions from about 10 different packages allowing GPU acceleration. From what I have understood so far, GPU acceleration can only be applied on C/CUDA coded functions.
Am I wrong thinking my own R functions, which themselves require functions of different R packages, cannot benefit of GPU acceleration until I "translate" into C/CUDA my functions and all dependent functions?
In other words, does it means that almost all R packages created so far are unable to benefit from GPU acceleration?
I am looking for solving for the first few eigenvectors of a really large sparse symmetric matrix (1M x 1M). I am looking for suggestions as to which library will be best suited for my purpose. I am looking to compute the eigenvectors inside a few seconds. I have taken leverage of the library Spectra which uses the Iterative Arnoldi to solve the eigenvectors, but the time for computation has let me down. I have also looked into CUDA/MKL , but these libraries tend to take advantage of existing architectures. Is there some alternative or CUDA/MKL/OpenCL is the standard these days?
I am working on continuous-time nonlinear dynamical systems which require a lot numerical computations. Previously, I used 'for' loop for solving the system for a range of parameter values to investigate the bifurcation and regions of stability. Then I used vectorization, which made my codes faster. I have a NVIDIA GTX 1060 6 GB card installed on my pc. I don't know if GPU programming will help me getting results faster or not. So, I want suggestions from you if I should start learning GPU programming or not and if the answer is yes where should I start.
I am using Nvidia's K20x cards on my machine. I am facing some problems in using lammps gpu enabled version upto its full efficiency on GPU-node. The main role of the gpu is to save the time consuming step of building the neighbour list and then calculation of force between these pairs on cpu. But, I am only able to use force calculation part otherwise its giving some error on using for building neighbourlist. I have also tried re-compilation but facing the same problem. So, please let me know the solution or the mistake I am doing.
I am getting the error something like this "Cuda driver error 700 in call at file 'geryon/nvd_timer.h' in line 76".
After reading the proceeding: "Best Practices in Running Collaborative GPU Hackathons: Advancing Scientific Applications with a Sustained Impact" I came across the paper: "Porting the MPI Parallelized LES Model PALM to Multi-GPU Systems – An Experience Report". It is evident that GPU's are Single Instruction, multiple threads and therefore it is not a cache-optimized computer architecture. My experience is that unless you changed the whole kernel the Hybrid approach MPI+GPU affects the performance a lot. However, there is a boom in HPC with GPU's. In some instances, MPI alone performs better than the hybrid since we do not need to move information back and forth to the GPU and CPU, respectively. We have different good practices but there is no a standard or a reference that we can always take to the bank. The picture becomes even worse when high order schemes are used.
1) Which is your common approach for this issues in CFD?
2) Should we keep the structure of the source code as functional programming, imperative programming with a bunch of do .. end do in each subroutine that affects the performance a lot?
3) Should we use data region (where the data remains local in the GPU) where we packed all the computations even though we are hurting the readability of the source code?
4) Should we update the ghost cells in every time step?
Again the focus is only CFD
Dear friends with experience in high(er) performance computing,
My lab is planning to purchase a computer specifically for GPU-based rendering (fancy picture / video generation based on CAD-type files). Currently mainly using Blender software.
We have spec’d out a machine (below). Could you give me some feedback on any improvements you think would make it better? We are working with a budget of about $7-10k.
REALLY appreciate your help/advice!
Intel - Xeon E5-2697 V4 2.3Ghz 18 Core ~ $2495.99
Corsair - H115i 104.7 CFM Liquid CPU Cooler ~ $129.99
Asus-X99-E-10G WS SSI CEB LGA2011-3 ~ $639.99
Corsair - Vengeance LPX 64GB (8x8GB) DDR4-2400 ~ $449.99
• Samsung 850 EVO 1TB 2.5" SATA III Internal SSD (MZ-75E1T0B/AM) ~ $356.99 w/mount
2. Toshiba 5TB X300 7200 rpm SATA III 3.5" Internal HDD ~ $146.84
2x - NVIDIA - Titan Xp 12GB Video Card (2-Way SLI) ~ $1432.15 each ~ $2864.30
Corsair - AX1500i 1500W 80+ Titanium Certified Fully-Modular ATX ~ $409.99
Thermaltake ATX Full Tower Cases CA-1H1-00F6WN-00 ~ $249.99
Total cost for core computer parts ~ $7744.07
I know some popular libraries among researcher focusing on deep reinforcement learning such as TensorFlow, Theano, Caffe, Torch and CNTK. I usually work on both CPU and GPU. I don't need much detail configuration and customization, but computational time is important. Can anyone recommend which library is best for my case?
Thank you in advance
I should buy a computer to implement GPU on it for my thesis which is about image segmentation with graph cut algorithms.
can any body help me to realize the best hardware and software for this application??
I have been trying to process EEG data in matlab 2009 using the GPU mat. However I am stuck right at the GPUstart command, it gets stuck at the assigning the cubinpath. I can see that the while reading the major and minor revisions of the GPU, the major =5 and minor =0. However, I can't see conditionals in the program for major =5. The max I can see is major =3, for the Kepler architecture.
Can anyone walk me through this process ??
I am trying to adaptively sample a uniform pixel grid, based on a specific pixel position and a linear (or close to linear, depending on the method) falloff in importance/sampling depth based on this position.
I already have several approaches in mind and currently we rely on a sorting-based method, but it won't be fast enough at higher resolutions (currently we run 1024^2, but it could be arbitrarily high, depending on the output device).
Is there any method allowing for quickly sampling such a distribution? The main challenge here seems to be that we absolutely do not want any duplicates, which is why there has not been much progress with standard importance sampling so far.
Also, the pixel position which is used for distance computation may change all the time (from frame to frame) and we need all that stuff to be done for hundreds of frames per second with tens of thousands of samples per frame. In addition,
it would be absolutely great if such a method could guarantee a fixed number of samples being generated. This would be no problem with importance sampling, but the only way of avoiding duplicates when using this is to mark sampled pixels and
reject a sample if it hits the area of a pixel that has already been sampled, making it hard to predetermine the time the sampling step is going to take.
Does anyone know of a more elegant approach to this issue? If that's important in any way, we use CUDA for all our stuff.
i want to model a Semi-submersible propeller which is a multiphase problem. my target is to show the advantage of using GPU accelerator in solving complex problem and discover how much GPUs can fastened the process of solving. i am going to use ANSYS Fluent v16.0. Is Fluent capable using GPU like Tesla K80 to meet my requirement? Do i have to write a code or any kine of consideration in order to solve my problem? any advice will be greatly appreciated.
I am using Tesla K20. I got an error that the shared memory is limited to 16K although K20 supports up to 48K. How to configure the GPU and NVCC compiler to use 48K shared memory instead of 16K?
I'm doing equilibrium and steered molecular dynamics simulations with GROMACS 4.6.5 in a single computer with 2 GPUs and 8 CPUs. I have obtained that simulations in the GPU-CPU goes almost 3 times faster than when they are run only at CPU. I would like to know whether this performance is reasonable or may it be increased?.
The classification process takes long time (11 hours for a picture) but gives excellent results. i have already shrunk my training bank to the minimum. the File of ClassificanionKNN stores many types of information cells and arrays, uploading this bank to parallel processing or GPU processing causes errors. (Matlab preferred).
My background is more agricultural than computer science.
Thanks for the help
I have somewhat parallelized C++ code running on a compute server with 22 boards. It does not use the GPU at all. Is there a way to re-compile my code to use the GPU without my having to re-code anything? It need not be freeware, and I can buy a new compiler if need be.
I often use Mathematica software in my scientific work. Sometimes computations are very complex and spend a lot of time on standard PC (4-6 physical cores or 8-12 logical cores). I try to use parallel calculations but they accelerate a few times. Generally I solve numerically single or double integrals with nonlinear equations which can also contain integrals. One mentioned operation is calculated in non parallel mode but I need series of results which I can calculate parallel (for example ParalellTable).
I have question with CUDA in Mathematica. Theoretically very efficient graphic card should give significant acceleration. After reading Mathematica manual we can state that it is easy to calculate matrix operations with CUDA but integrals and nonlinear equations are rather difficult to obtain. I have question if somebody has good experience in solving said issues using GPU computing and could give me clear tips, literature, etc?
A real time visual attention system based on the human visual attention, it's better to use a only a FPGA, GPU or FPGA+dual core ARM?
I have a kernel which shows poor performance, nvprof says that it has low warp execution efficiency (page 3 in the attached PDF) and suggests to reduce an "intra-warp divergence and predication". Am i right that intra-warp divergence is any if-then statement which creates branching? For example this causes divergence: if (x<0.5) then action; but this doesn't: if (threadId.x == 0) then action2. Right? If this is right, then my kernel doesn't have any divergence. Another issue i'm concerned about is i'm using a lot of shared memory while all variables are of double precision, which causes bank conflicts in case of CC2.x with subsequent serialisation of calculations. What can we do in this case? Could increase of number of active warps help to hide the latencies caused by serialisation?
like always, you helped me. i ask another question.
we are conduction survey on GPU based database. for that we need SQL processing algorithms and respective Paper/article published still now...
I have also tried and i get following database
So please help me by replaying other database (SQL Query processing algorithms) link
and Thanks in Advance...
I have to solve the set of non-linear equations in 200x200x200x1000 variables space. I developed the parallel code in Matlab for that but it takes very very long calculation time. The solution could be GPU based algorithms. However, it is not cheap to get the license for commercialized GPU code.
Do you have any idea of homemade code for GPU based algorithm to solve set of 7 linear equations with 4D variables space in Matlab with MPI?
Is there any proposal to add timeouts to MPI?
Let's say that I would like to perform an MPI_Recv which could block.
In some cases I would like to limit the time that the MPI_Recv can be blocked. It would be useful to have a receive timeout in those cases.
I know that this can actually be done with MPI_Irecv & MPI_Test, but I wanted to know if anyone was able [somehow] to define a Rx Timeout parameter to the MPI environment and then get a [new] MPI_ERR_TIMEOUT when that occurs in a MPI_Recv.
I guess the same could be applied to other MPI calls.
I am unable to find the function to retrieve web page in kernel. If anyone could share his/her work or guide me to a useful tutorial.
We are accelerating some matlab codes. We use a mix between gpuArrays and CUDA to achieve this. Does anyone have best practices document for that?
Does the gpu array include any clue to manage the number of blocks in a grid, the number of threads/block, shared memory, pinned memory, etc?
On a standard multi-core machine, it is easy to specify the number of cores you want to use. As a comparison, it would be interesting to know how cuda programmes scale with the number of GPU cores that the programme uses. How does one alter the number of cores that will be used to handle a task?
I have noticed that CUDA is still prefered for parallel programming despite only be possible to run the code in a NVidia's graphis card. On the other hand, many programmers prefer to use OpenCL because it may be considered as a heterogeneous system and be used with GPUs or CPUs multicore. Then, I would like to know which one is your favorite and why?
I want to read a video file using openCV CUDA C++. However when I tried the code of the link given, it gave "OpenCV was built without CUDA Video decoding support". I have been suggested that I should decode the video using CPU and process the raw frames using GPU instead. Need to know about these terminologies and how I can attain it?
I want to analyse the enhancement in processing time of a video on GPU. For the same, I need to know how to read a video file (or from a webcam) using openCV CUDA on a linux OS?
I need to transfer a relatively large raw frame grabber images to GPU for subsequent CUDA enhanced signal processing and it seems that GPU Direct / RDMA is the way to go but all I can find for examples is related to linux applications.
I have parallelized (on GPU) and used the SPA for variable selection in multivariate calibration problems, and would like to know if there are some others parallelized algorithms that have been used for the same content.
Meteorological models generally benefit from multi processors, depending on its optimization. As GPUs are capable of massively parallel computing, they're often seen as a solution for cheaper and faster model runs. Are we any close from running full meteorological models such as WRF on personal PCs using its (consumer-grade) GPU?
I would like to know if there is an application for human genome assembly with ALLPATH or "DE NOVO" algorithms to run on a GPU using CUDA.
I wish to achieve parallelism in video processing by using openGL. Can we stream multiple videos on GPU using openGL and perform edge detection and object segmentation in that effectively? Are there any codes/functions available in openGL for tasks such as edge detection and object segmentation? Can MATLAB codes be converted to openGL ones?
I observe no change in behavior if I switch using the default GCC 4.6.3 on Ubuntu. I don't know if my employers' IT department enabled any special configuration options when they built and installed GCC 4.5.3 on the RHEL5 system.
I want to learn about GPU programming. I want to know about platforms that are being used currently, and also what libraries to use. How should I begin?
This is a code for tracking a gps signal . There 8 software channels created and the tracking for each channel is performed sequentially .overall the tracking loop takes about 5 minutes to track 5 channels in which the signals are detected. I need to optimize the code to make it work faster and maybe get it down to 2 mins. Can anyone suggest a solution ? i have already done pre allocation of arrays and such minor optimizations .
I am implementing some GPU computing undevelopment under CUDA, however NVIDIA non professional GPUs (that I am using for testing) are very (artificially) limited in performance for double precision calculations. The NVIDIA TITAN is an exception, but its price range is indeed in another scale. Conversely AMD mid-range gaming boards are at least on paper not limited in DP calculations. For example the HD7970-3GB is rated 947 GFLOPS in Double Precision for 300$ which is very good for my resource economy. However I don't have experience with OpenCL. Hence i would appreciate some 'expert' comparison between NVIDIA and AMD solutions before starting to climb the learning curve of the AMD solution.
-Please rate the questions/answers you find useful-
I would like to use GPU's to run micromagnetic simulations. What factors should I consider before buying one? Does anybody know of good source of info on the subject?
CUDA provides a sort of memory that can be indexed by two floating point indices. When you index between integer points it returns the bilinear interpolation for the position you asked for. On an Intel Xeon Phi you do this by compiling gather instructions along with appropriate multiplications, how does NVIDIA hardware do it.
I just set up CUDA-5.5 on Ubuntu 12-04 and my GPU is the NVIDIA NVS 3100M. I have set up a page where I uploaded my experience doing it and potential applications of parallel programming using CUDA that I am looking at right now: https://sites.google.com/site/niravadesaishomepage/home/research/nvidia-gpu-programming. Any suggestions you might have, are welcome. I am looking for the CUBLAS libraries right now, esp. the libcublas.so. If any of you know where to find them, it will be very helpful.
My question stems from the fact that single precision algorithm are vastly faster in GPU
implementation. I would like to understand which part of an optical FDTD would be critically affected by switching from double to a single precision computation.
I would like also to understand whether would be convenient or not (computational wise) to perform some part of the computation in double and some part as single precision float and storing the data in a single precision (my understanding is that under CUDA for example switching from single to double can be performed simultaneously on large data using SIMD instructions).
In the end, GPU hardware is still limited in terms of available memory (assuming that a transfer to the central memory is unpractical for many fast applications), in addition NVIDIA for example strongly limits dual-precision computation in Mid-range hardware, applying an expensive bonus for double-precision enabled hardware.
GPU-like accelerators is a disruptive technology in parallel computing. Do you think that it is possible to use GPU accelerators to speed up your favorite ensemble algorithm? What actual speed-ups do you think will be realistic? For computer scientists, think of a matrix vector multiplication where the ensemble matrix is 10^9 by 100. We also often have algorithms where a large problem is split into small problems of size < 1000x1000.
Using cuSparse library for Tridiagonal solver on a CUDA compatible GPU with compute capability 1.1, has decreased the performance drastically, up to 50 times slower when compared with a traditional serial solver on a Core 2 Duo CPU.
Is it because of the low compute compatibility or is my implementation wrong?
I currently have an OpenCL code that uses double precision floating point in massively parallel filtering functions (i.e. a lot of multiply-accumulate instructions). Since it does not meet a time deadline, I was wondering how much time I can save if I used 64-bits fixed-point integers rather than 64-bits floating-point. The accuracy will only be marginally affected. The main problem I have is execution time. Before doing the transition I would like to hear from some experts.
I have a working pseudo-spectral code that's well-tested, but too slow for my needs. My plan is to re-implement in C or C++ with GPU acceleration via CUDA using my existing code as a guide and check. Immediate applications are to problems in heliospheric physics but my goal is to produce something as broadly adaptable as possible given my time and resources, so I want to put a lot of careful thought into the design phase.
Concerning implicit methods - any suggested references are welcome. I've found a few, but I want to be thorough.
Concerning my code-design - I want to produce something that's fast, scalable, and adaptable while still being well-organized without being "over-written", and easy for other users to learn and modify. I'm especially interested in learning what pitfalls to avoid - particularly in regard to the use of GPU acceleration, but also in regard to any memory-management or efficiency problems that could arise due to a programming style that errs on the side of over-modularization (if that's possible).
I gladly welcome and appreciate all forms of input including wish-lists, cautionary tales, stories of bitter-experience, and anything else you might be inclined to offer. Thanks!
We have assisted to many advances over the last years regarding the automatic mapping of high-level languages to hardware, namely C-to-HDL translators. Examples are PICO-NPA, LegUp, OpenRCL, AutoPilot, FCUDA and quite a few others. More recently, we had contact with the Silicon-to-OpenCL (SOpenCL) tool from the University of Thessaly. Even more recently the Vivado compiler from Xilinx has been released and the OpenCL-RTL program from Altera has been announced to come out soon. Maxeler has also proposed a quite interesting solution based in Java.
My question is: Do you believe in any of them? How do you see the future in this area of automatic translation and synthesis (for FPGAs and ASICs) over the next, let us say, 10 years? Can you share your experience (and vision) on the subject?
CPU programming gives good results, especially when it requires a "fast response". But testing for programming, the GPU's vector calculus reports that significant improvements in the overall process time. No less important is the fact that you can do a lot of calculations spending a much lower power level.