Science topic

Parallel Algorithms - Science topic

Explore the latest questions and answers in Parallel Algorithms, and find Parallel Algorithms experts.
Questions related to Parallel Algorithms
  • asked a question related to Parallel Algorithms
Question
5 answers
I am looking for solving for the first few eigenvectors of a really large sparse symmetric matrix (1M x 1M). I am looking for suggestions as to which library will be best suited for my purpose. I am looking to compute the eigenvectors inside a few seconds. I have taken leverage of the library Spectra which uses the Iterative Arnoldi to solve the eigenvectors, but the time for computation has let me down. I have also looked into CUDA/MKL , but these libraries tend to take advantage of existing architectures. Is there some alternative or CUDA/MKL/OpenCL is the standard these days? 
Relevant answer
  • asked a question related to Parallel Algorithms
Question
3 answers
I was wondering whether anyone knows about an automated tool to collect GPU kernels features, i.e., stencil dimension, size, operations, etc. Such tools are widely available for CPU kernels.
Relevant answer
Answer
Well, I only work with CUDA GPUs for servers, so i'm not aware of how things works in embedded/mobile platforms.
But basically you would need a profiler. If you are dealing with smartphones, you coud try the Snapdragon/Adreno profiler
Also, looking at their software I see that they provide an LLVM compiler. If there is an open source version available, you can modify it to obtain the features you want. 
  • asked a question related to Parallel Algorithms
Question
4 answers
In our research institute, we are developping a new authomatic model selection technique using HPC in econometrics. Massive parallelism is applied to reduce selection algorithm' running times. To examine our code properties (speed up, latency, crashing probabilities, etc.) we will focus on alternative oil-price models. We will be very grateful if anyone could provide us with specific information about benchmarks or forecast comparisons on this subject.
Relevant answer
Answer
Demian, please send your question to my colleague Nicolás Di Sbroiavacca, which is the specialist in that issue: ndisbro@fundacionbariloche.org.ar
Thanks, Adrian
  • asked a question related to Parallel Algorithms
Question
7 answers
How to convert Netcdf4 files to Netcdf3 files with NETCDF_nccopy
My system is Ubuntu 14.04, and netcdf-4.3.3.1has been installed
_____________________terminal message_________________
root@xx-desktop:~/Desktop/cc# nccopy
nccopy: nccopy [-k kind] [-[3|4|6|7]] [-d n] [-s] [-c chunkspec] [-u] [-w] [-[v|V] varlist] [-[g|G] grplist] [-m n] [-h n] [-e n] [-r] infile outfile
[-k kind] specify kind of netCDF format for output file, default same as input
kind strings: 'classic', '64-bit offset',
'netCDF-4', 'netCDF-4 classic model'
[-3] netCDF classic output (same as -k 'classic')
[-6] 64-bit-offset output (same as -k '64-bit offset')
[-4] netCDF-4 output (same as -k 'netCDF-4')
[-7] netCDF-4-classic output (same as -k 'netCDF-4 classic model')
[-d n] set output deflation compression level, default same as input (0=none 9=max)
[-s] add shuffle option to deflation compression
[-c chunkspec] specify chunking for dimensions, e.g. "dim1/N1,dim2/N2,..."
[-u] convert unlimited dimensions to fixed-size dimensions in output copy
[-w] write whole output file from diskless netCDF on close
[-v var1,...] include data for only listed variables, but definitions for all variables
[-V var1,...] include definitions and data for only listed variables
[-g grp1,...] include data for only variables in listed groups, but all definitions
[-G grp1,...] include definitions and data only for variables in listed groups
[-m n] set size in bytes of copy buffer, default is 5000000 bytes
[-h n] set size in bytes of chunk_cache for chunked variables
[-e n] set number of elements that chunk_cache can hold
[-r] read whole input file into diskless file on open (classic or 64-bit offset format only)
infile name of netCDF input file
outfile name for netCDF output file
netCDF library version 4.3.3.1 of Nov 6 2015 20:09:00 $
root@xx-desktop:~/Desktop/cc# nccopy -k classic pres.nc pres3.nc
NetCDF: Unknown file format
Location: file nccopy.c; line 1354
root@xx-desktop:~/Desktop/cc#
________________________________________________________
Relevant answer
Answer
Please try the latest version of NCO. NCO 4.5.4
git clone https://github.com/nco/nco.git;cd nco;git checkout 4.5.4
  • asked a question related to Parallel Algorithms
Question
3 answers
Can anyone please suggest Partitioning algorithms to partition the vision algorithm (computations or workload) to expose opportunities for parallel execution by decomposing computations into small tasks
Relevant answer
Answer
In CNN, each neuron can be processed independently and corresponds only to a small portion of the input. Why isn't the neuron (or a small area of adjacent neurons) the processing block for each thread? I presume you might need a synchronization event once their dot-products are done, to share results and inputs from neighbors, but  CNN, like other neural nets, is kind of easily parallelized, one layer at a time.
  • asked a question related to Parallel Algorithms
Question
8 answers
The system contains multi-processor with different types (Cpu, Gpu, and may contain soft-core)
Relevant answer
Answer
Maybe not the "implementation", but the considerations of these "old" publications are still valid :)
Looking at your list some things seem to be clear:
1. video preprocessing is best scheduled on CUDA. The same could apply to feature extraction (from the video stream). This could be an application with or without explicit scheduling - depending on the computing power available.
2. image processing would also fit the CUDA, but could as well be executed on DSP or CPU (if the image rate is not too high).3. Operating systems with explicit scheduling are typically implemented on CPUs. This may also be the unit to implement the voter - presuming that the more specialized execution units make their results available eg. memory mapped.
Whether you need "partitioned" scheduling (which I'm interpreting as "unix/win style scheduling) depends on a lot of issues as "software/driver availability", communication requirements etc. Thinking of at least 2 CPUs one might run on unix, the other on a AUTOSAR like OS (non-partitioned), which is more suitable for hard real-time operations.
As you see, you not only have quite a number of execution units - you are also free to implement a multi-OS system to best serve your needs :)
  • asked a question related to Parallel Algorithms
Question
6 answers
I have found the paper on parallel version of rate monotonic algorithm. So it should be possible to get parallel version of DM/EDF. I want conformation form some knowledgeable person on the same to start the M.Tech project on same line.
The paper related to parallel version of rate monotonic algorithm is attached herewith.
Relevant answer
Answer
Zakaria, You are requested to go through the paper attached with the basic question & pl tell what extra can be done in that. Thanks.
  • asked a question related to Parallel Algorithms
Question
4 answers
MARE is a programming model and a run-me system that provides simple yet powerful abstractions for parallel, power-­‐efficient software
– Simple C++ API allows developers to express concurrency
– User-­‐level library that runs on any Android device, and on Linux, Mac OS X, and Windows platforms
Relevant answer
Answer
Hi there,
I'm working on a similar library. Mine is called HPX. It provides a C++11/14 compliant interface for task based programming (async, future etc.). We extended the standard in a straight forward manner to also support distributed memory parallelism through the same interfaces. In addition we added some more API functions to make everything slightly more composable. Check it out at http://github.com/STE||AR-GROUP/hpx
  • asked a question related to Parallel Algorithms
Question
52 answers
I have noticed that CUDA is still prefered for parallel programming despite only be possible to run the code in a NVidia's graphis card. On the other hand, many programmers prefer to use OpenCL because it may be considered as a heterogeneous system and be used with GPUs or CPUs multicore. Then, I would like to know which one is your favorite and why?
Relevant answer
Answer
I think this debate is quite similar to DirectX vs. OpenGL. The only difference is that DirectX is bound to a single operating system, whereas CUDA is not. Both DirectX and CUDA are proprietary solutions whereas OpenGL and OpenCL are open standards (actually, both managed by the Khronos Group).
The question is really about your goals. Do you want maximum performance? Then CUDA might be the best choice. Updates for CUDA occur quite often supporting the most recent features of NVIDIA GPUs. But, you have vendor lock-in and can only use NVIDIA cards. Looking at other discussions, AMD graphics cards can provide "more bang for the buck", especially when it is about double precision operations. NVIDIA graphic cards artificially reduce the bandwidth for double precision computations. So, if you want to use CUDA with double precision you need to buy TESLA compute cards. These are more expensive. In summary, depending on your problem and if maximum performance (i.e. every last percent) is of your concern (and money is not) then CUDA is the best choice.
If you don't buy a new graphic card once a year to get the best performance or you don't have the money for an expensive card, OpenCL (and maybe AMD) is a valid alternative. Also, if you don't spend too much time on optimization and don't learn every new language/GPU feature as soon as it comes out, there is not much of a performance difference between OpenCL and CUDA in general.
My choice is really simple: I've chosen OpenCL. The reason for this is that I'm writing commercial software which should work with the hardware the client already has. He should not be forced to buy an NVIDIA card. Furthermore, hardware as old as back to 2010/2012 should still be supported, which means that I cannot use modern hardware features. With these restrictions there is not much of a performance difference between CUDA and OpenCL. But, the good thing is that I don't have to write my code twice--once for the GPU and once for the CPU. OpenCL kernels will support the CPU as well. (The reason for this is that even if a client does not have an OpenCL-capable GPU the software should still work.)
Looking at the battle DirectX vs. OpenGL, OpenGL has a good chance to win in the long run as Valve is switching to OpenGL and starts bringing out their own operating system SteamOS. The reasons are portability and compatibility. For the same reasons OpenCL might win over CUDA in the long run. Not now, but maybe 10-20 years from now. OpenCL will not die because it is necessary for Mac OS and for AMD hardware. The question is if NVIDIA will fight for CUDA the same way as Microsoft does for DirectX. In my opinion the cost of DirectX for Microsoft is already too high.
  • asked a question related to Parallel Algorithms
Question
3 answers
Suggest materials for parallel programming for HPC for above requirement.
Relevant answer
Hello, Rashmi! If you don't have a NVidia's graphics card in your PC, it is going to be more difficult for you to learn about parallel programming using CUDA. However, you can find the PDF of the CUDA C Programming Guide in the internet. Make a search for it. I would be grateful if my answers were appreciatted (VOTE UP) by you in order to help increasing my RGSCORE.
  • asked a question related to Parallel Algorithms
Question
6 answers
Adaboost is the algorithm for increasing the accuracy of the classifier. It is generally used in data ware house and mining.
Relevant answer
Answer
The training of an AdaBoost ensemble in its purest form can be hardly parallelized. The problem is that the weightings of the examples in the next iteration of the algorithm depends on the performances of the previous iteration. This implies that a new iteration cannot be started before the previous one finishes. On the contrary, the classification through an AdaBoost ensemble is easily parallelized: each weak classifier can work independently from the others and thus it can be executed on its own thread or remotely. I've skimmed through the paper suggested by Stephane Genaud and indeed it seems they the authors experiment with a variation on the algorithm, not with its original formulation (they create a number of workers and a master, the master will build the final ensemble, the workers train a number of classifiers and return the best to the master).