Science method

# Parallel Computing - Science method

Explore the latest questions and answers in Parallel Computing, and find Parallel Computing experts.
Questions related to Parallel Computing
• asked a question related to Parallel Computing
Question
I want to run an ODE system by varying a suitable parameter, suppose \alpha. Now I have divided the job into the corresponding worker and created a data file so that for each parameter \alpha, I need to copy a certain measure of the system into the data file. When I googled it and went through some documentation, I realized that using a single file only creates a route to a corrupt file. It is not a way to do that, and I need to create multiple files. I need to ask whether there is any way to create files with some exact name (must be in parallel programing) and scan data from all the files to create another data file that contains all the data.
• asked a question related to Parallel Computing
Question
Hello everyone,
I've implemented a FEM code in matlab, and now I'd like to make it faster. Since I have to perform nonlinear dynamic analysis, I have a lot of iterations, each one requiring to asseble the full stiffness matrix K.
What I would like to do thus is to parallelize the assembling procedure of K. Till now I tried parpool and spmd, the latter with poor results, the first one performing nicely (speedup factor x2... despite using 10 cores...) but only under a certain number of elements. Overcome a certain treshold, parallel computation (14 cores) would take as much as 10 times the single core version.
I understand this may be related to overheads in the comunication between "master" and workers and/or slicing procedures, but it seems I cannot get the hang of it...
Does anyone have suggestions and/or can point me to some useful material on this specific matter?
Jacopo
• asked a question related to Parallel Computing
Question
I have a problem. Measurements show the opposite of what convention assumes.
I test soil specimens. We try to decode how much deformation a certain loading (force) sequence will generate.
After 6 years of testing, I noticed reaction force behaves as a function of deformation. Not the way convention tries to describe it.
It's a big problem. All software is designed to midel deformation as a function of force applied. But in reality, the stiffness hysteresis loops behave such, that force (reaction) is a function of deformation.
The observations, empirical evidence, pointed to an abandoned theory from 40 years ago (strain space plasticity, by P. J. Yoder). His theory seems to be not only compatible with the observed physical properties, but also GPU - parallel computation compatible.
So, we have something that is both:
1. Potentially more physically cotrect
2. For the first time - super computer compatible.
I am stuck building robots for industrial projects. Which are meant to provide "quick profit" to faculty. Research is not financed. All observations were made in spare time. At times - using life savings...
When experts are shown the test results, they become red in the face, angry, and say "have not seen anything like it". After an hour of questions - vannot find any flaws in the testing machines. And.. Leave. Never to hear from again.
The theory of P. J. Yoder was defended in public defenses multiple times in the 80's. No one found flaws in it. Did not prove it wrong. But... Forgot, ignored and abandoned.
Industry asks for code compatible with existing software (return of investment)... And I alone can not code a full software package. Frankly, I would rather keep testing, try to prove my assumptions wrong. But the more I test, the mote anomalies and paradoxes are observed, exposed and resolved on the topic..
Is
After a fast check of your research. Do you know the sand-pile model and Per Back's self-organized criticality? He wrote a book about it.
It seems to be close to your own research. My first research project was on dynamic recrystallization. It happened that I explained the non-linearity of stress-strain curves by a very primitive program theoretically (see my PhD and open software that was rewritten into C++ and Qt by Jakub) for the first time. The reaction of my colleagues was a total refusal. Now, it became the standard.
• asked a question related to Parallel Computing
Question
How to run NAMD 2.14 on cores from multiple nodes in slurm? need batch file please.
this is the one i am using:
#!/bin/sh
#SBATCH --job-name=grp_prod
#SBATCH --partition=cpu
#SBATCH --time=30-50:20:00
mpirun -np 48 /lfs01/workdirs/cairo029u1/namd2_14/NAMD_2.14_Linux-x86_64-multicore/namd2 +auto-provision step4_equilibration.inp > eq.log
this batch file produces this attached log file (the first part only is attached)
I'm not exactly sure of the answer, but have you considered reading the "Choosing CPU queue" section here https://hpc.bibalex.org/quick-start#submission. Or I think you may send the HPC support team an email asking them.
• asked a question related to Parallel Computing
Question
I am looking for a user-friendly neuro-evolution package with good parallelization. I wish to quickly explore some ideas (using a short learning curve); preferably in R or Python programming language.
• asked a question related to Parallel Computing
Question
Hello,
I am running a finite element (FE) simulation using the FE software PAMSTAMP. I have noticed that when I run the solver in parallel with 4 cores, the repeatably of the results is poor; when I press 'run' for the same model twice, it gives me inconsistent answers.
However when I run the simulation on a single core, it gives me consistent answers for multiple runs of the same model.
This has lead me to believe that the simulations are divided differently between multiple cores, each time the solver is run.
Is there a way to still run the simulation in parallel (multiple cores) but have the solver divide the calculation in the same manner each time, to ensure consistency for multiple runs?
Thanks,
Hamid
I had similar problems with simulating crash boxes in parallel. Although I am no expert in HPC, I think it is because each core handles the round-off operations differently and also the interactions between the cores at the domain boundary plays an import role. I guess if you can control the exact same domain assignments to the exact same core as in any of your previous runs, you should get the same results. However, i doubt PAM software would be so flexible.
• asked a question related to Parallel Computing
Question
Parallel computing in Matlab is not supported for the fminsearch (Nelder-Mead Simplex search) method. We would be grateful if you could share your experience to run this method in parallel in order to speed up the optimisation process of expensive problems.
Parallel Computing with MATLAB by: Tim Mathieu
Use Parallel Computing for Response Optimization
• asked a question related to Parallel Computing
Question
Hello,
I'm trying to run my Abaqus simulation using gpus. I have a PC with AMD Ryzen™ 5 2400G & Radeon™ RX Vega 11 Graphics.
Calling the gpu from CAE and the Command window doesn't work. I have found a possible solution using CUDA, which I have not tried yet since it refers to NVIDIA hardware. Other posts suggest using OpenCL but I can not find how to download it.
You should use OPEN GL for the AMD graphics card. I'm not sure AMD supports Abaqus parallel computing.
• asked a question related to Parallel Computing
Question
In litterature i find lot of paper that discuss the problematic of parallel computing but i didn't find something relevant concerning how can we define the optimal number of compute nodes that will give us the maximum speedup.
#HPC #parallel computing #scientifique_simulation
#graph_partitionning
In the field of high-order finite element methods, we have a good idea of how to achieve 80% parallel efficiency on HPC systems. The rule of thumb we follow is 2000 points per MPI rank for CPU systems (e.g. Blugene Q, Sequoia, etc.). For GPU systems (e.g. Summit, Sierra, etc.) the number is 2 million points (since latency is very expensive). That is based on many years of development. You can read this paper which has many benchmarks and ideas you can follow to get an answer to your question
• asked a question related to Parallel Computing
Question
I am currently interested in solving large linear systems on a distributed parallel computing grid using the ScaLAPACK library. I wonder if there is a quadruple precision version (Or any other alternative library which provides quadruple precision for parallel distributed computing)?
Hi
Interesting question. A guess is if you want double double (quad) precision, then the compiler should support the data type, as well as you need the processor drivers for your AMD ir i386. Both seem to be available, and you will need relevant flags when compiling the library.
• asked a question related to Parallel Computing
Question
Hi, I am trying to compile quantum espresso for amd ryzen 7 4800h CPU, which is my personal laptop. Since Intel mkl isn't well suited or optimised for amd processors I was wondering if there were other alternatives. For instance is openblas a good option? Or should I just go ahead with the internal libraries that are already there in qe. Also Has anyone tested aocl 2.2 for qe? I read recently in the qe website that aocl 2.1 gave some weird results. Finally, what mpi should I choose for parallel computing? Is openmpi fine or is there other better alternative which anyone would like to suggest. Thank you
I have no idea about the software, nor about how you are using your device. If you are going to run on a shared memory, then use of OpenMP before MPI should be preferred. Whereas OpenMPI is easier to use, my preference is mpich2. Further, the blas should also not be OpenBLAS, which I found to clash with AMD drivers - compile the blas yourself using gnu compilers in combination with the mpich wrappers.
• asked a question related to Parallel Computing
Question
I am using fluent 14.5 with a parallel computing system. I am using UDF for property calculation as well as nucleation and growth in the subdomain/ cell zone.
After solving one iteration it is giving errors like "primitive error"  and cortex error Ansys.Fluent.Cortex.CortexNotAvailableException' was thrown.
Can anyone please tell me why is this happening?
Hi everyone,
I faced the same problem too. The problem only occurs if you are actually using sheet bodies (2D bodies) in a 3D geometry. (If you are working 2D you should change the analysis type as 2D)
I solved the problem as follows:
1) Before the starting right click on Geometry > Properties and select Analysis Type 2D
2) Save after meshing
Then you can go Setup.
Could you please indicate if it worked or not?
Thanks,
Emin
• asked a question related to Parallel Computing
Question
Hello,
I using gromacs 5 and want to choose how many processors to use in a simulation. For choosing GPUs I have read the mdrun function page which is easy to do it, but for cpu I haven't found how to do it.
regards
gmx mdrun -v -nt x -deffnm md_0_100
x= number of processor
or install mpi version for even fatser calculations.
• asked a question related to Parallel Computing
Question
Hi, I am looking for some good online (prefereably free) resources where I can learn "parallel computing" from scratch using Python or R programming languages. Thanks.
For python the packages 'xarray and dask can parallelize lots of data crunching tasks either on your computer or on a hpc.
A good starting point is this (long) video:
or the dask Tutorial on git: https://github.com/dask/dask-tutorial
• asked a question related to Parallel Computing
Question
I wanna apply parallel computing techniques to decrease the calibration time. Using the code below, I can calculate 10 sets for 11 parameters simultaneously using PARFOR loop but I cannot get the cost function value simultaneously for each set of parameters values during the calibration process using GA. Your helpful commands and suggestion are highly appreciated.
% GA Parameteres
MaxIt=20; % Maximum Number of Iterations
nPop=10; % Population Size
NoVar=11; % No. of parameters to be calibrated during
delete(gcp('nocreate')) % shut down previous parrel computing
parpool('local') % Activate parallel computing parfor
i=1:nPop
pop(i).Position=unifrnd(VarMin,VarMax,VarSize);
pop(i).Cost=CostFunction(pop(i).Position);
end
• asked a question related to Parallel Computing
Question
I am running various Abaqus simulations with a model size of about 1.5 million degrees of freedom in total. To speed up the calculations I am trying to decide what number of CPUs would be optimal and what are influencing factors (like model size, steps, time steps, outputs, hardware etc.). I'm interested in the question at what number the writing and merging of partial results and data between the different cores outweighs the benefit of using multiple CPUs.
Surprisingly though the HPC Advisory Council studies seem to be a bit out of date
• asked a question related to Parallel Computing
Question
Hi,
I am trying to do a Quantum espresso SCF calculation on an Intel Xeon Gold Gold 5120 CPU @ 2.20 GHz (2 Processor). It has 56 cores and a 96 GB RAM.
I am trying to do a parallel calculation on this workstation by using:
mpirun -np (no of cores) pw.x -npool (no of pools) -inp si.pw.in
According to internet sources, I have tried to improve the performance by setting the OMP_NUM_THREADs=1 and I_MPI_PIN_DOMAIN=1.
Can anyone please guide me as to how to choose the no of optimum cores and the no of pools on which I should run the calculation.
The input file is attached below.
The FFT grid dimensions is (48 48 48) and maximum number of k-points is 40000
Subsidiary Questions:
1. Should the Subspace diagonalization in iterative solution of the eigenvalue problem run by a serial algorithm or an ELPA distributed-memory algorithm
2. Should the maximum dynamical memory per process be high or low for better performance?
There are different way to get the best parallelization performance from QE. Here are a few tips:
1. If your calculation consists of a large number of k-points, npool can strongly enhance the process. It must be a divisor of the total number of k-points. I personally set it to 4.
2. In combination with npool, you can further improve the performance using the bgrp command. It allows you the program to slit the KS states across the selected group of processors.
For example, assuming you have 32 cores you can try something like this,
mpirun –np 32 pw.x –npool 4 –bgrp 4 –input pw.in
3. You can also try to parallelize the matrix diagonalization routines, using ndiag command. Choose it such that ndiag<=(number of cores/npool)^2.
For example,
mpirun –np 32 pw.x –npool 4 –ndiag 36 –input pw.in
4. Finally, you can use the task command ntg. But since you don't have many cores, I think this is not relevant.
• asked a question related to Parallel Computing
Question
I have built up a DEM model with Abaqus, in which there are three cubics falling onto the floor. Particles in the two cubics (named as Solid-1 and Solid-2 in the model) were applied Beam-MPC constraint, and the other one cubic (named as Particle-3 in the model) has no constraint among the particles. The three cubics fall onto the rigid floor due to gravity.
The purpose of the example is to testify the possibility of parallel computing of a DEM model with particle clusters and particles.
However, the example can run under 1 CPU, but it failed to parallel computing. Attached is the .inp file and the snapshot of the error message. I try to use the *DOMAIN DECOMPOSITION to decompose the Particle-3 domain, but it failed.
I will really appreciate it if anyone can help me fix this problem.
Hi Alexander,
I have tried it many times and asked some experts for help. It seems that DEM simulation with the MPC constraint can not run in multiple CPUs. For DEM problems, the particles are not allocated for CPUs in order or evenly, which means some particles in a cubic cluster are allocated for CPU A, but the other particles in that cubic cluster may be allocated for CPU B. Hence, due to the poor allocation strategy, ABAQUS can not decouple the model and pop up with the error.
If you are interested in modeling a particle cluster, maybe you should try another constraint type. But I don't know which kind of constraint works for parallel computing.
Zehua
• asked a question related to Parallel Computing
Question
We are undergraduate students and would like to work on simulating tsunamis. Could you suggest study material/ebooks/lectures to learn SPH basics from?
for 2D/3D simulation, it is useful the free code of the book by G. Liu & M. Liu, Smoothed Particle Hydrodynamics. For example, 3D water drop formation can be reproduced fairly well by this book methodology & code.
…
• asked a question related to Parallel Computing
Question
Amdahl's law claims a constant 1/f upper limit on speedup, even if infinitely many processors are available, where f is an assumed inherently sequential fraction of the workload. A paper entitled "The Refutation of Amdahl’s Law and Its Variants" (full text available on ResearchGate) shows that inherently sequential fractions of workloads do not exist in theory and that parts of workloads possibly remaining sequential in practice are irrelevant to speedup if the growth rate of the time requirement of the part executed in parallel is higher than that of the sequential part.
Polynomial speedup with a polynomial number of processors exists for a huge number of problems in the NC class of problems. Polynomial speedup is possible even for some problems not known in NC. We say that a problem has polynomial speedup if there exists an eps > 0 such that the speedup is at least n^eps, where n is the problem size.
Graphics Processing Units (GPUs) were introduced for the solution of the hidden-surface problem. The best possible sequential algorithm for the hidden-surface problem takes a*n² time, while a parallel algorithm takes b*log n time by using n²/log n) processors, where n is the number of polygon edges in the input and a and b are positive constants. Therefore, the hidden-surface problem has polynomial speedup a*n²/b*log n = Θ(n²/log n).
The successful application of GPUs for general-purpose computations demonstrates that Amdahl's law is also broken in practice.
• asked a question related to Parallel Computing
Question
Is it possible, in some way, to parallelize a script in MATLAB that has dependent variables? Would it be possible to use parfor?
Dear Marcelle Vargas , Execute for-loop iterations in parallel on workers by using " parfor" instruction: https://fr.mathworks.com/help/parallel-computing/parfor.html
• asked a question related to Parallel Computing
Question
Hi,
I'm trying to run my Abaqus simulation using gpus. I've downloaded the Nvidia Cuda Toolbox, but my simulation still doesn't seem to run on the gpu.
If I run following command in Abaqus Command:
"abaqus job=Job-8 inp=Job-8.inp user=umat_1d_linear_viscoelastic.for cpus=2 gpus=1"
Abaqus starts the simulation, but never finishes it and the .lck file stays remaining.
Does anybody now how I can enable Abaqus to use my gpu?
If so, find abaqus_v6.env file in your C: and add in the end of the code the following:
os.environ["ABA_ACCELERATOR_TYPE"]="PLATFORM_CUDA" # Nvidia
• asked a question related to Parallel Computing
Question
Hello everyone,
I'm running MMPBSA by using the g_mmpbsa tool for GROMACS. However, it runs extremely slow even though it's claimed that by default it uses all the processors: https://rashmikumari.github.io/g_mmpbsa/How-to-Run.html
How can I be sure if it really uses all of the processors? And how can I increase it if possible/necessary?
Try writing a bash or python script and incorporate no of processors as well as g_mmpbsa module.
HTH!
• asked a question related to Parallel Computing
Question
I want to launch a parallel simulation of CQN model on Omnet++ using multiple instance of Omnet++ in different PCs. Do you have any idea of this?
• asked a question related to Parallel Computing
Question
I am trying to run an SCF calculation on a small cluster of 3 CPUs with a total of 12 processors. But when I am running my calculation with command mpirun -np 12 ' /home/ufscu/qe-6.5/bin/pw.x ' -in scf.in> scf.out it is taking a longer time then it was taking in a single CPU with 4 processor. So it will be really great if you can guide me on this since I am new in this area. Thank you so much for your time.
It's not running on other CPUs.
I think, their are some mistake in the comand.
• asked a question related to Parallel Computing
Question
Hi,
I want to setup a 4 nodes (each one with 4 cores) Raspberry pi cluster and install quantum espresso on it for simulations. Does anyone know how good it works compare with core i7 PC computer for simulating nanostructures?
any test on it?
thanks
1 Rasphberry pi 4b have 235GFLOPS
I think , 4 rasph have 0.9 TFlops?
• asked a question related to Parallel Computing
Question
In many research papers, I read that dependency between iterations or tasks can be identified using some type of dependency tests like Banerjee's Test. Can anyone please provide a real example that can be applied on for loop that showed how does the dependency test work?
Do you have any example dr. Salameh A. Mjlae ?
• asked a question related to Parallel Computing
Question
Hi dear
I want to use MICH2 for parallel computing of MCNP. I Try it nut MPIEXEC run multi-time instead of running parallel?
Can anyone help me?
my command is here:
mpiexec -n 2 -noprompt mcnpx.exe i=Sp1w_05.TXT
Hi dear
I use win7 64 bit.
And I found it in output that multi-time MCNP is running
• asked a question related to Parallel Computing
Question
I installed AmberTools18 and I installed MPI libraries as well. Everthing is working but when I want to run sander parallel like this:
mpirun -np 32 sander.MPI -O -i min.in -o min.out -p ras-raf_solvated.prmtop -c ras-raf_solvated.inpcrd \
-r min.rst -ref ras-raf_solvated.inpcrd
it didn't work, then I realized sander.MPI doesn't seem in my AmberTools package. So how can I run it? or if I didn't install how can I install it?
Hi Baris,
In $AMBERHOME, try ./configure -mpi gnu, and make install. • asked a question related to Parallel Computing Question 12 answers i wrote a shallow water equation code on MATLAB but it is solving sequentially and takes a lot of time so i was wondering how can i transform my code to a parallel one so that it can utilize multiple cores. Relevant answer Answer HPE's "The Machine" follow-ons and the Gen-Z Consortium are working on a much larger shared memory fabric system using load/store byte-addressable cache-line granularity memory channel semantics, but you won't be able to use virtual addressing across 100s of computing nodes (no remote snooping of cache). The access to large sections of no-coherent memory would be mmap() and the addressing base + offset, and the overlap halo regions shared using fabric atomics or sharing by an in memory byte-addressable mail system between the sub-grids. Any number of cores on compute nodes should be able to access the memory, modulo some choreography to prevent traffic jams on memory nodes. Progress on Gen-Z by the consortium and HPE is being made ... Memory fabric error rates are being reduced by former SGI/now HPE personnel: http://genzconsortium.org/wp-content/uploads/2019/05/Highly-Resilient-25G-Gen-Z-PHY.pdf What does that get you? No more marshaling, buffering network-sharing across sub-grids. No more marshaling, buffering for those 10 minutes of checkpoint writes every hour to maintain a consistent grid fallback point at a timestep an hour ago. All of that compute time and expensive networks and fabrics hardware goes away. MPI or MPT and OpenMP will still work. Your stencil code and meshes will have to change to be resilient to memory node and compute node failure, but that was foreseeable anyway. Matlab would have to change. But it should scale up and out. • asked a question related to Parallel Computing Question 5 answers I tried to detect faces in video with help of Machine Learning. Now I want to accelerate the running time of my Algorithm via some abstraction layer (High Language). Is there any high-level language of OpenCL to do that? Has someone experience with SYCL, it's possible to implement this with it? many thanks in advance for your help Me Relevant answer Answer Hi. Yes there is https://futhark-lang.org/ • asked a question related to Parallel Computing Question 2 answers Special issue Parallel computing and co-simulation in railway research International Journal of Rail Transportation Submissions are due on 15th April 2019. Relevant answer Answer • asked a question related to Parallel Computing Question 4 answers Minuscule devices need to process huge amounts of data. Currently, data from such devices is offloaded to larger machines such as GPUs to be processed. However, this setup poses many issues such as exposure of human tissue to radiation, as well as distance/security issues. Thus, such data needs to be processed at the source. Relevant answer Answer It may depend on the size you can afford to use. I would say that, yes, FPGAs and ASICs are some options. You can also use small IoT-oriented boards with multi-core ARM processors and GPUs (i.e. take a look at devices such as Nvidia Jetson), and there are some new devices called MPPAs (Massively Parallel Processor Array), which are many-core processors (i.e. see Kalray processors). We managed to adapt our own CEP (Complex Event Processing) engine for one of these MPPAs, and it was interesting to parallelize some operations in OpenMP, but they're rather expensive. • asked a question related to Parallel Computing Question 3 answers I read some scientific papers and most of them are using data dependency test to analyse their code for parallel optimization purpose. one of the dependency test is that, Banerjee's test. Are there other tests that can provide better result testing data dependency? and is it hard to do a test on control dependency code, and if we can, what are some of the technique that we can use? Thank you Relevant answer Answer Hisham Alosaimi sorry forgot to share the link for the slides. Here you go • asked a question related to Parallel Computing Question 9 answers Hello everyone, I have a VUMAT code for a plasticity model, and I want to run a job for this model in cluster with ABAQUS 6.12. Because my VUMAT code is time-cosuming I have to run my model in cluster with 16 processors. I receive the following error when run the model and the job is terminated. ***ERROR: Floating Point Exception detected in increment 86485. Job exiting. 86485 1.469E-01 1.469E-01 08:58:41 1.699E-06 436 NaN NaN 3.230E+04 INSTANCE WITH CRITICAL ELEMENT: BILLET-1 A special ODB Field Frame with all relevant variables was written at 1.469E-01 THE ANALYSIS HAS NOT BEEN COMPLETED Many Thanks, ALI Relevant answer Answer hello, have you solved it? I meet same problem • asked a question related to Parallel Computing Question 5 answers Hi, I need to solve an Eigenvalue problem of the form [A]{x} = e [B]{x}, where A, B are sparse matrices. Since the size of my matrices are quite huge, some 20,000 x 20,000, the only approach is to employ a parallel code. Can anyone suggest such a solver or point me in the right direction? Thanks in advance ! Relevant answer Answer Hi Nishanth, I think you can solve your problem using PETSc (https://www.mcs.anl.gov/petsc/). This is a powerful scientific computation environment that includes SLEPc for working with Eigenvalues. Best, Eduardo • asked a question related to Parallel Computing Question 5 answers Hello, I currenty work on a FSI (Fluid-Structure-Interaction) using Ansys 18.1. After a lot of efford I succeded at a working simulation, but it takes a high computation duration. With Ansys Fluent, a simple simulation with around 500k elements in CFD and 30k elements in Mechanical takes 67h with only 10 time steps (75 iterations). I want to improve the speed of the computation, but I cannot find any setting in order to do this. In Fluent I selected parallel computing with 4 cores. Even with GPGPU support, there is no significant duration improvement. During the project, I have a max CPU perfomance of 20%. The CFD simulation with Fluent, without FSI takes around 1h. Has anyone an idea for a FSI simulation? I use the pressure-based solver with the coupled setting and second order implicit transient. The dynamic mesh is generated by smoothing (diffusion). By request, there is more information, if needed. Kind regards, Relevant answer Answer Hi Karl, Here the first question is, are you doing one way FSI or two way? means on each time step data is being transferred from Fluent to Ansys Structural or just you are calculating only from Fluent and transferring your results to Ansys structural of vise versa? Because of computed time i assume you are working of two way but please make sure about it. I will also recommend you to run Ansys Structural separately like you did with fluent (you didn't mention how much time step you are solving only in fluent only 10 time steps?) by mimicking the load from fluent and see how much time it is taking. If that one is also not taking much time then there is issue in transferring data between the two modules of Ansys. • asked a question related to Parallel Computing Question 5 answers I want to run a stereo vision algorithm on two computers, what is the best way to write a parallel algorithm? Relevant answer Answer thanks a lot Mr. Hosein • asked a question related to Parallel Computing Question 3 answers Is there a way that I can access a Hadoop Map/Reduce cluster available for researchers for free to run some experiments? Relevant answer Answer You can access free MR framework on HDP sandbox.You just need to download and install the HDP and run on vm player. • asked a question related to Parallel Computing Question 6 answers Is there a A* implementation for nvidia gpus? Or a Library for HPC stuff on GPGPUs at-least? Some real life parallel algorithms implemented on them would be good too. Relevant answer Answer I think, if you are interested in GPU computing applications, the best way to go for fully implemented CUDA codes are the CUDA samples from NVIDIA. These samples are available online and they cover many diverse applications. • asked a question related to Parallel Computing Question 4 answers aoa my research topic is resource management in cloud computing. my supervisor ask me to relate it to parallel computing. can any one tell me how can i do so ? • asked a question related to Parallel Computing Question 16 answers We live in a world of computations, there is no doubt about it. Computations that increase in terms of computarional demand every day. What do you think is the future in solving large scale numerical models? Some say OMP, others MPI and the romantics say cloud computing. What is your opinion and why? PS: Any other methods used to solve in parallel or other techniques are more than welcome to be presented. Relevant answer Answer Well, the most common model used today is MPI+X. MPI is used for distributed memory communication and X is one of the shared-memory methods (OpenMP, OpenACC, CUDA or OpenCL). I would recommend a recent presentation by Bill Gropp on "MPI+X on The Way to Exascale" • asked a question related to Parallel Computing Question 4 answers I want to simulate a discrete random media with FDTD method. the simulation environment is air that is filled with random spherical particles (that are small to the wavelength) with defined size distribution. what is an efficient and simple way to create random scatterers in large numbers in FDTD code? i have put some random scatterers in small area but i have problem producing scatterers in large numbers. any tips, experiences and suggestions would be appreciated. thanks in advance. Relevant answer Answer Paralellizing a code is not easy. You may want to get a hold of a commercial tool like Lumerical or an open-source tool like Meep if you need parallelization. It is possible. You will need to "voxelize" the STL file in order to identify points on the grid that are inside or outside of your particles. Search Mathworks website for anything called Voxelize or something like that. There is a little more information about this in Lecture 19 here: • asked a question related to Parallel Computing Question 4 answers Does anyone have experience in deploying the WIndows HPC private cluster, or executing an Abaqus job on two computers without queuing server using MPI Parallelisation? I have managed to set-up the Cluster and to join the headnode running on Windows Server 2012 R2 and one compute node "W510" running on Windows 7 64bit Ultimate. Unfortunately I keep getting the following error massage: Failed to execute "hostname" on host "w510". Please verify that you can execute commands remotely on "w510" without a password. Command used: "C:\Program Files\Microsoft MPI\Bin\mpiexec.exe -wdir C:\Windows -hosts 1 w510 1 -env PATH C:\SIMULIA\Abaqus\6.13-4\code\bin;C:\Windows\system32 C:\SIMULIA\Abaqus\6.13-4\code\bin\dmpCT_rsh.exe hostname". Relevant answer Answer I also got the similar problem. Do you had overcome this problem? • asked a question related to Parallel Computing Question 3 answers I have this project about atherosclerosis and huge amount of computation. I don't know how can I use parallel computing in ADINA or using a cluster or distributed computation. The information in the internet is so confusing. Anyone know an introductory book? Relevant answer Answer ADINA have options to run either as SMP or on a cluster as DMP. But there is a separate code and licensing for DMP. • asked a question related to Parallel Computing Question 6 answers I read many books and search on Internet but I did not find any answer . Relevant answer Answer Dear Nitin Mathur, I hope you have your answer by now from previous answers. I still am trying to explain the reason for conditional branching instruction 7T states if condition is not met. Why isn’t it 10 instead? The explanation: First of all we need to keep two things in mind 1. Each instruction has opcode fetch cycle where in first two clock pulses the memory is fetched for the opcode and it is placed on the data bus and in next clock pulses it is loaded into instruction register and subsequently to instruction decoder for interpreting the instruction. 2. The program counter always points to the next memory location from where an 8 bit data will be fetched. Processor always fetch the program counter value either as opcode fetch or normal memory read as per the previous fetching and the program counter value will automatically be incremented by 1. Now in connection with the answer of the question: When the opcode for a conditional jump is being fetched the PC value is pointing to the higher byte of jump address. After the 4th CP the processer automatically starts fetching the memory location pointed by the PC (Lower byte ) as a simple memory read and PC is auto incremented by 1 but as soon as the instruction decoder completes the interpretation of the conditional jump and finds the condition to be false then processor finds it unnecessary to fetch the Higher byte of address which is currently pointed by the program counter and the PC value is incremented by 1 again to point next instruction (i.e. an opcode instead of Higher byte). So at false condition there are two machine cycles, i.e. OFC + Mem read (unnecessary but have to wait to finish the cycle) Hence it is 4+3=7 I will be glad if this info helps. Thanks • asked a question related to Parallel Computing Question 4 answers Please Share the Steps. It should be highly appreciated. Relevant answer Answer Newer Ubuntu versions do not need bumblebee for this purpose. • asked a question related to Parallel Computing Question 4 answers more explanation on implementation Relevant answer Answer I don't think so, this is a 1 GB main memory device, if you are thinking on exploring small things maybe you can do something. But if you are thinking on doing some real work then it won't be enough. Efehan's advice is definitively a good one, although expensive :) • asked a question related to Parallel Computing Question 2 answers Removed Question as account is inactive? Relevant answer Answer I'd highly recommend using a higher timestep for your goal of getting the most ns/day possible. I don't recall the details but a colleage of mine is using a timestep of 6 fs, which is a 300% speedup compared to your setting of 2fs. Have a look at "virtual sites" or "hydrogen mass repartitioning" in context of gromacs • asked a question related to Parallel Computing Question 2 answers As I know a YARN container specifies a set of resources (i.e. vCores, memory etc.). Further we may also allocate containers of required size based on the type of task it executes. In the same sense, what does a map or reduce slot specify in MRv1? does it specify cpu core or small size of memory or both? And is it possible to specify the size of map reduce slots? Further, suppose we have two nodes with same CPU power but different disk bandwidth, one with a HDD and other with SSD. Can we say that power of slot on first node with HDD is less than node with SSD? Relevant answer Answer Not all the cases, if you have massive RAIDx4 storage of SATA HDD zou can write it once and no more acting as moving some perculiar information doing rewrite, move operation occasionately, so you can write onlz output of you programm without insufficient disk space to investigate related researches as you only can wish. • asked a question related to Parallel Computing Question 3 answers I want to find the total memory access time for a matrix program. The formula is MAT = HITRate × CacheAccessTime + MissRate × RAMAccessTime I have calculated the cache miss and hit rate by cache grind. But I dont know how to find the cache access time and Ram access time? Kindly I need your help. Is there any tool like perf and cache grid which find the cache and Ram access time in ubunutu. Relevant answer Answer As a practical matter the measurements are intrinsic to the hardware. There are testing tools which boot up on its own RTOS and will test the system. In other words, this is a borrow, not build situation. But if you insist on building, here is how I would approach it, apologies in advance for the too many words: To Samir's point, doing this in an arbitrary live OS is considerably more tricky. You will need to use some method to enter a non-preemptable privilege time execution state - such as you might need to happen during an interrupt service routine. The name used to describe this execution environment will vary from OS to OS. In linux they call this an "atomic context" - for obvious reasons - but also if you mess up the kernel goes nuclear. For example, you cannot call any blocking functions, and using swapped out virtual memory? Kaboom! There are several examples of writing ISR's for Linux - find one you like. You may, depending on the OS, be able to use an atomic context outside of an ISR, in which case you might be able to implement a kernel module implementing a procfs interface to interrogate the test code. that makes running the tests and getting the results easy. You would basically disable interrupts while your test code beats up on the RAM. You can still use the on-CPU super high resolution timer which does still advance during an interrupt blackout. Provided the amount of time you stall the kernel is not too long ( single digit seconds ) you should be OK. • asked a question related to Parallel Computing Question 5 answers I have conducted the results on the coachbench benchmark. Each one performs repeated access to data items on varying vector lengths. Timings are taken for each vector length over a number of iterations. Computing the product of iterations and vector length gives us the total amount of data accessed in bytes. This total is then divided by the total time to compute a bandwidth figure. This figure is in megabytes per second. Here we define a Megabyte as being 10242 or 1048576 bytes. In addition to this figure, the average access time in nanoseconds per each data item is computed and reported. But I just got the result in the following form: why there is a sudden decrease in the time by increasing vector size. output: C Size Nanosec 256 11715.798191 336 11818.694309 424 11812.024314 512 11819.551479 .... ...... ..... ..... 4194309 9133.330071 I need the results in the following form. How I get this result. The output looks like this: Read Cache Test C Size Nanosec MB/sec % Change ------- ------- ------- ------- 4096 7.396 515.753 1.000 6144 7.594 502.350 1.027 8192 7.731 493.442 1.018 12288 17.578 217.015 2.274 Relevant answer Answer 1. Dear Sir, Thank you so much for your response. Why a sudden decrease in the time by increasing vector length C Size Nanosec 256 11715.798191 336 11818.694309 424 11812.024314 512 11819.551479 .... ...... ..... ..... 4194309 9133.330071 But I am confused in this, that if we increase the vector size then for writing or reading the data will be coming from the main memory because it will not be present in the cache then, in this case, the time should be increased. But I am confused that why it's decreasing when we increase the vector size. Dear Sir, If I am wrong please do correct me. • asked a question related to Parallel Computing Question 3 answers I was wondering whether anyone knows about an automated tool to collect GPU kernels features, i.e., stencil dimension, size, operations, etc. Such tools are widely available for CPU kernels. Relevant answer Answer Well, I only work with CUDA GPUs for servers, so i'm not aware of how things works in embedded/mobile platforms. But basically you would need a profiler. If you are dealing with smartphones, you coud try the Snapdragon/Adreno profiler Also, looking at their software I see that they provide an LLVM compiler. If there is an open source version available, you can modify it to obtain the features you want. • asked a question related to Parallel Computing Question 7 answers i am using Ubuntu 64 bit 14.04 and 32 bit 8.04 versions. I am running a source code in fortran. i get different results on two machines. what can be the reASONS. i tried using gfortran and intel compilers. the two compilers give identical results on respective machines but they differ when the results are compared from two different machines Relevant answer Answer Dear Colleagues, It is known to check of real numbers for equalities, it is necessary to compare them with some accuracy. Give a fragment of the source code. BW Alexander. • asked a question related to Parallel Computing Question 2 answers I am using OpenMP for performing parallel programming and i want to execute my c code (with OpenMP) in gem5. But i was unable to execute my code. I have tried to install m5threads, but there is lots of error when i installing. Can anybody help me about how to successfully install m5threads and how can i execute my c code (with OpenMP) in gem5? Please Help Relevant answer Answer openmp and pthreads are not supported in SE mode. you may use m5threads or FS mode. these are the steps for installing m5threads. 1. Go to the directory where you want m5threads installed. 2. Download m5threds 3. Compile libpthread make (this command will do compilation) (as you've not cleared the exact error you're getting, if you can upload some snapshot or details so it will be easy to solve the problem) Some times make command gives error because of architecture dependency, if you're using a 64bit machine (uname -a command will show the bit type 64/32bit) then you should make changes in make file Edit Makefile for x86 target. # 64-bit compiles #Uncomment to use sparc/alpha cross-compilers #CC := sparc64-unknown-linux-gnu-gcc #CPP := sparc64-unknown-linux-gnu-g++ #CC := alpha-unknown-linux-gnu-gcc #CPP := alpha-unknown-linux-gnu-g++ CC := arm-linux-gnueabi-gcc CPP := arm-linux-gnueabi-g++ #CC := gcc #CPP := g++ ... CFLAGS := -g -O3$(ARM_FLAGS)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
To
# 64-bit compiles
#Uncomment to use sparc/alpha cross-compilers
#CC := sparc64-unknown-linux-gnu-gcc
#CPP := sparc64-unknown-linux-gnu-g++
#CC := alpha-unknown-linux-gnu-gcc
#CPP := alpha-unknown-linux-gnu-g++
#CC := arm-linux-gnueabi-gcc
#CPP := arm-linux-gnueabi-g++
CC := gcc
CPP := g++
4. Compile tests for x86
make
5. Run a test
cd test
./test_atomic 4
Hope this will work for you
• asked a question related to Parallel Computing
Question
Hi,
We have several isolated GPU units and plan to build a cluster and a queuing system. Does anyone have relevant experiences? Can you tell me the prerequisites of the soft and hard wares and how to compute the installation step by step?
Cheng
How you build your cluster depends on what you want to do with it.
Do you already have software? What are the requirements of the software?
If you intend to use CUDA or similar libraries with GPUs (as opposed to CPUs) you should refer to the recommendations in the Nvidia documentation. Not all systems are created equally.
Another aspect is that, yes people are using RaspberryPi for clustering, **However**, what are your doing? Is it just an experiment to try clustering, or is it a production cluster for doing actual work?
Beowulf does not automatically use your GPU cores without using proper librarys. Same with the others. There are also a differences between MPICH, OpenMP, and GPU programming that you should research before spending your money. Those are all different things that can be used to develop parallel programs, but you need to understand where to use them to make them effective.
From my experience it's better to start with a clear plan for your software implementation and your goals. Decide what libraries you are using, etc..
Boost libraries are very good, but like all libraries you will probably need to write some other code to fully support your GPUs.
• asked a question related to Parallel Computing
Question
if I had 3 nodes each one has 8 core and these nodes allow hyper threading. We found running code on 3 nodes and 16 thread will be faster than 4 nodes with 16 thread and 5 nodes with 16 thread faster than 3 and 4 nodes! Any explanation?
I’ll try to rewrite your pb: you have up to 5 cluster nodes, which are probably interconnected through Gbit Ethernet, and you have an application running as a single process with 16 parallel threads. A possible alternative interpretation of your text would be that your code has as many running processes as the number of nodes, and each process with 16 threads.
Each of your nodes has a single 8-core x86-64 device or two 4-core x86-64 devices; if not too outdated, this 2nd alternative configures a NUMA memory organization, with impact on performance.
Your x86-64 device could be an i5 or i7 or Xeon device, that supports 2-way Symmetric MultiThreading (SMT) at the hardware level (Intel calls this HyperThreading, HT). This support at the hardware level means that the core architecture has some duplicated units, namely the registar bank (including the PC/IP), which allows the core to simultaneously run two independent pieces of code that may share the same memory address space (we call these threads), if it has enough replicated arithmetic units. Note that current core architecture are superscalar, i.e., they can launch 2 to 6 parallel machine code instructions at the same time, and this helped Intel to design their multi-threaded architectures.
With so little number of cores, we can assume you are not using any of the Intel Many-core Integrated Core (MIC) device, neither the previous Xeon Phi Knights Corner co-processor (with up to 61 cores) nor the newest Xeon Phi Knights Landing (KNL) processor (up to 72 cores), which support up to 4-way multi-threading (HT in the new KNL).
If your code is a single process with 16 threads it could achieve the best performance running on a single node, with OpenMP or OpenMPI.
If you are running your code using 3, 4 or 5 nodes (with MPI?) you should have also specified your input data set, if the same for all cases, if you accounted the time to transfer code & data for the other nodes, if they all fit in local/shared caches or share external RAM, and if the application is compute-bound or memory-bound.
As you can see, there is no simple answer for your questions, since there are too many variables that may impact the overall performance of your code…
• asked a question related to Parallel Computing
Question
I have a problem with parallel computing. I can't open matlabpool. When I type matlabpool 2, I get the following error:
Error using matlabpool (line 144) Failed to open matlabpool. (For information in addition to the causing error, validate the profile 'local' in the Cluster Profile Manager.)
Caused by: Error using distcomp.interactiveclient/start (line 61) Failed to locate and destroy old interactive jobs.
This is caused by: The storage metadata file does not exist or is corrupt
In linux do:
rm -rf .matlab/local_cluster_jobs
• asked a question related to Parallel Computing
Question
Hi all,
I would like to establish parallel computing by using the GPU of my M2000 Nvidia graphics card. I was wondering if this works for all the softwares, in my case bioinformatics softwares like PASTA (Python platform), or is it limited to certain programming platforms or particular softwares/libraries provided by the parallel computing APIs (e.g. CUDA)?
If it is limited then, is there way to achieve parallel computing without being limited to using only certain softwares?
Thanks.
Writing code using CUDA already makes your program limited to NVIDIA GPU's, and hence not very portable.  (Exactly the desire of NVIDIA, of course).  Most of the GPU codes I've seen use the OpenACC set of directives (Fortran) / pragmas (C/C++) instead. These have the advantage of (relative) ease of use, and of portability to non-GPU systems, where the directives are just ignored.  The latest versions of OpenMP also have support for GPU computing.
• asked a question related to Parallel Computing
Question
Is it possible to load the operating system on a Jetson TK1/TX1 using SD cards? From the best of our knowledge, only it is possible using the internal 16Gb flash memory.
it is  possible to load the operating system on a Jetson TK1/TX1 using SD cards because ust like any Linux computer, the recommended way to shut-down or turn off Jetson TK1 is to click Shutdown in the GUI or run this in a terminal, to ensure the filesystem will not be corrupt it is possible due to only by using os.
• asked a question related to Parallel Computing
Question
We have already used Trimaran, but it seems to be hard to work with.
Is there any alternative to Trimaran compiler?
we need the following features:
Very Long Instruction Word Processors (VLIW)
Explicitly Parallel Instruction Computing (EPIC)
Instruction-Level Parallelism (ILP)
You are talking about two separate problems here: Obtaining a WCET estimate and minimizing the WCET. For the former - WCET analysis - there are basically two approaches: static WCET analysis and measurement-based WCET analysis. Static WCET analysis is required if you want to have safe and tight bounds, and usually required in certifiable, safety-critical (hard) real-time systems. A model for the platform (processor, memory subsystem, ..) is necessary, so there is no general solution for arbitrary processors, as the others have mentioned. You can find an overview of the WCET problem in [1].
But as you are asking about a compiler, I assume you have a platform at hand for which you are building a compiler. Here you need tight integration of the compiler and a WCET analysis tool. Things are difficult because slight changes in the generated code may have a huge impact on the WCET (memory layout changes etc).
There were efforts in research to produce a compiler that reduces the WCET. One prominent example is WCC [2], but it is a closed-source compiler and AFAIK interfaces aiT at intermediate level.
In an effort to create a time-predictable platform within an EU FP7 project (http://www.t-crest.org), we used LLVM. Our processor was a dual-issue VLIW processor. The sources are on GitHub [3]. We interfaced aiT and also have a basic WCET analysis tool in the package.
Also Trimaran was used recently [4]. The researchers employed WCET-oriented hyperblock formation, and built their own IPET-based timing analyzer for a simple architecture model.
I hope this helps you to get started!
[1] The worst-case execution-time problem—overview of methods and survey of tools
[4] A Compile-Time Optimization Method for WCET Reduction in Real-Time Embedded Systems through Block Formation
• asked a question related to Parallel Computing
Question
I know that you use mpicc to compile MPI programs as > mpicc abc.cpp  and you use pgc++ for compiling OpenACC directives. Is there a way where you can compile MPI+OpenACC or pgc++ and mpicc. Thanks
The following link provides some step by step instructions for what you are trying to accomplish. However, I would recommend that you consider OpenMPI/MPICH + OpenCL in order to avoid vender lock-in for either your gpu or compiler. Things could have changed since I last considered OpenACC, but I believe that the PGI acceleration directives only work for NVidia GPUs for now.  The GPUs from AMD provide much higher double precision performance and work well with OpenCL.
• asked a question related to Parallel Computing
Question
How can i create small cloud computing lab? ( I am now working in task scheduling and balancing in cloud computing, so i want know what are i need requirements ( no. of physical machines and types of operating system, how can i install hypervisor, types of hypervisor, how can i create virtual machines and send the task to vms , where can i add my task scheduling algorithm ) is there any site or video can clear that.
You can use one or more of following:
DevStack
RocksOS
Eucalyptes
• asked a question related to Parallel Computing
Question
I would like to emulate various n-bit binary floating-point formats, each with a specified e_max and e_min, with p bits of precision. I would like these formats to emulate subnormal numbers, faithful to the IEEE-754 standard.
Naturally, my search has lead me to the MPFR library, being IEEE-754 compliant and able to support subnormals with the mpfr_subnormalize() function. However, I've ran into some confusion using mpfr_set_emin() and mpfr_set_emax() to correctly set up a subnormal-enabled environment. I will use IEEE double-precision as an example format, since this is the example used in the MPFR manual:
mpfr_set_default_prec (53);
mpfr_set_emin (-1073); mpfr_set_emax (1024);
The above code is from the MPFR manual in the above link - note that neither *e_max* nor *e_min* are equal to the expected values for double`. Here, p is set to 53, as expected of the double type, but e_max is set to 1024, rather than the correct value of 1023, and e_min is set to -1073; well below the correct value of -1022. I understand that setting the exponent bounds too tightly results in overflow/underflow in intermediate computations in MPFR, but I have found that setting e_min exactly is critical for ensuring correct subnormal numbers; too high or too low causes a subnormal MPFR result (updated with mprf_subnormalize()) to differ from the corresponding double result.
My question is how should one decide which values to pass to mpfr_set_emax() and (especially) mpfr_set_emin(), in order to guarantee correct subnormal behaviour for a floating-point format with exponent bounds e_max and e_min? There doesn't seem to be any detailed documentation or discussion on the matter.
Here is a small program which demonstrates the choice of e_max and e_min for single-precision numbers.
#include <iostream>
#include <cmath>
#include <float.h>
#include <mpfr.h>
using namespace std;
int main (int argc, char *argv[]) {
cout.precision(120);
// Actual float emin and emax values don't work at all
//mpfr_set_emin (-126);
//mpfr_set_emin (127);
// Not quite
//mpfr_set_emin (-147);
//mpfr_set_emax (127);
// Not quite
//mpfr_set_emin (-149);
//mpfr_set_emax (127);
// These float emin and emax values work in subnormal range
mpfr_set_emin (-148);
mpfr_set_emax (127);
cout << "emin: " << mpfr_get_emin() << " emax: " << mpfr_get_emax() << endl;
float f = FLT_MIN;
for (int i = 0; i < 3; i++) f = nextafterf(f, INFINITY);
mpfr_t m;
mpfr_init2 (m, 24);
mpfr_set_flt (m, f, MPFR_RNDN);
for (int i = 0; i < 6; i++) {
f = nextafterf(f, 0);
mpfr_nextbelow(m);
cout << i << ": float: " << f << endl;
//cout << i << ": mpfr: " << mpfr_get_flt (m, MPFR_RNDN) << endl;
mpfr_subnormalize (m, 1, MPFR_RNDN);
cout << i << ": mpfr: " << mpfr_get_flt (m, MPFR_RNDN) << endl;
}
mpfr_clear (m);
return 0;
}
With thanks,
James
There are different conventions to express significands and the associated exponents. IEEE 754 chooses to consider significands between 1 and 2, while MPFR (like the C language, see DBL_MAX_EXP for instance) chooses to consider significands between 1/2 and 1 (for practical reasons related to multiple precision). For instance, the number 17 is represented as 1.0001·24 in IEEE 754 and as 0.10001·25 in MPFR. As you can see, this means that exponents are increased by 1 in MPFR compared to IEEE 754, hence emax = 1024 instead of 1023 for double precision.
Concerning the choice of emin for double precision, one needs to be able to represent 2−1074 = 0.1·2−1073, so that emin needs to be at most −1073 (as in MPFR, all numbers are normalized).
As documented, the mpfr_subnormalize function considers that the subnormal exponent range is from emin to emin+PREC(x)−1, so that for instance, you need to set emin = −1073 to emulate IEEE double precision.
• asked a question related to Parallel Computing
Question
Dear all
• Would anyone give me a brief and useful review on VASP and Quntum esprsso? I need to know about their performance, speed, parallel computing, free license and tools. Which one is more applicable in solid state physics and comutational nanosciences. Please rank them 0-100.
VASP is a commercial code and Quantum-Espresso (QE) is an open source code.
Both have their own issues:
1. VASP is faster than QE.
2. QE has a very good mailing support.
3. Being free (under GNU license), QE is accessible to everybody.
4. For calculations of vibrational properties, QE contains implementations of DFPT (Density Functional Perturbations Theory), whereas VASP uses another third party software.
5. There are well-tested pseudpotentions (PP) for VASP, whereas you have to choose proper PP for QE as there are many for the same element.
• asked a question related to Parallel Computing
Question
I am looking for works or researches where the CPU actually outperforms the GPU , or the integrated GPU (APU) .
GPU algorithms don't scale very well in two situations:
1. I/O bound cases, where it takes longer to send the data to the GPU and read it back to the CPU than it takes to do the computation in serial. For example, pairwise adding of two sets of numbers.
2. High branch/divergence cases, where some work units take much longer than others. For example, some types of tree traversals and loop constructs with nonlinear iteration counts. In these cases, the GPU may block waiting for the slowest job to finish. If there is one "long pole", this will determine the runtime, and will ruin scalability. The CPU is generally much faster at doing this type of serial work.
• asked a question related to Parallel Computing
Question
Can you suggest me "how to calculate expected time to compute a task which requires heterogeneous multiple resources" ?
For example: if we consider only CPU requirement of a task, then the execution time of the task is task length in MI divided by CPU speed in terms of MIPS.
Can you suggest me for the consideration of multiple resources like CPU, RAM, Bandwidth, etc.
Hi Mohammed Benalla,
We have a set of tasks and those task have different resource (like CPU time, RAM, Bandwidth, etc.) requirement. Then, how to calculate expected execution time of task ?
• asked a question related to Parallel Computing
Question
Can I run a multi-threaded program and specify how many cores should be used: 1, 2, 4 etc. so as to enable a performance comparison?
Parallel processing
Multi-core systems
Hi Yigal sir,
You can map the threads to specific cores. You can Refer to https://software.intel.com/en-us/node/522691 for details. I have been looking for this answer too and this article helped me. I hope it works.
• asked a question related to Parallel Computing
Question
Using OpenMP I divided a simple algorithm in different threads but the execution time is increasing drastically. This may be due to all the threads running on same CPU core. I am aware that if my CPU is dual core or quad core then assigning number of threads more than number of CPU cores will not help much. but even with two threads the execution time is increasing.
Hi Avipriyo,
you can use GOMP_CPU_AFFINITY variable and set which thread to use which CPU core. For example, if you have 16 cores, and use only 4 threads, you can set to execute to cores 0-3, or 0, 4, 8, 12.
Try to analyze the cache line utilization in parallel and sequential execution, where a huge trap could happen. If you divide an array of 100 elements 0-49 and 50-99 to corresponding two cores will be much faster rather than to execute dynamically (odd to Core0, and even to odd 1).
Latter case will generate more cache misses, especially if the cache line is smaller and the elements are 4 or 8 bytes.
• asked a question related to Parallel Computing
Question
I am using Hidden markov model model to predict the volatility in a financial time series. I want to speed up my computations with matlab parallel computing toolbox.
To change my code correctly I must know how does matlab run my code in parallel? i.e. what is matlab parallelism methodology?
I'm familiar with Open-MPI and it's way to run code in parallel but I dont know anything about matlab And i couldn't find anything in its documentations!
Using Matlab for this is a terrible idea.  Yes, Matlab will take advantage of threads for basic matrix operations (if sufficiently large).  But it's just using Intel's MKL for that - you can use MKL directly.  Similarly with parfor and DCE - Matlab will expose capabilities provided by other standard/open-source packages like MPI, but will charge you for it.
These days, compilation is basically instantaneous, and performance will be better than Matlab if you write using a high-level interface like Boost.  Matlab is really only for the weak (who can't imagine leaving a GUI) or those who insist on the convenience of Matlab's library of toolboxes (which, again, are just repackaging techniques freely available elsewhere.)
• asked a question related to Parallel Computing
Question
Hi all,
There exist certainly quite a good number of static scheduling algorithms for applications that can be described using Directed Acyclic Graphs (DAGs).
If the goal is to speed up the execution of the application, what should one look at when choosing the scheduling algorithm. What criteria should  be considered in order to select one algorithm over the others?
I know the question may seem too general, but I would like to know if anyone has an experience on this subject, for example, if he/she has tested different algorithms on one problem and what were the learned lessons?
Any experience feedback/contribution is much appreciated.
Thanks,
Salahh Eddine SAIDI
Thank you for the article.
• asked a question related to Parallel Computing
Question
I am planning to work on Multi GPU system based on CUDA. But I only have single GPU machine. So is there any way to simulate GPU cluster to get results? Also suggest if there is any way that i can combine normal laptops with CUDA GPUs to create cluster/
In order to run in a cluster, you have to check if the simulator has support for a distributed computing paradigm like MPI, and calculate the overhead according the amount of communication between the different threads.
You could use profiler tools like nvprof,  to guess the performance and scalability of the problem.
Although if the problem is not suitable for a distributed paradigm, running in a cluster will slow it down due the communication overhead. For multi-GPU solutions is more common to run in a single machine with several GPUs
• asked a question related to Parallel Computing
Question
Hi guys,
Have posted several days ago on intel mkl forum and have not yet got any information. Hope someone here can share some experience or perspectives. Thank you all in advance!
Have been looking for an efficient linear solver (to be used in my abaqus uamt) for middle sized (between 6000*6000 to 20000*2000) non-symmetric sparse (density between 0.6 and 0.8) system.
I have compared PARDISO, DGESV and Matlab using randomly generated 6000*6000 matrix on my Linux desktop with i7-3720QM CPU @ 2.60GHz and 8G RAM, ifort version 16.0.2.  Matlab backslash took 2.x seconds, DGESV took 5.x seconds while PARDISO took 12.x seconds. This does not seem to be reasonable to me as I am expecting PARDISO to be faster than DGESV.  Can someone share some experience/perspective on this or provide some solutions for which solver to choose for this type of problem (I am hoping to get somewhat close to MATLAB backslash if possible)?
The following is my PARDISO parameters and I also found that by manually setting mkl_set_num_threads to be 1 to 8 won't make any obvious change of the wall time? Is that reasonable or have I made something wrong?
ALLOCATE( iparm ( 64 ) )
iparm=0

iparm(1) = 1 ! no solver default
iparm(2) = 3 ! fill-in reordering from METIS
iparm(4) = 0 ! no iterative-direct algorithm
iparm(5) = 0 ! no user fill-in reducing permutation
iparm(6) = 0 ! =0 solution on the first n compoments of x
iparm(8) = 1 ! numbers of iterative refinement steps
iparm(10) = 13 ! perturbe the pivot elements with 1E-13
iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS
iparm(13) = 1 ! maximum weighted matching algorithm is switched-off (default for symmetric). Try iparm(13) = 1 in case of inappropriate accuracy
iparm(14) = 0 ! Output: number of perturbed pivots
iparm(18) = -1 ! Output: number of nonzeros in the factor LU
iparm(19) = -1 ! Output: Mflops for LU factorization
iparm(20) = 0 ! Output: Numbers of CG Iterations
iparm(27) = 1 ! check the integer arrays ia and ja
error  = 0 ! initialize error flag
msglvl = 0 ! print statistical information
mtype  = 11 ! real and nonsymmetric
call mkl_set_dynamic(0)