Science method

Parallel Computing - Science method

Explore the latest questions and answers in Parallel Computing, and find Parallel Computing experts.
Questions related to Parallel Computing
  • asked a question related to Parallel Computing
Question
14 answers
Hello everyone,
I am facing an issue with Abaqus and parallel computing. I am using VUHARD subroutine with COMMON block and VEXTERNALDB. But I am using different results using different number of cores. I am starting each analysis with same microstructure and same subroutine just with different number of cores. The results seem to show triangles where calculations seems to be happening. For example, in the attached document, I start with an initial microstructure with 10,000 elements and I run it with cpus=4, 8, 12. I get different results. Could someone please explain what could be going on? And how can I achieve analysis of the full model?
Thanks,
Akanksha
Relevant answer
Answer
Hello Akanksha,
The problem comes from using common blocks while running a process on multiple cores. The VUHARD subroutine is accessed simultaneously from every core and potentially modifies the values inside the common block. If that is the case, then you are dealing with a thread-safety problem, every thread (core) is accessing the memory address of the common block in any arbitrary order, and potentially overwriting data.
The bad news is that your current implementation may only work on a single core.
I hope you find it useful!
Best regards,
Miguel
  • asked a question related to Parallel Computing
Question
6 answers
..
Relevant answer
Answer
Dear doctor
"Hadoop is a distributed file system that lets you store and handle massive amounts of data on a cloud of machines, handling data redundancy. The primary benefit of this is that since data is stored in several nodes, it is better to process it in a distributed manner."
"Hadoop is a distributed file system, which lets you store and handle massive amount of data on a cloud of machines, handling data redundancy. Go through this HDFS content to know how the distributed file system works. The primary benefit is that since data is stored in several nodes, it is better to process it in distributed manner. Each node can process the data stored on it instead of spending time in moving it over the network.
On the contrary, in Relational database computing system, you can query data in real-time, but it is not efficient to store data in tables, records and columns when the data is huge."
Dr.Sundus Fadhil Hantoosh
  • asked a question related to Parallel Computing
Question
7 answers
I want to simulate a discrete random  media with FDTD method.
the simulation environment is air that is filled with random spherical particles (that are small to the wavelength) with defined size distribution.
what is an efficient  and simple way to create random scatterers in large numbers in FDTD code?
i have put some random scatterers in small area but i have problem producing scatterers in large numbers.
 any tips, experiences and suggestions would be appreciated.
thanks in advance.
Relevant answer
Answer
Are you asking how to make your particles a helix/spiral shape? First, before doing that, I recommend running simulations that will just use the effective properties that your helices should give. That will let you explore device ideas without needed the complexity and inefficiency of having to resolve the spirals in your grid. Second, you will want to create a small simulation of a helix to retrieve the effective properties. This is usually called homogenization or parameter retrieval. Third, if you need to, move on to your more complicated 3D simulation.
That still leaves the question of how to build a helix in a 3D grid. One way you can do this is to create the helix in a CAD software such as SolidWorks or Blender. You can export that model as an STL file, which is just a surface mesh that MATLAB (or whatever software you are using) can import. From there, there are codes available that can import those STL files and "voxelize" them into a 3D array. If you are using MATLAB, search the MathWorks website for "voxelize" and you will find multiple solutions. Once you have that, you can create copies of the spiral in your FDTD grid. This approach will let you import even more complicated shapes relatively easily. Alternatively, you can create the spiral directly in your grid. I would do this by creating a array that has a list of points along the center of your spiral. From there, you can calculate what points your FDTD grid should be assigned to that spiral by calculating their distance to the points on the spiral. If they are within a certain distance, assign the material properties of your spiral. Otherwise, assign the material properties of whatever medium the spirals are residing in.
I am sure there are plenty of other ways to do this. If you are interested, I dedicated almost all of Chapter 1 to describing techniques for building geometries into arrays for simulation using the finite-difference method. I do not specifcally talk about spirals, but you may find some of that chapter very helpful. Here is a link to the book website:
Hope this helps!!
  • asked a question related to Parallel Computing
Question
5 answers
Is it possible, in some way, to parallelize a script in MATLAB that has dependent variables? Would it be possible to use parfor?
Relevant answer
Answer
I think you mean that the loop itself is dependent on the prior iteration, which cannot be done using the parfor feature, as this distributes the workload to different workers that do not communicate between each other.
for example:
parfor k = 2:10 x(k) = x(k-1) + k; end
will not work because x is indexed differently within the same parfor loop. However, I know it can be done when the communication is between workers is set. I do not know how to do this but I would be really interested if anyone knows how to do so!
  • asked a question related to Parallel Computing
Question
1 answer
I am currently studying the application of simulated annealing techniques to optimization problems. In particular, I am interested in applying the optimization method to the study of Ising models. It is well known that artificial annealing is an intrinsically sequential method. Over the past 20 years, many parallel algorithms based on simulated annealing have emerged. At this time, I don't understand their pros and cons at all. I ask for a recommendation for a good and clear overview article or book chapter on parallel methods based on simulated annealing.
Relevant answer
Answer
The "Encyclopedia of Parallel Computing" can be searched. This 4 volume set covers many techniques and systems for parallel computing. It's published by Springer Reference. Here are the search results. Your library would need full ACM book access to read the answers I think.
  • asked a question related to Parallel Computing
Question
7 answers
I am a researcher who works on Abaqus and i am using parallel computing system (HPC)
I have very heavy files, and i just noticed that when i run them, ABAQUS is not using the full capacity of the memory.
it is very slow to the point it might take months to run a file!
Note: i have around 10 millions element in the mesh.
Is there anything to do for ABAQUS to make it actually use the full capacity of the memory!
Note: parallel computation option is turned on and it actually uses several cores but the problem it is not using the full capacity of each core.
See the screenshot!
it is using around 50 GB only out of 376GB.
How to increase this value!
Any help is appreciated!
Best,
Yasmeen
Relevant answer
Answer
While running the simulations on ABAQUS I have a few observations. They might be of some use to you.
Memory setting can be assigned as mentioned by Seyed Sadjad Abedi-Shahri and that is the upper threshold. However, memory is used to store results data during a step. If you have assigned the Field/history output to save for each increment, it will use more memory before saving to disk and vice versa. If your analysis has very small-time increments, it will utilize more memory.
Processor allocation can be done by assignments in the parallelization. Based on the number of physical cores, you can assign the number of cores for the analysis. This parallelization uses breaking the problem into the number of processes, and later patching up the individual solutions into the final solution. Therefore, the parallelization utilizes more resources. There are specific problems you can take advantage of if you want to be efficient in parallelization.
I hope these two points will be of some use. These are based on observation and probably lack the programming/computation jargon.
Good luck.
  • asked a question related to Parallel Computing
Question
1 answer
Hello,
I am trying to enable a parallel computing in ABAQUS using my AMD GPU with the OpenCL, but it is not working
I have this error on log file
Error: Call of undefined function
Compiler log: There is a call to an undefined label
Error: HSAIL program is not finalized successfully
Codegen phase failed compilation
BUILD LOG
*************************
There is a call to an undefined label
Error: HSAIL program is not finalized successfully
Codegen phase failed compilation
Error: BRIG finalization to ISA failed.
**************************
Error: Kernel compilation Failure
Warning: GPUAcceleration disabled
My GPU is an AMD Radeon PRO wx3100
Someone could help me?
Relevant answer
  • asked a question related to Parallel Computing
Question
4 answers
I am working on a research project in which we are doing a comparative analysis of reinforcement learning (RL) with evolutionary algorithms in solving a nonconvex and nondifferentiable optimization problem with respect to solution quality and computation time.
We are using python implementations, but one difficulty is that, although we can use GPUs for the execution of reinforcement learning algorithm, there is not much support for using GPUs with evolutionary algorithms in Python.
On the other hand, if we want to compare the algorithms with respect to computation time, we have to execute them on the same hardware (parallel computing system).
However, we cannot run RL algorithm on CPU based parallel system because of our resource constraints.
Can anyone tell us how to establish an equivalent parallel computing systems, one based on CPUs & GPUs (for RL algorithms), and the other based on CPUs only (for evolutionary algorithms), so that we can compare them with respect to computation time.
Thanks in advance,
Best Regards
Relevant answer
Answer
Peter, although you are right that predicting the optimal number of CPUs for running algorithm is tough, but I am currently building my knowledge base in parallel computing, algorithms and advanced computer architecture.
In regards to compute resources, actually I am running my machine learning and other optimization algorithms on google colab and as you probably know, we have only one gup or tpu with dual core CPU here.
I am graduate student at UET, Lahore (which is a big organization), and as it is a government university in a third world country, I am not able to get HPC resources here. By the way amaizing story at University of Gondar, I will try my best to hunt down these kind of compute resources!!
However, I will join LUMS as a research associate (for AI research) very soon, and here, I probably get resources I want.
  • asked a question related to Parallel Computing
Question
9 answers
I am running various Abaqus simulations with a model size of about 1.5 million degrees of freedom in total. To speed up the calculations I am trying to decide what number of CPUs would be optimal and what are influencing factors (like model size, steps, time steps, outputs, hardware etc.). I'm interested in the question at what number the writing and merging of partial results and data between the different cores outweighs the benefit of using multiple CPUs.
Relevant answer
Answer
Abaqus Standard tends to have benefit of number of cores start to level off significantly after 16 cores. More is always better though for larger problems. Using the MP_HOST_SPLIT parameter and setting it to an integer that can be evenly divided into the number of cores or sockets will help immensely, more so than even GPU computing. GPU computing can help in larger problems, but you need number of GPUs installed in a machine equal to MP_HOST_SPLIT parameter. If MP_HOST_SPLIT = 1 you'll get poor performance, and a single GPU can help a lot then. But with MP_HOST_SPLIT set to the proper number (say 2 or 4 for 16 cores) you'll get a good speed up, better than an single GPU and MP_HOST_SPLIT=1. All the hype on GPU computing is totally overblown, believe me I know. I have 3 GPUs installed and I still don't get much improvement, if at all. Spent many an hour with the Abaqus people on this. You do well with MP_HOST_SPLIT = 1 and GPU, but just set MP_HOST_SPLIT equal to the proper number and you'll do better with no GPU. Did I mention they don't support or document this much? Silly. For Abaqus Explicit, that is a different animal, and more cores is always much better for larger problems.
Good luck,
Steve
  • asked a question related to Parallel Computing
Question
1 answer
I want to run an ODE system by varying a suitable parameter, suppose \alpha. Now I have divided the job into the corresponding worker and created a data file so that for each parameter \alpha, I need to copy a certain measure of the system into the data file. When I googled it and went through some documentation, I realized that using a single file only creates a route to a corrupt file. It is not a way to do that, and I need to create multiple files. I need to ask whether there is any way to create files with some exact name (must be in parallel programing) and scan data from all the files to create another data file that contains all the data.
  • asked a question related to Parallel Computing
Question
10 answers
Hello everyone,
I have a VUMAT code for a plasticity model, and I want to run a job for this model in cluster with ABAQUS 6.12. Because my VUMAT code is time-cosuming I have to run my model in cluster with 16 processors. I receive the following error when run the model and the job is terminated.
***ERROR: Floating Point Exception detected in increment 86485. Job exiting.
86485 1.469E-01 1.469E-01 08:58:41 1.699E-06 436 NaN NaN 3.230E+04
INSTANCE WITH CRITICAL ELEMENT: BILLET-1
A special ODB Field Frame with all relevant variables was written at 1.469E-01
THE ANALYSIS HAS NOT BEEN COMPLETED
Many Thanks,
ALI
Relevant answer
Answer
Hello, Hongyan,
Yes, I have solved this problem. My problem is related to the "Module" module of Fortran. I have defined several variables in the module, which can be used globally. For example:
module B
parameter (L=10)
real*8,save :: A(L)
end module
This can be run on ABAQUS 2016, but not on 2019. Through debugging the subroutine on ABAQUS 2019, I found that when the module calls variable A(L), its value is NaN (not a number), that is, no initial value is given to variable A(L). I tried to write the same statement on VS 2015, and did not give the initial value to the variable. I output it and found that its value was 0.
Then I feel that when ABAQUS calls A(L), it is NaN, which leads to floating-point exception and unable to calculate.Then I give the initial value of 0 to variable A(L), and the program has no errors. As follows:
module B
parameter (L=10)
real*8,save :: A(L) =0.0
end module
I don't know if you have the same problem. My suggestion is to build a one element model, use the "WRITE" statement to output each variable in the subroutine for debugging, check which variable has the problem (NaN), and then adjust the program. Of course, the whole process may be time-consuming.
I hope it will help you.
  • asked a question related to Parallel Computing
Question
4 answers
Hello everyone,
I've implemented a FEM code in matlab, and now I'd like to make it faster. Since I have to perform nonlinear dynamic analysis, I have a lot of iterations, each one requiring to asseble the full stiffness matrix K.
What I would like to do thus is to parallelize the assembling procedure of K. Till now I tried parpool and spmd, the latter with poor results, the first one performing nicely (speedup factor x2... despite using 10 cores...) but only under a certain number of elements. Overcome a certain treshold, parallel computation (14 cores) would take as much as 10 times the single core version.
I understand this may be related to overheads in the comunication between "master" and workers and/or slicing procedures, but it seems I cannot get the hang of it...
Does anyone have suggestions and/or can point me to some useful material on this specific matter?
Thank you all in advance,
Jacopo
  • asked a question related to Parallel Computing
Question
4 answers
I have a problem. Measurements show the opposite of what convention assumes.
I test soil specimens. We try to decode how much deformation a certain loading (force) sequence will generate.
After 6 years of testing, I noticed reaction force behaves as a function of deformation. Not the way convention tries to describe it.
It's a big problem. All software is designed to midel deformation as a function of force applied. But in reality, the stiffness hysteresis loops behave such, that force (reaction) is a function of deformation.
The observations, empirical evidence, pointed to an abandoned theory from 40 years ago (strain space plasticity, by P. J. Yoder). His theory seems to be not only compatible with the observed physical properties, but also GPU - parallel computation compatible.
So, we have something that is both:
1. Potentially more physically cotrect
2. For the first time - super computer compatible.
I am stuck building robots for industrial projects. Which are meant to provide "quick profit" to faculty. Research is not financed. All observations were made in spare time. At times - using life savings...
When experts are shown the test results, they become red in the face, angry, and say "have not seen anything like it". After an hour of questions - vannot find any flaws in the testing machines. And.. Leave. Never to hear from again.
The theory of P. J. Yoder was defended in public defenses multiple times in the 80's. No one found flaws in it. Did not prove it wrong. But... Forgot, ignored and abandoned.
Industry asks for code compatible with existing software (return of investment)... And I alone can not code a full software package. Frankly, I would rather keep testing, try to prove my assumptions wrong. But the more I test, the mote anomalies and paradoxes are observed, exposed and resolved on the topic..
Is
Relevant answer
Answer
After a fast check of your research. Do you know the sand-pile model and Per Back's self-organized criticality? He wrote a book about it.
It seems to be close to your own research. My first research project was on dynamic recrystallization. It happened that I explained the non-linearity of stress-strain curves by a very primitive program theoretically (see my PhD and open software that was rewritten into C++ and Qt by Jakub) for the first time. The reaction of my colleagues was a total refusal. Now, it became the standard.
  • asked a question related to Parallel Computing
Question
8 answers
How to run NAMD 2.14 on cores from multiple nodes in slurm? need batch file please.
this is the one i am using:
#!/bin/sh
#SBATCH --job-name=grp_prod
#SBATCH --partition=cpu
#SBATCH --ntasks=48
#SBATCH --cpus-per-task=1
#SBATCH --time=30-50:20:00
mpirun -np 48 /lfs01/workdirs/cairo029u1/namd2_14/NAMD_2.14_Linux-x86_64-multicore/namd2 +auto-provision step4_equilibration.inp > eq.log
this batch file produces this attached log file (the first part only is attached)
Relevant answer
Answer
I'm not exactly sure of the answer, but have you considered reading the "Choosing CPU queue" section here https://hpc.bibalex.org/quick-start#submission. Or I think you may send the HPC support team an email asking them.
  • asked a question related to Parallel Computing
Question
2 answers
I am looking for a user-friendly neuro-evolution package with good parallelization. I wish to quickly explore some ideas (using a short learning curve); preferably in R or Python programming language.
Relevant answer
Answer
  • asked a question related to Parallel Computing
Question
4 answers
Hello,
I am running a finite element (FE) simulation using the FE software PAMSTAMP. I have noticed that when I run the solver in parallel with 4 cores, the repeatably of the results is poor; when I press 'run' for the same model twice, it gives me inconsistent answers.
However when I run the simulation on a single core, it gives me consistent answers for multiple runs of the same model.
This has lead me to believe that the simulations are divided differently between multiple cores, each time the solver is run.
Is there a way to still run the simulation in parallel (multiple cores) but have the solver divide the calculation in the same manner each time, to ensure consistency for multiple runs?
Thanks,
Hamid
Relevant answer
Answer
I had similar problems with simulating crash boxes in parallel. Although I am no expert in HPC, I think it is because each core handles the round-off operations differently and also the interactions between the cores at the domain boundary plays an import role. I guess if you can control the exact same domain assignments to the exact same core as in any of your previous runs, you should get the same results. However, i doubt PAM software would be so flexible.
  • asked a question related to Parallel Computing
Question
6 answers
Parallel computing in Matlab is not supported for the fminsearch (Nelder-Mead Simplex search) method. We would be grateful if you could share your experience to run this method in parallel in order to speed up the optimisation process of expensive problems.
Relevant answer
Answer
Read these links:
Parallel Computing with MATLAB by: Tim Mathieu
Use Parallel Computing for Response Optimization
  • asked a question related to Parallel Computing
Question
7 answers
Hello,
I'm trying to run my Abaqus simulation using gpus. I have a PC with AMD Ryzen™ 5 2400G & Radeon™ RX Vega 11 Graphics.
Calling the gpu from CAE and the Command window doesn't work. I have found a possible solution using CUDA, which I have not tried yet since it refers to NVIDIA hardware. Other posts suggest using OpenCL but I can not find how to download it.
Any ideas would be helpful!
Relevant answer
Answer
You should use OPEN GL for the AMD graphics card. I'm not sure AMD supports Abaqus parallel computing.
  • asked a question related to Parallel Computing
Question
8 answers
In litterature i find lot of paper that discuss the problematic of parallel computing but i didn't find something relevant concerning how can we define the optimal number of compute nodes that will give us the maximum speedup.
Thanks for your help
#HPC #parallel computing #scientifique_simulation
#graph_partitionning
Relevant answer
Answer
In the field of high-order finite element methods, we have a good idea of how to achieve 80% parallel efficiency on HPC systems. The rule of thumb we follow is 2000 points per MPI rank for CPU systems (e.g. Blugene Q, Sequoia, etc.). For GPU systems (e.g. Summit, Sierra, etc.) the number is 2 million points (since latency is very expensive). That is based on many years of development. You can read this paper which has many benchmarks and ideas you can follow to get an answer to your question
  • asked a question related to Parallel Computing
Question
1 answer
I am currently interested in solving large linear systems on a distributed parallel computing grid using the ScaLAPACK library. I wonder if there is a quadruple precision version (Or any other alternative library which provides quadruple precision for parallel distributed computing)?
Relevant answer
Answer
Hi
Interesting question. A guess is if you want double double (quad) precision, then the compiler should support the data type, as well as you need the processor drivers for your AMD ir i386. Both seem to be available, and you will need relevant flags when compiling the library.
  • asked a question related to Parallel Computing
Question
1 answer
Hi, I am trying to compile quantum espresso for amd ryzen 7 4800h CPU, which is my personal laptop. Since Intel mkl isn't well suited or optimised for amd processors I was wondering if there were other alternatives. For instance is openblas a good option? Or should I just go ahead with the internal libraries that are already there in qe. Also Has anyone tested aocl 2.2 for qe? I read recently in the qe website that aocl 2.1 gave some weird results. Finally, what mpi should I choose for parallel computing? Is openmpi fine or is there other better alternative which anyone would like to suggest. Thank you
Relevant answer
Answer
I have no idea about the software, nor about how you are using your device. If you are going to run on a shared memory, then use of OpenMP before MPI should be preferred. Whereas OpenMPI is easier to use, my preference is mpich2. Further, the blas should also not be OpenBLAS, which I found to clash with AMD drivers - compile the blas yourself using gnu compilers in combination with the mpich wrappers.
  • asked a question related to Parallel Computing
Question
20 answers
I am using fluent 14.5 with a parallel computing system. I am using UDF for property calculation as well as nucleation and growth in the subdomain/ cell zone.
After solving one iteration it is giving errors like "primitive error"  and cortex error Ansys.Fluent.Cortex.CortexNotAvailableException' was thrown.
Can anyone please tell me why is this happening?
Relevant answer
Answer
Hi everyone,
I faced the same problem too. The problem only occurs if you are actually using sheet bodies (2D bodies) in a 3D geometry. (If you are working 2D you should change the analysis type as 2D)
I solved the problem as follows:
1) Before the starting right click on Geometry > Properties and select Analysis Type 2D
2) Save after meshing
Then you can go Setup.
Could you please indicate if it worked or not?
Thanks,
Emin
  • asked a question related to Parallel Computing
Question
14 answers
Hi, I am looking for some good online (prefereably free) resources where I can learn "parallel computing" from scratch using Python or R programming languages. Thanks.
Relevant answer
Answer
For python the packages 'xarray` and `dask` can parallelize lots of data crunching tasks either on your computer or on a hpc.
A good starting point is this (long) video:
or the `dask` Tutorial on git: https://github.com/dask/dask-tutorial
  • asked a question related to Parallel Computing
Question
7 answers
I wanna apply parallel computing techniques to decrease the calibration time. Using the code below, I can calculate 10 sets for 11 parameters simultaneously using PARFOR loop but I cannot get the cost function value simultaneously for each set of parameters values during the calibration process using GA. Your helpful commands and suggestion are highly appreciated.
% GA Parameteres
MaxIt=20; % Maximum Number of Iterations
nPop=10; % Population Size
NoVar=11; % No. of parameters to be calibrated during
delete(gcp('nocreate')) % shut down previous parrel computing
parpool('local') % Activate parallel computing parfor
i=1:nPop
pop(i).Position=unifrnd(VarMin,VarMax,VarSize);
pop(i).Cost=CostFunction(pop(i).Position);
end
  • asked a question related to Parallel Computing
Question
2 answers
Hello everyone,
I'm running MMPBSA by using the g_mmpbsa tool for GROMACS. However, it runs extremely slow even though it's claimed that by default it uses all the processors: https://rashmikumari.github.io/g_mmpbsa/How-to-Run.html
How can I be sure if it really uses all of the processors? And how can I increase it if possible/necessary?
Thanks in advance.
Relevant answer
Answer
Hi, Lalehan
Yes, I think you can. I noticed you use the GROMACS (which I have no experience withit) and I use the Amber. They may have some common.
Here is my example using Amber with supporting of MPI. Hope this can give you some clues.
mpirun -np 8 MMPBSA.py.MPI -O -i mmpbsa.in -sp wet.complex.prmtop -cp gas.complex.prmtop -rp gas.receptor.prmtop -lp gas.ligand.prmtop -y md.mdcrd -o FEB.dat
Good luck
  • asked a question related to Parallel Computing
Question
6 answers
Hi,
I am trying to do a Quantum espresso SCF calculation on an Intel Xeon Gold Gold 5120 CPU @ 2.20 GHz (2 Processor). It has 56 cores and a 96 GB RAM.
I am trying to do a parallel calculation on this workstation by using:
mpirun -np (no of cores) pw.x -npool (no of pools) -inp si.pw.in
According to internet sources, I have tried to improve the performance by setting the OMP_NUM_THREADs=1 and I_MPI_PIN_DOMAIN=1.
Can anyone please guide me as to how to choose the no of optimum cores and the no of pools on which I should run the calculation.
The input file is attached below.
The FFT grid dimensions is (48 48 48) and maximum number of k-points is 40000
Subsidiary Questions:
1. Should the Subspace diagonalization in iterative solution of the eigenvalue problem run by a serial algorithm or an ELPA distributed-memory algorithm
2. Should the maximum dynamical memory per process be high or low for better performance?
Relevant answer
Answer
There are different way to get the best parallelization performance from QE. Here are a few tips:
1. If your calculation consists of a large number of k-points, npool can strongly enhance the process. It must be a divisor of the total number of k-points. I personally set it to 4.
2. In combination with npool, you can further improve the performance using the bgrp command. It allows you the program to slit the KS states across the selected group of processors.
For example, assuming you have 32 cores you can try something like this,
mpirun –np 32 pw.x –npool 4 –bgrp 4 –input pw.in
3. You can also try to parallelize the matrix diagonalization routines, using ndiag command. Choose it such that ndiag<=(number of cores/npool)^2.
For example,
mpirun –np 32 pw.x –npool 4 –ndiag 36 –input pw.in
4. Finally, you can use the task command ntg. But since you don't have many cores, I think this is not relevant.
  • asked a question related to Parallel Computing
Question
1 answer
I have built up a DEM model with Abaqus, in which there are three cubics falling onto the floor. Particles in the two cubics (named as Solid-1 and Solid-2 in the model) were applied Beam-MPC constraint, and the other one cubic (named as Particle-3 in the model) has no constraint among the particles. The three cubics fall onto the rigid floor due to gravity.
The purpose of the example is to testify the possibility of parallel computing of a DEM model with particle clusters and particles.
However, the example can run under 1 CPU, but it failed to parallel computing. Attached is the .inp file and the snapshot of the error message. I try to use the *DOMAIN DECOMPOSITION to decompose the Particle-3 domain, but it failed.
I will really appreciate it if anyone can help me fix this problem.
Relevant answer
Answer
Hi Alexander,
I have tried it many times and asked some experts for help. It seems that DEM simulation with the MPC constraint can not run in multiple CPUs. For DEM problems, the particles are not allocated for CPUs in order or evenly, which means some particles in a cubic cluster are allocated for CPU A, but the other particles in that cubic cluster may be allocated for CPU B. Hence, due to the poor allocation strategy, ABAQUS can not decouple the model and pop up with the error.
If you are interested in modeling a particle cluster, maybe you should try another constraint type. But I don't know which kind of constraint works for parallel computing.
Zehua
  • asked a question related to Parallel Computing
Question
4 answers
We are undergraduate students and would like to work on simulating tsunamis. Could you suggest study material/ebooks/lectures to learn SPH basics from? 
Relevant answer
Answer
In adittion to all interesting posts in this thread, I would like to add that
for 2D/3D simulation, it is useful the free code of the book by G. Liu & M. Liu, Smoothed Particle Hydrodynamics. For example, 3D water drop formation can be reproduced fairly well by this book methodology & code.
… 
  • asked a question related to Parallel Computing
Question
5 answers
Relevant answer
Answer
Amdahl's law claims a constant 1/f upper limit on speedup, even if infinitely many processors are available, where f is an assumed inherently sequential fraction of the workload. A paper entitled "The Refutation of Amdahl’s Law and Its Variants" (full text available on ResearchGate) shows that inherently sequential fractions of workloads do not exist in theory and that parts of workloads possibly remaining sequential in practice are irrelevant to speedup if the growth rate of the time requirement of the part executed in parallel is higher than that of the sequential part.
Polynomial speedup with a polynomial number of processors exists for a huge number of problems in the NC class of problems. Polynomial speedup is possible even for some problems not known in NC. We say that a problem has polynomial speedup if there exists an eps > 0 such that the speedup is at least n^eps, where n is the problem size.
Graphics Processing Units (GPUs) were introduced for the solution of the hidden-surface problem. The best possible sequential algorithm for the hidden-surface problem takes a*n² time, while a parallel algorithm takes b*log n time by using n²/log n) processors, where n is the number of polygon edges in the input and a and b are positive constants. Therefore, the hidden-surface problem has polynomial speedup a*n²/b*log n = Θ(n²/log n).
The successful application of GPUs for general-purpose computations demonstrates that Amdahl's law is also broken in practice.
  • asked a question related to Parallel Computing
Question
3 answers
Hi,
I'm trying to run my Abaqus simulation using gpus. I've downloaded the Nvidia Cuda Toolbox, but my simulation still doesn't seem to run on the gpu.
If I run following command in Abaqus Command:
"abaqus job=Job-8 inp=Job-8.inp user=umat_1d_linear_viscoelastic.for cpus=2 gpus=1"
Abaqus starts the simulation, but never finishes it and the .lck file stays remaining.
Does anybody now how I can enable Abaqus to use my gpu?
Relevant answer
Answer
Have downloaded CUDA?
If so, find abaqus_v6.env file in your C: and add in the end of the code the following:
os.environ["ABA_ACCELERATOR_TYPE"]="PLATFORM_CUDA" # Nvidia
  • asked a question related to Parallel Computing
Question
2 answers
I want to launch a parallel simulation of CQN model on Omnet++ using multiple instance of Omnet++ in different PCs. Do you have any idea of this?
Relevant answer
  • asked a question related to Parallel Computing
Question
2 answers
I am trying to run an SCF calculation on a small cluster of 3 CPUs with a total of 12 processors. But when I am running my calculation with command mpirun -np 12 ' /home/ufscu/qe-6.5/bin/pw.x ' -in scf.in> scf.out it is taking a longer time then it was taking in a single CPU with 4 processor. So it will be really great if you can guide me on this since I am new in this area. Thank you so much for your time.
Relevant answer
Answer
It's not running on other CPUs.
I think, their are some mistake in the comand.
Thank you @Masab Ahmad
  • asked a question related to Parallel Computing
Question
8 answers
Hi,
I want to setup a 4 nodes (each one with 4 cores) Raspberry pi cluster and install quantum espresso on it for simulations. Does anyone know how good it works compare with core i7 PC computer for simulating nanostructures?
any test on it?
thanks
Relevant answer
Answer
1 Rasphberry pi 4b have 235GFLOPS
I think , 4 rasph have 0.9 TFlops?
  • asked a question related to Parallel Computing
Question
2 answers
In many research papers, I read that dependency between iterations or tasks can be identified using some type of dependency tests like Banerjee's Test. Can anyone please provide a real example that can be applied on for loop that showed how does the dependency test work?
Relevant answer
Answer
Do you have any example dr. Salameh A. Mjlae ?
  • asked a question related to Parallel Computing
Question
3 answers
Hi dear
I want to use MICH2 for parallel computing of MCNP. I Try it nut MPIEXEC run multi-time instead of running parallel?
Can anyone help me?
my command is here:
mpiexec -n 2 -noprompt mcnpx.exe i=Sp1w_05.TXT
Relevant answer
Answer
Hi dear
I use win7 64 bit.
And I found it in output that multi-time MCNP is running
  • asked a question related to Parallel Computing
Question
12 answers
I installed AmberTools18 and I installed MPI libraries as well. Everthing is working but when I want to run sander parallel like this:
mpirun -np 32 sander.MPI -O -i min.in -o min.out -p ras-raf_solvated.prmtop -c ras-raf_solvated.inpcrd \
-r min.rst -ref ras-raf_solvated.inpcrd
it didn't work, then I realized sander.MPI doesn't seem in my AmberTools package. So how can I run it? or if I didn't install how can I install it?
Relevant answer
Answer
Hi Baris,
In $AMBERHOME, try ./configure -mpi gnu, and make install.
  • asked a question related to Parallel Computing
Question
12 answers
i wrote a shallow water equation code on MATLAB but it is solving sequentially and takes a lot of time so i was wondering how can i transform my code to a parallel one so that it can utilize multiple cores.
Relevant answer
Answer
HPE's "The Machine" follow-ons and the Gen-Z Consortium are working on a much larger shared memory fabric system using load/store byte-addressable cache-line granularity memory channel semantics, but you won't be able to use virtual addressing across 100s of computing nodes (no remote snooping of cache). The access to large sections of no-coherent memory would be mmap() and the addressing base + offset, and the overlap halo regions shared using fabric atomics or sharing by an in memory byte-addressable mail system between the sub-grids. Any number of cores on compute nodes should be able to access the memory, modulo some choreography to prevent traffic jams on memory nodes. Progress on Gen-Z by the consortium and HPE is being made ...
Memory fabric error rates are being reduced by former SGI/now HPE personnel: http://genzconsortium.org/wp-content/uploads/2019/05/Highly-Resilient-25G-Gen-Z-PHY.pdf
What does that get you? No more marshaling, buffering network-sharing across sub-grids. No more marshaling, buffering for those 10 minutes of checkpoint writes every hour to maintain a consistent grid fallback point at a timestep an hour ago. All of that compute time and expensive networks and fabrics hardware goes away. MPI or MPT and OpenMP will still work. Your stencil code and meshes will have to change to be resilient to memory node and compute node failure, but that was foreseeable anyway. Matlab would have to change. But it should scale up and out.
  • asked a question related to Parallel Computing
Question
5 answers
I tried to detect faces in video with help of Machine Learning. Now I want to accelerate the running time of my Algorithm via some abstraction layer (High Language).
Is there any high-level language of OpenCL to do that?
Has someone experience with SYCL, it's possible to implement this with it?
many thanks in advance for your help
Me
Relevant answer
Answer
Hi. Yes there is https://futhark-lang.org/
  • asked a question related to Parallel Computing
Question
2 answers
Special issue
Parallel computing and co-simulation in railway research
International Journal of Rail Transportation
Submissions are due on 15th April 2019.
  • asked a question related to Parallel Computing
Question
4 answers
Minuscule devices need to process huge amounts of data. Currently, data from such devices is offloaded to larger machines such as GPUs to be processed. However, this setup poses many issues such as exposure of human tissue to radiation, as well as distance/security issues. Thus, such data needs to be processed at the source.
Relevant answer
Answer
It may depend on the size you can afford to use. I would say that, yes, FPGAs and ASICs are some options. You can also use small IoT-oriented boards with multi-core ARM processors and GPUs (i.e. take a look at devices such as Nvidia Jetson), and there are some new devices called MPPAs (Massively Parallel Processor Array), which are many-core processors (i.e. see Kalray processors).
We managed to adapt our own CEP (Complex Event Processing) engine for one of these MPPAs, and it was interesting to parallelize some operations in OpenMP, but they're rather expensive.
  • asked a question related to Parallel Computing
Question
3 answers
I read some scientific papers and most of them are using data dependency test to analyse their code for parallel optimization purpose. one of the dependency test is that, Banerjee's test. Are there other tests that can provide better result testing data dependency? and is it hard to do a test on control dependency code, and if we can, what are some of the technique that we can use?
Thank you
Relevant answer
Answer
Hisham Alosaimi sorry forgot to share the link for the slides. Here you go
  • asked a question related to Parallel Computing
Question
5 answers
Hi,
I need to solve an Eigenvalue problem of the form [A]{x} = e [B]{x}, where A, B are sparse matrices. Since the size of my matrices are quite huge, some 20,000 x 20,000, the only approach is to employ a parallel code. Can anyone suggest such a solver or point me in the right direction?
Thanks in advance !
Relevant answer
Answer
Hi Nishanth,
I think you can solve your problem using PETSc (https://www.mcs.anl.gov/petsc/). This is a powerful scientific computation environment that includes SLEPc for working with Eigenvalues.
Best,
Eduardo
  • asked a question related to Parallel Computing
Question
5 answers
Hello,
I currenty work on a FSI (Fluid-Structure-Interaction) using Ansys 18.1. After a lot of efford I succeded at a working simulation, but it takes a high computation duration. With Ansys Fluent, a simple simulation with around 500k elements in CFD and 30k elements in Mechanical takes 67h with only 10 time steps (75 iterations).
I want to improve the speed of the computation, but I cannot find any setting in order to do this. In Fluent I selected parallel computing with 4 cores. Even with GPGPU support, there is no significant duration improvement. During the project, I have a max CPU perfomance of 20%. The CFD simulation with Fluent, without FSI takes around 1h.
Has anyone an idea for a FSI simulation? I use the pressure-based solver with the coupled setting and second order implicit transient. The dynamic mesh is generated by smoothing (diffusion).
By request, there is more information, if needed.
Kind regards,
Relevant answer
Answer
Hi Karl,
Here the first question is, are you doing one way FSI or two way? means on each time step data is being transferred from Fluent to Ansys Structural or just you are calculating only from Fluent and transferring your results to Ansys structural of vise versa? Because of computed time i assume you are working of two way but please make sure about it.
I will also recommend you to run Ansys Structural separately like you did with fluent (you didn't mention how much time step you are solving only in fluent only 10 time steps?) by mimicking the load from fluent and see how much time it is taking. If that one is also not taking much time then there is issue in transferring data between the two modules of Ansys.
  • asked a question related to Parallel Computing
Question
5 answers
I want to run a stereo vision algorithm on two computers, what is the best way to write a parallel algorithm?
Relevant answer
Answer
thanks a lot Mr. Hosein
  • asked a question related to Parallel Computing
Question
3 answers
Is there a way that I can access a Hadoop Map/Reduce cluster available for researchers for free to run some experiments?
Relevant answer
Answer
You can access free MR framework on HDP sandbox.You just need to download and install the HDP and run on vm player.
  • asked a question related to Parallel Computing
Question
4 answers
aoa my research topic is resource management in cloud computing. my supervisor ask me to relate it to parallel computing. can any one tell me how can i do so ?
  • asked a question related to Parallel Computing
Question
23 answers
We live in a world of computations, there is no doubt about it. Computations that increase in terms of computarional demand every day. What do you think is the future in solving large scale numerical models? Some say OMP, others MPI and the romantics say cloud computing. What is your opinion and why?
PS: Any other methods used to solve in parallel or other techniques are more than welcome to be presented.
Relevant answer
Answer
Well, the most common model used today is MPI+X. MPI is used for distributed memory communication and X is one of the shared-memory methods (OpenMP, OpenACC, CUDA or OpenCL).
I would recommend a recent presentation by Bill Gropp on "MPI+X on The Way to Exascale"
  • asked a question related to Parallel Computing
Question
4 answers
Does anyone have experience in deploying the WIndows HPC private cluster, or executing an Abaqus job on two computers without queuing server using MPI Parallelisation?
I have managed to set-up the Cluster and to join the headnode running on Windows Server 2012 R2 and one compute node "W510" running on Windows 7 64bit Ultimate.
Unfortunately I keep getting the following error massage:
Failed to execute "hostname" on host "w510". Please verify that you can execute commands remotely on "w510" without a password. Command used: "C:\Program Files\Microsoft MPI\Bin\mpiexec.exe -wdir C:\Windows -hosts 1 w510 1 -env PATH C:\SIMULIA\Abaqus\6.13-4\code\bin;C:\Windows\system32 C:\SIMULIA\Abaqus\6.13-4\code\bin\dmpCT_rsh.exe hostname". 
Relevant answer
Answer
I  also got the similar problem. Do you had overcome this problem?
  • asked a question related to Parallel Computing
Question
3 answers
I have this project about atherosclerosis and huge amount of computation. I don't know how can I use parallel computing in ADINA or using a cluster or distributed computation. The information in the internet is so confusing. Anyone know an introductory book?
Relevant answer
Answer
ADINA have options to run either as SMP or on a cluster as DMP. But there is a separate code and licensing for DMP. 
  • asked a question related to Parallel Computing
Question
6 answers
I read many books and search on Internet but I did not find any answer .
Relevant answer
Answer
Dear Nitin Mathur,
I hope you have your answer by now from previous answers.
I still am trying to explain the reason for conditional branching instruction 7T states if condition is not met. Why isn’t  it 10 instead?
The explanation:
First of all we need to keep two things in mind
1. Each instruction has opcode fetch cycle where in first two clock pulses the memory is fetched for the opcode and it is placed on the data bus and in next clock pulses it is loaded into instruction register and subsequently to instruction decoder for interpreting the instruction.
2. The program counter always points to the next memory location from where an 8 bit data will be fetched. Processor always fetch the program counter value either as opcode fetch or normal memory read as per the previous fetching and the program counter value will automatically be incremented by 1.
Now in connection with the answer of the question:
When the opcode for a conditional jump is being fetched the PC value is pointing to the higher byte of jump address. After the 4th CP the processer automatically starts fetching the memory location pointed by the PC (Lower byte ) as a simple memory read and PC is auto incremented by 1 but as soon as the instruction decoder completes the interpretation of the conditional jump and finds the condition to be false then processor finds it unnecessary to fetch the Higher byte of address which is currently pointed by the program counter and the PC value is incremented by 1 again to point next instruction (i.e. an opcode instead of Higher byte). So at false condition there are two machine cycles, i.e. OFC + Mem read (unnecessary but have to wait to finish the cycle)
Hence it is 4+3=7
I will be glad if this info helps.
Thanks
  • asked a question related to Parallel Computing
Question
4 answers
Please Share the Steps.
It should be highly appreciated.
Relevant answer
Answer
Newer Ubuntu versions do not need bumblebee for this purpose.
  • asked a question related to Parallel Computing
Question
4 answers
more explanation on implementation
Relevant answer
Answer
I don't think so, this is a 1 GB main memory device, if you are thinking on exploring small things maybe you can do something. But if you are thinking on doing some real work then it won't be enough. Efehan's advice is definitively a good one, although expensive :)
  • asked a question related to Parallel Computing
Question
2 answers
Removed Question as account is inactive?
Relevant answer
Answer
I'd highly recommend using a higher timestep for your goal of getting the most ns/day possible. I don't recall the details but a colleage of mine is using a timestep of 6 fs, which is a 300% speedup compared to your setting of 2fs.
Have a look at "virtual sites" or "hydrogen mass repartitioning" in  context of gromacs
  • asked a question related to Parallel Computing
Question
2 answers
As I know a YARN container specifies a set of resources (i.e. vCores, memory etc.). Further we may also allocate containers of required size based on the type of task it executes. In the same sense, what does a map or reduce slot specify in MRv1? does it specify cpu core or small size of memory or both? And is it possible to specify the size of map reduce slots?
Further, suppose we have two nodes with same CPU power but different disk bandwidth, one with a HDD and other with SSD. Can we say that power of slot on first node with HDD is less than node with SSD?
Relevant answer
Answer
Not all the cases, if you have massive RAIDx4 storage of SATA HDD zou can write it once and no more acting as moving some perculiar information doing rewrite, move operation occasionately, so you can write onlz output of you programm without insufficient disk space to investigate related researches as you only can wish.
  • asked a question related to Parallel Computing
Question
3 answers
I want to find the total memory access time for a matrix program. The formula is 
MAT = HITRate × CacheAccessTime + MissRate × RAMAccessTime 
I have calculated the cache miss and hit rate by cache grind. But I dont know how to find the cache access time and Ram access time? Kindly I need your help. Is there any tool like perf and cache grid which find the cache and Ram access time in ubunutu.
Relevant answer
Answer
As a practical matter the measurements are intrinsic to the hardware.   There are testing tools which boot up on its own RTOS and will test the system.  In other words, this is a borrow, not build situation.  But if you insist on building, here is how I would approach it, apologies in advance for the too many words:  
To Samir's point, doing this in an arbitrary live OS is considerably more tricky.
You will need to use some method to enter a non-preemptable privilege time execution state - such as you might need to happen during an interrupt service routine.  The name used to describe this execution environment will vary from OS to OS.  
In linux they call this an "atomic context" - for obvious reasons - but also if you mess up the kernel goes nuclear.  For example, you cannot call any blocking functions, and using swapped out virtual memory?  Kaboom!
There are several examples of writing ISR's for Linux - find one you like.   You may, depending on the OS, be able to use an atomic context outside of an ISR, in which case you might be able to implement a kernel module implementing a procfs interface to interrogate the test code.  that makes running the tests and getting the results easy. 
You would basically disable interrupts while your test code beats up on the RAM.  You can still use the on-CPU super high resolution timer which does still advance during an interrupt blackout.   Provided the amount of time you stall the kernel is not too long ( single digit seconds ) you should be OK. 
  • asked a question related to Parallel Computing
Question
5 answers
I have conducted the results on the coachbench benchmark. Each one performs repeated access to data items on varying vector lengths. Timings are taken for each vector length over a number of iterations. Computing the product of iterations and vector length gives us the total amount of data accessed in bytes. This total is then divided by the total time to compute a bandwidth figure. This figure is in megabytes per second. Here we define a Megabyte as being 10242 or 1048576 bytes. In addition to this figure, the average access time in nanoseconds per each data item is computed
and reported.
But I just got the result in the following form: why there is a sudden decrease in the time by increasing vector size.
output:
C Size         Nanosec
256             11715.798191
336             11818.694309
424              11812.024314
512               11819.551479
.... ......
..... .....
4194309 9133.330071
I need the results in the following form. How I get this result.
The output looks like this:
Read Cache Test
C Size    Nanosec    MB/sec    % Change
------- ------- ------- -------
4096     7.396        515.753       1.000
6144     7.594        502.350       1.027
8192     7.731         493.442       1.018
12288    17.578     217.015        2.274
Relevant answer
Answer
  1. Dear Sir, Thank you so much for your response. Why a sudden decrease in the time by increasing vector length
C Size         Nanosec
256             11715.798191
336               11818.694309
424                11812.024314
512               11819.551479
.... ......
..... .....
4194309        9133.330071
But I am confused in this, that if we increase the vector size then for writing or reading the data will be coming from the main memory because it will not be present in the cache then, in this case, the time should be increased. But I am confused that why it's decreasing when we increase the vector size.
Dear Sir, If I am wrong please do correct me.
  • asked a question related to Parallel Computing
Question
3 answers
I was wondering whether anyone knows about an automated tool to collect GPU kernels features, i.e., stencil dimension, size, operations, etc. Such tools are widely available for CPU kernels.
Relevant answer
Answer
Well, I only work with CUDA GPUs for servers, so i'm not aware of how things works in embedded/mobile platforms.
But basically you would need a profiler. If you are dealing with smartphones, you coud try the Snapdragon/Adreno profiler
Also, looking at their software I see that they provide an LLVM compiler. If there is an open source version available, you can modify it to obtain the features you want. 
  • asked a question related to Parallel Computing
Question
7 answers
i am using Ubuntu 64 bit 14.04 and 32 bit 8.04 versions. I am running a source code in fortran. i get different results on two machines. what can be the reASONS.
i tried using gfortran and intel compilers. the two compilers give identical results on respective machines but they differ when the results are compared  from two different machines
Relevant answer
Answer
Dear Colleagues,
It is known to check of real numbers for equalities, it is necessary to compare them with some accuracy. Give a fragment of the source code.
BW
Alexander.
  • asked a question related to Parallel Computing
Question
2 answers
I am using OpenMP for performing parallel programming and i want to execute my c code (with OpenMP) in gem5. But i was unable to execute my code. I have tried to install m5threads, but there is lots of error when i installing. Can anybody help me about how to successfully install m5threads and how can i execute my c code (with OpenMP) in gem5?
Please Help
Relevant answer
Answer
openmp and pthreads are not supported in SE mode.
you may use m5threads or FS mode.
these are the steps for installing m5threads.
1. Go to the directory where you want m5threads installed.
2. Download m5threds
3. Compile libpthread
make (this command will do compilation)
(as you've not cleared the exact error you're getting, if you can upload some snapshot or details so it will be easy to solve the problem)
Some times make command gives error because of architecture dependency, if you're using a 64bit machine (uname -a command will show the bit type 64/32bit) then you should make changes in make file
Edit Makefile for x86 target.
# 64-bit compiles
#Uncomment to use sparc/alpha cross-compilers
#CC := sparc64-unknown-linux-gnu-gcc
#CPP := sparc64-unknown-linux-gnu-g++
#CC := alpha-unknown-linux-gnu-gcc
#CPP := alpha-unknown-linux-gnu-g++
CC := arm-linux-gnueabi-gcc
CPP := arm-linux-gnueabi-g++
#CC := gcc
#CPP := g++
...
CFLAGS := -g -O3 $(ARM_FLAGS)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
To
# 64-bit compiles
#Uncomment to use sparc/alpha cross-compilers
#CC := sparc64-unknown-linux-gnu-gcc
#CPP := sparc64-unknown-linux-gnu-g++
#CC := alpha-unknown-linux-gnu-gcc
#CPP := alpha-unknown-linux-gnu-g++
#CC := arm-linux-gnueabi-gcc
#CPP := arm-linux-gnueabi-g++
CC := gcc
CPP := g++
4. Compile tests for x86
make
5. Run a test
cd test
./test_atomic 4
Hope this will work for you
  • asked a question related to Parallel Computing
Question
6 answers
Hello, 
I using gromacs 5 and want to choose how many processors to use in a simulation. For choosing GPUs I have read the mdrun function page which is easy to do it, but for cpu I haven't found how to do it.
regards
Relevant answer
Answer
Hi Quezada,
Hope this command could help 
mpirun -np X (X- number of processor)
  • asked a question related to Parallel Computing
Question
14 answers
Hi, 
We have several isolated GPU units and plan to build a cluster and a queuing system. Does anyone have relevant experiences? Can you tell me the prerequisites of the soft and hard wares and how to compute the installation step by step? 
Thanks for your time!
Cheng
Relevant answer
Answer
How you build your cluster depends on what you want to do with it.
Do you already have software? What are the requirements of the software?
If you intend to use CUDA or similar libraries with GPUs (as opposed to CPUs) you should refer to the recommendations in the Nvidia documentation. Not all systems are created equally.
Another aspect is that, yes people are using RaspberryPi for clustering, **However**, what are your doing? Is it just an experiment to try clustering, or is it a production cluster for doing actual work?
Beowulf does not automatically use your GPU cores without using proper librarys. Same with the others. There are also a differences between MPICH, OpenMP, and GPU programming that you should research before spending your money. Those are all different things that can be used to develop parallel programs, but you need to understand where to use them to make them effective.
From my experience it's better to start with a clear plan for your software implementation and your goals. Decide what libraries you are using, etc..
Boost libraries are very good, but like all libraries you will probably need to write some other code to fully support your GPUs.
  • asked a question related to Parallel Computing
Question
8 answers
if I had 3 nodes each one has 8 core and these nodes allow hyper threading. We found running code on 3 nodes and 16 thread will be faster than 4 nodes with 16 thread and 5 nodes with 16 thread faster than 3 and 4 nodes! Any explanation? 
Relevant answer
Answer
Your pb is not clearly stated, so you received several confusing answers…
I’ll try to rewrite your pb: you have up to 5 cluster nodes, which are probably interconnected through Gbit Ethernet, and you have an application running as a single process with 16 parallel threads. A possible alternative interpretation of your text would be that your code has as many running processes as the number of nodes, and each process with 16 threads.
Each of your nodes has a single 8-core x86-64 device or two 4-core x86-64 devices; if not too outdated, this 2nd alternative configures a NUMA memory organization, with impact on performance.
Your x86-64 device could be an i5 or i7 or Xeon device, that supports 2-way Symmetric MultiThreading (SMT) at the hardware level (Intel calls this HyperThreading, HT). This support at the hardware level means that the core architecture has some duplicated units, namely the registar bank (including the PC/IP), which allows the core to simultaneously run two independent pieces of code that may share the same memory address space (we call these threads), if it has enough replicated arithmetic units. Note that current core architecture are superscalar, i.e., they can launch 2 to 6 parallel machine code instructions at the same time, and this helped Intel to design their multi-threaded architectures.
With so little number of cores, we can assume you are not using any of the Intel Many-core Integrated Core (MIC) device, neither the previous Xeon Phi Knights Corner co-processor (with up to 61 cores) nor the newest Xeon Phi Knights Landing (KNL) processor (up to 72 cores), which support up to 4-way multi-threading (HT in the new KNL).
If your code is a single process with 16 threads it could achieve the best performance running on a single node, with OpenMP or OpenMPI.
If you are running your code using 3, 4 or 5 nodes (with MPI?) you should have also specified your input data set, if the same for all cases, if you accounted the time to transfer code & data for the other nodes, if they all fit in local/shared caches or share external RAM, and if the application is compute-bound or memory-bound.
As you can see, there is no simple answer for your questions, since there are too many variables that may impact the overall performance of your code…
  • asked a question related to Parallel Computing
Question
2 answers
I have a problem with parallel computing. I can't open matlabpool. When I type matlabpool 2, I get the following error:
Error using matlabpool (line 144) Failed to open matlabpool. (For information in addition to the causing error, validate the profile 'local' in the Cluster Profile Manager.)
Caused by: Error using distcomp.interactiveclient/start (line 61) Failed to locate and destroy old interactive jobs.
This is caused by: The storage metadata file does not exist or is corrupt
Relevant answer
Answer
In linux do:
rm -rf .matlab/local_cluster_jobs
  • asked a question related to Parallel Computing
Question
11 answers
Hi all,
I would like to establish parallel computing by using the GPU of my M2000 Nvidia graphics card. I was wondering if this works for all the softwares, in my case bioinformatics softwares like PASTA (Python platform), or is it limited to certain programming platforms or particular softwares/libraries provided by the parallel computing APIs (e.g. CUDA)?
If it is limited then, is there way to achieve parallel computing without being limited to using only certain softwares?
Thanks.
Relevant answer
Answer
Writing code using CUDA already makes your program limited to NVIDIA GPU's, and hence not very portable.  (Exactly the desire of NVIDIA, of course).  Most of the GPU codes I've seen use the OpenACC set of directives (Fortran) / pragmas (C/C++) instead. These have the advantage of (relative) ease of use, and of portability to non-GPU systems, where the directives are just ignored.  The latest versions of OpenMP also have support for GPU computing. 
  • asked a question related to Parallel Computing
Question
4 answers
Is it possible to load the operating system on a Jetson TK1/TX1 using SD cards? From the best of our knowledge, only it is possible using the internal 16Gb flash memory.
Relevant answer
Answer
Hi Jose,
It should be possible to boot from the SD card. However please note that even if you have setup everything correctly you may experience boot problems with some SD cards (unfortunately bootloaders are not as sophisticated and plug-n-play as we would all want :) ).
You can check for additional details the thread here:
Best regards,
 Stan
  • asked a question related to Parallel Computing
Question
3 answers
We have already used Trimaran, but it seems to be hard to work with.
Is there any alternative to Trimaran compiler?
we need the following features:
Very Long Instruction Word Processors (VLIW)
Explicitly Parallel Instruction Computing (EPIC)
Instruction-Level Parallelism (ILP)
Relevant answer
Answer
You are talking about two separate problems here: Obtaining a WCET estimate and minimizing the WCET. For the former - WCET analysis - there are basically two approaches: static WCET analysis and measurement-based WCET analysis. Static WCET analysis is required if you want to have safe and tight bounds, and usually required in certifiable, safety-critical (hard) real-time systems. A model for the platform (processor, memory subsystem, ..) is necessary, so there is no general solution for arbitrary processors, as the others have mentioned. You can find an overview of the WCET problem in [1].
But as you are asking about a compiler, I assume you have a platform at hand for which you are building a compiler. Here you need tight integration of the compiler and a WCET analysis tool. Things are difficult because slight changes in the generated code may have a huge impact on the WCET (memory layout changes etc).
There were efforts in research to produce a compiler that reduces the WCET. One prominent example is WCC [2], but it is a closed-source compiler and AFAIK interfaces aiT at intermediate level.
In an effort to create a time-predictable platform within an EU FP7 project (http://www.t-crest.org), we used LLVM. Our processor was a dual-issue VLIW processor. The sources are on GitHub [3]. We interfaced aiT and also have a basic WCET analysis tool in the package.
Also Trimaran was used recently [4]. The researchers employed WCET-oriented hyperblock formation, and built their own IPET-based timing analyzer for a simple architecture model.
I hope this helps you to get started!
[1] The worst-case execution-time problem—overview of methods and survey of tools
[4] A Compile-Time Optimization Method for WCET Reduction in Real-Time Embedded Systems through Block Formation
  • asked a question related to Parallel Computing
Question
3 answers
I know that you use mpicc to compile MPI programs as > mpicc abc.cpp  and you use pgc++ for compiling OpenACC directives. Is there a way where you can compile MPI+OpenACC or pgc++ and mpicc. Thanks 
Relevant answer
Answer
The following link provides some step by step instructions for what you are trying to accomplish. However, I would recommend that you consider OpenMPI/MPICH + OpenCL in order to avoid vender lock-in for either your gpu or compiler. Things could have changed since I last considered OpenACC, but I believe that the PGI acceleration directives only work for NVidia GPUs for now.  The GPUs from AMD provide much higher double precision performance and work well with OpenCL.
  • asked a question related to Parallel Computing
Question
7 answers
How can i create small cloud computing lab? ( I am now working in task scheduling and balancing in cloud computing, so i want know what are i need requirements ( no. of physical machines and types of operating system, how can i install hypervisor, types of hypervisor, how can i create virtual machines and send the task to vms , where can i add my task scheduling algorithm ) is there any site or video can clear that.
Relevant answer
Answer
You can use one or more of following:
DevStack
RocksOS
Eucalyptes
  • asked a question related to Parallel Computing
Question
1 answer
I would like to emulate various n-bit binary floating-point formats, each with a specified e_max and e_min, with p bits of precision. I would like these formats to emulate subnormal numbers, faithful to the IEEE-754 standard.
Naturally, my search has lead me to the MPFR library, being IEEE-754 compliant and able to support subnormals with the mpfr_subnormalize() function. However, I've ran into some confusion using mpfr_set_emin() and mpfr_set_emax() to correctly set up a subnormal-enabled environment. I will use IEEE double-precision as an example format, since this is the example used in the MPFR manual:
mpfr_set_default_prec (53);
mpfr_set_emin (-1073); mpfr_set_emax (1024);
The above code is from the MPFR manual in the above link - note that neither *e_max* nor *e_min* are equal to the expected values for `double`. Here, p is set to 53, as expected of the double type, but e_max is set to 1024, rather than the correct value of 1023, and e_min is set to -1073; well below the correct value of -1022. I understand that setting the exponent bounds too tightly results in overflow/underflow in intermediate computations in MPFR, but I have found that setting e_min exactly is critical for ensuring correct subnormal numbers; too high or too low causes a subnormal MPFR result (updated with mprf_subnormalize()) to differ from the corresponding double result.
My question is how should one decide which values to pass to mpfr_set_emax() and (especially) mpfr_set_emin(), in order to guarantee correct subnormal behaviour for a floating-point format with exponent bounds e_max and e_min? There doesn't seem to be any detailed documentation or discussion on the matter.
Here is a small program which demonstrates the choice of e_max and e_min for single-precision numbers.
#include <iostream>
#include <cmath>
#include <float.h>
#include <mpfr.h>
using namespace std;
int main (int argc, char *argv[]) {
     cout.precision(120);
    // Actual float emin and emax values don't work at all
    //mpfr_set_emin (-126);
    //mpfr_set_emin (127);
    // Not quite
    //mpfr_set_emin (-147);
    //mpfr_set_emax (127);
    // Not quite
    //mpfr_set_emin (-149);
    //mpfr_set_emax (127);
    // These float emin and emax values work in subnormal range
    mpfr_set_emin (-148);
    mpfr_set_emax (127);
    cout << "emin: " << mpfr_get_emin() << " emax: " << mpfr_get_emax() << endl;
    float f = FLT_MIN;
    for (int i = 0; i < 3; i++) f = nextafterf(f, INFINITY);
    mpfr_t m;
    mpfr_init2 (m, 24);
    mpfr_set_flt (m, f, MPFR_RNDN);
    for (int i = 0; i < 6; i++) {
        f = nextafterf(f, 0);
        mpfr_nextbelow(m);
        cout << i << ": float: " << f << endl;
        //cout << i << ": mpfr: " << mpfr_get_flt (m, MPFR_RNDN) << endl;
        mpfr_subnormalize (m, 1, MPFR_RNDN);
        cout << i << ": mpfr: " << mpfr_get_flt (m, MPFR_RNDN) << endl;
    }
    mpfr_clear (m);
    return 0;
}
With thanks,
James
Relevant answer
Answer
There are different conventions to express significands and the associated exponents. IEEE 754 chooses to consider significands between 1 and 2, while MPFR (like the C language, see DBL_MAX_EXP for instance) chooses to consider significands between 1/2 and 1 (for practical reasons related to multiple precision). For instance, the number 17 is represented as 1.0001·24 in IEEE 754 and as 0.10001·25 in MPFR. As you can see, this means that exponents are increased by 1 in MPFR compared to IEEE 754, hence emax = 1024 instead of 1023 for double precision.
Concerning the choice of emin for double precision, one needs to be able to represent 2−1074 = 0.1·2−1073, so that emin needs to be at most −1073 (as in MPFR, all numbers are normalized).
As documented, the mpfr_subnormalize function considers that the subnormal exponent range is from emin to emin+PREC(x)−1, so that for instance, you need to set emin = −1073 to emulate IEEE double precision.
  • asked a question related to Parallel Computing
Question
14 answers
Dear all
  • Would anyone give me a brief and useful review on VASP and Quntum esprsso? I need to know about their performance, speed, parallel computing, free license and tools. Which one is more applicable in solid state physics and comutational nanosciences. Please rank them 0-100.
Relevant answer
Answer
VASP is a commercial code and Quantum-Espresso (QE) is an open source code.
Both have their own issues:
1. VASP is faster than QE.
2. QE has a very good mailing support.
3. Being free (under GNU license), QE is accessible to everybody.
4. For calculations of vibrational properties, QE contains implementations of DFPT (Density Functional Perturbations Theory), whereas VASP uses another third party software.
5. There are well-tested pseudpotentions (PP) for VASP, whereas you have to choose proper PP for QE as there are many for the same element.
  • asked a question related to Parallel Computing
Question
15 answers
I am looking for works or researches where the CPU actually outperforms the GPU