Questions related to Parallel and Distributed Computing
I want to simulate a discrete random media with FDTD method.
the simulation environment is air that is filled with random spherical particles (that are small to the wavelength) with defined size distribution.
what is an efficient and simple way to create random scatterers in large numbers in FDTD code?
i have put some random scatterers in small area but i have problem producing scatterers in large numbers.
any tips, experiences and suggestions would be appreciated.
thanks in advance.
I am running a finite element (FE) simulation using the FE software PAMSTAMP. I have noticed that when I run the solver in parallel with 4 cores, the repeatably of the results is poor; when I press 'run' for the same model twice, it gives me inconsistent answers.
However when I run the simulation on a single core, it gives me consistent answers for multiple runs of the same model.
This has lead me to believe that the simulations are divided differently between multiple cores, each time the solver is run.
Is there a way to still run the simulation in parallel (multiple cores) but have the solver divide the calculation in the same manner each time, to ensure consistency for multiple runs?
I have been working on task allocation policies and load balancing algorithms (Best Fit, Worst Fit, Next Fit, First Fit, etc). I have read some papers on online and offline scheduling techniques but I found it difficult to understand. I would be grateful if someone can explain the difference or cite a link to the same.
Thanks in advance.
I am currently interested in solving large linear systems on a distributed parallel computing grid using the ScaLAPACK library. I wonder if there is a quadruple precision version (Or any other alternative library which provides quadruple precision for parallel distributed computing)?
Let we change the default block size to 32 MB and replication factor to 1. Let Hadoop cluster consists of 4 DNs. Let input data size is 192 MB. Now I want to place data on DNs as following. DN1 and DN2 contain 2 blocks (32+32 = 64 MB) each and DN3 and DN4 contain 1 block (32 MB) each. Can it be possible? How to accomplish it?
Does anyone have experience in deploying the WIndows HPC private cluster, or executing an Abaqus job on two computers without queuing server using MPI Parallelisation?
I have managed to set-up the Cluster and to join the headnode running on Windows Server 2012 R2 and one compute node "W510" running on Windows 7 64bit Ultimate.
Unfortunately I keep getting the following error massage:
Failed to execute "hostname" on host "w510". Please verify that you can execute commands remotely on "w510" without a password. Command used: "C:\Program Files\Microsoft MPI\Bin\mpiexec.exe -wdir C:\Windows -hosts 1 w510 1 -env PATH C:\SIMULIA\Abaqus\6.13-4\code\bin;C:\Windows\system32 C:\SIMULIA\Abaqus\6.13-4\code\bin\dmpCT_rsh.exe hostname".
As I know a YARN container specifies a set of resources (i.e. vCores, memory etc.). Further we may also allocate containers of required size based on the type of task it executes. In the same sense, what does a map or reduce slot specify in MRv1? does it specify cpu core or small size of memory or both? And is it possible to specify the size of map reduce slots?
Further, suppose we have two nodes with same CPU power but different disk bandwidth, one with a HDD and other with SSD. Can we say that power of slot on first node with HDD is less than node with SSD?
I am doing a computational demanding time series analysis in R with a lot of for-loops which do the same analysis several times (e.g. for 164 patients, for 101 different time series per patient or for different time lags). In the end, the results of these analyses are summarized to one score per patient, but till this point, they work absolutely independent of each other. To shorten the computing time, I would like to parallelize the analysis. The independent parts could be analyzed parallel using not only one of the 8 cores of my processor.
I read some postings about performing functions like apply with more than one core, but I am not shure how to implement the approaches.
Does anybody know a simple and comprehensible way of translating a classical sequential for-loop into a procedure which uses different cores simultaneously to run a few of the analyses parallel?
Thank you very much for every comment!
I have conducted the results on the coachbench benchmark. Each one performs repeated access to data items on varying vector lengths. Timings are taken for each vector length over a number of iterations. Computing the product of iterations and vector length gives us the total amount of data accessed in bytes. This total is then divided by the total time to compute a bandwidth figure. This figure is in megabytes per second. Here we define a Megabyte as being 10242 or 1048576 bytes. In addition to this figure, the average access time in nanoseconds per each data item is computed
But I just got the result in the following form: why there is a sudden decrease in the time by increasing vector size.
C Size Nanosec
I need the results in the following form. How I get this result.
The output looks like this:
Read Cache Test
C Size Nanosec MB/sec % Change
------- ------- ------- -------
4096 7.396 515.753 1.000
6144 7.594 502.350 1.027
8192 7.731 493.442 1.018
12288 17.578 217.015 2.274
Please, I have a question: I installed 4 Virtual Machines (VM) on my computer. And I have a simple program written in MATLAB. I would like to copy this program to each VM in a distributed environment and let them work in parallel. My question is: Does MATLAB work and support 4 virtual machines on one computer and let them work in parallel?
In the paper "Apache Hadoop YARN: Yet Another Resource Negotiator", it claims that Yarn can extend node number from 4000 in Hadoop 1.x to over 7000. My question is that what is the key principle of improving scalability behind YARN.
In Hadoop 1.x, there are centralized NameNode and centralized JobTracker. It seems that things haven't changed much in YARN since it also has centralized ResourceManager. The bottleneck in Hadoop is still the design of centralized control.
I would like to establish parallel computing by using the GPU of my M2000 Nvidia graphics card. I was wondering if this works for all the softwares, in my case bioinformatics softwares like PASTA (Python platform), or is it limited to certain programming platforms or particular softwares/libraries provided by the parallel computing APIs (e.g. CUDA)?
If it is limited then, is there way to achieve parallel computing without being limited to using only certain softwares?
I have a program that write with parallel programming in mathematica software on one CPU, this program run good & excellent . If I have several computers(CPUs) that are network together, how to run this program in this case? I want my run time for this program in new conditions, divides to number of CPUs, is this possible?
Thank you all .
Actually I’m working on numerical modeling of grain growth with cellular automata. I have some questions in terms of deterministic cellular automata .
Does anyone work on deterministic transformation rule used in cellular automata? (calculation of velocity and its fraction)
My result for two grain is attached with this question.
Actually the triple junction must be 120 degrees while my result is in contrast with this fact!
I want to find the CPU time taken for the tasks executed using MapReduce on Hadoop. When i opened it on localhost:50070/ under overview, it doesn't show how much time has been taken by CPU to execute the particular task. If anyone please can help me. Thanks in advance.
for DG LSF=2*Peff,k*Rk-1,k/Vk^2, Peff,k how can i calculate it correctly
for 15 bus system i get the correct load flow as in this paper
"Particle Swarm Optimization Based Capacitor
Placement on Radial Distribution Systems"
but LSF not correct please any help !!
my results for example for Bus 6 in paper LSF=0.016437, but my work 0.0129525
Can you suggest me "how to calculate expected time to compute a task which requires heterogeneous multiple resources" ?
For example: if we consider only CPU requirement of a task, then the execution time of the task is task length in MI divided by CPU speed in terms of MIPS.
Can you suggest me for the consideration of multiple resources like CPU, RAM, Bandwidth, etc.
Can I run a multi-threaded program and specify how many cores should be used: 1, 2, 4 etc. so as to enable a performance comparison?
I keep looking for software for high-performance cellular automata simulations, but I can't find anything specific. I need one that takes advantage of multi-core processors.
What software do researchers use? Is it a Matlab toolbox or an R library for example..? And can I somehow extract accurate measurements of its performance?
Thank you in advance for your help!
Several techniques can be applied in Peer-to-Peer networks to handle the presence of malicious nodes (Reputation Systems, Accountability, Distributed Consensus, etc...).
Which one do you think has the best trade off between the capability to discriminate malicious nodes from honest ones, and the cost of the technique (in term of the number of messages for example). Obviously, knowing that none of the previous systems can fully decide (with a 100% accuracy) whether a node is malicious or not .
I am researching on spinlocks used in multi-threaded applications running on modern multi-core processors. Unfortunately most of the standard benchmarks like SPLASH-2, UTS, PARSEC use mutex locks. I am thinking to replace mutex locks in these benchmarks with spinlocks and get the results? Will the results be still considered by the scientific community if I inform the modifications done to the benchmarks? Will that be a valid claim? Any pointer in this matter is highly appreciated.
I need some useful references to parallelize my FDTD Codes. I'm currently working on imaging and scattering analysis of random rough surfaces using numerical methods (FDTD, MoM and etc). In most 3D problems (with large dimensions) there are some restrictions related to memory and CPU time. As a result we need to use fast methods or to parallelize our conventional codes.
please give me some simple,fast and useful references which i can utilize it to parallelize my codes. I'm beginner in parallel processing and need help.
Some time ago I used the Wolfram Lightweight Grid Manager to set up Mathematica for parallel computing in my local network. All the computers had installed Mathematica which have been configured to use the license server. This environment worked properly.
I have question if it is possible to configure Mathematica (which possesses the site licence) to work similar to the previously described (but without the Wolfram Lightweight Grid Manager and the license sever). I think about the following situation. One computer is the master and the other (which working together in LAN) share their computing kernels.
Map-Redude exists in MongoDB, but the documentation says:
For most aggregation operations, the Aggregation Pipeline provides better performance and more coherent interface. However, map-reduce operations provide some flexibility that is not presently available in the aggregation pipeline.
My questions are:
- Why is map-reduce so effective in Hadoop but discouraged in MongoDB
- Is there any example of "flexibility" of map-reduce not available in the aggregation pipeline?
Thanks in advance
I'm currently working on a linux version. I would like to print meshes without the data calculated in the FMESH card.The purpose is to verify the mesh (with the geometry simulation on background).
In CUDA C i searched how to make a table but not geeting how can i do group by queries on that table and function which can define them.
one of the reason of slow performance in FEM analysis is construction of elastic stiffness matrix with loops.
There is a large for loops with size of the number of elements. This can quickly add significant overhead when NE is large since each line in the loop will be interpreted in each iteration. Vectorization should be applied for the sake of efficiency.
I have installed a Hadoop 2.6.0 Cluster using one NameNode (NN) and 3 DataNodes (DN). Two DNs are on two physical machine running Ubuntu while 3rd DN is virtual node running Ubuntu on window server. Two physical DNs have 2 GB RAM and Intel Xenon 2.0 GHz Processor x 4 (i.e. 4 cores). While 3rd DN is assigned with 4 GB RAM and 1 processor x 4 (i.e. 4 cores) and it is running on window server which have 32 GB RAM and 2 Intel Xenon 2.30 GHz processors x 6 (i.e 12 cores).
Mappers running on Physical DNs are faster than Mappers running on Virtual DN. Why ?
The original version of the Intel Advanced Vector Extensions Programming Reference. from March 2008 as Document no. 319433-002. is quite often referenced in some HPC papers which were early adopters of the AVX SIMD extensions. However, the original link for this document doesn't exist any more.
I am trying to use one pulse program in Topspin 2.3, but it's not working. The pulse program works in an earlier version of Topspin (probably version 1.1). But when I tried to use it in Topspin 2.3, it's showing compilation error. The error is mainly for the line '#include <trigg.incl>' and 'trigg' command in the body of pulse program.
When I deactivated (by putting semicolon) these two lines, the program doesn't show any compilation error. Looks like it will work.
Now I am concerned that, is it OK to use the pulse program without 'trigg' command or trigg include file. Is there any chance of damaging the probe? What does the 'trigg' command do?
[I noticed that the other pulse programs came with the Topspin 2.3 do not have that 'trigg.incl' file or trigg command.]
Can anybody help me?
Thanks in advance!
I have executed my MapReduce program on these two version of Hadoop and observe that Hadoop-2.6.0 takes about double time in comparison to Hadoop-1.2.1. And also confused about the jar files of Hadoop-2 which is complicated while in Hadoop-1 it was so easy to add jar files. I am using eclipse kepler and Ubuntu 14.04 64 bit. Please help me.
I'im current PhD Student, i'im trying to post-traitemnet databases (ODbs) with scripting Python, but the scripting python Abaqus use 22.5% of memory and 1 CPU. I want to use full memory and some CPUs (8 for example).
Thanks for yours helps,
The University where I work is planning to move some of its workload to public cloud environments. I am not sure if it is the right decision because we have a very low bandwidth. Next year, the bandwidth could be increased to 1 Gbps which would provide access to about 10,000 people. I wish to write a [well supported] document with pros and cons about such decision.
Thanks for your help,
My experiment was conducted on different set of computers with varying cores and varying CPU speed. My analysis shall be based on the impact of cores, but all the machines does not have the same CPU speed.
A method is required to make all the results from the various machines uniform to remove the effect of CPU speed.
Is there any known method of normalizing/standardizing results of execution on varying computers with varying CPU speed?
I am using Hadoop-2.6.0 and Eclipse Kepler/Juno. I have executed my MapReduce Job in both way and I found that running jar file on Terminal takes more time than direct execution for Eclipse IDE. I could not understand the reason behind it. Please help, I could not decide which way I should follow ?
Can we simulate every subject in the field of “Service Discovery” or “Resource Discovery” using “CloudSim”? Or there are other necessary tools to be added to "cloudsim"?
Or other simulator exists for this purpose?
I am using hadoop-2.6.0 and eclipse kepler on Ubuntu-14.04. I am confused about library file of hadoop, as there are so many jar files in contrast to hadoop-1x. So how many jar files should be added in any MapReduce project. For example what should add in word count program ?
I am studying the performance of spinlocks used in parallel programs running on modern Multi-core processors. Though spinlocks are more used in multithreaded OS kernels, I see that in the application space mutex lock is widely used than spinlocks. Hence I would like to check whether some apps do use spinlocks.
Also I would like to know if there are any parallel computing benchmarks that use spinlocks. I found that SPLASH2 benchmark uses mutex locks.
In driver class I have added Mapper, Combiner and Reducer classes and executing on Hadoop 1.2.1. I have added NumReduceTasks as 2. The output of my mapreduce code is generated in a single file part-r-00000. I need to generate output in more than one file. Does it need to be add our own Partitioner class?
I've structured a novel static interconnection topology for parallel and distributed a system which shows considerable advantages over existing static interconnection topologies in terms of its physical properties. Now I must support my novel topology with practical comparison. To do that how can I choose a practical problem and categorize the results?
How can add parameter as waiting time in priority based scheduling algorithm. If we add this parameter then it's provides efficient scheduling list.
We add this parameter as
Turn round time,
MARE is a programming model and a run-me system that provides simple yet powerful abstractions for parallel, power-‐efficient software
– Simple C++ API allows developers to express concurrency
– User-‐level library that runs on any Android device, and on Linux, Mac OS X, and Windows platforms
I'm not sure about how the throughput in data center networks can be calculated.
Could you please give me the exact formula that calculate the throughput of a data center network?
Thanks in advance.
I am using Ubuntu in VMWare on 4 nodes running Window 7. I have configured Ubuntu with 1 GB RAM, 1 Processor and 20 GB Hard Disk each. Suppose I have used these configuration directly on physical nodes instead of VMWare, then it could be better in performance?
I am wondering if anybody has run calculations with NWchem on Mira (IBM's Blue Gene/Q System) at ALCF. Please if you have or have done some performance studies and have collected some data, could you share any suggestions or ideas. I am particularly interested in the DFT, TDDFTs, and the Many body correlated suites (CCSD,EOM-CC, MRCC etc).
We are looking for thread-safe container classes in C/C++ - please inform if you know any. The library should support multi-thread read-write vector, list and similar containers.
Lock-free, wait-free implementations are also welcome.
1. Open-sourced and commercial friendly license. (We are aware of TBB, but its free version is GPL, hence not suitable)
2. C++11 ready if possible (GCC 4.8, MSVC 2013)
Came across STAPL (https://parasol.tamu.edu/stapl/ ), but couldn't find any download link for that.
A multi-writer shared storage is the one that allows two writers to modify the same storage object concurrently without imposing any locking. Usually in cases of write collision one of the write operations is visible and the other is hidden, meaning that no read operation will later observe this write. I just wonder if such behavior is useful for any existing applications. For example if it was that the shared storage would implement a file system then would it be OK if one writer would overwrite the changes of a concurrent writer on a single file?
I'm asking those who have experiences with MPI working on various environments.
It can be used on simple Beowulf clusters, consisting of four PCs connected to a single switch (or even hub), where all info can be broadcasted (or multicasted) to all hosts. It can also be used on large supercomputers of complicated topology, with large diameters.
It seems various cases need different program optimization don't they?
In particular - are the collective operations efficient on supercomputers, like BlueGene? AFAIK, modern algorithms (like SUMMA) often use MPI_Reduce, at least...
What are your experiences? How to design programs for MPI?
As a part of the course on Computer Architecture, how can students be provided a flavor of multicore/manycore architecture through some real hands on experience (may be through programming/using some specific tool/using some pre-designed boards or arrangements etc.)?
First I tried to run mpiblast on rocks cluster but it always end up with errors. Can anyone suggest what I should do?
I just need few examples of applicable algorithms because I am looking for a new research areas.
There are many simulators (eg: DCNSim, NS3) to simulate the architecture of data center networks. Which one is more powerful and provide good flexibility?
In an Intel Processor does each of the cores has its own independent L1 and L2 Cache. In the intel architecture diagram it says "Shared L3" Cache. does this means that the memory allocation in this L3 Cache is at the mercy of the kernel. What happens if one of the process in Core 1 demands more memory ? Will this allocate more memory in L3 Cache (Once it runs out/ Miss out of memory ref in L1 and L2 ). Will this Hamper the performance of process in running in core2 or Core 3.
Is there any way to grantee the performance of an process on a core, when there are other process running at the same time but on different cores. I am trying to avoid any kind of context switching for a process running on a single core.
What is the best way to Schedule/Execute two separate codes on a dual core processor? Consider that the two codes don't have any dependency on each other or any shared resources. For Example : one piece of code Generates a Fibonacci series and another the Square root. ( Both code from 1 to billion). Can these two code be run on two separate processor entirely independent of each other? If one of the codes (Assume the Fibonacci) encounters a overflow error the other must not be affected in any manner.