Science method

# Parallel and Distributed Computing - Science method

Parallel and Distributed Computing are distributed systems and calculations being carried out in parallel
Questions related to Parallel and Distributed Computing
• asked a question related to Parallel and Distributed Computing
Question
Hello,
I am running a finite element (FE) simulation using the FE software PAMSTAMP. I have noticed that when I run the solver in parallel with 4 cores, the repeatably of the results is poor; when I press 'run' for the same model twice, it gives me inconsistent answers.
However when I run the simulation on a single core, it gives me consistent answers for multiple runs of the same model.
This has lead me to believe that the simulations are divided differently between multiple cores, each time the solver is run.
Is there a way to still run the simulation in parallel (multiple cores) but have the solver divide the calculation in the same manner each time, to ensure consistency for multiple runs?
Thanks,
Hamid
I had similar problems with simulating crash boxes in parallel. Although I am no expert in HPC, I think it is because each core handles the round-off operations differently and also the interactions between the cores at the domain boundary plays an import role. I guess if you can control the exact same domain assignments to the exact same core as in any of your previous runs, you should get the same results. However, i doubt PAM software would be so flexible.
• asked a question related to Parallel and Distributed Computing
Question
I have been working on task allocation policies and load balancing algorithms (Best Fit, Worst Fit, Next Fit, First Fit, etc). I have read some papers on online and offline scheduling techniques but I found it difficult to understand. I would be grateful if someone can explain the difference or cite a link to the same.
Offline scheduling is the classical scheduling setting where we have a set of jobs and we need to define a proper scheduling of these jobs on a machine environment in order to optimize a certain objective function. So an offline algorithm is analyzed on its capacity to obtain optimal solutions.
However, in online scheduling, at the beginning of the decision process an online algorithm has no knowledge about incoming jobs. So, he has to make an immediate decision every time a job arrives (two models for jobs arrival: over-list/over-time). The decision can consists of deciding of whether to schedule immmediately a job (and on which machine) or wait for a defined period of time hoping for more interresting jobs to arrive in the futur (this is in the case of over-time arrival model). In order to evaluate the performance of online algorithm, one has to compare the obtain objective value of the online algorithm on any instance to the obtained objective value by an optimal offline algorithm while having complete knowledge of the instance (competitive analysis). Basicly, the performance of an online algorithm can never be better than an offline one.
• asked a question related to Parallel and Distributed Computing
Question
I am currently interested in solving large linear systems on a distributed parallel computing grid using the ScaLAPACK library. I wonder if there is a quadruple precision version (Or any other alternative library which provides quadruple precision for parallel distributed computing)?
Hi
Interesting question. A guess is if you want double double (quad) precision, then the compiler should support the data type, as well as you need the processor drivers for your AMD ir i386. Both seem to be available, and you will need relevant flags when compiling the library.
• asked a question related to Parallel and Distributed Computing
Question
I need to parallelize gradient descent in a distributed fashion. Also, i don't want to partition the model, but to replicate it and, after training some number steps, reduce all models into one. I want to know how to reduce those models.
I think that you are looking for something like the following:
• asked a question related to Parallel and Distributed Computing
Question
Let we change the default block size to 32 MB and replication factor to 1. Let Hadoop cluster consists of 4 DNs. Let input data size is 192 MB. Now I want to place data on DNs as following. DN1 and DN2 contain 2 blocks (32+32 = 64 MB) each and DN3 and DN4 contain 1 block (32 MB) each. Can it be possible? How to accomplish it?
As I understand, Hadoop replicate data based on rack awareness.
• asked a question related to Parallel and Distributed Computing
Question
I want to simulate a discrete random  media with FDTD method.
the simulation environment is air that is filled with random spherical particles (that are small to the wavelength) with defined size distribution.
what is an efficient  and simple way to create random scatterers in large numbers in FDTD code?
i have put some random scatterers in small area but i have problem producing scatterers in large numbers.
any tips, experiences and suggestions would be appreciated.
Paralellizing a code is not easy.  You may want to get a hold of a commercial tool like Lumerical or an open-source tool like Meep if you need parallelization.
It is possible.  You will need to "voxelize" the STL file in order to identify points on the grid that are inside or outside of your particles.  Search Mathworks website for anything called Voxelize or something like that.  There is a little more information about this in Lecture 19 here:
• asked a question related to Parallel and Distributed Computing
Question
Does anyone have experience in deploying the WIndows HPC private cluster, or executing an Abaqus job on two computers without queuing server using MPI Parallelisation?
I have managed to set-up the Cluster and to join the headnode running on Windows Server 2012 R2 and one compute node "W510" running on Windows 7 64bit Ultimate.
Unfortunately I keep getting the following error massage:
Failed to execute "hostname" on host "w510". Please verify that you can execute commands remotely on "w510" without a password. Command used: "C:\Program Files\Microsoft MPI\Bin\mpiexec.exe -wdir C:\Windows -hosts 1 w510 1 -env PATH C:\SIMULIA\Abaqus\6.13-4\code\bin;C:\Windows\system32 C:\SIMULIA\Abaqus\6.13-4\code\bin\dmpCT_rsh.exe hostname".
I  also got the similar problem. Do you had overcome this problem?
• asked a question related to Parallel and Distributed Computing
Question
I read many books and search on Internet but I did not find any answer .
Dear Nitin Mathur,
I still am trying to explain the reason for conditional branching instruction 7T states if condition is not met. Why isn’t  it 10 instead?
The explanation:
First of all we need to keep two things in mind
1. Each instruction has opcode fetch cycle where in first two clock pulses the memory is fetched for the opcode and it is placed on the data bus and in next clock pulses it is loaded into instruction register and subsequently to instruction decoder for interpreting the instruction.
2. The program counter always points to the next memory location from where an 8 bit data will be fetched. Processor always fetch the program counter value either as opcode fetch or normal memory read as per the previous fetching and the program counter value will automatically be incremented by 1.
Now in connection with the answer of the question:
When the opcode for a conditional jump is being fetched the PC value is pointing to the higher byte of jump address. After the 4th CP the processer automatically starts fetching the memory location pointed by the PC (Lower byte ) as a simple memory read and PC is auto incremented by 1 but as soon as the instruction decoder completes the interpretation of the conditional jump and finds the condition to be false then processor finds it unnecessary to fetch the Higher byte of address which is currently pointed by the program counter and the PC value is incremented by 1 again to point next instruction (i.e. an opcode instead of Higher byte). So at false condition there are two machine cycles, i.e. OFC + Mem read (unnecessary but have to wait to finish the cycle)
Hence it is 4+3=7
I will be glad if this info helps.
Thanks
• asked a question related to Parallel and Distributed Computing
Question
Removed Question as account is inactive?
I'd highly recommend using a higher timestep for your goal of getting the most ns/day possible. I don't recall the details but a colleage of mine is using a timestep of 6 fs, which is a 300% speedup compared to your setting of 2fs.
Have a look at "virtual sites" or "hydrogen mass repartitioning" in  context of gromacs
• asked a question related to Parallel and Distributed Computing
Question
As I know a YARN container specifies a set of resources (i.e. vCores, memory etc.). Further we may also allocate containers of required size based on the type of task it executes. In the same sense, what does a map or reduce slot specify in MRv1? does it specify cpu core or small size of memory or both? And is it possible to specify the size of map reduce slots?
Further, suppose we have two nodes with same CPU power but different disk bandwidth, one with a HDD and other with SSD. Can we say that power of slot on first node with HDD is less than node with SSD?
Not all the cases, if you have massive RAIDx4 storage of SATA HDD zou can write it once and no more acting as moving some perculiar information doing rewrite, move operation occasionately, so you can write onlz output of you programm without insufficient disk space to investigate related researches as you only can wish.
• asked a question related to Parallel and Distributed Computing
Question
Hi,
I am doing a computational demanding time series analysis in R with a lot of for-loops which do the same analysis several times (e.g. for 164 patients, for 101 different time series per patient or for different time lags). In the end, the results of these analyses are summarized to one score per patient, but till this point, they work absolutely independent of each other. To shorten the computing time, I would like to parallelize the analysis. The independent parts could be analyzed parallel using not only one of the 8 cores of my processor.
I read some postings about performing functions like apply with more than one core, but I am not shure how to implement the approaches.
Does anybody know a simple and comprehensible way of translating a classical sequential for-loop into a procedure which uses different cores simultaneously to run a few of the analyses parallel?
Thank you very much for every comment!
Best,
Brian
The easiest way to take advantage of multiprocessors is the multicore package which includes the function mclapply(). mclapply() is a multicore version of lapply(). So any process that can use lapply() can be easily converted to an mclapply() process. However, multicore does not work on Windows. I wrote a blog post about this last year which might be helpful. The package Revolution Analytics created, doSMP, is NOT a multi-threaded version of R. It's effectively a Windows version of multicore.
If your work is embarrassingly parallel, it's a good idea to get comfortable with the lapply() type of structuring. That will give you easy segue into mclapply() and even distributed computing using the same abstraction.
Things get much more difficult for operations that are not "embarrassingly parallel".
• asked a question related to Parallel and Distributed Computing
Question
I have conducted the results on the coachbench benchmark. Each one performs repeated access to data items on varying vector lengths. Timings are taken for each vector length over a number of iterations. Computing the product of iterations and vector length gives us the total amount of data accessed in bytes. This total is then divided by the total time to compute a bandwidth figure. This figure is in megabytes per second. Here we define a Megabyte as being 10242 or 1048576 bytes. In addition to this figure, the average access time in nanoseconds per each data item is computed
and reported.
But I just got the result in the following form: why there is a sudden decrease in the time by increasing vector size.
output:
C Size         Nanosec
256             11715.798191
336             11818.694309
424              11812.024314
512               11819.551479
.... ......
..... .....
4194309 9133.330071
I need the results in the following form. How I get this result.
The output looks like this:
C Size    Nanosec    MB/sec    % Change
------- ------- ------- -------
4096     7.396        515.753       1.000
6144     7.594        502.350       1.027
8192     7.731         493.442       1.018
12288    17.578     217.015        2.274
1. Dear Sir, Thank you so much for your response. Why a sudden decrease in the time by increasing vector length
C Size         Nanosec
256             11715.798191
336               11818.694309
424                11812.024314
512               11819.551479
.... ......
..... .....
4194309        9133.330071
But I am confused in this, that if we increase the vector size then for writing or reading the data will be coming from the main memory because it will not be present in the cache then, in this case, the time should be increased. But I am confused that why it's decreasing when we increase the vector size.
Dear Sir, If I am wrong please do correct me.
• asked a question related to Parallel and Distributed Computing
Question
Dear all,
Please, I have a question: I installed 4 Virtual Machines (VM) on my computer. And I have a simple program written in MATLAB. I would like to copy this program to each VM in a distributed environment and let them work in parallel. My question is: Does MATLAB work and support 4 virtual machines on one computer and let them work in parallel?
Kind regards
Ammar
Dear Farshid,
I would like to thank you for this answer. I read in the Internet about distributed in MATLAB. I can see there is a MATLAB version for distribution called "MATLAB Distributed Computing Server". And I think this version has a special license.
Please, if there anyone work with "MATLAB Distributed Computing Server" can help !!!
Kind regards
Ammar
• asked a question related to Parallel and Distributed Computing
Question
In the paper "Apache Hadoop YARN: Yet Another Resource Negotiator", it claims that Yarn can extend node number from 4000 in Hadoop 1.x to over 7000. My question is that what is the key principle of improving scalability behind YARN.
In Hadoop 1.x, there are centralized NameNode and centralized JobTracker. It seems that things haven't changed much in YARN since it also has centralized ResourceManager. The bottleneck in Hadoop is still the design of centralized control.
Both Suzanne McIntosh and Feras Awaysheh mentions that the master node is freed from the job management, and these freed resources could be used for management of more node. That does perfectly explain the improvements of the scalability.
I am now trying to understand why the number of nodes could be the factor limiting its scalability. Could it be the resource limitation of a single process since it keeps the state information of all slave nodes?  If this is the case, will allocate the ResourceManager to a more powerful server help improve the scalability?
• asked a question related to Parallel and Distributed Computing
Question
Hi all,
I would like to establish parallel computing by using the GPU of my M2000 Nvidia graphics card. I was wondering if this works for all the softwares, in my case bioinformatics softwares like PASTA (Python platform), or is it limited to certain programming platforms or particular softwares/libraries provided by the parallel computing APIs (e.g. CUDA)?
If it is limited then, is there way to achieve parallel computing without being limited to using only certain softwares?
Thanks.
Writing code using CUDA already makes your program limited to NVIDIA GPU's, and hence not very portable.  (Exactly the desire of NVIDIA, of course).  Most of the GPU codes I've seen use the OpenACC set of directives (Fortran) / pragmas (C/C++) instead. These have the advantage of (relative) ease of use, and of portability to non-GPU systems, where the directives are just ignored.  The latest versions of OpenMP also have support for GPU computing.
• asked a question related to Parallel and Distributed Computing
Question
I am evaluating different iPaaS solution . Looking for following·
o    iPaaS Reference Model
o    Strengths of iPaaS
o    Limitations of iPaaS
Hi Chandra
Integration Platform as a Service (iPaaS) evolved some 5 years ago in response to the requirement of companies to integrate a cloud-based service that addresses data, processes, service oriented architecture (SOA) and other application use cases into existing company systems, but using a cloud service model. This allows companies to use a suite of cloud based models to develop, execute and govern integration flows. customers typically do not have to install or manage any hardware or middleware, particularly useful for SMEs.
As I recall, Gartner were the first to come out with an iPaaS reference model, which allows users to easily compare offerings from different service providers.
The main strengths of iPaaS are the ease of implementation, the lack of investment needed in new infrastructure, the lack of a requirement to manage additional hardware or middleware, and that fact that it is paid on an accounting friendly pay as you go basis, ie Opex instead of Capex.
The limitations of iPaaS are the same as for the cloud generally. You are running on someone else's hardware, you may suffer from vendor lock-in, and of course all the usual cloud security, privacy, governance and risk issues will also apply. Clearly, the super secure corporate firewall will also not stretch to protect your iPaaS end, and the strength of the service level agreement will determine your fate to a large extent.
I have loaded some lists of references which address all the usual cloud issues, such as cloud security, cloud privacy, cloud accountability, cloud governance, and cloud risk.
I hope that helps.
Regards
Bob
enc
• asked a question related to Parallel and Distributed Computing
Question
I have a program that write with parallel programming in mathematica software on one CPU, this program run good & excellent . If I have several computers(CPUs) that are network together, how to run this program in this case? I want my run time for this program in new conditions, divides to number of CPUs, is this possible?
Thank you all .
It depends on the programming environment and the operating system. If you want to use several computers in a network you can use MPI. On a single computer you can use OpenMP (a good tutorial here: http://bisqwit.iki.fi/story/howto/openmp/ ).
There are also other tools like the one here: http://mpc.hpcframework.paratools.com/ that allows you to run parallel programs on clusters of  multiprocessor/multicore NUMA nodes.
Regards,
Bogdan
• asked a question related to Parallel and Distributed Computing
Question
Actually I’m working on numerical modeling of grain growth with cellular automata. I have some questions in terms of deterministic cellular automata .
Does anyone work on deterministic transformation rule used in cellular automata? (calculation of velocity and its fraction)
My result for two grain is attached with this question.
Actually the triple junction must be 120 degrees while my result is in contrast with this fact!
• asked a question related to Parallel and Distributed Computing
Question
to write research proposal in high performance computing area for my Phd.
Reading in general is good, but an important part of original research is, based on knowledge of a field, come up yourself or with some little help with research topics to investigate. An important part of a Ph.D. research is finding the important, relevant, challenging subjects you can develop on, solve, advance.
If you get direct advice from others you can be a technician of some sort, but not a complete researcher!
• asked a question related to Parallel and Distributed Computing
Question
I want to find the CPU time taken for the tasks executed using MapReduce on Hadoop. When i opened it on localhost:50070/ under overview, it doesn't show how much time has been taken by CPU to execute the particular task. If anyone please can help me. Thanks in advance.
• asked a question related to Parallel and Distributed Computing
Question
for DG LSF=2*Peff,k*Rk-1,k/Vk^2, Peff,k how can i calculate it correctly
for 15 bus system i get the correct load flow as in this paper
"Particle Swarm Optimization Based Capacitor
but LSF not correct please any help !!
my results for example for Bus 6 in paper LSF=0.016437, but my work  0.0129525
Hope these link and papers can help to to solve this problem.
Regards
Md.Julkar Nayeen Mahi
• asked a question related to Parallel and Distributed Computing
Question
Can you suggest me "how to calculate expected time to compute a task which requires heterogeneous multiple resources" ?
For example: if we consider only CPU requirement of a task, then the execution time of the task is task length in MI divided by CPU speed in terms of MIPS.
Can you suggest me for the consideration of multiple resources like CPU, RAM, Bandwidth, etc.
Hi Mohammed Benalla,
We have a set of tasks and those task have different resource (like CPU time, RAM, Bandwidth, etc.) requirement. Then, how to calculate expected execution time of task ?
• asked a question related to Parallel and Distributed Computing
Question
Can I run a multi-threaded program and specify how many cores should be used: 1, 2, 4 etc. so as to enable a performance comparison?
Parallel processing
Multi-core systems
Hi Yigal sir,
You can map the threads to specific cores. You can Refer to https://software.intel.com/en-us/node/522691 for details. I have been looking for this answer too and this article helped me. I hope it works.
• asked a question related to Parallel and Distributed Computing
Question
I want to know why more articles in NoC Reliability field assume Fault occurs in the NOC-Router why not in Resources,Links,PEs , other elements of NoC?
In my humble opinion, the assumption that faults occurs on router rather than other elements, is that errors are more critical at the routing decisions, where a single bit can lead to a deadlock, livelock, or increase traffic in an already crowded link.
In addition, if you use error detecting/correcting mechanisms you have two main options:
- Put them in the Network interface as a end-to-end error detecting/correcting mechanism
- Or, put them in every router as a link error detecting/correcting mechanism
Each has its advantages and drawbacks as well, it depends on the robustness of the application running on top of the NoC
Error and faults occurs everywhere but they matter the most at the routing process, therefore the common assumption is to "detect" errors or faults  at the routers
• asked a question related to Parallel and Distributed Computing
Question
Greetings everyone,
I keep looking for software for high-performance cellular automata simulations, but I can't find anything specific. I need one that takes advantage of multi-core processors.
What software do researchers use? Is it a Matlab toolbox or an R library for example..? And can I somehow extract accurate measurements of its performance?
If you talking about classical synchronous CA, consider the NVIDIA CUDA (it is in fact massivelly parallel C).
Nothing have such efficiency in one machine.
• asked a question related to Parallel and Distributed Computing
Question
Several techniques can be applied in Peer-to-Peer networks to handle the presence of malicious nodes (Reputation Systems, Accountability, Distributed Consensus, etc...).
Which one do you think has the best trade off between the capability to discriminate malicious nodes from honest ones, and the cost of the technique (in term of the number of messages for example). Obviously, knowing that none of the previous systems can fully decide (with a 100% accuracy) whether a node is malicious or not .
In summary: for very very large p2p systems (file-sharing), I would go with a reputation system / shared history; for small ones, I would go with accounting, bilateral interactions (private history).
• asked a question related to Parallel and Distributed Computing
Question
I am researching on spinlocks used in multi-threaded applications running on modern multi-core processors. Unfortunately most of the standard benchmarks like SPLASH-2, UTS, PARSEC use mutex locks. I am thinking to replace mutex locks in these benchmarks with spinlocks and get the results? Will the results be still considered by the scientific community if I inform the modifications done to the benchmarks? Will that be a valid claim? Any pointer in this matter is highly appreciated.
Hello Karthik,
As much as possible (and for obvious reasons), it is best if the authors are not the writers of the benchmarks they use in their papers. However, I believe that if you have a good reason of modifying a benchmark, you can definitely do so. I can only speak for HPC though; I can hardly talk about what happens elsewhere. I have read countless of papers where people actually did that. It is in fact better to modify and document a well-known benchmark than writing your own from scratch. In any case, you wouldn't use an ill-fitted benchmark for your work anyway; and sometimes, you're the first to come up with something for which there is no existing benchmark.
Just consider these few points:
• The modification is clearly justified (e.g. spinlocks are not covered and that is what you are testing)
• Clearly explain what you modified. Localized pseudo code, etc.; and if possible, make the modified sources available somewhere (this will add a lot of credibility)
• If the modification is substantial, you can even publish it; as it could benefit the community as well.
Now, a quick nuance! In some cases you have to use (unmodified) what is established to measure a well-defined metric; but I don't think your example falls in any of those cases. For instance, if you build your own cluster in your garage and want to show off its performance, everybody would expect you to use Linpack unmodified.
• asked a question related to Parallel and Distributed Computing
Question
I need some useful references to parallelize my FDTD Codes. I'm currently working on imaging and scattering analysis of random rough surfaces using numerical methods (FDTD, MoM and etc). In most 3D problems (with large dimensions) there are some restrictions related to memory and CPU time. As a result we need to use fast methods or to parallelize our conventional codes.
please give me some simple,fast and useful references which i can utilize it to parallelize my codes. I'm beginner in parallel processing and need help.
Go to massively parallel algorithms to get fast solvers,  the Parallel FDTD is a fast FDTD solver on a GPU-based in : https://sourceforge.net/directory/?q=fdtd
the GPU and GPGPU are the well suited to these problems.
• asked a question related to Parallel and Distributed Computing
Question
Some time ago I used the Wolfram Lightweight Grid Manager to set up Mathematica for parallel computing in my local network. All the computers had installed Mathematica which have been configured to use the license server. This environment worked properly.
I have question if it is possible to configure Mathematica (which  possesses the site licence) to work similar to the previously described (but without the Wolfram Lightweight Grid Manager and the license sever). I think about the following situation. One computer is the master and the other (which working together in LAN) share their computing kernels.
thank you for the link to the book. It is very interesting but in the frame of building computer cluster it prefers standard solution with LGM.
Best wishes
• asked a question related to Parallel and Distributed Computing
Question
Map-Redude exists in MongoDB, but the documentation says:
For most aggregation operations, the Aggregation Pipeline provides better performance and more coherent interface. However, map-reduce operations provide some flexibility that is not presently available in the aggregation pipeline.
My questions are:
1. Why is map-reduce so effective in Hadoop but discouraged in MongoDB
2. Is there any example of "flexibility" of map-reduce not available in the aggregation pipeline?
Rafael Caballero
Pitfalls of MongoDB can be summarised as -
As compared to MySQL, we are very limited when it comes to aggregating data with mongoDB, for mongoDB beginners it might seem obvious that using map-reduce functionality to mimic some of MySQLs aggregating operations is a good idea.
With MongoDB, things are not as intuitive as one might hope, especially if you are coming from a relational database background we can only advice you to do the research rather than trusting your instincts.
So again we had to learn the hard way that MongoDb’s map-reduce functionality just isn’t meant for real time computing; it is extremely slow, especially when you have a large amount of data in a shared environment. The main reason for the weak performance of mongos map-reduce is that it’s javascript engine SpiderMonkey.
Comparison and usage of Hadoop and MapReduce -
If you have requirements for processing low-latency real-time data, or are looking for a more encompassing solution (such as replacing your RDBMS or starting an entirely new transactional system), MongoDB may be a good choice. If you are looking for a solution for batch, long-running analytics while still being able to query data as needed then Hadoop is a better option. Depending on the volume and velocity of your data, Hadoop is known to handle larger scale solutions, so that should certainly be taken into account for scalability and expandability.
Either way, both can be excellent options for a scalable solution that process large amounts of complex data sets more efficiently than a traditional RBDMS. In fact, it is also possible to utilise both of these systems in tandem, which is even recommended by the people at MongoDB. This combination of the two platforms addresses several of the strengths and weaknesses of each other, by delegating real-time data tasks to MongoDB, and batch data processing to Hadoop.
I guess this should answer the first part of your question. I will try and post the answer to the second part also, if possible.
• asked a question related to Parallel and Distributed Computing
Question
I'm currently working on a linux version. I would like to print meshes without the data calculated in the FMESH card.The purpose is to verify the mesh (with the geometry simulation on background).
Thanks.
• asked a question related to Parallel and Distributed Computing
Question
In CUDA C i searched how to make a table but not geeting how can i do group by queries on that table and function which can define them.
I don't fully understand your question, but:
1) the PDF file is from 2009 ! This was an ancient times of GPGPU ! Maybe this is more up-to-date https://wiki.postgresql.org/wiki/PGStrom
2) you should decide on which level you will write your programs: at the CUDA level or at SQL level?
• asked a question related to Parallel and Distributed Computing
Question
one of the reason of slow performance in FEM analysis is construction of elastic stiffness matrix with loops.
There is a large for loops with size of the number of elements. This can quickly add significant overhead when NE is large since each line in the loop will be interpreted in each iteration. Vectorization should be applied for the sake of efficiency.
Dear Ali
There are some files as well as a link, as attachment. It is hoped that these sources would be useful.
best regards
Panji
• asked a question related to Parallel and Distributed Computing
Question
I have installed a Hadoop 2.6.0 Cluster using one NameNode (NN) and 3 DataNodes (DN). Two DNs are on two physical machine running Ubuntu while 3rd DN is virtual node running Ubuntu on window server. Two physical DNs have 2 GB RAM and Intel Xenon 2.0 GHz Processor x 4 (i.e. 4 cores). While 3rd DN is assigned with 4 GB RAM and 1 processor x 4 (i.e. 4 cores) and it is running on window server which have 32 GB RAM and 2 Intel Xenon 2.30 GHz processors x 6 (i.e 12 cores).
Mappers running on Physical DNs are faster than Mappers running on Virtual DN. Why ?
Hi
VIDHYASAGAR
That's depend of the visualization you build. If you use direct DD access you get good performance, but your VM is not movable. If you use virtual disk you can move all the virtual machine relatively quickly. But you pay the virtual disk management and the COW if you use snapshot.
There are no dream IBM invent the virtual machine in 1970 and everyone know the cost.
So it's the balance between usability of management and performance cost. You can;t have both.
Sorry. !
Pierre Léonard
• asked a question related to Parallel and Distributed Computing
Question
The original version of the Intel Advanced Vector Extensions Programming Reference. from March 2008 as Document no. 319433-002. is quite often referenced in some HPC papers which were early adopters of the AVX SIMD extensions. However, the original link for this document doesn't exist any more.
• asked a question related to Parallel and Distributed Computing
Question
I am trying to use one pulse program in Topspin 2.3, but it's not working. The pulse program works in an earlier version of Topspin (probably version 1.1). But when I tried to use it in Topspin 2.3, it's showing compilation error. The error is mainly for the line '#include <trigg.incl>' and 'trigg' command in the body of pulse program.
When I deactivated (by putting semicolon) these two lines, the program doesn't show any compilation error. Looks like it will work.
Now I am concerned that, is it OK to use the pulse program without 'trigg' command or trigg include file. Is there any chance of damaging the probe? What does the 'trigg' command do?
[I noticed that the other pulse programs came with the Topspin 2.3 do not have that 'trigg.incl' file or trigg command.]
Can anybody help me?
the trigg command is only in there to provide a trigger for an oscilloscope. This was used to see the pulse sequence on a scope.  It is perfectly safe to run it with the lines commented out.
Clemens
• asked a question related to Parallel and Distributed Computing
Question
Try to look at the log files to characterize your job - for instance, in which part of the job execution does the YARN/MR2 run take longer - map phase? reduce phase? You can also check this article: http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas
• asked a question related to Parallel and Distributed Computing
Question
Hi all,
I'im current PhD Student, i'im trying to post-traitemnet databases (ODbs) with scripting Python, but the scripting python Abaqus use 22.5% of memory and 1 CPU. I want to use full memory and some CPUs (8 for example).
Thanks for yours helps,
Best regards,
Anh-Tuan DAU
Anh-Tuan,
There is an entire chapter dedicated to the C++ API in the ABAQUS user manual. It has a number of examples on how to read and write to an odb file.
However, if you do not know C++ you might want to stick with Python. It is easier to learn and if you get stuck you can always have a look at the code in the .rpy file.
Michael
• asked a question related to Parallel and Distributed Computing
Question
The University where I work is planning to move some of its workload to public cloud environments. I am not sure if it is the right decision because we have a very low bandwidth. Next year, the bandwidth could be increased to 1 Gbps which would provide access to about 10,000 people. I wish to write a [well supported] document with pros and cons about such decision.
No, I am not familiar with models or simulators.  I would strongly advocate collecting real, live data from your existing systems.  Linux systems, in particular, are very good about making system-behavior statistics available, for both bare-metal and virtualized systems.  I've never found Windows systems to be other than opaque, but thankfully also never had to work with them.  It's also often possible to get a very good idea of what's going on by just looking at raw device counters (such as network stats available via SNMP).
• asked a question related to Parallel and Distributed Computing
Question
My experiment was conducted on different set of computers with varying cores and varying CPU speed. My analysis shall be based on the impact of cores, but all the machines does not have the same CPU speed.
A method is required to make all the results from the various machines uniform to remove the effect of CPU speed.
Is there any known method of normalizing/standardizing results of execution on varying computers with varying CPU speed?
I agree with Anthony that comparing cycles is the best you can do with the data you have. Take the CPU with the lowest core count as your baseline. Then compute the number of cycles divided by the number of cores for each experiment. Now, you can compute the speed-up by taking the ratio of each experiment and your baseline experiment. This resolves the problem of different clock frequencies.
Nevertheless, you might not get very meaningful data by using this method. Your different CPUs would need to have the same internal architecture and just different clock speeds. I expect that you took different CPUs for your tests. For different CPU models the number of execution units and the time it takes to execute an integer or floating point instruction may vary. Also, CPUs internally use pipelining. Different lengths of the pipelines will also have an impact on performance. See, CPUs usually spit out more than one instruction per clock cycle. Some will have about 2 instruction per clock cycle, others more, and different ones less. So, it is not just the clock frequency which will influence results. Still another thing is that for multicore processors several cores will share L2 and L3 caches. A quad-core system might have two L2 caches shared by two cores each and one L3 cache shared by all four cores. A 6-core system might share the L2 cache between 3 cores. And a different model again between just 2 cores each. Different sets of caches will have different impact on performance depending on your memory access patterns. All these things will make your results very questionable.
Because of all this the correct answer to your question is that there is no known method to truly unify the results.
• asked a question related to Parallel and Distributed Computing
Question
I am using Hadoop-2.6.0 and Eclipse Kepler/Juno. I have executed my MapReduce Job in both way and I found that running jar file on Terminal takes more time than direct execution for Eclipse IDE. I could not understand the reason behind it. Please help, I could not decide which way I should follow ?
1. I don't believe running with IDE is faster than running jar with command line for serious hadoop applications with considerable data size (not toy examples);
2. When there may be exceptions in your runnings, your comparsions would not make sense anymore, i.e. your so-called 'fast/slow" are meaningless!
• asked a question related to Parallel and Distributed Computing
Question
Can we simulate every subject in the field of “Service Discovery” or “Resource Discovery” using “CloudSim”? Or there are other necessary tools to be added to "cloudsim"?
Or other simulator exists for this purpose?
While you wait for better answers (among other things, I have not played with CloudSim) there is a network emulator (EmulNet) provided in the Cloud Computing Concepts MOOC from The University of Illinois that is good enough to implement and test membership protocols. Therefore I am sure it can be used to test discovery and other distributed algorithms as well.
• asked a question related to Parallel and Distributed Computing
Question
I am using hadoop-2.6.0 and eclipse kepler on Ubuntu-14.04. I am confused about library file of hadoop, as there are so many jar files in contrast to hadoop-1x. So how many jar files should be added in any MapReduce project. For example what should add in word count program ?
Thanks once again, I got your steps and video but how to install maven in eclipse in order to create a maven project ? I followed through install new software in help of eclipse IDE and add http://download.eclipse.org/technology/m2e/releases but it did not work. Help me please.
• asked a question related to Parallel and Distributed Computing
Question
I would like to find a mathematical foundation of communication complexity.
• asked a question related to Parallel and Distributed Computing
Question
I am studying the performance of spinlocks used in parallel programs running on modern Multi-core processors. Though spinlocks are more used in multithreaded OS kernels, I see that in the application space mutex lock is widely used than spinlocks. Hence I would like to check whether some apps do use spinlocks.
Also I would like to know if there are any parallel computing benchmarks that use spinlocks. I found that SPLASH2 benchmark uses mutex locks.
OpenMPI in its default implementation can tend to spin, or busy wait.
Overall, what's happened in recent years is that the low-contention case for mutexes has been heavily optimized in most environments. Back in the day, some implementations were exceedingly naive and always caused a system call transition, just to conclude that the resource was free. I would only say that a spinlock could make sense these days if you indeed expect to wait, but for a really short time. That's really quite rare. If you expect to wait for a short time, generally the resource is also experiencing low contention, so the most likely case is that you won't need to wait at all and the mutex is just fine. When you are reaching the tipping point to high contention, you might need a complete rearchitecture of the solution or a fair queuing system, rather than putting your threads in eternal spinlock races on who will get the resource.
One main reason to spinlock in a kernel is when it would not make sense for your thread to sleep due to the state you are in (servicing an interrupt etc). That will simply not happen in userspace.
In any case when you do use a spinlock, you should use one provided by a library rather than trying to roll your own. Waiting in the right way for being friendly to Hyperthreading CPUs etc is not trivial.
• asked a question related to Parallel and Distributed Computing
Question
In driver class I have added Mapper, Combiner and Reducer classes and executing on Hadoop 1.2.1. I have added NumReduceTasks as 2. The output of my mapreduce code is generated in a single file part-r-00000. I need to generate output in more than one file. Does it need to be add our own Partitioner class?
HashPartitioner is the default Partitioner in hadoop. you may use other partitioner provided by hadoop or create custom partitioner class too as per your requirements.
following link may be useful to you. kindly go through all the content because it will answer almost all your questions related to hadoop map-reduce.
• asked a question related to Parallel and Distributed Computing
Question
If you have a plan to build the binary by yourself, just download hadoop-2.6.0-src.tar.gz, extract and build. If you use windows and need executable files, download hadoop-2.6.0.tar.gz and extract it. It should contain binary files which can run on windows. *.mds files are checksums.
• asked a question related to Parallel and Distributed Computing
Question
I've structured a novel static interconnection topology for parallel and distributed a system which shows considerable advantages over existing static interconnection topologies in terms of its physical properties. Now I must support my novel topology with practical comparison. To do that how can I choose a practical problem and categorize the results?
This is a good question.
One way to make the comparison between a new topology and any other topology is to do the following:
1.   Let your new topology be a set $\tau$ of points in some space $X$ that satisfy the union and intersection conditions for a topological space and guarantee that $\tau$ includes $X$ and the empty set.
2.   Let $Y$ be  another topology.
3.   Endow $\tau$ and $Y$ with a proximity relation $\delta$ such as the Wallman proximity.   Recall that the Wallman proximity satisfies the 4 Cech conditions for a proximity and the closure rule holds true.   Let $\mbox{cl}A$ denote the closure of $A$.  That is, for example, let $A,B\subset \tau$ and guarantee that $\delta$ satisfies the following condition:
\[
A \delta B\ \mbox{if and only if}\ \mbox{cl}A \cap  \mbox{cl}B \neq \emptyset.
\[
4.   Define a mapping $f$ on  $\tau$  into $Y$ such  that $f(A)\delta f(B)$, provided $A \delta B$.
This framework will make it possible to compare topologies.
• asked a question related to Parallel and Distributed Computing
Question
How can add parameter as waiting time in priority based scheduling algorithm. If we add this parameter then it's provides efficient scheduling list.
Waiting time,
Turn round time,
Execution Time
My clarity is given below:
My works is make an efficient techniques for load balancing in cloud computing.
• asked a question related to Parallel and Distributed Computing
Question
.
Dear Sajal, I suggest you to read about BiCGStab or BiCGStab(2) method. I have some papers that describe such methods. Feel free to contact me anytime if you wish to clarify some doubt.
Best wishes and good luck with your research!
• asked a question related to Parallel and Distributed Computing
Question
Hi! Is there a way that I can access a Hadoop Map/Reduce cluster available for researchers for free to run some experiments?
Hello,
I think you can try and apply for an account on Grid5000 research platform https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home , after that you could set up Hadoop cluster yourself as described on the link below: https://www.grid5000.fr/mediawiki/index.php/Run_Hadoop_On_Grid%275000 .
Regards,
Khalid
• asked a question related to Parallel and Distributed Computing
Question
any literature ,systematic review and survey of resource allocation in cloud computing.
• asked a question related to Parallel and Distributed Computing
Question
I need to study the Intel Broadwell Processor. I need relevant data for the architecture of this processor. If any body can help.
A simple google search on 'broadwell processor specs' gives many references.  You could start with the wikipedia entry, or just go to the Intel website.
• asked a question related to Parallel and Distributed Computing
Question
MARE is a programming model and a run-me system that provides simple yet powerful abstractions for parallel, power-­‐efficient software
– Simple C++ API allows developers to express concurrency
– User-­‐level library that runs on any Android device, and on Linux, Mac OS X, and Windows platforms
Hi there,
I'm working on a similar library. Mine is called HPX. It provides a C++11/14 compliant interface for task based programming (async, future etc.). We extended the standard in a straight forward manner to also support distributed memory parallelism through the same interfaces. In addition we added some more API functions to make everything slightly more composable. Check it out at http://github.com/STE||AR-GROUP/hpx
• asked a question related to Parallel and Distributed Computing
Question
I'm not sure about how the throughput in data center networks can be calculated.
Could you please give me the exact formula that calculate the throughput of a data center network?
The bisection bandwidth  provides an upper bound of throughput for random traffic  in which (by being random) half of the traffic in sent to the other side. In most cases, throughput under random traffic will be lower, as the calculation assumes contention won't prevent packets from reaching the bisection.
On the other hand,  if the traffic is local, throughput can be much higher that the bisection bandwidth limit , as most traffic will use only one local link, so the volume of data  sent/receive per unit will be high.
A very rough calculation will use the average distance travelled and the network capacity to calculate an upper bound for a given traffic load.
• asked a question related to Parallel and Distributed Computing
Question
Is there any faster way than the direct way of computing Smith Normal Form of polynomial Matrices?
Hello, Cihan! I suggest you to take a look at these papers:
Best wishes and good luck with your research!
• asked a question related to Parallel and Distributed Computing
Question
I need to analyse their performance in terms of fault tolerance, energy consumed etc.
Data center (DC) design has become increasingly significant with the swift progress of cloud computing and online services. There are many simulation tools being used for DC. We did simulation for DC using ns2 but I feel that was really cumbersome process. So, I will never recommend to go with ns2. OMNET++ is reasonable choice but no best. In addition, discrete-event queuing simulation can also be used for generating task arrival events through random draws but it will be depending on the nature of your proposed work for DC.
To understand idea of simulation for DC, you need to read following helping material.
I am honestly speaking if I get next chance to simulate some features of DC then I can go with “CoolSim” that is better choice for all airflow, design and modeling. It also resolves the issues of traditional stand-alone modeling solutions.
Sriram Sankar, Aman Kansal, and Jie Liu, "Towards a Holistic Data Center Simulator",18 September 2013.
• asked a question related to Parallel and Distributed Computing
Question
I am using Ubuntu in VMWare on 4 nodes running Window 7. I have configured Ubuntu with 1 GB RAM, 1 Processor and 20 GB Hard Disk each. Suppose I have used these configuration directly on physical nodes instead of VMWare, then it could be better in performance?
David - your assumption is incorrect. what a Virtual Machine allows you to do is run multiple "computers" simultaneously on the same hardware. From your description, it sounds like you could *replace* all the other computers around the house by running VMs on your main 24 core server. You could easily allocate 3 cores for one Virtual Machine, 1 core for another, and so on - the only limitation would be memory and the performance you are willing to tolerate either on the main system or inside the Virtual Machines.
Any other kind of sharing amongst disparate machines would be done via network protocols (CIFS or NFS for filespace). There is a commercial product, ScaleMP, that supposedly allows you to aggregate memory from several different machines into one larger configuration, but again - that is not what you are describing.
• asked a question related to Parallel and Distributed Computing
Question
If yes, how is it different from MapReduce? What else can be used?
Certainly, it can be. There exist bindings for MPI, here: http://cran.r-project.org/web/packages/pbdMPI/ and it seems to scale rather well with problem size (from personal experience, admittedly single-node only). Perhaps you could be more specific about what type of analysis is required - R already has the basic statistics field covered, but for specific needs you may need to do some programming.
• asked a question related to Parallel and Distributed Computing
Question
I am wondering if anybody has run calculations with NWchem on Mira (IBM's Blue Gene/Q System) at ALCF. Please if you have or have done some performance studies and have collected some data, could you share any suggestions or ideas. I am particularly interested in the DFT, TDDFTs, and the Many body correlated suites (CCSD,EOM-CC, MRCC etc).
Jerry thanks. I made the contact and should be getting some data
• asked a question related to Parallel and Distributed Computing
Question
We are looking for thread-safe container classes in C/C++ - please inform if you know any. The library should support multi-thread read-write vector, list and similar containers.
Lock-free, wait-free implementations are also welcome.
Requirements:
1. Open-sourced and commercial friendly license. (We are aware of TBB, but its free version is GPL, hence not suitable)
2. C++11 ready if possible (GCC 4.8, MSVC 2013)
You may have a look at this one - http://libcds.sourceforge.net/
• asked a question related to Parallel and Distributed Computing
Question
A multi-writer shared storage is the one that allows two writers to modify the same storage object concurrently without imposing any locking. Usually in cases of write collision one of the write operations is visible and the other is hidden, meaning that no read operation will later observe this write. I just wonder if such behavior is useful for any existing applications. For example if it was that the shared storage would implement a file system then would it be OK if one writer would overwrite the changes of a concurrent writer on a single file?
@Peter: I think that the way this question is asked is not so much about consistency in the DB. Though consistency is certainly a problem with this approach.
I can't think of any situation where this kind of multi-writer approach is useful. Suppose the current state is A and writer1 wants to write B, writer2 wants to write C. In many cases writers think that the current state is what they have written. Sometimes this cannot be assumed, so each writer reads back the current state directly after writing. Suppose that writer1 wins the battle and writer2 will never write C (i.e. it is hidden, just as you explained). Now writer2 wants to read back the current state, but writer1 does not have finished writing. Which state does writer2 get? A, B or C? (B only works if writer2 waits until writer1 is finished -> you need synchronization again.) Since C will be hidden it only can be A or B. We can't read A because writer2 knows that it is older than C. And B is not available yet. What do you do now?
In another scenario, maybe when writing to a log file where there is no reader during execution of the application, I don't see why we would throw away any writes. We could use one thread that handles a queue and both writers can enqueue their write commands almost without any delay (lock-free data structures come to mind).
• asked a question related to Parallel and Distributed Computing
Question
I'm asking those who have experiences with MPI working on various environments.
It can be used on simple Beowulf clusters, consisting of four PCs connected to a single switch (or even hub), where all info can be broadcasted (or multicasted) to all hosts. It can also be used on large supercomputers of complicated topology, with large diameters.
It seems various cases need different program optimization don't they?
In particular - are the collective operations efficient on supercomputers, like BlueGene? AFAIK, modern algorithms (like SUMMA) often use MPI_Reduce, at least...
What are your experiences? How to design programs for MPI?
MPI does have some topology routines that you can use to
create MPI communicators that are intended to be useful to MPI routines.
I'm not sure how much they are used.
I do know that Cray has made a great effort to optimize their MPI
collective routines, I assume that IBM and other major vendors have also,
but do not have direct experience.
Our group has reported any performance problems to vendors in terms of
1) we see this performance in your library routine, but
2) we can obtain better performance using an alternate version of
the collective routine written in terms of point-to-point routines.
One way to get good performance for collectives involving large messages
among large numbers of processors is to pipeline the process, and use the fact
that communication connection are generally bi-directional.
E.g.
To broadcast a large array of size NxN to p processors from process 0 to all others
1) Send the first row block (or column block for Fortran) from 0 to 1
2) Send 2nd block from 0 to 1
3) Simultaneously on proc 1, send 1st block , recv 2nd block withe sendrecv
4) Keep adding blocks to pipeline on proc 0 until all blocks have been sent,
5) Keep adding new processes to pipline till it is full.
This makes a broadcast which is O(1) in terms of the number of processors,
assuming
the message size is large enough so that the p communication latencies
is small compared to the time to send a block.
communication between proc k and k+1 is the same for all processes and
all adjacent processes can communicate in parallel. (A communication ring would suffice.) Latency goes up, due to the piplining, so this is not appropriate
for small messages.
This can all be coded in a MPI_Bcast_alternate subroutine/function and
you can compare it with your current MPI_Bcast
I did discover this on my own, but later found the result already published
for the Intel Delta machine.
A similar technique can be applied to all collective routines which send large
of data.
• asked a question related to Parallel and Distributed Computing
Question
On running mpiblast on rocks cluter, this error came on node 10.1.254.236.
• asked a question related to Parallel and Distributed Computing
Question
As a part of the course on Computer Architecture, how can students be provided a flavor of multicore/manycore architecture through some real hands on experience (may be through programming/using some specific tool/using some pre-designed boards or arrangements etc.)?
Is the purpose to train students in designing the hardware, or rather to
understand the hardware to best use it? The latter is of course a
much larger group.
For the latter group, programming would be the way that they would interact
with this. One way would be to expose them to OpenMP in either
Fortran, C or C++.
OpenMP is directive based so it is a fairly simple extension to the language.
The big difference in multicore is that you can have shared variables accessed
simultaneously so you have race conditions in updating variables, and
you can have performance problems form false sharing. Both these
feed into the architecture.
At our Univ, the suggested computer for incoming freshman is an Intel I5
which is a quad core processor. Even an I3 would be dual core.
If you need more, you could get a computer with an Intel Knights Corner
card which has 60 cores on the card for about \$1500 a card, though
you need to have this designed in so that you have sufficient power for
the card. This card runs OpenMP programs and even runs its
own OS, so you could even think of it as an attached computer.
• asked a question related to Parallel and Distributed Computing
Question
First I tried to run mpiblast on rocks cluster but it always end up with errors. Can anyone suggest what I should do?
Rathers waited for 2 hrs to get result but no outcome
• asked a question related to Parallel and Distributed Computing
Question
Links to papers, or ppts for computer science courses would be appreciated.
There are many schedulers that are used in major computing resources. Some of the big name types are SGE (sun grid engine/ now oracle) and LSF (by IBM). I am the coauthor on a paper describing a large thousand job schedule system as well.
The type of scheduler used for any computing system will depend on the specific architecture. LSF, for example, works great in SMP type environments with many CPUS and less node machines. Grid engine works better with thousands of nodes.
Resources:
Personal adaptive clusters as containers for scientific jobs.
Edward Walker, Jeffrey P. Gardner, Vladimir Litvin, Evan L. Turner
Cluster Computing (Impact Factor: 0.78). 01/2007; 10:339-350. DOI:10.1007/s10586-007-0028-5
Source: DBLP
• asked a question related to Parallel and Distributed Computing
Question
I just need few examples of applicable algorithms because I am looking for a new research areas.
Almost any algorithm in computational geometry gives incorrect results for certain inputs due to lack of precision. See Chapter 1 of the textbook by de Berg, Cheong, van Kreveld, and Overmars for an introductory discussion of this.
• asked a question related to Parallel and Distributed Computing
Question
Especially for ZFS and other filesystems, Including SmartOS?
Here's one article my students made with me. Maybe it's going to be helpful.
• asked a question related to Parallel and Distributed Computing
Question
There are many simulators (eg: DCNSim, NS3) to simulate the architecture of data center networks. Which one is more powerful and provide good flexibility?
I would suggest NS3 is more powerful simulator for data centre architecture. But on the other hand it is not that easy to manipulate and person's should have programming skills. on the other hand, CDNSim have GUI support which is a plus point but in my recommendation NS3 is mush better. But at the end choice is your because you know the scenario and your test requirements.
• asked a question related to Parallel and Distributed Computing
Question
In an Intel Processor does each of the cores has its own independent L1 and L2 Cache. In the intel architecture diagram it says "Shared L3" Cache. does this means that the memory allocation in this L3 Cache is at the mercy of the kernel. What happens if one of the process in Core 1 demands more memory ? Will this allocate more memory in L3 Cache (Once it runs out/ Miss out of memory ref in L1 and L2 ). Will this Hamper the performance of process in running in core2 or Core 3.
Is there any way to grantee the performance of an process on a core, when there are other process running at the same time but on different cores. I am trying to avoid any kind of context switching for a process running on a single core.
well the question you have asked depends on the architecture of the Intel processor.
If we consider the modern Intel processors, where a processor has 4 cores (Quad-core) along with multi-threading/hyper-threading technology (not with the scope of this discussion) option, then each core has L1 and L2 caches of its own, but the L3 cache is shared among the 4 cores. Now L1 is closely situated on each core, whereas L2 is little further from that core and L3 is situated even further. Hence the access time on each level of cache increases with the hierarchy of the cache (L3>L2>L1). For example if we consider the latency between L1 and the core is 1 then the latency between L2 and the core may be 4 and for L3 it might be 20, whereas the latency between the main memory and processor is around 50 or even more (this is complete assumption but in reality it depends on the architecture of the processor yet close to this concept).
Now since the L1 and L2 cache memory is on-chip, for that reason it can not accommodate big memories because of the fixed size of the die of each core or the processor to an extent. For this reason the closer the cache memory i.e. L1, the smaller will be the size of that level of cache. For example L1<L2<L3. Now, the size of the cache levels are inversely proportional to the distance of that level of cache from the memory. This mean that although the size of the L1 cache is small but the access speed is very high from the core.
Now if we think in term of one core in a multi-core processor system, then first the more frequently accessed data will be kept in L1 cache but if the size is not sufficient enough the it will be kept in the L2 cache and even if it is not sufficient enough then it will be kept on L3 cache. If the data is still large enough to be cached then part of the data which is most frequently computed or used is kept to the closest cache level possible so that the core have fast access to that part of the data. But this actually again depends on the associativity of the cache (2-way associative or 4-way associative) and how it is implemented. [Answer to one of your question]
Now, since L3 cache is shared so if a process is assigned to a processor, then if that process is computed using multi-core architecture, the data related to that process, which are more frequently accessed by all the cores, are cached on L3. Now depending on each thread of the main process that is being computed by each core, the relevant data from the L3 will be cached on respective L1 or L2 caches. And should not hamper the performance of the other cores.
N.B. The above section tells you the logic if the process is computed using multi-core and locking mechanism, so that it does not create a deadlock in the system while computing a process in multi-core architecture.
Hence, true answer to your question will depend on the program itself, and how it is being implemented in multi-core systems. For this reason it is very important to understand how locking (fine-grain or coarse-grain) is implemented in order to prevent deadlock or you may also need to understand how non-blocking algorithms or transactional memory works in multi-core systems. And depending on this the cache memory can be utilised properly to get the desired performance out of that program.
But these are all part of a big subject in computer science called Parallel and Multi-core Computing (Or can be called High Performance Computing). Until and unless you understand how parallelism can be implemented on multi-core systems and how these mechanisms utilise the caches of each core, then your questions will be very generic and unanswerable. So it is wise to be specific.
Again to answer the question on 'does this means that the memory allocation in this L3 Cache is at the mercy of the kernel':
It depends on how the cache is implemented: in architecture or in virtual.
Actually caches are part of processor organisation not the architecture because caches (or working of cache) can not be modified or seen by a programmer (until it is a virtual cache) but is a logic that is implemented into the processor technology to provide higher performance.
Hence, the kernel should not be able to affect cache memory until it is implemented in virtual level or the kernel is using some part of the cache to increase performance of the OS.
Worth Reading materials for Parallel and Multi-core Computing:
Herlihy, M. , Shavit, N. , "The Art of Multiprocessor Programming", Elsevier.
Hennessey J. L. , Patterson, D. A. , "Computer Architecture: A Quantitative Approach".
If you want to understand how cache coherency protocols work then it is worth to take a look at the following:
• asked a question related to Parallel and Distributed Computing
Question