Science topic

Cluster Computing - Science topic

Explore the latest questions and answers in Cluster Computing, and find Cluster Computing experts.
Questions related to Cluster Computing
  • asked a question related to Cluster Computing
Question
6 answers
I have recently gained access to a research cluster which has Gaussian 16. I am relatively new to both Gaussian and cluster computing. I am currently optimizing a combination of small first row metal-chalcogenide clusters with ligands using a TPSSh functional and a TZVP basis set. I currently have access to several nodes each with 48 cores and at least 8GB of ram per core.
In the .gjf file I have a bit of scratch memory, 32 cores, and 64GB of memory.
%LindaWorkers=str-c23
%NProcShared=32
%rwf=a1,20GB,a2,20GB
%NoSave
%nprocshared=32
%mem=64000MB
In the .slurm I have allocated 1 node, 32 cores, and 2GB of memory per core.
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --time=48:00:00
#SBATCH --mem-per-cpu=2GB
#SBATCH --output=%x_%j.out
Yet when running my optimizations I get dramatically different amounts of memory utilized.
For a run on a single ligand under these conditions it used 31.4 of the 64GB given.
For a larger ligand w/ same condtions used 20GB.
For the small ligand with a larger basis set (def2TZVP) it only used 12.7 GB.
For a metal cluster with 4 ligands (note: convergence failure) it used only 2-3GB.
Is there something I can do to better utilize the memory I have available?
Relevant answer
Answer
For those still interested, a new desktop version of GaussMem is available for calculating the amount of memory required by Gaussian calculations as a function of the type of calculation and the number of processors. It can be freely downloaded at the program page:
  • asked a question related to Cluster Computing
Question
1 answer
Hello, I'm COMSOL newbie working on cluster computing.
I'm using distributed parametric sweep mode, in which each server node solves a single problem (ex. Node 1 - Inlet fluid velocity 1m/s, Node 2 - Inlet fluid velocity 2m/s, etc...)
In distributed parametric sweep mode,
(1) I want to monitor multiple convergence plots since multiple problems are solved simultaneously. But I can only monitor a single convergence plot... Please give me some advice.
(2) How can I monitor multiple probe plots (which are Probe parameter - Iteration number plots)? I can only monitor the accumulated probe table, which is the final converged value. But I want to check the probe value trend with respect to iteration number.
Thank you very much!!! Every comment is my energy please help me!
Relevant answer
Answer
Quiero monitorear múltiples gráficos de convergencia ya que múltiples problemas se resuelven simultáneamente. Pero solo puedo monitorear un solo gráfico de convergencia... Por favor, dame un consejo.@
  • asked a question related to Cluster Computing
Question
5 answers
Hello everyone,
Let me explain my situation. In lab, we have enough computers for general uses. Recently, we started to DFT studies of our materials. My personal computer has i5 and 20gb ram and mostly enough for calculation. However, i can not use my computer during the calculations. Moreover, accessing the supercomputer clusters of government is a bit pain and needs a lot of paper work.
My goal is creating simple cluster for calculations and free my personal computer. Speed is important but not so much. Thus, is investing money on 4 raspberry pi 4 8gb a good idea?? I am also open for other solutions.
Relevant answer
Answer
Berkay-Sungur: I DID NOT say that a Beowulf Cluster would be unsuitable. What I did say is that one built from small computers such as RaspberryPi or BeagleBone would not be suitable.
What you describe would require what I did suggest: A Beowulf Cluster built from reasonable personal computers, a switch, and cables that nobody wants anymore. I've built several of those. The computers do not have to be homogeneous, while I do recommend sticking with all 32-bit or all 64-bit. Whether a computer is a desktop, floortop, or laptop would not matter, as long as they have at least one ethernet port. It is possible to mix machines with different word sizes but that requires additional work to be done.
I've built several such clusters. The first one used machines that were pulled from a scrap pile. Some were put together from parts coming from that same pile. That bunch was all 32-bits of varying CPU speeds. One node was even less than 1mhz. Another was so old it eventually caught fire but was easily replaced. The software would allocate workload to a node according to it's cores processing speed. Node-level software allocated threads according to the number of cores. Thus, the advantages of distributed and shared processing and memory were brought to bear.
Another cluster was built from scrap to mimic the one used in a customer facility that could not be remotely accessed. We rented time on our cluster to a company that needed a remote testbed whose results could be directly installed manually on the customer's facility.
Beowulf clusters can be very handy as a laboratory's computational needs grow beyond what one computer can reasonably handle. All of the clusters I have built required NO capital outlay.
  • asked a question related to Cluster Computing
Question
11 answers
I want to learn new computing technology. I heard that Fog computing is the current one. Whether other latest computing technology available?
Relevant answer
Answer
Dew computing is very trendy nowadays. You can check this very recent paper on this.
  • asked a question related to Cluster Computing
Question
3 answers
I am trying to fit a cluster model in spatstat in R using the locations of archaeological sites.
The resulting parameters values are:
trend: 2.514429e-06
kappa: 1.606992e-08
scale: 481.5851
mu: 155.5071
Does it mean that my set of points corresponds to clusters with a radius of 481m, with a mean number of offsprings of 155.5 and very few parents (k) ? Are these parameters physically meaningful?
Relevant answer
Answer
FYI: I deposited the R script code on Zenodo. I hope that it would be of any interest to you!
  • asked a question related to Cluster Computing
Question
4 answers
How to compile trj_cavity module in gromacs version 5.X on cluster server ??
actually ,i want to determine cavities and their volume with respect to time .
so, i am trying to install trj_cavity program in gromacs on cluster server as well as on my workstation .but its not installing .
can any one help me !!
thanks !
Relevant answer
Answer
Shahid Ali If you were able to install it, please help me out.
  • asked a question related to Cluster Computing
Question
2 answers
I am trying to run an SCF calculation on a small cluster of 3 CPUs with a total of 12 processors. But when I am running my calculation with command mpirun -np 12 ' /home/ufscu/qe-6.5/bin/pw.x ' -in scf.in> scf.out it is taking a longer time then it was taking in a single CPU with 4 processor. So it will be really great if you can guide me on this since I am new in this area. Thank you so much for your time.
Relevant answer
Answer
It's not running on other CPUs.
I think, their are some mistake in the comand.
Thank you @Masab Ahmad
  • asked a question related to Cluster Computing
Question
14 answers
I am doing research on load balancing for web server clusters. Please suggest which simulator can we use for the same.
Relevant answer
Answer
If you want to do research then just try this: https://github.com/lalithsuresh/absim/tree/table and paper: "C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection"
  • asked a question related to Cluster Computing
Question
5 answers
Dear all,
My question is the following:
I have large datset: 100,000 observations, 20 numerical and 2 categorical variables (i.e. mixed variables)
I need to cluster these observations based on the 22 variables, I have no idea how many clusters/groups a priori I should expect.
As the large dataset I use clara() function in r (based on "pam").
Because of the large number of observations, there is no way to compare distance matrixes (R does not allow such calculations, and is not a problem of RAM), therefore the common way of cluster selection using treeClust() and pamk() and comparison of "silhouette" does not work.
My main quesitons is: can I use factors like total SS, within SS, between SS to have an idea of the best performing Tree (in terms of number of clusters)? Do you have any other idea of how can I select the right number of clusters?
Best regards
Alessandro
Relevant answer
Answer
The book
" Experimental Design" by Federer
will be helpful in this situation.
  • asked a question related to Cluster Computing
Question
12 answers
I'm getting slightly different random numbers depending on the OS (Windows vs Linux) although I have specified the seed using set.seed.
Is there anyway to guarantee reproducibility across platforms?
More specifically, I am looking for a way to force the R on Linux produce the same random numbers the R on Windows produces.
You could test the following code on both platforms:
set.seed(1234); print(rnorm(1), 22)
Relevant answer
Answer
I think Vahid already posted the answer referencing stackoverflow [3].
It is suggested to use a certain implementation such as mt19937.
Apparently R has a package which interfaces the dieharder suite (http://webhome.phy.duke.edu/~rgb/General/dieharder.php):
  • asked a question related to Cluster Computing
Question
5 answers
In order to resolve a computing intensive problem I'm working with a lot of recursive function, whoever i don't get a good answer in a short amount of time and sometimes I do get stack overflowing problem. Since the problem seems to be complex i would like to distribute it across a cluster of computer.
So to do that i wondering if i have to switch to an iterative approach rather than a recursive one ?
Relevant answer
Answer
Use a stack data structure. With operations
Push(...), Pop(...), IsEmpty().
Many programming frameworks come along
with ready-made stack classes.
Regards,
Joachim
  • asked a question related to Cluster Computing
Question
1 answer
Let we change the default block size to 32 MB and replication factor to 1. Let Hadoop cluster consists of 4 DNs. Let input data size is 192 MB. Now I want to place data on DNs as following. DN1 and DN2 contain 2 blocks (32+32 = 64 MB) each and DN3 and DN4 contain 1 block (32 MB) each. Can it be possible? How to accomplish it?
Relevant answer
Answer
As I understand, Hadoop replicate data based on rack awareness.
you can read more about this here:
  • asked a question related to Cluster Computing
Question
16 answers
I have a set of given points and I am going to cluster (i.e. group) them based on Kmeans++ algorithm (which is an extended version of Kmeans algorithm). However, Kmeans++  has a random initialization step and I should use initialization. Due to this, for a set of fixed points, I get different results (clusters), every time I run my code! 
Is there any way to modify this problem of Kmeans (and Kmeans++)? Thanks!
Relevant answer
Answer
Thank you so much Vladimir for your information and the useful link. No, it is not really important how many clusters I have, I just want to find the best clusters. Thank you so much for the useful link as well!
Best regards,
Manijeh
  • asked a question related to Cluster Computing
Question
6 answers
I tried to make autodock vina run a cluster but to not much of success. 
Relevant answer
Answer
If you are trying to do a virtual screening, instead of assigning maximum number of CPUs per job, running maximum number of parallel processes on the cluster might be a better option. There are several ways to achieve that as discussed in the attached thread.
  • asked a question related to Cluster Computing
Question
2 answers
Removed Question as account is inactive?
Relevant answer
Answer
I'd highly recommend using a higher timestep for your goal of getting the most ns/day possible. I don't recall the details but a colleage of mine is using a timestep of 6 fs, which is a 300% speedup compared to your setting of 2fs.
Have a look at "virtual sites" or "hydrogen mass repartitioning" in  context of gromacs
  • asked a question related to Cluster Computing
Question
5 answers
I am leveraging the SLEPc library for solving the first "k" (k = 3 or 4) eigen values and their corresponding vectors for a matrix sized (200k x 200k). The matrix is sparse and symmetric. I intend to compute the eigenvalues without using mpi, as a result of which the entire computation takes more than 2 minutes. I have tried a lot of eigen solvers like the Krylov-Schur , the Jacobi-Davidson, Rayleigh Quotient CG but it takes awful amount of time to finish computation for each one of them.
The lanczos/krylov schur returns the extreme eigenvalues of the spectrum quite fast, on the contrary. It would be awesome if I could replicate such behavior for the interior values of the spectrum. But so far I have not met with any success.
Is there any way to accelerate the convergence using a single mpi process. Tweaking the tolerance and maximum iterations do not help. Do I need to try some other library for a quick computation of the eigen values? Has anyone tried what I am trying to do? 
Relevant answer
Answer
Some remarks:
1. Use the Krylov iteration method (Lanczos) for approximately solving the auxiliary linear systems.
2. If your matrix is diagonally dominant, use the Jacobi preconditioner. If not, try use the Gauss or Gauss-Seidel preconditioner. Then you do not need the matrix factorization (as in the case of an incomplete LU preconditioner).
3. Use the deflation of a computed eigenvalue to compute the next one. I suppose here you are not interested in eigenvectors.
  • asked a question related to Cluster Computing
Question
3 answers
I use OpenSuse 13.2 and I am interested in performing chemical quantum calculations.
Relevant answer
Answer
If you're interested in pure QM, there's really no need to make clusters (unless for the reason of using a queueing system, which helps in managment of jobs or for larger storage). The reason is that QM jobs usually scale pretty bad with the number of processors, and it rarely makes sense to run any jobs with more than 16 cores. Even if I use a cluster for QM calculations I usually limit my self to 8 or 16 cores and make sure that they are on the same node, to make the speed optimal.
There is currently a number of very fast and quite reasonably priced (well...) processors with 6/8/10/12 cores and you can build very officient double-processor machines using them, which should handle most of QM calculations.
  • asked a question related to Cluster Computing
Question
8 answers
I am on Linux platform with MySQL NDB 5.7. I am trying to monitor all traffic related to MySQL clustering – between data nodes, management node and sql nodes. To that end, I used netstat to list all open ports listening on my machine before starting MySQL cluster. Then, I started MySQL cluster and ran netstat again. I assumed that the ports that were listening the second time around, but not the first time, were related to MySQL clustering.
But there are two problems with this. First, there could be ports opened by other processes between the two netstat runs. Second, MySQL might open other ports after I ran the netstat command the second time.
What is the best way to go about finding all ports being used by MySQL for clustering purposes?
I believe ephemeral ports are picked dynamically, so perhaps if I knew all the MySQL clustering related processes that would be running, I can figure out every port that they are using. Pointers will be very welcome.
Relevant answer
Answer
Hi,
I'm not sure to fully understand your question, but it wonder if you are not mixing several concepts.
Indeed, unless a an application runs with root privileges, it cannot open or close a port itself. Ports can be opened or closed at the OS level, for instance using its firewall UI. In other words, MySQL (or any other app) doesn't need to run for its related ports being open or close. In fact, MySQL will not be able to run properly if the port(s) it uses are not open before it is launched. Thus, checking if ports are open or close will probably not be a solution for you. If I'm right (but again, I'm not sure to have fully understood your question), you need to check the traffic on the (open) ports. A right tool to do this is tcpdump. You'll find several examples around, for instance at http://superuser.com/questions/604998/monitor-tcp-traffic-on-specific-port
Hope this helps.
  • asked a question related to Cluster Computing
Question
5 answers
Most software is designed for UNIX
Relevant answer
Answer
Thank you all. Happy new year!
  • asked a question related to Cluster Computing
Question
2 answers
How to get the NMR from the gaussian_virgo cluster computational study?
Relevant answer
Answer
Hi,
Apart from above mentioned links, I would like to remind one thing, please dig the literature for bench-marking of the similar system as NMR being sensitive property to calculate.     
  • asked a question related to Cluster Computing
Question
3 answers
I have installed two single node hadoop cluster on two different machine using vmware.But I am facing difficulty to connect the guest machine(rather facing a problem to connect two virtual machines).I will be helpful if anyone provide me the detail description of hadoop multinode cluster setup with the networking required.
Relevant answer
Answer
According to your description your problem originated from WMware network's configurations, not from Hadoop itself. In case you are using two machines you need to use the same network card with enabled DHCP on it, but on machines themselves instead of using dynamic IP address you need to configure static ones in the same IP range. After you will get both machines "pining" each other via static IP adresses, you will need to configure SSH (share common key between machines), then just use Apache Ambari to install your desired Apache Hadoop cluster. 
Good luck!
  • asked a question related to Cluster Computing
Question
6 answers
Dear researcher in this world!
Please help me! How to reduce NP-Hard bin packing problem not using mathematic solutions? if there is mathematic solutions, can teach me how?
Relevant answer
Answer
thanx jose...
  • asked a question related to Cluster Computing
Question
2 answers
I would like to perform the docking studies with the Discovery studio 2.1 Ligandfit.
I had already installed in the server of windows 2008 right now. with XEON 2.4 Quadcore, 16 GBRAM
But it takes time to complete the task. Now please let me know that I can perform my task in distributed environment in several PC/Server.
Discovery studio provide the facility for Parallel Processing. But need some help to arrange this facility.
Waiting for perfect and intelligent answer.
Relevant answer
Answer
I am having the same problem, I am working on DS 2016 on windows 7, been trying to dock using flexible docking but cpu utilization was 20% , re-ran the Job using parallel processing parameter> Server:local host> process 8 , and now its 50% utilization across all 16 threads I have
  • asked a question related to Cluster Computing
Question
5 answers
I want to calculate CuBTC charges by using gaussian 09, then I separated a cluster (C18H3Cu6O24)  and need to configure multipliciity, but I do not konw how to confirm multiplicity of a cluster.
Could you please anyone suggest how to do it?
Thank you so much in advance.
Relevant answer
Answer
Bascially, you should know the oxidation state of copper in your system. 
Then, check a number of electrons what you have interestted in. ( odd ? or even?)
Then you can calculate multiplicity with consdering a charge state.
  • asked a question related to Cluster Computing
Question
7 answers
i want to know cluster formation in highway using matlab..In highway 900 vehicles  used in the both the direction..
Relevant answer
Answer
number of clusters in each direction=16,cluster radius=300m, number of vehicles in cluster=20,random distance between vehicles..
  • asked a question related to Cluster Computing
Question
3 answers
hello 
I wanna know if it is possible to analyse RNA-seq locally on Linux because of lack of cluster computer and is it possible to do de novo assembly becayse of lack of genome for my specie? 
if possible to give me links for tutorials to use the RNA-seq analysis tools locally.
thanks for your help.
Relevant answer
Answer
This is an industrial solution to your query. it is the first way to start  RNA analysis;
take a look to these  attached files.
Good luck,
  • asked a question related to Cluster Computing
Question
3 answers
How do you calculate the availability of a one out of 2 system?
Relevant answer
Answer
Consider two components in standby configuration. Component 1 is the active component with a Weibull failure distribution (beta=1.5, eta=1000). Component 2 is the standby component. When Component 2 is operating, it has a Weibull failure distribution with beta=1.5 and eta=1000. Note: Even though Components 1 and 2 have the same distribution and parameters, it is possible that the two can be different. For the quiescent distribution, consider three different scenarios:
1. Same distribution as when in operation (hot standby).
2. beta=1.5, eta=2000 (warm standby).
3. Cannot fail in quiescent mode (cold standby).
What is the system reliability at 1000 hours? Note: For this example, we will only consider the non-repairable case, i.e. when a component fails, it is not repaired/replaced.
The reliability of the system at some time, t, can be calculated using the following equation:
R1 is the reliability of the active component
f1 is the pdf of the active component
R2,sb is the reliability of the standby component when in standby mode (quiescent reliability)
R2,A is the reliability of the standby component when in active mode
te is the equivalent operating time for the standby unit if it had been operating at an active mode, such that:
Solving Eqn. (2) with respect to te, you can obtain an expression for the equivalent time, which can then be substituted into Eqn. (1).
  • asked a question related to Cluster Computing
Question
4 answers
Some time ago I used the Wolfram Lightweight Grid Manager to set up Mathematica for parallel computing in my local network. All the computers had installed Mathematica which have been configured to use the license server. This environment worked properly.
I have question if it is possible to configure Mathematica (which  possesses the site licence) to work similar to the previously described (but without the Wolfram Lightweight Grid Manager and the license sever). I think about the following situation. One computer is the master and the other (which working together in LAN) share their computing kernels.
Relevant answer
Answer
Dear Ariadne,
thank you for the link to the book. It is very interesting but in the frame of building computer cluster it prefers standard solution with LGM.
Best wishes
Radoslaw Jedynak
  • asked a question related to Cluster Computing
Question
7 answers
Hello All,
My department (in Pakistan) has a small budget (up to $50,000) to buy a computing cluster. I have experience of working on cluster but I am just a user. Can any expert help me to analyze the strength of following specifications for a small cluster and any improvement there (file attached). Mostly the codes that will be run on cluster are Gaussian 09, Crystal 09 and some protein X-rays data refinement software.
Thank you very much in advance for your precious time and valuable comments.
Relevant answer
Answer
Before I start I want to make it clear I work for Intel and was the original lead architect of the Intel Cluster Ready specification.
Specifying the cluster is only part of the problem.  It needs to designed, an instance of the design created and the health of the same needs to be taken into account.  All of these phases make your total experience good or bad.   Since, there was many issues that came to bear int he cluster space, Intel wanted to make it easier to the end users to succeed.  Thus Intel developed a cluster architecture spanning small to enterprise and even Top 10 style system.   This architecture, Intel Cluster Ready (aka ICR), is available to anyone to download (see:   http://www.intel.com/go/clusters).  The idea is to solve two problems:
1.) Make is easy to purchase a cluster that is known to work and keep working.  -- i.e. you vendor creates a system that has been correctly integrated/ known to work when arrives at your site and will continue work well in the future.
2.) Ensure that the application codes that you want run can be installed and will function on that cluster.  [ISV's that are part of the program, validated that their codes "just work" and do not violate architectural considerations -- i.e. installing application A does not some how reek havoc on a previous installed application B.
But an architecture only goes so far, Intel has created a tool, called the Intel Cluster Checker which makes available for free to its OEM & manufacturing partners to help them in three ways:
1.) Certify that design will actually meet specification (design mode)
2.) Verify in manufacturing that the cluster actually is an instance of that design (manufacturing mode)
3.) Validate/Diagnose (health mode) that the current running cluster has not deviated from they way it was when it was installed (entropy has not take over). [The partners are required to deliver the tool to you the user, so you have it keep the cluster running properly]
I highly encourage you to go the web site and get the documents that are there.   Then work with your specific vendor and make sure they put together an ICR compliant cluster (if you want, do what may users do and add ICR compliance and certification as a requirement to your bid).  If they do not already have it, your vendor can get the cluster checker tool from Intel and use it to create a system for you.  Note they are required to deliver a copy of Intel CLCK & it's associated "fingerprint files" on the system as part of certification, so you have in the future to keep you system in top health.
Best wishes,
Clem Cole
HPC Architect, Intel
A note about Genuine Intel vs "or compatible"  and the difference between and architecture and instance of that architecture.    As the manufacturer of the CPU chips and creator of the specification and associated certification program, ICR defines the use of Intel*64 based processors in ICR certified clusters.   We can not speak for the validity of using non-Intel manufactured CPUs in the cluster.   Thus the specification calls for Intel*64 (aka Xeon or Xeon Phi) processors.   When the Cluster Checker tool runs, it will detect and warn of issues where the cluster instance deviates from the specification, from incorrectly set up IB connections, missing software components and the like.   If the tool detects a processor that is not what the specification defined a warning will be issues.    That said, ICR is an >>Architecture<< and you are obviously free to pick whatever components make you comfortable/you see as best for your project/meet your needs.   However Intel's certification program, by definition, can not certify an instance will function that deviates from the specification, which again,  includes a cluster made with CPUs that we do not manufacturer. 
  • asked a question related to Cluster Computing
Question
7 answers
I have a big dataset which has a lot of zeros. I want to use a fuzzy clustering algorithm. Do you know a useful fuzzy clustering for sparse dataset?
Relevant answer
Answer
interesting!
  • asked a question related to Cluster Computing
Question
10 answers
I am interested in any papers written on the loss of performance in HPC (High Performance Computing) applications due to thermal throttling. My current R&D project goal is to replace passive heat sinks with active liquid cooling that will match the keep out, life and cost while providing about 6 times the thermal performance. I need to translate the thermal performance improvement into an estimate for increased computer performance.
Relevant answer
Answer
well, the problem is that HPC machines are normally designed to not throttle - if they did throttle, the only interpretation is that the heatsink is inadequate, or the airflow, or the incoming air temperature.  further, current HPC chips (say, e5-2690 v3) have a kind of built-in throttling, in that you can only achieve max clock by using a subset of cores.  to realy take advantage of improved cooling, you'd have to somehow get Intel to let you exceed this programmed TDP-based capping.  and as an HPC person, I'm not sure I'd go for a product like that, since although you might support higher clocks than normal, any chip's power-performance efficiency drops off as you really push it.  (a pitch based on reducing cooling costs would be more successful, I think.  and/or improved cluster density...)
I also note that the range for Intel's "turbo" modulation is about 10%.  that's not a knock-your-socks-off kind of competitive advantage...
  • asked a question related to Cluster Computing
Question
9 answers
In research papers, I have come across terms like Grid sites or clusters. I was wondering if these two words can be used interchangeably? 
Relevant answer
Answer
A grid is a collection of resources. Those resources might be clusters or clusters plus other resources. In general, a grid is a geographically dispersed and more importantly organizationally and administratively diverse collection. I.E. a virtual organization. It is usually about more than just scheduling jobs on compute clusters, although for some grids (such as NSF's TeraGrid) that turns out to be the most important aspect. It can be a mushy idea since it was taken up and used as a marketing term, often by folks you didn't fully grok it. The classic paper as a reference is http://toolkit.globus.org/alliance/publications/papers/anatomy.pdf
  • asked a question related to Cluster Computing
Question
15 answers
hello all,
I need a clustering algorithm that verify this condition:
  • Do Not require specification of cluster number a prioris.
  • The cluster density  is different in the data.
  • The complixity is minimum.
I read about DBSCAN that verify some of this condition I don't know if it exist an other algorithm.
Bast regards
Thank you
Relevant answer
Answer
Hi!
Regarding your question, I would suggest a density based method.
Normally, in density based methods, you do not set a number of clusters, but you configure a threshold density and a minimal number of neighbors.
DBScan works fine to detect clusters with similar density. However, if you want to detect clusters based on the local/neighboring density I suggest Optics or any further improved algorithmn of the Optics family..
This image shows clusters with different density:
There are many implementations of Optics. For Java, you can use Elki oder Weka, for example.
If you want to speed up the process, you can use a index structure for your neighborhood queries, e.g., quadtree, r-tree, kd-tree.
Best regards,
Robert
  • asked a question related to Cluster Computing
Question
4 answers
I am trying to do post analysis of gromacs simulation, like g_rdf, g_hbond etc. As my system is very large, so analysis takes very long time to finish. I would like to do these job in batch mode like submitting to computing cluster. But problem is during these analysis i need to choose options interactively. I do not know how can i handle interactive choosing when it runs in batch mode ?
Relevant answer
Answer
In order to run the make_ndx command in batch I have written a very useful script in Perl. You can find it at my repo https://github.com/dcorrada/RICEDDARIO
  • asked a question related to Cluster Computing
Question
7 answers
How can I classify a set of concepts in conceptual graphs? Do you have any idea? Thanks.
Relevant answer
Answer
Just a different perspective on Stefan's answer: hypergraphs would be projections of the incidence relation of the concept graph, so you would be losing structure by using them. 
Incidentally, if you haven't seen it already, the Applications book that Stefan linked has a paper dealing directly with Conceptual Graphs and FCA.
  • asked a question related to Cluster Computing
Question
4 answers
Any Sample example of implementation in OPNET.
Sample Code for cluster routing protocol.
Please give examples from where I can learn.
Relevant answer
Answer
I can tell you that by setting hop count limit of broadcast messege.
  • asked a question related to Cluster Computing
Question
17 answers
I would like to know how I can perform the velocity of the CCSD(T) calculation. Should I put more CPU or increment the RAM memory like in MP2 calculation? What is a suitable way? Thank you very much.
Relevant answer
Answer
Dear Joaquim,
As you know CCSD(T) is too time demanding. I think it would be better first you submit a CCSD calculation and find how many hours one step of CCSD takes time,when you have it, multiple the time by number of occupied orbitals, so you have a time estimation on how many hours CCSD(T) takes time with respect to mem and nproc you have specified. For correlated calculations you need to increase RAM and cpu both. If you are doing your calculation on a cluster,specify more cpu to have more memory, and I suggest you that it would be better if all the cpus are on same node.
  • asked a question related to Cluster Computing
Question
10 answers
I would like to set up a computational laboratory. I would like to know how much it would cost (including electricity, maintenance) to start up a computational laboratory with 200 processors to begin with.
Relevant answer
Answer
Ahmed Farid has done a good job of identifying the main decisions you need to make. What you decide affects the answer to your question.  ARM processors generally consume less electricity and generate less heat than x86 processors, so which you choose will affect your need for power and cooling.  
One other power-related issue is how reliable your local power grid is.  If it is not very reliable (e.g., power failures or brownouts), you'll want to include a battery-based uninterruptible power supply (UPS) into your design and use it to power your cluster, or you'll lose all your work in progress any time there is a momentary power failure.
If you want to build it from inexpensive system-on-a-board components (e.g., Raspberry Pi), I would look at Nvidia's Jetson board.  It's more expensive than the Pi, but has a quad-core CPU and a 192 CUDA core GPU, so it delivers quite a bit more processing power. There are other options out there too.
The networking decision is an important one that depends on the nature of the programs you'll be running.  If the programs running on your system will be doing lots of communication between the distributed processes, you want a high-speed low-latency network, or those processes will be doing a lot of waiting for messages to arrive. 
One other question you need to resolve: do you need 200 processors (CPUs) or 200 cores?  With 10 or 12 core Xeon CPUs, you can build a 200 core system in a relatively small form factor (e.g., half a rack).  The challenge becomes providing enough memory and network bandwidth to keep them all busy.
And once you have  designed your cluster, built it, and installed it in a room with sufficient power and cooling, it is very helpful to have a system administrator to maintain it.  For a small cluster, this can consume 1/4 to 1/2 of a full time equivalent (FTE) position, depending on how many people are using it, requiring software installation, etc.
  • asked a question related to Cluster Computing
Question
4 answers
Considering there are multiple attributes.
Is there a way of saying this cluster is related to another in terms of so and so ? 
How to determine confidence interval or with what probability that this member lies in a cluster?
Relevant answer
Answer
Dear Sai,
Maybe Silhouette plots can help you:
Rousseeuw, Peter J. "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis." Journal of computational and applied mathematics 20 (1987): 53-65.
Kind regards,
Olivier PARISOT
  • asked a question related to Cluster Computing
Question
4 answers
We have HP workstations one with 8 core Intel processors and 16 GB RAM and another with 12 core processor and 16 GB RAM. Are we able to cluster these two workstations and implement OpenMP threaded C code?
Relevant answer
In addition to Simon's response, if your problem can be decomposed in small pieces, you can use the map-reduce approach. For computing in many computers, you also must distribute the jobs, for example, using http://discoproject.org/
  • asked a question related to Cluster Computing
Question
4 answers
I ran a cluster analysis using STRUCTURE on windows 8.0 and the project saved. when i tried to retrieve the project via File-recent project menu, an error message pops up "cannot read project file". I need suggestions on how to visualize or re-acess a completed and saved project in STRUCTURE. Can anyone be of help?
Relevant answer
Answer
Hi Paul,
you need to find saved project file of type *.spj. When you open it - all your result will be displayed and available in Parameter Set.
  • asked a question related to Cluster Computing
Question
2 answers
I need to use two different oxidation states for a metal in cluster while doing the VASP calculation? For example, in 16TiO2 cluster the first 8 Titanium is +2 oxidation state and the other 8 Titanium is +4 oxidation? What are the parameters needed to use in INCAR, POTCAR and POSCAR files. Kindly help. Thank you.
Relevant answer
Answer
Yes, I agree with Jian.  In  ADF, you may be able set the oxidation states manually for specific atoms.
  • asked a question related to Cluster Computing
Question
2 answers
Hello,
Recently I have made a CCSD(T) calculation, but in output only appear the results for every process, but not a global (total energy or the thermochemistry results). Should I put an input key to specify that it is a problem, or am I supposed to calculate thermochemistry independently ?
I used:
$BASIS GBASIS=N311 NGAUSS=6 NDFUNC=2 NPFUNC=2 DIFFSP=.TRUE. DIFFS=.TRUE. $END
$CONTRL SCFTYP=RHF RUNTYP=OPTIMIZE CCTYP=CCSD(T) NZVAR=1 $END
$SYSTEM MWORDS=300 MEMDDI=300 PARALL=.TRUE. $END
$SCF DIRSCF=.TRUE. diis=.true. damp=.true. $END
$STATPT OPTTOL=0.0005 NSTEP=50 HESS=CALC hssend=.true. $END
$FORCE METHOD=FULLNUM $END
$zmat dlc=.true. auto=.true. $end
Thank you for your time
Relevant answer
Answer
Thank you for help. I could find the error in ",dat" file.
  • asked a question related to Cluster Computing
Question
11 answers
I am working on success rate of DNA barcoding in identification of species using distance and tree based methods.
Regarding distance based method, I have used Adhoc and species identifier and confirmed species identification using best closest match criteria (BCM).
Please suggest program/software/method for tree based identification (using NJ, parsimony, bayes), exhibiting clustering (number of clusters formed for particular species) that will suggest singleton (=Ambigious) species. (Pls see the attachment)
(Please Note: It is not possible to do it manually as iam having >4000 specimens)
Answers detailing methods or softwares is appereciated! Thank you!
Relevant answer
Answer
I believe the only accurate method would be alignment-based phylogenetic analysis based on likelihood (e.g. RAxML on a public server such as CIPRES for such a large dataset), combined with a GMYC approach. Anything else, including and particular clustering and NJ methods, have been shown to be inaccurate, sometimes substantially so. Also, if you actually want to TEST barcoding accuracy, you need a testable expectation. For example, you have a set of barcoding sequences for KNOWN species (or at least species-level clades), based on a reconstructed phylogeny. Then you randomly select a sequence and blast it against the entire dataset and see how often the correct species is blasted. That is kind of trivial but that would be the only way of testing. If you have a set of unidentified barcoding sequences and just want to see how many distinct species you have in there, that would not be a barcoding test but an assessment of species delimitation. A very quick approach to sorting out species would also be using MAFFT with the sorting option, which will align the sequences and also sort them based on similarity. It is not 100% accurate but usually, especially if all sequences are complete, does a great job in outlining possible species-level clades even before running a tree. And 4000 sequences can be easily done in MAFFT.
  • asked a question related to Cluster Computing
Question
13 answers
I am using Ubuntu in VMWare on 4 nodes running Window 7. I have configured Ubuntu with 1 GB RAM, 1 Processor and 20 GB Hard Disk each. Suppose I have used these configuration directly on physical nodes instead of VMWare, then it could be better in performance?
Relevant answer
Answer
David - your assumption is incorrect. what a Virtual Machine allows you to do is run multiple "computers" simultaneously on the same hardware. From your description, it sounds like you could *replace* all the other computers around the house by running VMs on your main 24 core server. You could easily allocate 3 cores for one Virtual Machine, 1 core for another, and so on - the only limitation would be memory and the performance you are willing to tolerate either on the main system or inside the Virtual Machines.
Any other kind of sharing amongst disparate machines would be done via network protocols (CIFS or NFS for filespace). There is a commercial product, ScaleMP, that supposedly allows you to aggregate memory from several different machines into one larger configuration, but again - that is not what you are describing.
  • asked a question related to Cluster Computing
Question
7 answers
I need to set up a computing environment for several moderately demanding applications - some pure processing and some data acquisition, storage and analysis. The applications include numerical weather prediction (using WRF) and 3D rendering of an entire city. There will be two separate systems, one linux one windows. My current plan is to create a "high performance computing cluster" HPCC. For the linux side I would like to use CentOS, but it may not be the best choice for scheduling compute nodes for distributed processing. I would like opinions on 1) which linux OS is best to use and 2) which windows OS is best and 3) has anyone done this entirely using remote services like those offered by Amazon? 4) any hardware suggestions? Many thanks!
Relevant answer
Answer
I did a comparison of cloud HPC approaches including Amazon ECC here -
I think you can build a modest cluster of Linux machines yourself and get more out of it if you plan to use for a good length of time (say 2+ years). Some options to consider include GPU co-processing and use of CUDA and/or OpenCL as well as distributed computing tools (noted in the paper). CentOS is a great choice, but for ease of installation and use I like Ubuntu 12.04 LTS myself - it's just easier to install and configure with tools like CUDA, OpenCL, etc.
I suspect your workload would benefit from a vector co-processor and CUDA as well as GPUs for rendering. However, if you just need lots of cores, another option that may be of interest is MICA (the Intel Xeon Phi co-processor).
The advantage of AWS EC2 and other pay as you go HPC options like Penguincomputing (http://www.penguincomputing.com/services/hpc-cloud ) is that you can solve a problem and not own the equipment, but again, if you in this for the long haul, that's perhaps also the disadvantage - the idea of HPC on-demand is an interesting compromise you might consider - build your own small cluster and then use Cloud HPC for really big runs - this is my favorite approach. I find that I need to develop and test on my own small installation anyway (or it goes best if I do) and then use university resources or Cloud HPC once I'm ready for production runs.
As far as storage, Linux software RAID is an option along with Ceph if you're willing to pull together a JBOD and do the SAN/NAS configuration yourself.
If you build your own small scale system, I find it helps you make better use of Cloud HPC since you'll know better what to ask for and look for from any HPC on-demand service. You might find your private system is good enough.
Hope that helps.
  • asked a question related to Cluster Computing
Question
12 answers
I want to know if MapReduce paradigm is better than MPI (Message Passing Interface)? Which type of parallelism i.e. data or task parallelism is followed in MPI and MapReduce?
Relevant answer
Answer
Hi Sudhakar,
Have a look at this paper:
Chen, W.-Y.; Song, Y.; Bai, H.; Lin, C.-J. & Chang, E. Y.
Parallel Spectral Clustering in Distributed Systems. IEEE Trans. Pattern Anal. Mach. Intell., 2011, 33, 568-586
MPI is a message passing library interface specification for parallel programming.
MapReduce is a Google parallel computing framework. It is based on user-specified
map and reduce functions
It also says: "In general, MapReduce is suitable for non-iterative algorithms where nodes require little data exchange to proceed (non-iterative and independent); MPI is appropriate for iterative algorithms where nodes require data exchange to
proceed (iterative and dependent).
Best Wishes,
Ashkan
  • asked a question related to Cluster Computing
Question
3 answers
First I tried to run mpiblast on rocks cluster but it always end up with errors. Can anyone suggest what I should do?
Relevant answer
Answer
Rathers waited for 2 hrs to get result but no outcome
  • asked a question related to Cluster Computing
Question
6 answers
I have access to a computer cluster made of 44 nodes. Each node has 12 cores and 48 GB ram. The main problem is that jobs have a maximum walltime of 6h then they get killed, but I can use as many nodes as I like, which means that for a job I could use i.e up to 10 nodes = 120 cores x 480 GB RAM or more.
So in order to make a tophat job to finish in 6h hours I wanted to parallelize it to a a very large number of cores with the appropriate RAM per core, and specify the number of cores through the -p parameter.
My problem is that I can' t get tophat to recognize all the cores of the multiple nodes, and I think this is because I have to launch it using "mpirun". (mpirun -np XX tophat etc). But when I run tophat this way it fails to start. So I'm wondering if tophat is mpi capable or not? Do I need a recompiled version? I could find somewhere on the internet clues of tophat being run through mpi (http://seqanswers.com/forums/archive/index.php/t-11472.html)
Can anyone give me suggestions or alternatives?
Thanks.
Relevant answer
Answer
I received an answer to the topic on SeqAnswers. I'll paste it here just in case someone is interested.
Devon Ryan:
"Since tophat isn't written to use MPI, the instances run on each node will be blind to each other. There is no way around this without rewriting the program (actually, you'd need to rewrite bowtie as well). If you really want to use tophat, then just split your fastq files into small enough versions and then push a large number of tophat jobs onto the cluster (normally one would create a script to do this). If you're going that route anyway, just switch to STAR (you might have enough RAM, I'm not sure) and you'll get your results in a fraction of the time. The last alternative here would by Myrna, which seems to be designed with this sort of scenario in mind.
You'll have the same problems with cufflinks, namely that increasing the number of instances with MPI won't decrease runtime since the instances won't talk to each other. Your best bet there would be eXpress or to just use count-based methods. "
  • asked a question related to Cluster Computing
Question
6 answers
I am working on streams of dataset which yields 256 bit strings. I want to cluster them in real time, (i.e. i don't want to save them in memory, and give each instance a cluster id in real time).
I have a similarity measure for these strings defined by number of 1's in ANDing to the number of 1's in ORing the two strings. (very similar to Jaccard similarity)
I initially worked with sequential leader clustering for this, and it gave great results. The only problem is that, it cannot be applied for distributed systems.
I have found that minhashing and LSH can be implemented for real time text clustering. As per my understandings, Minhashing is used to generate signatures, and then signatures are clustered by banding of the signatures. And, if hash functions are same, the minhash signatures are same in every node in distributed settings.
Can there be a LSH approach for clustering the bitstrings in my case with the defined similarity measure??
Or can i apply minhashing/LSH for minhashing with some trick to get the desired results?
Relevant answer
Answer
.
minhash is indeed a locality sensitive hashing scheme when locality is defined according to the jaccard similarity
see (by increasing completeness) :
and
for an example of implementation
.
  • asked a question related to Cluster Computing
Question
6 answers
I want to know the architecture of the cluster computers and layered design of this system.
Relevant answer
Answer
Sudhakar,
you will find quite a lot of litterature out there if you search for the terms "Beowulf" and "MPI". I am supposing here that you are looking for number-crunching (High-Performance) clusters, and not High Availability - which would be another totally different use case.
Some "classical" places to start out are the web page at ClusterMonkey, http://clustermonkey.net/ , and the mailing list archives at Beowulf.org, http://beowulf.org/ .
As for the practical aspects of system engineering, I would suggest taking a look at the Centos and RedHat administrator documents, http://www.centos.org/docs/5/ and https://access.redhat.com/site/documentation/Red_Hat_Enterprise_Linux/ . Quite a lot of this stuff is relevant to your query, both for High-Performance and for High-Availability.
Cheers,
-Alan
  • asked a question related to Cluster Computing
Question
2 answers
I want to know how to check how robust the clustering is in a phenogram (not cladogram).
Relevant answer
Answer
I suggest two things. First, perform a formal test to see if your data has evidence that clusters actually exist. Second, use bootstrap-aggregation to evaluate the robustness of your results.
  • asked a question related to Cluster Computing
Question
1 answer
Are there any LRM software which support measurement and prediction of performance of batch jobs?
Relevant answer
Answer
What do you mean? The behavior of any batch system is knowable exactly as much as you know the distribution and properties of running and queued jobs. Do you know the actual runlength of the jobs? Do the jobs have differing resource requirements (cores-per-process, memory-per-process, processes-per-node, etc)?
  • asked a question related to Cluster Computing
Question
12 answers
Can we combine cluster computing and cloud computing? That is, in the cloud environment if we want to balance load of VMs on physical machines, can we apply concepts of cluster computing to the cloud environment? What software is available to accomplish this?
Relevant answer
Answer
Depends what you are trying to accomplish. For simply large scale data processing tasks take a look at Hadoop. But if you are doing clustered parallel processing in more time sensitive applications then Storm Cluster is great. Both can run out-of-the-box on AWS. If its simply a matter of load balancing, you can do this natively on AWS to manage load across VMs connected to an Elastic Load Balancer.
  • asked a question related to Cluster Computing
Question
4 answers
I want to connect some PCs together and form a grid. Using Globus is one option. I would like to hear about other options and also, what are the challenges in this work and, if it has a bright future in India.
Relevant answer
Answer
OpenStack is one of the best available option for now....as it open source and freeware we can do all the type of R&D upon it. Its implementation is little bit hard but we can try to understand it .. here is the refrence link ...http://www.openstack.org/
  • asked a question related to Cluster Computing
Question
3 answers
I am using R to generate a pairwise distance matrix of a large dataset (3000 grid cells) and I am interested in grouping cells into ten clusters.
I have successfully used the "cutree" function in the CLUSTER package in R to get 10 clusters; but I got stuck on how to create a new matrix based on mean of all pairwise grid cell values for the new clusters defined by my cutree function. What I am interested in is to use a summary (i.e. mean) of the new clusters to get a new dendrogram based ONLY on the means. I only want to show a dendrogram of the means of the summarized clusters (i.e. the 10 clusters) which I retrieved using the cutree function so that I can further do some analyses e.g. metaMDS in vegan.
I have looked up different R help pages to see whether I can get a new "summary" dendrogram based on the "mean of all pairwise grid cells observed between the clusters", but unsuccessful.
I attached a picture as an example of what I hope to represent.
Any help especially in the form of R code will be highly appreciated.
Relevant answer
Answer
The function meandist finds the mean within and between block dissimilarities.
With the function meandist(), there is the function mrpp (in vegan library) : Multiple Response Permutation Procedure provides a test of whether there is a significant difference between two or more groups of sampling units.
  • asked a question related to Cluster Computing
Question
4 answers
I need clear step-by-step guidelines that I can follow. Any relevant links would be useful.
Relevant answer
Answer
It depends on the purpose. Parallelisation over ethernet is not very useful, communication slows down the calculation. We use MPI on multicore computers so that ethernet communication is avoided. We just use computers with shared /home via NFS. On one cluster Grid Engine was used. It serves our purpose but may not be sufficient for CFD.
  • asked a question related to Cluster Computing
Question
5 answers
I am using the CLUTO clustering toolkit to cluster software faults from multiple versions of a software project. The goal is to find a clustering solution that provides insight into the characteristics of the software, and its defects. I would like to show that I get consistent clusters on different versions, and identify differences as they occur over time.
I have not seen any papers where this has been done. So far I am thinking of a chi-square test for homogeneity. The clusters are ranked, and the ranks seem to be fairly consistent as well. Adding ranks into the comparison may help my argument, but I'm not sure which test would be most appropriate. Any suggestions on the best way to do this? Any references to see similar comparisons in other applications of clustering?
Relevant answer
Answer
People have implemented Chi-sq before for comparison (i randomly googled it and found a couple of papers) but you can also try agglomerative and see the trends in the clusters instead of increasing k by 1. Then you can see at which stage the results match up. Hope that helps!