Science topic
Cluster Computing - Science topic
Explore the latest questions and answers in Cluster Computing, and find Cluster Computing experts.
Questions related to Cluster Computing
I have recently gained access to a research cluster which has Gaussian 16. I am relatively new to both Gaussian and cluster computing. I am currently optimizing a combination of small first row metal-chalcogenide clusters with ligands using a TPSSh functional and a TZVP basis set. I currently have access to several nodes each with 48 cores and at least 8GB of ram per core.
In the .gjf file I have a bit of scratch memory, 32 cores, and 64GB of memory.
%LindaWorkers=str-c23
%NProcShared=32
%rwf=a1,20GB,a2,20GB
%NoSave
%nprocshared=32
%mem=64000MB
In the .slurm I have allocated 1 node, 32 cores, and 2GB of memory per core.
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --time=48:00:00
#SBATCH --mem-per-cpu=2GB
#SBATCH --output=%x_%j.out
Yet when running my optimizations I get dramatically different amounts of memory utilized.
For a run on a single ligand under these conditions it used 31.4 of the 64GB given.
For a larger ligand w/ same condtions used 20GB.
For the small ligand with a larger basis set (def2TZVP) it only used 12.7 GB.
For a metal cluster with 4 ligands (note: convergence failure) it used only 2-3GB.
Is there something I can do to better utilize the memory I have available?
Hello, I'm COMSOL newbie working on cluster computing.
I'm using distributed parametric sweep mode, in which each server node solves a single problem (ex. Node 1 - Inlet fluid velocity 1m/s, Node 2 - Inlet fluid velocity 2m/s, etc...)
In distributed parametric sweep mode,
(1) I want to monitor multiple convergence plots since multiple problems are solved simultaneously. But I can only monitor a single convergence plot... Please give me some advice.
(2) How can I monitor multiple probe plots (which are Probe parameter - Iteration number plots)? I can only monitor the accumulated probe table, which is the final converged value. But I want to check the probe value trend with respect to iteration number.
Thank you very much!!! Every comment is my energy please help me!
Hello everyone,
Let me explain my situation. In lab, we have enough computers for general uses. Recently, we started to DFT studies of our materials. My personal computer has i5 and 20gb ram and mostly enough for calculation. However, i can not use my computer during the calculations. Moreover, accessing the supercomputer clusters of government is a bit pain and needs a lot of paper work.
My goal is creating simple cluster for calculations and free my personal computer. Speed is important but not so much. Thus, is investing money on 4 raspberry pi 4 8gb a good idea?? I am also open for other solutions.
I want to learn new computing technology. I heard that Fog computing is the current one. Whether other latest computing technology available?
I am trying to fit a cluster model in spatstat in R using the locations of archaeological sites.
The resulting parameters values are:
trend: 2.514429e-06
kappa: 1.606992e-08
scale: 481.5851
mu: 155.5071
Does it mean that my set of points corresponds to clusters with a radius of 481m, with a mean number of offsprings of 155.5 and very few parents (k) ? Are these parameters physically meaningful?
How to compile trj_cavity module in gromacs version 5.X on cluster server ??
actually ,i want to determine cavities and their volume with respect to time .
so, i am trying to install trj_cavity program in gromacs on cluster server as well as on my workstation .but its not installing .
can any one help me !!
thanks !
I am trying to run an SCF calculation on a small cluster of 3 CPUs with a total of 12 processors. But when I am running my calculation with command mpirun -np 12 ' /home/ufscu/qe-6.5/bin/pw.x ' -in scf.in> scf.out it is taking a longer time then it was taking in a single CPU with 4 processor. So it will be really great if you can guide me on this since I am new in this area. Thank you so much for your time.
I am doing research on load balancing for web server clusters. Please suggest which simulator can we use for the same.
Dear all,
My question is the following:
I have large datset: 100,000 observations, 20 numerical and 2 categorical variables (i.e. mixed variables)
I need to cluster these observations based on the 22 variables, I have no idea how many clusters/groups a priori I should expect.
As the large dataset I use clara() function in r (based on "pam").
Because of the large number of observations, there is no way to compare distance matrixes (R does not allow such calculations, and is not a problem of RAM), therefore the common way of cluster selection using treeClust() and pamk() and comparison of "silhouette" does not work.
My main quesitons is: can I use factors like total SS, within SS, between SS to have an idea of the best performing Tree (in terms of number of clusters)? Do you have any other idea of how can I select the right number of clusters?
Best regards
Alessandro
I'm getting slightly different random numbers depending on the OS (Windows vs Linux) although I have specified the seed using set.seed.
Is there anyway to guarantee reproducibility across platforms?
More specifically, I am looking for a way to force the R on Linux produce the same random numbers the R on Windows produces.
You could test the following code on both platforms:
set.seed(1234); print(rnorm(1), 22)
In order to resolve a computing intensive problem I'm working with a lot of recursive function, whoever i don't get a good answer in a short amount of time and sometimes I do get stack overflowing problem. Since the problem seems to be complex i would like to distribute it across a cluster of computer.
So to do that i wondering if i have to switch to an iterative approach rather than a recursive one ?
Let we change the default block size to 32 MB and replication factor to 1. Let Hadoop cluster consists of 4 DNs. Let input data size is 192 MB. Now I want to place data on DNs as following. DN1 and DN2 contain 2 blocks (32+32 = 64 MB) each and DN3 and DN4 contain 1 block (32 MB) each. Can it be possible? How to accomplish it?
I have a set of given points and I am going to cluster (i.e. group) them based on Kmeans++ algorithm (which is an extended version of Kmeans algorithm). However, Kmeans++ has a random initialization step and I should use initialization. Due to this, for a set of fixed points, I get different results (clusters), every time I run my code!
Is there any way to modify this problem of Kmeans (and Kmeans++)? Thanks!
I tried to make autodock vina run a cluster but to not much of success.
I am leveraging the SLEPc library for solving the first "k" (k = 3 or 4) eigen values and their corresponding vectors for a matrix sized (200k x 200k). The matrix is sparse and symmetric. I intend to compute the eigenvalues without using mpi, as a result of which the entire computation takes more than 2 minutes. I have tried a lot of eigen solvers like the Krylov-Schur , the Jacobi-Davidson, Rayleigh Quotient CG but it takes awful amount of time to finish computation for each one of them.
The lanczos/krylov schur returns the extreme eigenvalues of the spectrum quite fast, on the contrary. It would be awesome if I could replicate such behavior for the interior values of the spectrum. But so far I have not met with any success.
Is there any way to accelerate the convergence using a single mpi process. Tweaking the tolerance and maximum iterations do not help. Do I need to try some other library for a quick computation of the eigen values? Has anyone tried what I am trying to do?
I use OpenSuse 13.2 and I am interested in performing chemical quantum calculations.
I am on Linux platform with MySQL NDB 5.7. I am trying to monitor all traffic related to MySQL clustering – between data nodes, management node and sql nodes. To that end, I used netstat to list all open ports listening on my machine before starting MySQL cluster. Then, I started MySQL cluster and ran netstat again. I assumed that the ports that were listening the second time around, but not the first time, were related to MySQL clustering.
But there are two problems with this. First, there could be ports opened by other processes between the two netstat runs. Second, MySQL might open other ports after I ran the netstat command the second time.
What is the best way to go about finding all ports being used by MySQL for clustering purposes?
I believe ephemeral ports are picked dynamically, so perhaps if I knew all the MySQL clustering related processes that would be running, I can figure out every port that they are using. Pointers will be very welcome.
How to get the NMR from the gaussian_virgo cluster computational study?
I have installed two single node hadoop cluster on two different machine using vmware.But I am facing difficulty to connect the guest machine(rather facing a problem to connect two virtual machines).I will be helpful if anyone provide me the detail description of hadoop multinode cluster setup with the networking required.
Dear researcher in this world!
Please help me! How to reduce NP-Hard bin packing problem not using mathematic solutions? if there is mathematic solutions, can teach me how?
I would like to perform the docking studies with the Discovery studio 2.1 Ligandfit.
I had already installed in the server of windows 2008 right now. with XEON 2.4 Quadcore, 16 GBRAM
But it takes time to complete the task. Now please let me know that I can perform my task in distributed environment in several PC/Server.
Discovery studio provide the facility for Parallel Processing. But need some help to arrange this facility.
Waiting for perfect and intelligent answer.
I want to calculate CuBTC charges by using gaussian 09, then I separated a cluster (C18H3Cu6O24) and need to configure multipliciity, but I do not konw how to confirm multiplicity of a cluster.
Could you please anyone suggest how to do it?
Thank you so much in advance.
i want to know cluster formation in highway using matlab..In highway 900 vehicles used in the both the direction..
hello
I wanna know if it is possible to analyse RNA-seq locally on Linux because of lack of cluster computer and is it possible to do de novo assembly becayse of lack of genome for my specie?
if possible to give me links for tutorials to use the RNA-seq analysis tools locally.
thanks for your help.
Some time ago I used the Wolfram Lightweight Grid Manager to set up Mathematica for parallel computing in my local network. All the computers had installed Mathematica which have been configured to use the license server. This environment worked properly.
I have question if it is possible to configure Mathematica (which possesses the site licence) to work similar to the previously described (but without the Wolfram Lightweight Grid Manager and the license sever). I think about the following situation. One computer is the master and the other (which working together in LAN) share their computing kernels.
Hello All,
My department (in Pakistan) has a small budget (up to $50,000) to buy a computing cluster. I have experience of working on cluster but I am just a user. Can any expert help me to analyze the strength of following specifications for a small cluster and any improvement there (file attached). Mostly the codes that will be run on cluster are Gaussian 09, Crystal 09 and some protein X-rays data refinement software.
Thank you very much in advance for your precious time and valuable comments.
I have a big dataset which has a lot of zeros. I want to use a fuzzy clustering algorithm. Do you know a useful fuzzy clustering for sparse dataset?
I am interested in any papers written on the loss of performance in HPC (High Performance Computing) applications due to thermal throttling. My current R&D project goal is to replace passive heat sinks with active liquid cooling that will match the keep out, life and cost while providing about 6 times the thermal performance. I need to translate the thermal performance improvement into an estimate for increased computer performance.
In research papers, I have come across terms like Grid sites or clusters. I was wondering if these two words can be used interchangeably?
hello all,
I need a clustering algorithm that verify this condition:
- Do Not require specification of cluster number a prioris.
- The cluster density is different in the data.
- The complixity is minimum.
I read about DBSCAN that verify some of this condition I don't know if it exist an other algorithm.
Bast regards
Thank you
I am trying to do post analysis of gromacs simulation, like g_rdf, g_hbond etc. As my system is very large, so analysis takes very long time to finish. I would like to do these job in batch mode like submitting to computing cluster. But problem is during these analysis i need to choose options interactively. I do not know how can i handle interactive choosing when it runs in batch mode ?
How can I classify a set of concepts in conceptual graphs? Do you have any idea? Thanks.
Any Sample example of implementation in OPNET.
Sample Code for cluster routing protocol.
Please give examples from where I can learn.
I would like to know how I can perform the velocity of the CCSD(T) calculation. Should I put more CPU or increment the RAM memory like in MP2 calculation? What is a suitable way? Thank you very much.
I would like to set up a computational laboratory. I would like to know how much it would cost (including electricity, maintenance) to start up a computational laboratory with 200 processors to begin with.
Considering there are multiple attributes.
Is there a way of saying this cluster is related to another in terms of so and so ?
How to determine confidence interval or with what probability that this member lies in a cluster?
We have HP workstations one with 8 core Intel processors and 16 GB RAM and another with 12 core processor and 16 GB RAM. Are we able to cluster these two workstations and implement OpenMP threaded C code?
I ran a cluster analysis using STRUCTURE on windows 8.0 and the project saved. when i tried to retrieve the project via File-recent project menu, an error message pops up "cannot read project file". I need suggestions on how to visualize or re-acess a completed and saved project in STRUCTURE. Can anyone be of help?
I need to use two different oxidation states for a metal in cluster while doing the VASP calculation? For example, in 16TiO2 cluster the first 8 Titanium is +2 oxidation state and the other 8 Titanium is +4 oxidation? What are the parameters needed to use in INCAR, POTCAR and POSCAR files. Kindly help. Thank you.
Hello,
Recently I have made a CCSD(T) calculation, but in output only appear the results for every process, but not a global (total energy or the thermochemistry results). Should I put an input key to specify that it is a problem, or am I supposed to calculate thermochemistry independently ?
I used:
$BASIS GBASIS=N311 NGAUSS=6 NDFUNC=2 NPFUNC=2 DIFFSP=.TRUE. DIFFS=.TRUE. $END
$CONTRL SCFTYP=RHF RUNTYP=OPTIMIZE CCTYP=CCSD(T) NZVAR=1 $END
$SYSTEM MWORDS=300 MEMDDI=300 PARALL=.TRUE. $END
$SCF DIRSCF=.TRUE. diis=.true. damp=.true. $END
$STATPT OPTTOL=0.0005 NSTEP=50 HESS=CALC hssend=.true. $END
$FORCE METHOD=FULLNUM $END
$zmat dlc=.true. auto=.true. $end
Thank you for your time
I am working on success rate of DNA barcoding in identification of species using distance and tree based methods.
Regarding distance based method, I have used Adhoc and species identifier and confirmed species identification using best closest match criteria (BCM).
Please suggest program/software/method for tree based identification (using NJ, parsimony, bayes), exhibiting clustering (number of clusters formed for particular species) that will suggest singleton (=Ambigious) species. (Pls see the attachment)
(Please Note: It is not possible to do it manually as iam having >4000 specimens)
Answers detailing methods or softwares is appereciated! Thank you!

I am using Ubuntu in VMWare on 4 nodes running Window 7. I have configured Ubuntu with 1 GB RAM, 1 Processor and 20 GB Hard Disk each. Suppose I have used these configuration directly on physical nodes instead of VMWare, then it could be better in performance?
I need to set up a computing environment for several moderately demanding applications - some pure processing and some data acquisition, storage and analysis. The applications include numerical weather prediction (using WRF) and 3D rendering of an entire city. There will be two separate systems, one linux one windows. My current plan is to create a "high performance computing cluster" HPCC. For the linux side I would like to use CentOS, but it may not be the best choice for scheduling compute nodes for distributed processing. I would like opinions on 1) which linux OS is best to use and 2) which windows OS is best and 3) has anyone done this entirely using remote services like those offered by Amazon? 4) any hardware suggestions? Many thanks!
I want to know if MapReduce paradigm is better than MPI (Message Passing Interface)? Which type of parallelism i.e. data or task parallelism is followed in MPI and MapReduce?
First I tried to run mpiblast on rocks cluster but it always end up with errors. Can anyone suggest what I should do?
I have access to a computer cluster made of 44 nodes. Each node has 12 cores and 48 GB ram. The main problem is that jobs have a maximum walltime of 6h then they get killed, but I can use as many nodes as I like, which means that for a job I could use i.e up to 10 nodes = 120 cores x 480 GB RAM or more.
So in order to make a tophat job to finish in 6h hours I wanted to parallelize it to a a very large number of cores with the appropriate RAM per core, and specify the number of cores through the -p parameter.
My problem is that I can' t get tophat to recognize all the cores of the multiple nodes, and I think this is because I have to launch it using "mpirun". (mpirun -np XX tophat etc). But when I run tophat this way it fails to start. So I'm wondering if tophat is mpi capable or not? Do I need a recompiled version? I could find somewhere on the internet clues of tophat being run through mpi (http://seqanswers.com/forums/archive/index.php/t-11472.html)
Can anyone give me suggestions or alternatives?
Thanks.
I am working on streams of dataset which yields 256 bit strings. I want to cluster them in real time, (i.e. i don't want to save them in memory, and give each instance a cluster id in real time).
I have a similarity measure for these strings defined by number of 1's in ANDing to the number of 1's in ORing the two strings. (very similar to Jaccard similarity)
I initially worked with sequential leader clustering for this, and it gave great results. The only problem is that, it cannot be applied for distributed systems.
I have found that minhashing and LSH can be implemented for real time text clustering. As per my understandings, Minhashing is used to generate signatures, and then signatures are clustered by banding of the signatures. And, if hash functions are same, the minhash signatures are same in every node in distributed settings.
Can there be a LSH approach for clustering the bitstrings in my case with the defined similarity measure??
Or can i apply minhashing/LSH for minhashing with some trick to get the desired results?
I want to know the architecture of the cluster computers and layered design of this system.
I want to know how to check how robust the clustering is in a phenogram (not cladogram).
Are there any LRM software which support measurement and prediction of performance of batch jobs?
Can we combine cluster computing and cloud computing? That is, in the cloud environment if we want to balance load of VMs on physical machines, can we apply concepts of cluster computing to the cloud environment? What software is available to accomplish this?
I want to connect some PCs together and form a grid. Using Globus is one option. I would like to hear about other options and also, what are the challenges in this work and, if it has a bright future in India.
I am using R to generate a pairwise distance matrix of a large dataset (3000 grid cells) and I am interested in grouping cells into ten clusters.
I have successfully used the "cutree" function in the CLUSTER package in R to get 10 clusters; but I got stuck on how to create a new matrix based on mean of all pairwise grid cell values for the new clusters defined by my cutree function. What I am interested in is to use a summary (i.e. mean) of the new clusters to get a new dendrogram based ONLY on the means. I only want to show a dendrogram of the means of the summarized clusters (i.e. the 10 clusters) which I retrieved using the cutree function so that I can further do some analyses e.g. metaMDS in vegan.
I have looked up different R help pages to see whether I can get a new "summary" dendrogram based on the "mean of all pairwise grid cells observed between the clusters", but unsuccessful.
I attached a picture as an example of what I hope to represent.
Any help especially in the form of R code will be highly appreciated.
I need clear step-by-step guidelines that I can follow. Any relevant links would be useful.
I am using the CLUTO clustering toolkit to cluster software faults from multiple versions of a software project. The goal is to find a clustering solution that provides insight into the characteristics of the software, and its defects. I would like to show that I get consistent clusters on different versions, and identify differences as they occur over time.
I have not seen any papers where this has been done. So far I am thinking of a chi-square test for homogeneity. The clusters are ranked, and the ranks seem to be fairly consistent as well. Adding ranks into the comparison may help my argument, but I'm not sure which test would be most appropriate. Any suggestions on the best way to do this? Any references to see similar comparisons in other applications of clustering?