Science topic

# Clustering Algorithms - Science topic

Explore the latest questions and answers in Clustering Algorithms, and find Clustering Algorithms experts.

Questions related to Clustering Algorithms

I have the same dataset for k-mean clustering. I have applied same algorithm on the same dataset using two different tool (weka and rapidminer). I have got different cluster? Which one should I use? Your suggestions are welcome?

Kindly give suggestions to handle ordinal categorial dataset using clustering algorithm

Specifically, what is the difference between

A) POPDATA=1, LOCPRIOR=1, LOCDATA=0, LOCISPOP=1 and

B) POPDATA=0, LOCPRIOR=1, LOCDATA=1, LOCISPOP=0 ?

For the clustering algorithm and calculating Q, I think it is the same. In both cases, the Locprior model uses the sampling location information, only from formally different sources (POPDATA or LOCDATA column). The further difference is that the A settings generate Q values not only for individuals but also for pre-defined populations.

Am I right?

Thank you.

I Want to use this algorithm for grouping users based on similarities.

I have a longitudinal dataset for 500 sensors reading at 25 time point.

I want to cluster those sensor readings at each time point , from T1 to T25

bur I want to know which cluster each sensor lies in at each time point in easier manner.

Assuming I have found 3 clusters in T1, and 2 in T2, is there a way to measure if sensors in cluster 1 T1 significantly similar to sensors in T2 cluster 1 and 2?

Another question, which machine learning clustering algorithm are good for my case? which one don’t require hyper parameter tuning ( like maybe defining the number of cluster ) ?

last, any visualisation method to print clusters per sensors by time ?

I have a set of 2000 sequences. In that set there are two sequences

>Seq1

AAAAAAAAAAAAAAAAAAAAAAA

>Seq2

UUUUUUUUUUUUUUUUUUUUU

When clustering with Cd-hit-est with default options, these two are put in same cluster with Seq1 as representative and Seq2 as.... -/100%

Is this correct?

We are applying k-means clustering algorithm on an unlabeled data. Our aim is to at the end, pool a result that shows two possibility. Is it necessary then to carry out k-nn classification after the clustering?

I currently use .csv files to work with pandas dataframes and perform UMAP analyses and I would like to use Scanpy moving forward. Can anyone help me with converting .csv files into Anndata files for Scanpy?

I am planning to conduct a cluster analysis of EEG PSD values and respective synchronicity estimates (e.g. alpha/theta) using deep learning methods. I am still in the literature review stage and I would like to know the latest deep learning algorithms used for EEG clustering.

I have i have I havea dataset likethat :

users T1 T2 … Tn

1 [1,2,1.5] [1,3,3] … [2,2,6]

2 [1,5,1.5] [1,3,4] … [2,8,6]

.

n [1,5,7.5] [5,3,4] … [2,9,6]

Given that lists are distinct incident change by time.

My aim to find distinct incidents which might happen to users by time.

I thought of feeding the full dataset to clustering algorithms , but I need an advice from you about best algorithms to fit such 2D dataset or best approach to follow in solving this problem

I want to learn and include the temporal relationship alongside my variables at each time series instance.

And I learnt that previously they used models to represent all variables by a single variable or reduce the dimensionality of variables.

I read also they created a variable for time indexing which include first difference values of time, ex. for first instance its T2-T1, and i want to know more techniques to include temporal relationship as a variable or by any means before feeding my dataset to clustering algorithms. Do you know any other techniques to represent this as a feature or by transforming existing feature to include temporal/spatial pattering , or what they call it inter/intra patterns?

For example, suppose there are 50 data points which need to be clustered into seven parts, in a way so that each cluster contains at least five and at most 10 data points.

I understand that one of the methods to initialize the k medoids is by finding the minimum sum of the distance to every data point. In terms of the SWAP stage, some resources stop the iteration after a predefined number of iterations. However, the original PAM algorithm stops iterating when the Total Deviation (SSE)is minimized.

So, could you please confirm my method of doing the SWAP stage?

I sort the data points from the lowest sum of the distance to the highest. If the K is set to 3 then the first three data points are the initial medoids.

{M3, M1, M5, x0, x2, x7, x9, x8, x6, x4}

In the SWAP stage, first I calculate the Total Deviation (TD) for the initial medoids and set it aside for comparison.

Then I calculate the TD when swapping M3 with x0, x2, x7, x9, x8, x6, x4 and stop this process when the previous TD (within the loop) is smaller than the current TD. And I set the outcome of this process aside for comparison.

I repeat the last process for M1 and M5 and set the outcomes aside for comparison.

Finally, I will end up with four TD; one from the initial medoids and the other three from checking the initial medoids against the non-medoids data points. So, I just pick the outcome of the smallest TD.

Is this the correct way to do the SWAP stage?

how to made combination of optimization algorithms and clustering algorithms ?

i do research about image clustering and i need a proper data base for image clustering

Hello,

I would like to cluster my dataset, which contains approximately 35,000 protein sequences. Be that as it may, I need to make some clusters in terms of their superfamilies. Thus, I would like to set the number of clusters; for example, we can specify the number of clusters in the K-Means cluster algorithm.

Can I define the number of clusters (i.e., 10, 50, 250) using the MMseqs2/Linclust tool? Thank you!

Hi, I am working on UAV deployment in a wireless communication scenario to cache. Does anybody know how to simulate/deploy UAVs in Matlab using weighted or simple k-mean clustering? Below I also attached a snapshot.

Thank you

Hi,

I have several .fastq files from Nanopore sequencing technology. I have already done trimming and filtering. I was also able to do clustering using isONclust. Nevertheless, I have the following concern. How do I generate a single "otu table" with the clustering files' outputs separated by samples? In short, I run the following commands, but I do not know how to merge everything into one file.

sONclust --fastq sample1_filter.fastq --ont --medaka --outfolder sample1_clust/

isONclust --fastq sample2_filter.fastq --ont --medaka --outfolder sample2_clust/

isONclust --fastq sample3_filter.fastq --ont --medaka --outfolder sample3_clust/

Hello,

I'm looking working on a clustering analysis and would be curious if anyone has ideas about how to deal with

**nested categorical**variables.Normally I would calculate a distance/dissimilarity matrix (Gower when some variables are categorical), and then feed this to a clustering algorithm of choice. Now what happens when some categorical variables are nested?

**Fictious example**

If measuring characteristics of water samples like turbidity, temperature, dissolved gases, and presence/absence of 50 chemical compounds in the water.

* presence/absence of chemical compounds can be treated as 50 separate binary/categorical variables

* but say that these chemicals belong to 4 groups of compounds?

**Thoughts**

We could simply add an additional categorical variable "group" and for more complex nesting "subgroup", "subsubgroup"... OK, but as far as I understand, Gower distance is a bit like Manhattan distance in that it calculates a distance for each variable and then adds weights. What but part of the information will be redundant, and even more so if there are more levels of nesting. I was wondering whether anyone has come up with something else to specifically deal with that. Maybe some form of weighting of the variables?

Looking forward to your inputs!

Mick

I am trying to cluster a few categorical & continuous observations using an unsupervised cluster algorithm. I have used two-step clustering since I have categorical predictors! Could someone help me with algorithms to compute predictor importance in unsupervised learning?

I need to execute a clustering algorithm in RPL protocol using the COOJA simulator, Please could you provide me with a clustering code or any information? How i can create the clusters and choose cluster heads from clusters? And how to communicate between sensor nodes and sink via cluster heads? Please give suggestions?

I am developing a dataset with scientific articles. The dataset is a matrix where the rows are the different articles and the columns are the keywords extracted from the articles themselves. The value of each cell is the relevance of each word for each article normalized in L2.

I want to use clustering techniques to find out groups of articles based on the keywords extracted but the sparsity of the dataset is huge as you can imagine. However, the matrix itself is not sparse, as the zeroes mean that a word has no relevance in an article.

Having this in mind, I used UMAP for dimensionality reduction to visualize the dataset but I am not sure when to use the clustering algorithm. Should I use it before I reduced the dataset or after I reduced it?

I have seen in other posts that it is better to first apply the dimensionality reduction and then use the clustering algorithm but I'm not sure if this affects the output of the clustering.

I work with extremely short texts which span over 1 to 7 words. I have tried simple embeddings based clustering using simple clustering algorithms but the results are not satisfactory. Types of embeddings I have used are fasttext, glove etc. So I was wondering if there is any special work for such specific short texts to cluster and learn their underlying structures?

Please feel free to provide anything ranging from a blog to a code to a proper scientific paper.

I have several raster files and I need to perform a comparative analysis between them. In this context, a cluster analysis could be a good way to explore the possible grouping between raster files. Is there an R or Matlab scrip for this goal?

For unsupervised text clustering, the key thing is the init embedding for text.

If we want to use https://github.com/facebookresearch/deepcluster for text, the problem for text is how to get the init embedding from deep model.

BERT can not get good init embedding.

If we do not use deep model, is there better way to get embedding better than glove wordvec?

Thank you very much.

In my experience, pre-trained models are not suitable for unsupervised tasks. Especially in deep clustering when pre-trained models are used, they often have worse results than without pre-trained models.

What is the scientific reason for this?

Why are the learned representations in pre-trained models not suitable for clustering?

I'm a PGRs student, I would like to implement a new clustering algorithm using NS3, I have basic information in programming skills and I read about the object oriented c++ and NS3, I tried to create my network and deploy the node and the sink, is anyone has a good idea on how I can do this is NS3 or how I can start this, i'm so confused and do not know what can one do. Thank you for helpingv.

Hello everyone,

Could you recommend papers, books or websites about unsupervised neural networks?

Thank you for your attention and valuable support.

Regards,

Cecilia-Irene Loeza-Mejía

Hello to everyone. I am trying to implement KNN analysis to fix minPts in the DBSCAN clustering algorithm. My dataset is composed only of 4 variables and 935 observations. I have found that if k = 5 (no. of variables + 1) I get as output of DBASCAN 2 clusters: one of 911 observations and one of 8 observations. If I use a larger k, according to many papers as sqrt(no. of observations), I get 909 observation in only one cluster and the other are classified as noise points.

Both could be possible results, but their meaning is fundementally different. How can I get rid of this arbitrary choise of minPts hence k?

Thanks!

I want to develop an ensemble approach where the final layer of a CNN model(Flatten layer in this case) will be followed by a K-Means Clustering algorithm where I want to cluster inputs into a number of categories same as required number of categories in a task. I want help regarding how to apply K-Means Clustering with a CNN.

Hello Researcher, If we have ensemble model which apply different Association mining algorithms such as Apriori, Terius and Clustering Algorithms such as K-Means, Cobweb on same dataset, I want to analyse which algorithm is better among all. Is there any automated simple way to compare the performance of above Association mining and Clustering algorithms based on accuracy , time etc in Weka tool?

Dear researchers,

I want to apply clustering tasks for city traffic time series for a research in Intelligent Transportation Systems.

Can you recommend a city traffic dataset that contains the speed, flow, or occupancy measurements for roads and also contains the type of traffic pattern for each road? This type can be provided as a class label for each road, or a class label for each day measurements on each road (a class label for the whole time series).

Any help is much appreciated, thanks in advance.

I want to evaluate the robustness of the clustering algorithm to noise ... How can I add noise to the data... Is there a well known method (such as salt and pepper in the image data)?

I usually use Latent Dirichlet Allocation to cluster texts. What do you use? Can someone give a comparison between different text clustering algorithms?

There is an idea to design a new algorithm for the purpose of improving the results of software operations in the fields of communications, computers, biomedical, machine learning, renewable energy, signal and image processing, and others.

So what are the most important ways to test the performance of smart optimization algorithms in general?

Hi

We have two time series dataset. For example mean daily temperature from two stations. Each of them have 30 data for a month.

T

_{1}={ t_{1}, t_{2},... , t_{30}}T

_{2}={ t_{1}, t_{2},... , t_{30}}Now we want to calculate the similarity between them. Some people may suggest correlation coefficient for this task but I think we use it when we consider the relation between datasets not similarity.

Is there any index to measuring the similarity?

Thanks

dear researchers

I would be appreciative if you let me know your opinion about the disadvantages of the SOM clustering algorithm.

I want a matlab code for distributed energy-efficient clustering Protocol (DEEC) in wireless sensor network (WSN) with explanation please

thanks

In this equation how do I choose p (prob of cluster head)? I need to know what value I have to choose.

Hi

I am trying to segment a sentinel2 image.

At this stage, I want to run a binary classifier that assigns each pixel to either farm or non-farm pixel. For this purpose, I have 4 10m bands including R/G/B/NIR. I also have generated an NDVI raster for each month (8 months in total) that has values ranging from -1 to 1 (it can be normalized to 0 to 255).

I am looking for a classifier that can accurately classify the pixels using NDVI and/or any combination of my 4 10m bands.

Thanks in advance.

I am conducting exploratory research about users on the Ethereum blockchain (I obtain the data from big query), and I would like to cluster the users, mostly by transactional features, for persona/archetype development.

However, the data is not normally distributed, many of the variables have a power-law distribution and some have no clear distribution pattern. It is very likely that I would like to include more than five variables.

Besides the question of what algorithm fits best, is it reasonable to normalize all variables (to a more normal distribution) and to perform a z-transformation?

Hi,

Whenever used Scikit-learn algorithm (sklearn.model_selection.train_test_split), is recommended to used the parameter ( random_state=42) to produce the same results across a different run.

why we used the integer (42)?

can we use another number?

thanks

I know there are plenty evaluate methods can be used to evaluate the clustering result for a single data set, I am trying to apply the same clustering technique to two different data sets and then compare the similarity of the resulting clusters.

for example I want to compare the result of a same clustering algorithms on two consecutive time intervals with different data.

In order to evaluate the performance of evolutionary clustering algorithms, there is internal and external evaluation metrics. Except this, is there any statistical measures to compare the clustering algorithms in terms of best, average, and worst performing values on a particular dataset.

Can heuristic or meta-heuristic fuzzy clustering algorithms help me? Any suggestions generally? I want to create learner’s profiles based on computational intelligence methods. The number of the groups (profiles) is unknown.

Does anyone know of a WORKING

**implementation**of the original**DENCLUE**density based clustering algorithm, and NOT its extensions? I have explored a few from github but they ain't working.Moreover, If anyone is interested in a**collaboration**on this area, he/she is welcome.Regards

Apart from this, suggest few benchmark datasets for high utility mining.

Think of a scenario where there's a dataset with 15 attributes(i.e. columns heading). I want to apply the clustering algorithm on that dataset but not taking all 15 attributes but taking any of the combinations of those 15 attributes(that's what I mean by dynamic attributes clustering). How can I build any model which could be able to determine the optimal number of clusters and do clustering for any combination of the attributes provided?

I found numerous internal evaluation applicable to DBSCAN. Some researchers said that there is no available internal evaluation for DBSCAN, while some papers utilized the indexes such as DBI, DUNN, or S_Dbw.

Can I use the classical indexes for the internal evaluation of DBSCAN? Or, are there any limitations or challenges to apply those metrics?

I have already implemented clustering algorithm for WSN(IoT) based on remaining energies of CH's. But i need to improve algorithm for my academic project. Please suggest me the possible improvements.

Thanks

Normalized Mutual Information (NMI) and B3 are used for extrinsic clustering evaluation metrics when each instance (sample) has only one label.

What are equivalent metrics when each instance (sample) has only one label?

For example, in first image, we see [apple, orange, pears], in second image, we see [orange, lime, lemon] and in third image, we see [apple], and in the forth image we see [orange]. Then, if put first image and last image in the one cluster it is good, and if put third and forth image in one cluster is bad.

Application: Many popular datasets for object detection or image segmentation have multi labels for each image. If we used this data for classification (not detection and not segmentation), we have multiple labels for each image.

Note: My task is unsupervised clustering, not supervised classification. I know that for supervised classification, we can use top-5 or top-10 score. But I do not know what will be in unsupervised clustering.

Do we need to do feature scaling on each dimension of a multi-dimensional data distribution before applying K-Means clustering algorithm for it to be effective ?

Before starting on the K-Means clustering algorithm, is it advisable to convert the data to have zero mean and unit covariance.

I would like to know if there is any kind of research on the parameter selection (minpoints) for the HDBSCAN algorithm, in the context of having a training set with only normal samples and a test set that contains anomalies/outliers.

I want to apply the GLOSH algorithm to find the outliers but since i have a "clean" dataset available I suppose it is possible to exploit it.

I know that in the context of novelty detection exists lots of other algorithms (OC-SVM ecc...) but i would like to try solutions with HDBSCAN.

I would like to simulate a bio-inspired optimization based Geocast routing for VANET using a clustering algorithm in ns3 or Omnet++ with SUMO. And , evaluate my approach with other existing approaches.

I need source code of any geocast based routing protocol for VANET (such as: Gytar or Rover ) or any other position routing protocol (GPSR, GPCR) using a clustering algorithm. furthermore, I need to integrate an optimization algorithm such as ANT with in the routing protocol to deal with the optimal path selection.

I look forward to hearing from you.

thanks in advance.

Dear expert,

I am working on machine learning clustering algorithms for IoT sensor data fault detection and correction by considering non-spherical and unbalanced data in incremental way. Can you please suggest the best algorithm in this regard or best research article for sensor data fault detection and prevention.

I am using Hierarchical Aggromerative Clustering{HAC) and DBSCAN to find clusters in my data.

Please specify some validation technique to validate the results of clustering.

I have used Silhouette Score to validate the results of DBSCAN. Will it work fine with both clustering algorithm. I am using Python for implementation.

I don't want a recommendation system.

I want to group similar users who select similar items .

I want to then cluster them and label those clusters.

Jaccard similarity is not efficient because I have millions of users and items.

Matrix factorization gives the recommendation model. I don't want to recommend any items. I just want to group them and label.

I don't know if k-means clustering works well for multiple feature vectors.

Does anybody know to implement this?

Hi, I have a big GPS data of many users, my goal is to find key locations where users spent most of their time. So, basically I want to find key location of every user. Which clustering algorithm would be best to do this?

How to apply ensemble clustering when the clustering algorithm produces a different number of clusters?

I have implemented the k-means to find the centroids of a cluster, now I want to find the "kpi" of the algorithm.

Hello,

I am a novice researcher in the field of clustering algorithms. DBSCAN was the one which its mathematical background draw my attention to itself but unfortunately, except biological- and medical-oriented fields, I couldn't imagine any other use cases for the algorithm. I would appreciate if anyone introduce me some useful resources which DBSCAN is used in fields except medical or biological ones. According to my bachelor degree (Industrial Engineering), I'm particularly looking for industrial and manufacturing instances.

Regards.

I want to ask if one here has an idea about the stable approaches in such a way how can I compare my new developed stream clustering algorithm with for instance the stable approach proposed by Carlsson and Mèmoli in light of such invariant-order stability:

thank you in advance Dr. Hany

Rowanda Ahmed

Hello everyone,

Currently I am trying to do K - mean clustering on microarray dataset which consists of 127 columns and 1000 rows. When I plot the graph, it gives an error like "figure margins too large". Then, I write this in R console:

par("mar") #It will give current dimensions

par(mar=c(1,1,1,1) #Tried to update the dimensions

But; it did not work. So, can anyone suggest me another way of fixing this problem? (Attached the some part of code in below)

Thanks,

Hasan

--------------------------------------------------------------------------------------------------------------

x = as.data.frame(x)

km_out = kmeans(x, 2, nstart = 20)

km_out$cluster

plot(x, col=(km.out$cluster+1), main="K - Means Clustering Results with K=2",xlab"", ylab"", pch=20, cex=2)

>Error in plot.new() : figure margins too large

Hi,

I want to reduce the number of rows of my data set and until now i used some clustering algorithms (kmeans, kmedoids, SOM), but recently I discover some papers:

- lp row sampling with Lewis weights ( Cohen, Peng),
- Iterative row sampling (Li, Miller, Peng),
- Compressive sampling (Candes).

I would like to know what is the best method taking into account the density of the variables of the data set?

Is my question meaningful? Or does it make no sense?

I mean, I want a true representation of my data set.

Thanks,

Robin

Hello everybody,

Currently, I am working on a text clustering algorithm that combines the potential of Self-Organizing Maps (SOM) and K-means as to create valid clusters and to show their neighborhood. So far it works well, but I would like to see other potential approaches to compare results.

The idea is to improve the selection of documentation while preparing for research in a specific area. For example, by having a huge databank of papers related to a specific subject and you what to see the relationship between the publications and select just the area of interest.

Attached is an image of the SOM after ofganizing 1061 publication titles.

Have anyone of you some experience in the text clustering area?

Thank you in advance.

Hello, I'm a biologist interested in machine learning application in genomic data; specifically,

**I'm trying to apply clustering techniques to differential gene expression data.**I started by understand the basics of unsupervised learning and clustering algorithms with random datasets, but now I need to apply some of that algorithms (k-means, PAM, CLARA, SOM, DBSCAN...) to differential gene expression data and, honestly, I don't know where to begin, so

**I'd be grateful if someone can recommend me some tutorials or textbooks, or give me some tips.**Thank you for your time!

PD: I'm mainly using

**R language**, but if**Python**tutorials are also OK for me.The semi-supervised clustering framework, namely, MPCPK-means, integrates both constraints and metric leaning approaches (bilenko et al., 2004}. Then, several semi-supervised clustering algorithms were proposed to incorporate pairwise constraints for a improved clustering solution. All these methods take long time to cluster a large size data (few thousands data objects) by incorporating pairwise constraints. This incorporation plays the key role for consuming time.

Is there any paper introduce an intuitive method for clustering evaluation?

I would like to use the most intuitive method like: minimizing the within-cluster distance and maximizing the distance between neighboring clusters, but not sure does this method have a name or something? Is there any related paper regarding it? The only method I'm using now is Silhouette score. Many thanks.

Dear,

I am working on improve the cluster efficiency by reducing the number of clusters and centers. Can you please suggest the best technique/algorithm for mixed type IoT large data.

I really do not understand the concept of micro-cluster. I know its a temporal extension of Cluster Feature used in BIRCH algorithm but still I am confused whether it is a data structure or something else?

For all people who want to get the code of ISODATA algorithm?! it has been done and now available for helping and saving your time, just sent me a message, it will be sent through.

I need hierarchical clustering algorithm with single linkage method. whatever I search is the code with using Scikit-Learn. but I dont want that! I want the code with every details of this algorithm.

Hello, i have project in clustering using K-Means,

I have question, how to determine clusters in K-Means, because so far, scientist just create a clusters as ex : cluster_1, cluster_2, cluster_3, cluster_n

How if determine clusters in other as ex : cluster_germany, cluster_us, cluster_british etc ?